0% found this document useful (0 votes)
12 views474 pages

Intelligent Fault Diagnosis and Health Assessment For Complex Electro-Mechanical Systems

The document discusses intelligent fault diagnosis and health assessment for complex electro-mechanical systems, emphasizing the significance of these processes in ensuring safe operations under harsh conditions. It outlines the application of advanced machine learning methods, including support vector machines and deep learning, for fault diagnosis and performance degradation assessment. The book is structured into seven chapters, detailing various methodologies and case studies related to the topic.

Uploaded by

SOCOP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views474 pages

Intelligent Fault Diagnosis and Health Assessment For Complex Electro-Mechanical Systems

The document discusses intelligent fault diagnosis and health assessment for complex electro-mechanical systems, emphasizing the significance of these processes in ensuring safe operations under harsh conditions. It outlines the application of advanced machine learning methods, including support vector machines and deep learning, for fault diagnosis and performance degradation assessment. The book is structured into seven chapters, detailing various methodologies and case studies related to the topic.

Uploaded by

SOCOP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 474

Weihua Li · Xiaoli Zhang ·

Ruqiang Yan

Intelligent Fault
Diagnosis and Health
Assessment for
Complex Electro-
Mechanical Systems
Intelligent Fault Diagnosis and Health Assessment
for Complex Electro-Mechanical Systems
Weihua Li · Xiaoli Zhang · Ruqiang Yan

Intelligent Fault Diagnosis


and Health Assessment
for Complex
Electro-Mechanical Systems
Weihua Li Xiaoli Zhang
School of Mechanical and Automotive School of Construction Machinery
Engineering Chang’an University
South China University of Technology Xi’an, Shaanxi, China
Guangzhou, Guangdong, China

Ruqiang Yan
School of Mechanical Engineering
Xi’an Jiaotong University
Xi’an, Shaanxi, China

ISBN 978-981-99-3536-9 ISBN 978-981-99-3537-6 (eBook)


https://doi.org/10.1007/978-981-99-3537-6

Jointly published with National Defense Industry Press


The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the
print book from: National Defense Industry Press.

© National Defense Industry Press 2023

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

Production equipment in process industries such as energy, petrochemical, and metal-


lurgy, and other modern national economy industries such as national defense usually
run under complex and harsh working conditions, including high temperature, heavy
loads, severe corrosion, and serious fatigue and alternating stress. Various faults with
different degrees inevitably occur in the key components during the operation of
such complex electromechanical systems, which may cause hidden dangers to safety
production or even disastrous casualties. Fault diagnosis and prognostics are of great
significance to ensure the safe operation of equipment. Liangsheng Qu, an academi-
cian of the Chinese Academy of Sciences, pointed out that the fault diagnosis of
equipment is essentially a pattern recognition problem. Therefore, important develop-
ment directions such as intelligent fault diagnosis, prognostics, and health assessment
are derived from the successful applications of pattern recognition methods based
on artificial intelligence (AI) and machine learning in the field of fault diagnosis.
Starting from this point, this book combines the author’s latest research achieve-
ments in intelligent fault diagnosis, prognostics, and health assessment and mainly
focuses on the application of novel machine learning methods, such as enhanced
support vector machine, semi-supervised learning, manifold learning, and deep belief
network, in the field of signal feature extraction and fault diagnosis. Besides, the
application of performance degradation assessment methods for mechanical compo-
nents based on phase space reconstruction theory is introduced and illustrated through
effective simulation analysis and typical engineering cases.
The book consists of seven chapters. Chapter 1 introduces the definition of a
complex electromechanical system and the research contents and development status
of intelligent fault diagnosis, prognostics, and health assessment. Chapter 2 presents
supervised support vector machines (SVM)-based algorithms and their applications
in machinery fault diagnosis. Semi-supervised intelligent fault diagnosis methods,
such as kernel principal component analysis (KPCA), fuzzy kernel clustering algo-
rithms, self-organizing map (SOM) neural networks, and relevance vector machines
(RVM) are systematically described in Chap. 3. In Chap. 4, fault feature selection and
dimension reduction algorithms based on manifold learning algorithms, including

v
vi Preface

spectral clustering, locally linear embedding (LLE), and distance-preserving projec-


tion are addressed, followed by the introduction of deep learning theories and deep
belief network (DBN)-based signal reconstruction and fault diagnosis methods in
Chap. 5. Chapter 6 introduces the basic theory of phase space reconstruction and
the degradation performance assessment and remaining useful life (RUL) prediction
research of the electromechanical system based on recurrent quantitative analysis
(RQA) and Kalman filter (KF). Finally, Chap. 7 discusses the reliability assessment
problems of typical complex electromechanical systems, such as turbine generator
sets, compressor gearboxes, and aero-engine rotors.
This book is a summary of the author’s long-term research, where most of the
listed examples are the research findings on intelligent fault diagnosis and prognostics
of complex electromechanical systems. Chapters 1, 3, 4, and 5 were written and
compiled by Prof. Weihua Li. Prof. Xiaoli Zhang mainly contributed to Chaps. 2 and
7, and Prof. Ruqiang Yan wrote Chap. 6.
This book is supported by outstanding research projects and individuals. I would
like to express sincere appreciation for the support of the National Natural Science
Foundation of China under Grant 50605021, 51075150, 51175080, 51405028, and
51475170 (in order of funding time) and the China Postdoctoral Science Foundation. I
am grateful to Prof. Shuzi Yang, an academician of the Chinese Academy of Sciences,
and Prof. Tielin Shi, for their guidance and encouragement. I would also like to
acknowledge Prof. Rui Kang from Beihang University and Editor Tianming Bai
from National Defense Industry Press for their great support and help. Last, I would
like to thank the graduate students who have contributed a lot to the proofreading
and typesetting of the book, including Yixiao Liao, Can Pan, Bin Zhang, Lanxin Liu,
and Qiuli Chen.
Nothing is perfect. This book inevitably has shortcomings, so we welcome
criticism and generous advice from the readers.

Guangzhou, China Weihua Li


March 2020
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Intelligent Fault Diagnosis, Prognostics, and Health
Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Significance of Complex Electro-Mechanical System
Fault Diagnosis, Prognostics, and Health Assessment . . . . . . . . . . . . 2
1.3 The Contents of Complex Electro-Mechanical System Fault
Diagnosis, Prognostics, and Health Assessment . . . . . . . . . . . . . . . . . 4
1.4 Overview of Intelligent Fault Diagnosis, Prognostics,
and Health Assessment (IFDPHA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Shallow Machine Learning-Based Methods . . . . . . . . . . . . . . 6
1.4.2 Deep Learning-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Organization and Characteristics of the Book . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Supervised SVM Based Intelligent Fault Diagnosis Methods . . . . . . . . 13
2.1 The Theory of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 The General Model of Supervised Learning . . . . . . . . . . . . . . 14
2.1.2 Risk Minimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Primary Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Nonlinear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Kernel Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 The Applications of SVM in Machinery Fault
Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Parameters Optimization Method for SVM . . . . . . . . . . . . . . . . . 24
2.3.1 Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Ant Colony Optimization Based Parameters
Optimization Method for SVM . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Verification and Analysis by Exposed Datasets . . . . . . . . . . . 36

vii
viii Contents

2.3.4 The Application in Electrical Locomotive Rolling


Bearing Single Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4 Feature Selection and Parameters Optimization Method
for SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.1 Ant Colony Optimization Based Feature Selection
and Parameters Optimization Method for SVM . . . . . . . . . . . 52
2.4.2 The Application in Rotor Multi Fault Diagnosis
of Bently Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 The Application in Electrical Locomotive Rolling
Bearing Multi Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.5 Ensemble-Based Incremental Support Vector Machines . . . . . . . . . . 69
2.5.1 The Theory of Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . 71
2.5.2 The Theory of Reinforcement Learning . . . . . . . . . . . . . . . . . 73
2.5.3 Ensemble-Based Incremental Support Vector
Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.5.4 The Comparison Experiment Based on Rolling
Bearing Incipient Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . 79
2.5.5 The Application in Electrical Locomotive Rolling
Bearing Compound Fault Diagnosis . . . . . . . . . . . . . . . . . . . . 88
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3 Semi-supervised Learning Based Intelligent Fault Diagnosis
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.1 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.2 Fault Detection and Classification Based on Semi-supervised
Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.2.1 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . 96
3.2.2 Semi-supervise Kernel Principal Component Analysis . . . . . 97
3.2.3 Semi-supervised KPCA Classification Algorithms . . . . . . . . 109
3.2.4 Application of Semi-supervised KPCA Method
in Transmission Fault Detection and Classification . . . . . . . . 115
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.3.1 Correlation of Outlier Detection and Early Fault . . . . . . . . . . 128
3.3.2 Semi-supervised Fuzzy Kernel Clustering . . . . . . . . . . . . . . . 130
3.3.3 Semi-supervised Hypersphere-Based Fuzzy Kernel
Clustering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.3.4 Transmission Early Fault Detection . . . . . . . . . . . . . . . . . . . . . 137
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
3.4.1 Semi-supervised SOM Fault Diagnosis . . . . . . . . . . . . . . . . . . 150
3.4.2 Semi-supervised GNSOM Fault Diagnosis . . . . . . . . . . . . . . . 155
3.4.3 Semi-supervised DPSOM Fault Diagnosis . . . . . . . . . . . . . . . 156
3.4.4 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
3.5 Relevance Vector Machine Diagnosis Method . . . . . . . . . . . . . . . . . . 165
Contents ix

3.5.1 Introduction to RVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


3.5.2 RVM Classifier Construction Method . . . . . . . . . . . . . . . . . . . 166
3.5.3 Application of RVM in Fault Detection
and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4 Manifold Learning Based Intelligent Fault Diagnosis
and Prognosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.1 Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.2 Spectral Clustering Manifold Based Fault Feature Selection . . . . . . 194
4.2.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.2.2 Spectral Clustering Based Feature Selection . . . . . . . . . . . . . 196
4.2.3 DSTSVM Based Feature Extraction . . . . . . . . . . . . . . . . . . . . 208
4.2.4 Machinery Incipient Fault Diagnosis . . . . . . . . . . . . . . . . . . . . 214
4.3 LLE Based Fault Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.3.1 Local Linear Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.3.2 Classification Based on LLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.3.3 Dimension Reduction Performance Comparison
Between LLE and Other Manifold Methods . . . . . . . . . . . . . . 226
4.3.4 LLE Based Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4.3.5 VKLLE Based Bearing Health State Recognition . . . . . . . . . 235
4.4 Fault Classification Based on Distance Preserving Projection . . . . . 252
4.4.1 Locality Preserving Projections . . . . . . . . . . . . . . . . . . . . . . . . 252
4.4.2 NFDPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.4.3 Experiment Analysis for Engine Misfire . . . . . . . . . . . . . . . . . 258
4.4.4 Local and Global Spectral Regression Method . . . . . . . . . . . 262
4.4.5 Application of Method Based on Distance Preserving
Projections and Its Spectral Regression in Fault
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
5 Deep Learning Based Machinery Fault Diagnosis . . . . . . . . . . . . . . . . . 273
5.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2 DBN Based Machinery Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . 274
5.2.1 Deep Belief Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.2.2 DBN Based Vibration Signal Diagnosis . . . . . . . . . . . . . . . . . 284
5.2.3 DBN Based Fault Classification . . . . . . . . . . . . . . . . . . . . . . . . 297
5.3 CNN Based Fault Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.3.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.3.2 CNN Based Fault Diagnosis Method . . . . . . . . . . . . . . . . . . . . 331
5.3.3 Transmission Fault Diagnosis Under Variable Speed . . . . . . 349
5.4 Deep Learning Based Equipment Degradation State
Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
5.4.1 Stacked Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
5.4.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
x Contents

5.4.3 DAE-LSTM Based Tool Degradation State


Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6 Phase Space Reconstruction Based on Machinery System
Degradation Tracking and Fault Prognostics . . . . . . . . . . . . . . . . . . . . . . 371
6.1 Phase Space Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
6.1.1 Takens Embedding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
6.1.2 Determination of Delay Time . . . . . . . . . . . . . . . . . . . . . . . . . . 373
6.1.3 Determination of Embedding Dimensions . . . . . . . . . . . . . . . 374
6.2 Recurrence Quantification Analysis Based on Machinery
Fault Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
6.2.1 Phase Space Reconstruction Based RQA . . . . . . . . . . . . . . . . 375
6.2.2 RQA Based on Multi-parameters Fault Recognition . . . . . . . 378
6.3 Kalman Filter Based Machinery Degradation Tracking . . . . . . . . . . . 384
6.3.1 Standard Deviation Based RQA Threshold Selection . . . . . . 384
6.3.2 Selection of Degradation Tracking Threshold . . . . . . . . . . . . 386
6.4 Improved RQA Based Degradation Tracking . . . . . . . . . . . . . . . . . . . 387
6.5 Kalman Filter Based Incipient Fault Prognostics . . . . . . . . . . . . . . . . 389
6.6 Particle Filter Based Machinery Fault Prognostics . . . . . . . . . . . . . . . 399
6.7 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
6.8 Enhanced Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.9 Enhanced Particle Filter Based Machinery Components
Residual Useful Life Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
7 Complex Electro-Mechanical System Operational Reliability
Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.1 Complex Electro-Mechanical System Operational Reliability
Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.1.1 Definitions of Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
7.1.2 Operational Reliability Assessment . . . . . . . . . . . . . . . . . . . . . 421
7.2 Reliability Assessment and Health Maintenance of Turbo
Generator Set in Power Plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
7.2.1 Condition Monitoring and Vibration Signal
Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
7.2.2 Vibration Signal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.2.3 Operational Reliability Assessment and Health
Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
7.2.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.3 Reliability Assessment and Health Maintenance
of Compressor Gearbox in Steel Mill . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.3.1 Condition Monitoring and Vibration Signal
Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
7.3.2 Vibration Signal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Contents xi

7.3.3 Operational Reliability Assessment and Health


Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
7.3.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
7.4 Aero-Engine Rotor Assembly Reliability Assessment
and Health Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
7.4.1 The Structure Characteristics of Aero-Engine Rotor . . . . . . . 456
7.4.2 Aero-Engine Rotor Assembly Reliability Assessment
Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
7.4.3 Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
7.4.4 In-Service Aero-Engine Rotor Assembly Reliability
Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . 463
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
Chapter 1
Introduction

1.1 Intelligent Fault Diagnosis, Prognostics, and Health


Assessment

Intelligent fault diagnosis, prognostics, and health assessment (IFDPHA) refers to


the technology that leverages artificial intelligence and machine learning algorithms
to conduct an array of tasks for a complex electro-mechanical system, such as
health monitoring, fault diagnosis, and remaining useful life prediction. From the
perspective of artificial intelligence, intelligent fault diagnostics and prognostics of
the complex electro-mechanical system is a typical process of pattern recognition
in which machine learning algorithms learn pattern knowledge from historical data
and establish an end-to-end model for predicting results to guide the fault diagnosis,
prognostics, and health assessment.
The main contents of intelligent fault diagnosis, prognostic and health assessment
include the following three aspects: (1) fault detection and identification (fault diag-
nosis); (2) fault severity assessment (health condition assessment); (3) remaining
useful life prediction (prognosis). Fault detection and fault identification are two
major steps in fault diagnosis. Fault detection aims to judge whether a fault occurs
or not and ensure the faults can be detected in time, avoiding serious consequences
because of the fault evolving. Fault identification aims to identify and locate faulty
components so that downtime and maintenance costs can be reduced. Different from
fault detection and identification, health condition assessment focuses on quantita-
tively analyzing the fault severity and the performance degradation of the complex
electro-mechanical system. Usually, the incipient fault has little impact on the perfor-
mance and operation of the complex electro-mechanical system, so the maintenance
under such a situation might result in the waste of maintenance resources and incre-
ment of costs. However, when the fault evolves to a certain extent that may cause
serious production accidents, maintenance should be scheduled and conducted on
time. Based on the evaluation results of health condition assessment, fault prognosis
aims to forecast future states of the fault types, fault severity, and degradation trends
for the complex electro-mechanical system, and to predict the remaining useful life

© National Defense Industry Press 2023 1


W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_1
2 1 Introduction

of components or systems. IFDPHA is an extremely valuable assets in improving


the overall maintenance and reliability of the complex electro-mechanical system,
benefiting it in guiding predictive maintenance, enhancing the remaining useful life,
and minimizing unplanned downtime, etc.

1.2 The Significance of Complex Electro-Mechanical


System Fault Diagnosis, Prognostics, and Health
Assessment

Complex electro-mechanical systems reflect the epitome of human’s pursuit of


product performance and engineering designs, which is inevitable in the process
of mechanical equipment development. With the rapid development of informa-
tion technology, computer science, and artificial intelligence, complex electro-
mechanical systems are endowed with richer connotations and more complex func-
tions. Modern industrial production has high restrictions for the production process
and product quality, which makes the traditional mechanical system gradually
replaced by various complex electro-mechanical systems.
According to the definition in [1], a modern complex electro-mechanical system
is a complex electro-mechanical-based physical system that integrates mechanical,
electrical, hydraulic, and optical processes. A variety of functional units are inte-
grated into different electro-mechanical systems, such as aero engines, high-speed
trains, precision machine tools, and modern production equipment, through informa-
tion flow fusion and driving. Complex electro-mechanical systems usually consist
of a large number of parts and components, among which complex coupling rela-
tionships exist. Thus, it is necessary to consider the independent behavior of each
subsystem and the complex coupling relationship between subsystems. Due to the
complexity of functions, structures, coupling relationships, and the diversity of phys-
ical processes, the fault diagnosis, prognostics, and health assessment of complex
electro-mechanical systems encounters lots of great challenges.
Complex electro-mechanical systems are characterized by an inevitable trend to
large-scale, high complexity, and high precision. In industrial fields of petrochemical,
metallurgy, electric power, and machinery, industrial equipment typically operates
at harsh and varying environments such as high temperatures, full speeds, and heavy
loads. Mechanical and electrical system failure is the main factor that caused catas-
trophic accidents. For example, in 2011, the driving chain of an escalator in the
Beijing subway broke and went down in the wrong direction, resulting in a stampede
of passengers. In 2012, a transmission system fault occurred in the wind power plant
both in Hebei and Jilin Province. The National Transportation Safety Board (NTSB)
issued a report on the Harrison Ford plane crash on August 06, 2015, indicating that
a mechanical failure with the carburetor resulted in the plane’s engine losing power.
All the similar examples cannot be enumerated here. The failure of the electro-
mechanical system may cause huge economic losses, environmental pollution, and
1.2 The Significance of Complex Electro-Mechanical System Fault … 3

even casualties, therefore, it is urgent to conduct health condition monitoring and


assessment for electro-mechanical systems in the long-term services.
The fault diagnosis, prognosis, and health assessment of complex electro-
mechanical systems has become an important technical tool to ensure safety, stability,
and reliability production, and has attracted more and more attention from the
industry, academic institutions, and government departments. The technology for
ensuring the reliability, safety, and maintainability of major products, major equip-
ment, and key parts has been listed as one of the key technologies that should be
pushed for breakthroughs both in The National Medium and Long-Term Plan for
the Development of Science and Technology (2006–2020) and the mechanical engi-
neering disciplines Development Strategy Report (2011–2020). According to the
China Intelligent Manufacturing Engineering Implementation Guide (2016–2020)
released by the Ministry of Industry and Information Technology, developing online
fault diagnosis and analysis methods based on big data is a crucial technology that
should be placed more attention, which can avoid catastrophic accidents, to improve
the utilization rate of the equipment, shorten the downtime and maintenance time,
and ensure product quality.
In industrial production, the ultimate goal of mechanical maintenance is to keep
machines operating safely and efficiently. The development of the maintenance
strategy has gone through four stages: (1) Corrective Maintenance; (2) Planned Main-
tenance; (3) Condition-based Maintenance; (4) Predictive Maintenance. Corrective
maintenance refers to the strategy that maintenance is carried out after an uncovering
anomaly or failure. This strategy may be cost-effective in regular operation, but once
catastrophic faults occur, the costs sustained for downtime and repairs for bringing
the machines back to regular are largely expensive. Planned maintenance follows
a plan of action, which is created by the manufacturer of equipment according to
prescribed criteria, to replace and overhaul key parts of the equipment regularly.
Compared with corrective maintenance, planned maintenance can reduce the failure
risk or performance degradation of the equipment, but may increase the production
costs since the resource waste of parts. With the development of fault diagnosis tech-
nology, equipment maintenance is changing from corrective or planned maintenance
to condition-based maintenance, which is performed after one or more observation
indicators show that equipment is going to fail, or its performance is deteriorating.
Ideally, condition-based maintenance is scheduled only when it is necessary, and, in
the long term, allows drastically reducing the costs associated with maintenance and
avoiding the occurrence of serious faults, thereby minimizing system downtime and
time spent on maintenance. Thanks to the development of prognosis techniques, a
more powerful and advanced maintenance strategy, called predictive maintenance,
can be achieved by continuously monitoring the condition parameters of the system,
establishing a health assessment model, recognizing the fault types, fault severity,
and degradation trends, and predicting the remaining useful life of key compo-
nents. Predictive maintenance can guide the maintainers to select the appropriate
maintenance strategy and schedule the plan with the optimized cost.
4 1 Introduction

Overall, fault diagnosis, prognosis, and health assessment is of great significance


for ensuring equipment reliability, improving system utilization, reducing mainte-
nance costs, and avoiding safety accidents, when it comes to electro-mechanical
equipment systems with extremely complex function, structure, and coupling
relationship.

1.3 The Contents of Complex Electro-Mechanical System


Fault Diagnosis, Prognostics, and Health Assessment

The complex electro-mechanical system is working under a complicated condition


that various physical processes are coupled with each other. Generally, it is hard to
establish a precise physical model for such a system since not only each physical
process needs to be constructed but also the coupling relationship between these
physical processes should be considered, which is not suitable for solving the prob-
lems of fault diagnosis and prognosis for complex systems. Fortunately, IFDPHA
methods, which focus on leveraging machine learning algorithms to learn pattern
knowledge from historical data, show powerful performance and solid advantages for
establishing fault diagnosis and prognostics models in complex systems engineering.
IFDPHA methodology mainly include four steps: (1) Signal acquisition; (2) Signal
processing; (3) Feature extraction and selection; (4) Fault diagnosis, prognosis, and
health assessment. Signal acquisition is to capture the measured signals related to
the health conditions of equipment, including vibration, pressure, speed, temper-
ature, acoustic signals, etc. Since the actual working environment of machines is
harsh, the collected signals are often noisy. Thus, signal processing is applied to
preprocess the collected signals for reducing or even eliminating the affection of
noise. After signal acquisition and processing, feature extraction and selection is
another crucial step for fault diagnosis and prognosis, where various features, such
as time-, frequency- and time–frequency domain features, are extracted from the
preprocessed signals using advanced signal processing techniques or algorithms.
Then the discriminative features are selected to reduce the feature redundancy which
may lead to a negative effect on the final performance, and the selected features can
be regarded as comprehensive representations of the health condition of complex
electro-mechanical systems. Taking such features as inputs, fault diagnosis, prog-
nosis, and health assessment is to establish a classification or regression model which
can be trained with the historical data. With all the steps mentioned above, the moni-
toring signals can be input into the trained model to obtain the corresponding health
status information of the complex electro-mechanical system.
On the one hand, IFDPHA methods can be categorized into the following three
groups by considering whether labeled data and unlabeled data are available for
model training: supervised learning, unsupervised learning, and semi-supervised
learning. In the case of supervised learning, the samples annotated with labels are
1.3 The Contents of Complex Electro-Mechanical System Fault Diagnosis … 5

used to train the model. Supervised learning works well and a model with high accu-
racy and good generalization performance can be obtained under situations when the
labeled samples are sufficient. However, the trained model may be overfitted when the
labeled samples are insufficient. For unsupervised learning, there are only unlabeled
samples available for model training. Unsupervised learning methods learn useful
information about health conditions by analyzing the inherited similarity relationship
between samples, such as clustering algorithms that divides different samples into
clusters of similar samples. Generally, since there is no supervised information, the
performance of the unsupervised model is often not good enough. At present, unsu-
pervised learning is mainly used for anomaly detection. In semi-supervised learning,
sufficient unlabeled data and a few labeled data are assumed to be available for
model training, which has been proven to be an effective solution to improve the
performance of the model under the situation of lacking sufficient labeled samples.
All three kinds of algorithms can be used to develop IFDPHA methods, but they
are suitable for different application scenarios. Specifically, supervised learning tries
to obtain a model with satisfying performance when the labeled samples are suffi-
cient. Instead, unsupervised learning is a good choice when the labeled samples are
unavailable, while semi-supervised learning aims to address the problem where an
overfitting training occurred and was caused by the sparsity of labeled data.
On the other hand, IFDPHA methods can be categorized into the following two
groups by considering the architecture of the models: shallow machine learning-
based and deep learning-based. Shallow machine learning methods, such as Artifi-
cial Neural Networks (ANN), Support Vector Machines (SVM), Clustering Algo-
rithms (CA), Hidden Markov Models (HMM), Random Forest (RF), and Manifold
Learning, are widely applied for IFDPHA. These methods only make one or two
nonlinear changes to the input data, and the calculation amount is small. In addition,
its simple structure makes the parameters less, and a good generalization perfor-
mance can be achieved even with a few training samples. However, the feature
extraction ability of shallow structures is limited, so the input features should be
extracted and selected manually. In contrast, deep learning, a branch of machine
learning that has become increasingly popular since Hinton and Salakhutdinov [2]
used a greedy learning algorithm to solve the problem of gradient disappearance in
deep neural network training, attempts to abstract data at a higher level using multiple
processing layers (deep neural networks) consisting of complex structures or multiple
non-linear transformations. Features can be extracted and selected automatically by
deep learning algorithms, eliminating manual feature engineering. However, the deep
neural network with multiple layers has a large number of parameters that need to
be optimized, thereby a large number of fault data are required for network training.
Otherwise, the overfitting phenomenon may occur when the training samples are
insufficient. Obviously, deep learning methods require much more time for training
than shallow machine learning methods.
In conclusion, the contents of intelligent diagnosis, prognostics, and health assess-
ment of complex electro-mechanical systems are to use machine learning methods
to learn relevant knowledge from the data and establish the corresponding diagnosis
and prognostics model to evaluate the health status of equipment.
6 1 Introduction

1.4 Overview of Intelligent Fault Diagnosis, Prognostics,


and Health Assessment (IFDPHA)

IFDPHA is of great significance to improve production efficiency and reduce accident


rates. Industry and academia have paid much attention to related methods and appli-
cation research, and have proposed a large number of IFDPHA methods. At present,
the research on IFDPHA of complex electro-mechanical systems mainly focuses on
the key components, such as gears and bearings. In this section, the research status
of intelligent diagnosis, prognosis, and health assessment is reviewed according to
shallow machine learning-based and deep learning-based methods, respectively.

1.4.1 Shallow Machine Learning-Based Methods

Many scholars have carried out much research on intelligent diagnosis, prognosis,
and health assessment based on shallow machine learning methods. Lei et al. [3]
extracted the same two features from multi-sensor signals measured at different
positions of the planetary gearbox, which are the root mean square value of the
signal normal meshing components and the normalized sum of all positive values of
the spectrum difference between the measured and healthy signal respectively. The
adaptive neuro-fuzzy inference system was then used to fuse the above features and
diagnose the fault mode and fault degree of the planetary gearbox. Unal et al. [4]
used Envelope Analysis (EA), Hilbert Transform (HT), and Fast Fourier Transform
(FFT) to extract features from vibration signals as the input of the ANN for fault
diagnosis. The structure of the network was then optimized by the genetic algorithm
(GA). You et al. [5] proposed a wind turbine gearbox fault diagnosis method based
on ensemble empirical mode decomposition (EEMD) and Back Propagation (BP)
neural network. Wavelet transform was first adopted to denoise the collected vibration
signals, then EEMD was utilized to decompose the denoised signal and extract energy
characteristic parameters from the selected intrinsic mode function (IMF). These
features were finally normalized and fed into the BP neural network for gearbox fault
diagnosis. Chang et al. [6] proposed a fault diagnosis method for rotating machinery
based on spindle orbit analysis and fractal theory, which extracted the axis track
from vibration signals and then used fractal theory to extract features as the input
of the BP neural network for fault diagnosis. Tian et al. [7] proposed a bearing fault
diagnosis method based on manifold dynamic time warping (DTW) by measuring
the similarity between the test and template samples. Compared with the traditional
DTW-based method, it replaced the Euclidean distance (ED)-based similarity with
manifold similarity.
In addition, shallow machine learning methods are also widely used in signal
denoising, dimensionality reduction, and feature extraction. Widodo et al. [8] used
principal component analysis (PCA), independent component analysis (ICA), kernel
principal component analysis (KPCA), and kernel independent component analysis
1.4 Overview of Intelligent Fault Diagnosis, Prognostics, and Health … 7

(KICA) to extract features from acoustic emission signals and vibration signals,
and constructed the classifier with relevance vector machine (RVM) and support
vector machine (SVM) respectively. The above feature extractors and classifiers were
tested in the six-class bearing fault diagnosis task, and the diagnosis performances of
different combinations were compared. Zarei et al. [9] trained the neural network with
normal bearings data to establish a filter for removing non-bearing fault components
(RNFC). The filtered signal was subtracted from the original signal to remove the
non-bearing fault component, and then time-domain features were extracted from the
subtracted signal before they were fed into another neural network for bearing health
state recognition of the induction motor. Jiang et al. [10] extracted 29 commonly
used features from the vibration signal and selected useful features with Nuisance
Attribute Projection (NAP). An HMM was then used to process the selected features
for bearing degradation assessment. Yu [11] extracted 14 time-domain features and 5
time–frequency domain features from the vibration signals, and the feature dimension
was reduced by the PCA. A series of historical HMMs were established adaptively,
and the health state of the bearing was evaluated by the overlap rate of the historical
HMM and the current one.
To improve the generalization ability of machine learning models, ensemble
learning (EL), which completes learning tasks by constructing multiple learning
machines, is also widely used in IFDPHA of complex electro-mechanical systems.
For example, Khazaee et al. [12] fused vibration and acoustic data according to
the Dempster–Shafer theory and proposed an effective fault diagnosis method for
planetary gearboxes based on EL. First, the vibration and acoustic signals were
transformed from the time domain to the time–frequency one by wavelet transform,
and the time–frequency features were extracted as the input of the neural network.
Then, two neural network classifiers were constructed, and the vibration and acoustic
features were fed into different neural networks respectively. Finally, the output of
the two neural networks was fused to obtain the final classification result. Wang
et al. [13] proposed a particle swarm optimization-based selective ensemble learning
(PSOSEN) for fault diagnosis of rotating machinery. First, time- and frequency-
domain features were extracted from vibration signals, and a series of probabilistic
neural networks (PNNs) were trained. Then, the adaptive particle swarm optimiza-
tion (APSO) algorithm was proposed to select the network suitable for fault diagnosis
from the above PNNs, and singular value decomposition (SVD) was utilized to obtain
the best-weighted vector from the output of these networks. The final diagnosis result
is the inner product of the output vectors of the PNNs and the best-weighted vector.
Shallow machine learning methods are simple-structured, which require few
parameters to be trained and consume a small amount of computation. Therefore,
if the number of training samples and computing power is insufficient, IFDPHA
models with good accuracy and generalization ability can be quickly and effectively
established by shallow machine learning. However, due to its simple structures and
limited feature extraction capability, manual feature extraction and selection are
usually required.
8 1 Introduction

1.4.2 Deep Learning-Based Methods

With the rise of deep learning techniques and the rapid development of computing
facilities, intelligent diagnosis, prognosis, and health assessment methods based
on deep learning are emerging in the field of fault diagnosis. For example, Shao
et al. [14] extracted time-domain features from vibration signals as input of the deep
neural network (DNN), which was trained by a particle swarm-based optimization
algorithm. Qi et al. [15] used the overall empirical mode decomposition (EMD)
and autoregressive model to extract features from vibration signals. The extracted
features were the input of a stacked sparse autoencoder network for fault diagnosis of
rotating machinery. Chen and Li [16] extracted time and frequency-domain features
from vibration signals collected by different sensors, and input these features into
different two-layer autoencoder networks for further feature extraction. Finally, the
outputs of all autoencoder networks were combined and lined up as the input of a
deep belief network for bearing fault diagnosis. Guo et al. [17] proposed a bearing
remaining useful life (RUL) prediction method based on a long short-term memory-
recurrent neural network (LSTM-RNN). The proposed 6 similarity-based features
were first combined with 8 traditional time–frequency domain features as the orig-
inal feature space, and then the monotonic and correlation measurements were used
for feature selection. Finally, the selected features were used as the input of the
LSTM-RNN to perform RUL prediction of the bearing.
It can be concluded that in early deep learning-based diagnosis models, the
input features are manually extracted, and the deep network is used as the classi-
fier to identify fault modes or evaluate health status. To make full use of the feature
learning ability of the deep learning algorithms and extract fault features automati-
cally, scholars have carried out further research. Heydarzadeh et al. [18] preprocessed
the collected vibration acceleration signals, torque signals, and acoustic signals by
wavelet transform, and used the preprocessed wavelet coefficients to train three
different DNNs for gear fault diagnosis. Experimental results showed that all three
kinds of signals can be used for the effective gear fault diagnosis with the proposed
method. Janssens et al. [19] proposed a fault diagnosis method for rotating machinery
based on a convolutional neural network (CNN), where the spectra of vibration
signals collected from two different locations were obtained by Fourier transforms
(FT). The two spectra were placed in the same two-dimensional matrix, which was the
input of the CNN for rotating machinery fault diagnosis. Zhang [20] used the sparse
autoencoder network to fuse the vibration signals collected by multiple sensors and
evaluated the operating status of the equipment using square prediction error (SPE).
Guo et al. [21] proposed a CNN-based method with adaptive learning rate adjustment
to identify the type and severity of bearing faults. The vibration signal was first input
into the first convolutional neural network to identify the fault type and into second
to identify the fault degree of the bearing. To solve the problem of bearing fault
diagnosis under load fluctuation and noise environment, Lu et al. [22] utilized the
powerful feature extraction capability of deep learning by using the original signal
directly as the input of the stacked denoising autoencoder (SDA). The extracted
1.5 Organization and Characteristics of the Book 9

features were then used as the input of the classifier to diagnose bearing faults, and
different feature extraction methods were compared and analyzed. Jing et al. [23]
proposed a fault diagnosis method for planetary gearboxes based on data fusion and
CNN. The standardized vibration, sound, current, and speed signals were used as the
input of the deep CNN, which extracted features, selected the degree of data fusion,
and realized the fault diagnosis of the planetary gearbox. Shao et al. [24] proposed a
deep autoencoder network for fault diagnosis of gearboxes and electric locomotive
bearings. The proposed method used raw data as the input of the deep autoencoder
network, where the key parameters were optimized by an artificial fish swarm algo-
rithm rather than manual adjustment, and a new loss function based on the maximum
cross-entropy was proposed to avoid the negative impact of noise. Zhang et al. [25]
proposed a bearing fault diagnosis method combining an one-dimensional CNN and
EL, which realized the end-to-end mapping from the original data to the fault state
and avoided the uncertainty influence of artificial feature extraction.
DL-based methods automatically extract and select appropriate features from the
original data for IFDPHA, which can avoid the shortage of manual feature extraction
and enhance the intelligence of the methods. Effective fault diagnosis, prognosis,
and health assessment models, which hardly rely on expert knowledge and have low
requirements for users, can be established when the training samples are sufficient and
the computing power is strong enough. However, the complex structures and large-
scale training parameters of the DNNs usually require a large number of training
samples and training time. DL-based methods cannot meet the actual needs in case
of the number of training samples is insufficient, since overfitting will occur.
Research on fault diagnosis, prognosis, and health assessment of complex electro-
mechanical systems has achieved great success in the industry. Especially, in terms of
data-driven intelligent diagnosis methods, various machine learning algorithms have
been widely used in fault diagnosis, prognosis, and health assessment of mechanical
systems, and the research of deep learning continously goes further.

1.5 Organization and Characteristics of the Book

From the perspective of machine learning, this book details the applications in fault
feature extraction and selection, incipient fault prediction, fault mode classifica-
tion, and performance degradation evaluation based on supervised learning, semi-
supervised learning, manifold learning, phase space reconstruction, and other related
algorithms. Besides, the applications of deep learning in intelligent prognosis and
health assessment are explored and analyzed, which is one of the current research
hotspots of machine learning. Focusing on IFDPHA, this book is divided into 7
chapters and structured as follows.
Chapter 2 introduces supervised SVM-based algorithms and their applications in
machinery fault diagnosis. To fully improve the generalization performance of SVM,
the problems such as parameter optimization, feature selection, and ensemble-based
incremental methods are discussed. The effectiveness of the SVM-based algorithms
10 1 Introduction

is validated in several fault diagnosis tasks on electrical locomotive rolling bearings,


Bently rotor, and motor bearings testbench.
Chapter 3 discusses semi-supervised intelligent fault diagnosis methods. Consid-
ering the difficulty of obtaining labeled data for supervised learning and the inade-
quate generalization ability of traditional unsupervised learning, the thought of semi-
supervised learning is integrated into mature supervised and unsupervised learning
algorithms. Semi-supervised intelligent fault diagnosis methods, such as Kernel
Principal Component Analysis (KPCA), fuzzy kernel clustering algorithms, Self-
organizing Map (SOM) neural networks, and Relevance Vector Machines (RVM), are
introduced. These methods are validated in incipient fault diagnosis of transmissions
and bearings and have achieved successful results.
Chapter 4 addresses a variety of intelligent fault diagnosis and prognosis methods
based on manifold learning, including spectral clustering manifold-based fault
feature selection, locally linear embedding (LLE)-based fault recognition, and
distance-preserving projection-based fault classification. These methods are applied
to the fault diagnosis and prognosis of gearbox gears, rolling bearings, engines, etc.,
and the effectiveness is verified.
Chapter 5 mainly introduces four deep learning-based network models, including
convolutional neural network (CNN), deep belief network (DBN), stacked auto-
encoder (SAE), and recurrent neural network (RNN). With cases of automotive trans-
mission fault diagnosis and tool degradation assessment, this chapter gives detailed
descriptions of how to apply deep neural networks (DNNs) for machinery fault
diagnosis and equipment degradation assessment, and verifies the effectiveness of
DNNs.
Chapter 6 analyzes the application prospects of Recurrent Quantitative Anal-
ysis (RQA), Kalman Filter (KF), Particle Filter (PF), and other algorithms in fault
identification and prognosis based on the space reconstruction theory. Furthermore,
this chapter introduces KF-based incipient fault prediction and enhanced PF-based
remaining useful life (RUL) prediction methods. Experiments show the great perfor-
mance of the proposed algorithms in multi-parameter identification of bearing faults,
degradation tracking, and RUL prediction of transmission systems.
Chapter 7 proposes an operation reliability assessment method that realizes
health monitoring-based reliability assessment and condition-based maintenance for
complex electro-mechanical systems, such as the turbine generator set, compressor
gearbox, and aero-engine rotor.

References

1. Zhong, J.: Coupling Design Theory and Methods of Complex Electromechanical Systems (in
Chinese). China Machine Press, Beijing (2007)
2. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks.
Science 313(5786), 504–507 (2006)
3. Lei, Y., Lin, J., He, Z., et al.: A method based on multi-sensor data fusion for fault detection
of planetary gearboxes. Sensors 12(2), 2005–2017 (2012)
References 11

4. Unal, M., Onat, M., Demetgul, M., et al.: Fault diagnosis of rolling bearings using a genetic
algorithm optimized neural network. Measurement 58, 187–196 (2014)
5. You, Z., Wang, N., Li, M., et al.: Method of fan fault diagnosis of gearbox based on EEMD
and BP neural network (in Chinese). J. Northeast Dianli Univ. 35(01), 64–72 (2015)
6. Chang, H.C., Lin, S.C., Kuo, C.C., et al.: Using neural network based on the shaft orbit
feature for online rotating machinery fault diagnosis. In: Proceeding of 2016 IEEE International
Conference on System Science and Engineering (ICSSE), 07–09 July 2016
7. Tian, Y., Wang, Z., Lu, C.: Self-adaptive bearing fault diagnosis based on permutation entropy
and manifold-based dynamic time warping. Mech. Syst. Signal Process. 114, 658–673 (2019)
8. Widodo, A., Kim, E.Y., Son, J.D., et al.: Fault diagnosis of low speed bearing based on relevance
vector machine and support vector machine. Expert Syst. Appl. 36(3), 7252–7261 (2009)
9. Zarei, J., Tajeddini, M.A., Karimi, H.R.: Vibration analysis for bearing fault detection and
classification using an intelligent filter. Mechatronics 24(2), 151–157 (2014)
10. Jiang, H., Chen, J., Dong, G.: Hidden Markov model and nuisance attribute projection based
bearing performance degradation assessment. Mech. Syst. Signal Process. 72, 184–205 (2016)
11. Yu, J.: Adaptive hidden Markov model-based online learning framework for bearing faulty
detection and performance degradation monitoring. Mech. Syst. Signal Process. 83, 149–162
(2017)
12. Khazaee, M., Ahmadi, H., Omid, M., et al.: Classifier fusion of vibration and acoustic signals
for fault diagnosis and classification of planetary gears based on Dempster–Shafer evidence
theory. Proc. Inst. Mech. Eng. Part E J. Process Mech. Eng. 228(1), 21–32 (2014)
13. Wang, Z.Y., Lu, C., Zhou, B.: Fault diagnosis for rotary machinery with selective ensemble
neural networks. Mech. Syst. Signal Process. 113, 112–130 (2018)
14. Shao, H., Jiang, H., Zhang, X., et al.: Rolling bearing fault diagnosis using an optimization
deep belief network. Meas. Sci. Technol. 26(11), 115002 (2015)
15. Qi, Y., Shen, C., Wang, D., et al.: Stacked sparse autoencoder-based deep network for fault
diagnosis of rotating machinery. IEEE Access 5, 15066–15079 (2017)
16. Chen, Z., Li, W.: Multisensor feature fusion for bearing fault diagnosis using sparse autoencoder
and deep belief network. IEEE Trans. Instrum. Meas. 66(7), 1693–1702 (2017)
17. Guo, L., Li, N., Lei, Y., et al.: A recurrent neural network based health indicator for remaining
useful life prediction of bearings. Neurocomputing 240, 98–109 (2017)
18. Heydarzadeh, M., Kia, S.H., Nourani, M., et al.: Gear fault diagnosis using discrete wavelet
transform and deep neural networks. In: Proceeding of 2016 42nd Annual Conference of the
IEEE Industrial Electronics Society (IECON), Florence, Italy, 23–26 Oct 2016
19. Janssens, O., Slavkovikj, V., Vervisch, B., et al.: Convolutional neural network based fault
detection for rotating machinery. J. Sound Vib. 377, 331–345 (2016)
20. Zhang, S.: Bearing condition dynamic monitoring based on multi-way sparse autocoder (in
Chinese). J. Vib. Shock 35(19), 125–131 (2016)
21. Guo, X., Chen, L., Shen, C.: Hierarchical adaptive deep convolution neural network and its
application to bearing fault diagnosis. Measurement 93, 490–502 (2016)
22. Lu, C., Wang, Z.Y., Qin, W.L., et al.: Fault diagnosis of rotary machinery components using a
stacked denoising autoencoder-based health state identification. Signal Process. 130, 377–388
(2017)
23. Jing, L., Wang, T., Zhao, M., et al.: An adaptive multi-sensor data fusion method based on
deep convolutional neural networks for fault diagnosis of planetary gearbox. Sensors 17(2),
414 (2017)
24. Shao, H., Jiang, H., Zhao, H., et al.: A novel deep autoencoder feature learning method for
rotating machinery fault diagnosis. Mech. Syst. Signal Process. 95, 187–204 (2017)
25. Zhang, W., Li, C., Peng, G., et al.: A deep convolutional neural network with new training
methods for bearing fault diagnosis under noisy environment and different working load. Mech.
Syst. Signal Process. 100, 439–453 (2018)
Chapter 2
Supervised SVM Based Intelligent Fault
Diagnosis Methods

2.1 The Theory of Supervised Learning

An important aspect of human intelligence is to learn from examples, to predict facts


that cannot be directly observed by analyzing known facts and summarizing patterns
[1]. In this kind of learning, it is important to be able to draw inferences from one
another, that is, to use the rules learned from the sample data, not only to better
explain the known examples, but also to make correct predictions and judgments on
the future phenomenon or the phenomenon that cannot be observed. Usually, this kind
of learning ability is called generalization ability. With the advent of the information
age, a variety of data and information filled every corner of human production and
life, but human beings are very limited in the ability to process and use data, and
have not yet reached the degree of in-depth mining of data rules.
In the research of machine intelligence, we hope that we can simulate this gener-
alization ability with the machine (computer), and make it discover the latent law
hidden in the data by studying the sample data, and make predictions about future
data or unobservable phenomenon. Statistical reasoning theory infers a functional
dependency from an empirical sample of data and plays a fundamental role in solving
machine intelligence problems [1].
There are two kinds of statistical reasoning theories: one is the traditional statistical
theory, which studies the infinite asymptotic behavior of samples, and the other is
the statistical learning theory under the condition of finite samples. In traditional
statistical theory, we study the statistical properties of large samples on the premise of
the asymptotic theory when the samples tend to infinity, the form of the parameters is
known, and the empirical samples are used to estimate the values of the parameters, it
is the most basic and commonly used analytical method in the absence of a theoretical
physical model. But in practical application, it is difficult to satisfy the premise of
the asymptotic theory when the sample tends to infinity. Therefore, in the traditional
statistical theory, which is based on the asymptotic theory when the sample tends
to infinity, is difficult to obtain the ideal effect when the sample data is limited.
Statistical Learning Theory is a Bell Labs of AT&T’s Vapnik proposed in the 1960s,

© National Defense Industry Press 2023 13


W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_2
14 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

after three decades of painstaking research, until the mid-1990s to develop a new
theory of limited sample statistics and learning methods, it makes up the deficiency
of traditional statistical theory. The statistical reasoning under this theory system
not only considers the requirement of asymptotic performance, but also seeks the
optimal solution under the current limited information condition. Because of its more
systematic consideration of the case of a limited sample and its better practicability
than traditional statistical theory, it has been widely regarded by the world machine
learning community [1]. The Support vector machine, developed within this rigorous
theoretical framework, can transform real problems into high-dimensional feature
spaces through nonlinear transformations using kernel methods and mathematical
optimization, in order to realize the nonlinear decision function in the original space,
the linear decision function is constructed in the high dimension space, and the
dimension problem is solved skillfully, the structural risk minimization principle is
used to find a compromise between the approximation accuracy and the complexity
of the approximation function, which ensures the generalization ability.
Machine learning is not only a core research field of artificial intelligence, but
also one of the most active and potential fields in the whole computer field, it plays
an increasingly important role in the production and life of human beings. In recent
years, many countries in Europe are devoted to the research of machine learning
theory and application, and GE, Intel, IBM, Microsoft, and Boeing are also active
in this field. Supervised learning is a machine learning task that learns relevant
pattern recognition or regress knowledge from labeled training data. This chapter
mainly introduces the methods and applications of Support vector machine intelligent
diagnosis based on supervised learning.

2.1.1 The General Model of Supervised Learning

A general model of a supervised learning problem consists of three components as


shown in Fig. 2.1: a data (instances) generator G, a target operator S, and a learning
machine LM.
(1) Data (instance) generator G. The generator G is the source, which determines
the environment in which the trainer and learning machine work. The generator

Fig. 2.1 A general model


learned from examples
x
G S
y

LM

2.1 The Theory of Supervised Learning 15

G generates random vectors x ∈ R n independently and identically according to


some unknown (but fixed) probability distribution function F(x).
(2) The target operator S (sometimes called the trainer operator, or simply the
trainer). The destination operator S returns an output value y for each input
vector x according to the same fixed but unknown conditional distribution
function F(y|x).
(3) Learning machine LM, which can generate a certain function set f (x, α), α ∈ Λ
for each input vector x (where α is a vector of real numbers, Λ is a set of
parameters composed of real numbers) and produce an output value ỹ that
approximates the generated value y generated by the target operator S.
The problem of supervised learning is to select the function that best approximates
the training objective y from the generated function set f (x, α), α ∈ Λ. This selec-
tion is based on a training set of l independent identically distributed observations
drawn from the joint distribution F(x, y) = F(x)F(y|x).

(x 1 , y1 ), (x 2 , y2 ), . . . , (x l , yl ) (2.1)

In the supervised learning process, the learning machine trains a series of point
pairs (x i , yi )i=1,2,...,l obtained by observation, and constructs an operator to predict
the trainer response yi on a specific vector x i produced by the generator G. The goal
of the learning machine is to construct an appropriate approximation. By training,
the learning machine can return a ỹ very close to the trainer’s response y for any
given x.

2.1.2 Risk Minimization Problem

In order to obtain the best approximation to the trainer response, the loss and differ-
ence L(y, f (x, α)) between the trainer response y and the response f (x, α) given
by the learning machine for a given input x are measured. Consider the mathematical
expectation of loss:

R(α) = L(y, f (x, α))dF(x, y) (2.2)

where, R(α) is a risk functional, α is a vector of real numbers and α ∈ Λ, Λ is a


parameter set composed of real numbers. The goal of machine learning is to find the
function f (x, α 0 ) such that it minimizes the risk functional R(α) (on function set
f (x, α), α ∈ Λ) when the joint probability distribution function F(x, y) is unknown
and the training set (x 1 , y1 ), (x 2 , y2 ), . . . , (x l , yl ) is known.
16 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

2.1.3 Primary Learning Problem

There are three main learning problems in supervised learning: pattern recognition,
regression function, and probability density estimation, which are formulated as
follows.

2.1.3.1 Pattern Recognition

Let the output y of the trainer only take two values y = {0, 1}, and let f (x, α), α ∈ Λ
be the set of indicator functions (indicator functions are functions with only 0 or 1
values). Consider the following loss function:
{
0, y = f (x, α)
L(y, f (x, α)) = (2.3)
1, y /= f (x, α)

For this loss function, the functional of Eq. (2.2) determines the probability that
the answer given by the trainer and the indicator function f (x, α) is different, and
the situation that the answer given by the indicator function is different from the
output of the trainer is called classification error.
In this way, the learning problem becomes to find the function that minimizes the
probability of classification error when the probability measure F(x, y) is unknown,
but the data Eq. (2.1) is known.

2.1.3.2 Regression Estimation

Let the output y of the trainer be a real value, and let f (x, α), α ∈ Λ be a set of real
functions, which contains the regression function

f (x, α 0 ) = ydF(y|x) (2.4)

The regression function is the function that minimizes the functional Eq. (2.2)
under the loss function

L(y, f (x, α)) = (y − f (x, α))2 (2.5)

In this way, the problem of regression estimation is to minimize the risk functional
Eq. (2.2) that uses Eq. (2.5) as the loss function when the probability measure F(x, y)
is unknown but the data Eq. (2.1) is known.
2.2 Support Vector Machine 17

2.1.3.3 Density Estimation

For the problem of estimating the density function from the set p(x, α), α ∈ Λ of
density functions, consider the following loss function:

L( p(x, α)) = − log p(x, α) (2.6)

Therefore, the problem of estimating the density function from data is to minimize
the risk functional (2.2) by using Eq. (2.6) as the loss function when the corresponding
probability measure F(x) is unknown and independent identically distributed data
(2.1) is given.

2.2 Support Vector Machine

Statistical learning theory is a theory of statistical estimation and prediction of finite


samples. It adopts the principle of structural risk minimization to compromise the
empirical risk and confidence range, so as to minimize the actual risk. However, how
to construct learning machines to realize the structural risk minimization principle
is one of the key problems in statistical learning theory.
Support Vector Machine (SVM) [2] is a powerful tool developed under the system
of statistical learning theory to realize the principle of structural risk minimization. It
mainly achieves the principle of structural risk minimization by keeping the empirical
risk value fixed and minimizing the confidence range, which is suitable for small
sample learning. In 2001, the American Magazine Science evaluated support vector
machine as “a very popular method and successful example in the field of machine
learning, and a very impressive development direction” [3]. Support vector machine
integrates many techniques such as maximally spaced hyperplane, Mercer kernel,
convex quadratic programming, sparse solution and slack variables, and has the
following remarkable characteristics:
(1) Based on the statistical learning theory and the inductive principle of mini-
mizing structural risk, support vector machine seeks a learning machine that
minimizes the sum of empirical risk and confidence range. Compared with the
traditional learning machine based on the inductive principle of empirical risk
minimization, the support vector machine is better adapted to the situation of
small samples and can get the global optimal solution under the condition of
limited information, rather than the optimal solution when the sample goes to
infinity.
(2) The support vector machine algorithm transforms it into a quadratic optimization
problem through the duality theory, so as to ensure that the global optimal
solution is obtained, and overcome the problem that neural networks and other
methods are easy to fall into local extremum.
18 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(3) The support vector machine uses the kernel method to implicitly map the training
data from the original space to the high-dimensional space, and constructs
a linear discriminant function in the high-dimensional space to realize the
nonlinear discriminant function in the original space, the complexity of the
algorithm is independent of the sample dimension, which solves the dimension
problem and overcomes the traditional dimension disaster problem.
(4) The machine complexity obtained by SVM depends on the number of support
vectors rather than the dimension of transformation space, so the phenomenon
of “over-learning” is avoided.

2.2.1 Linear Support Vector Machine

The support vector machine is developed from the optimal classification surface in the
case of linear separability, and the basic idea can be illustrated in the case of the two-
dimensional plane shown below. In Fig. 2.2, the star and the dot represent two kinds of
samples, and the thick solid line in the middle is the optimal classification hyperplane,
the two adjacent lines are the ones that are parallel to the classification hyperplane
and are closest to the classification hyperplane. The distance between them is the
classification interval. The so-called optimal classification hyperplane requires that
the classification hyperplane can not only separate the two classes correctly, that is, the
training error rate is 0, but also maximizes the classification interval. Maximizing the
separation of categories is in fact the control of generalization, which is one of the core
ideas of Support vector machine. Linear Support vector machine is only applicable
to Linear separability samples, for many practical linear indivisible problems, it is
necessary to find a way to transform the original linear indivisible problem into a
simple linear divisible problem. See [2] for the algorithm.

Fig. 2.2 The optimal


classification hyperplane that
separates data at maximum
intervals
2.2 Support Vector Machine 19

2.2.2 Nonlinear Support Vector Machine

The basic idea of the nonlinear Support vector machine is to map the input vari-
ables x into a high-dimensional space through nonlinear transformation, and then
find the optimal classification surface in the high-dimensional transformation space.
This kind of transformation is more complex and difficult to realize under normal
circumstances. But it is noted that the problem of finding the optimal classification
surface only involves the inner-product operation between the training samples, that
is, only the inner-product operation is needed in the high-dimensional space, and this
inner-product operation can be realized by the functions in the original space, you
don’t even need to know the form of the transformation. According to the theory of
functional, as long as a kernel function satisfies the Mercer condition, it corresponds
to the inner product in a transformation space. See [4] for the algorithm.

2.2.3 Kernel Function

The basic principles of kernel space theory are shown in Fig. 2.3. For a classification
problem P, let X stand for the classification sample set, X ∈ R, R is called the input
space or measurement space. In this space, P is a nonlinear or linearly indivisible
problem (as shown in Fig. 2.3a). By finding an appropriate nonlinear mapping func-
tion φ(x), the sample set X in the input space can be mapped to a high-dimensional
space F, so that the classification problem P can be linearly classified in the space F
(as shown in Fig. 2.3b). Its essence is the same as the optimal classification hyper-
plane of the support vector machine (Fig. 2.2). F is called the characteristic space
and can have any large dimension, even infinite dimension.
Using kernels is an attractive approach to computation. The feature space can
be defined implicitly through kernel functions, and can be avoided not only in the
calculation of inner product but also in the design of Support vector machine. Using

Fig. 2.3 Basic principles of kernel space theory


20 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

different kernel functions and their corresponding Hilbert spaces is equivalent to


using different criteria to evaluate the similarity of data samples.
To construct different types of Support vector machines, you need to use different
kernels that satisfy the Mercer theorem. Therefore, it is very important to construct
kernel functions that can reflect the properties of approximation functions. An impor-
tant feature of support vector algorithm is attributed to the Mercer condition of kernel,
which makes the corresponding optimization problem become convex, thus ensuring
that the optimal solution is global.
First, we consider a finite input space, and assume that 1 , . . . , x n } is a
( X( = {x)) n
symmetric function on K (x, z). Consider a matrix: K = K x i , x j i, j=1 , since it
is symmetric, there must be an orthogonal matrix V such that K = V ΛV ' , where Λ
is the eigenvalue of K and λt corresponds to the eigenvector λt = (υti )i=1 n
, which is
the column of V . Now, assuming that all eigenvalues are nonnegative, consider the
feature map:
(√ )n
φ : xi → λt υti ∈ Rn , i = 1, . . . , n (2.7)
t=1

Now there are:

⟨ ( )⟩ Σ n
( ) ( )
φ(x i ) · φ x j = λt υti υt j = V ΛV ' i j = K i j = K x i , x j (2.8)
t=1

This means that K (x, z) is the kernel that really corresponds to the eigenmap φ.
The condition that the eigenvalues of K are non-negative is necessary because if
there is a negative eigenvalue λs corresponding to the eigenvector υ s , the points in
the eigenspace:

Σ
n

z= υsi φ(x i ) = ΛV ' v s (2.9)
i=1

Having a second order norm:


√ √
||z||2 = ⟨z · z⟩ = v 's V Λ ΛV ' v s = v 's V ΛV ' v s = v 's K v s = λs < 0 (2.10)

Contradicts the geometric properties of space. Therefore, the following Mercer


theorem is obtained. Mercer’s theorem describes the properties of a function when
it is a kernel function.
Mercer’s theorem: Let X be a finite input space and K (x, z) be a symmetric function
on X. So, the necessary and sufficient condition for K (x, z) to be a kernel is the
matrix
( ( ))n
K = K x i , x j i, j=1 (2.11)
2.2 Support Vector Machine 21

is positive semidefinite (i.e., the eigenvalue is not negative).


We can generalize the inner product in the Hilbert space by introducing weights
λi for each feature to obtain:

Σ
⟨φ(x) · φ(z)⟩ = λi φi (x)φi (z) = K(x, z) (2.12)
i=1

So, the feature vector becomes:

φ(x) = (φ1 (x), φ2 (x), . . . , φn (x), . . .) (2.13)

Mercer’s theorem gives an acceptable representation of continuous symmetric


functions K (x, z):

Σ
K (x, z) = λi φi (x)φi (z) (2.14)
i=1

where λi is non-negative, it is equivalent to K (x, z) being the inner product in the


eigenspace F ⊇ φ(x), where F is the l2 space of all the following sequences:

ψ = (ψ1 , ψ2 , . . . , ψi , . . .) (2.15)
Σ∞
where i=1 λi ψi2 < ∞.
It will implicitly derive a space defined by the eigenvector, and the decision
function of the Support vector machine will be expressed as:

Σ
l
( )
f (x) = α j y j K x, x j + b (2.16)
j=1

The four most common kernels are:


(1) Support vector polynomial kernel function
[( ) ]d
K (x i , x j ) = xi · x j + R (2.17)

where R is a constant and d is the order of the polynomial.


(2) Index Radial basis function
( || || )
|| x i − x j ||
K (x i , x j ) = exp − (2.18)
2σ 2

where, σ is the width of exponential radial basis kernel function.


22 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(3) Gauss Radial basis function


( || || )
|| x i − x j ||2
K (x i , x j ) = exp − (2.19)
2σ 2

where, σ is the width of the Gaussian radial basis function.


(4) The Sigmoid kernel function
[ ( ) ]
K (x i , x j ) = tanh υ x i · x j + θ (2.20)

υ > 0, θ < 0 is the Sigmoid kernel function parameter.


Vapnik et al. argue that in Support vector machine, just changing the kernel can
change the type of the learning machine (that is, the type of approximation function)
[1], and that a key factor in Support vector machine generalization performance is
the kernel. By using appropriate kernel functions, the training data can be implicitly
mapped to the high-dimensional space, and the data law of the high-dimensional
space can be found. But so far, the function and scope of various kernel functions
have not been clearly proved and clearly agreed.

2.2.4 The Applications of SVM in Machinery Fault Diagnosis

The fault diagnosis of mechanical equipment is a typical small-sample problem, and


traditional machine learning methods cannot obtain good generalization performance
when solving a small-sample problem. Therefore, the lack of sufficient learning
samples has always been a bottleneck problem in mechanical intelligent fault diag-
nosis. Statistical learning theory is a new theory to study the statistical law and
learning method of limited sample data, which makes up for the deficiency of tradi-
tional statistical theory. The support vector machine method developed under this
rigorous theoretical system is suitable for fault diagnosis in small samples. At present,
support vector machines are widely used in mechanical condition monitoring and
fault diagnosis of bearings, gears, motors and so on, as shown in Table 2.1.
Although the support vector machine has made great progress in the application
of mechanical fault diagnosis, it has not achieved the ideal effect in the practical
application:
(1) Optimization of Support vector machine parameters. The optimization of
Support vector machine parameters, also known as parameter selection, has been
a key factor that restricts the Support vector machine to perform its good gener-
alization performance in engineering practice. The Support vector machine
parameters are directly related to the generalization performance of the Support
vector machine. Therefore, it is a key problem to study the mechanism of the
effect of Support vector machine parameters on the generalization performance
2.2 Support Vector Machine 23

Table 2.1 Application of support vector machine in mechanical fault diagnosis


Diagnostic Equipment status Method Author
object
Bearing Normal, outer ring fault, inner ring fault Empirical mode Yang et al.
decomposition + [5]
support vector
machine
Gear Normal, cracked, broken teeth Empirical mode Cheng et al.
decomposition + [6]
support vector
machine
Normal, crack, broken tooth, tooth surface Morlet wavelet + Saravanan
wear support vector et al. [7]
machine
Motor Rotor bar broken, rotor cage end ring Welsh method + Poyhonen
broken, coil short circuit, stator winding support vector et al. [8]
coil short circuit machine
Pump Structural resonance, rotor radial contact Principal Chu and
friction, rotor axial contact friction, shaft component analysis Yuan [9]
crack, gear breakage, bearing breakage, + support vector
blade breakage, rotor eccentricity, shaft machine
bending, main body connection loosening,
bearing loosening, rotor partial loosening,
air pressure pulsation, cavitation
phenomenon
Aero-engine Bearing failure, oil failure, reducer failure Support vector Sun et al.
machine [10]
Diesel engine Normal, fuel injector nozzle enlarged, fuel Support vector Zhu and Liu
injector nozzle plug, fuel pump plug, fuel machine [11]
system leakage, fuel mixed with
impurities

of Support vector machine and to propose an effective optimization method


of Support vector machine parameters in the application of mechanical fault
diagnosis.
(2) Fault feature selection problem. In engineering practice, when the structure of
the Support vector machine kernel and algorithm is determined, the two key
factors that restrict the Support vector machine to achieve good generalization
performance are how to select the optimal Support vector machine parameters
and how to select the features related to the attributes of the samples for Support
vector machine learning. Most of the current research is confined to analyzing
and solving the two key factors that restrict the generalization performance of
Support vector machine, namely, feature selection and parameter optimization,
without considering the common influence of feature selection and parameter
optimization on the generalization performance of Support vector machine, it is
inevitable to lose one and lose the other, which restricts the full Support vector
24 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

machine of generalization performance. Therefore, how to solve the problem


of simultaneous optimization of Support vector machine feature selection and
parameters, improving the generalization ability of Support vector machine by
obtaining the best features and parameters that match each other is another
problem that Support vector machine face in the application of mechanical fault
diagnosis.
(3) The Support vector machine algorithm structure improvement problem. The
establishment of Support vector machine intelligent fault diagnosis model
requires the Support vector machine algorithm structure to have good gener-
alization performance, that is, to predict future samples and unknown events.
The generalization performance of the Support vector machine classifier (or
predictor) depends mainly on the availability of limited samples and the
adequacy of the Support vector machine algorithm in mining sample infor-
mation. However, it is very expensive and time-consuming to collect represen-
tative training samples in engineering practice, and it is not easy to accumulate
a certain amount of training samples in a certain period of time. How to fully
mine and make use of the knowledge and laws in the limited sample informa-
tion, improving the generalization ability of Support vector machine Support
vector machine from the aspects of Support vector machine theory and algo-
rithm construction is a problem that needs to be further studied in the application
of mechanical fault diagnosis.
The technology of mechanical fault diagnosis can improve the reliability and
normal running time of mechanical products, and the current available fault diag-
nosis technology usually needs a lot of professional knowledge and experience accu-
mulation, so in practical application, without professional training or expertise, it is
difficult for technicians in the production line to analyze the monitored data and
even more difficult for them to make accurate diagnoses. In addition, due to the large
noise interference in the industrial field, sometimes the fault conclusions obtained
by the conventional diagnosis methods are not sufficient, so the intelligent diagnosis
technology needs to be developed.

2.3 The Parameters Optimization Method for SVM

The performance of support vector machine mainly refers to the prediction ability
of support vector machine to unknown samples established by learning the attributes
of training samples, which is usually called generalization ability. The parameter
value of support vector machine profoundly affects the generalization performance
of support vector machine. In 2002, Chapelle and Vapnik et al. published an article
in the internationally renowned academic journal Machine Learning describing the
important influence of support vector machine parameters (penalty factor C and
kernel parameters) on support vector machines [12]: The penalty factor C determines
the compromise between maximizing the interval and minimizing the error, Kernel
2.3 The Parameters Optimization Method for SVM 25

parameters (such as order d in polynomial kernel, exponential radial basis kernel


width σ , Gaussian radial basis kernel width σ , Sigmoid kernel parameters υ, θ ,
and other kernel parameters) determine the nonlinear mapping characteristics and
distribution of the sample from the input space to the high-dimensional feature space.
If the selected parameter values (such as penalty factor C, kernel parameters) are
inappropriate, the predictive performance of binary Support vector machine will even
be lower than 50% random guess error probability. To achieve good generalization
performance of support vector machine, parameter optimization is an indispensable
step. Therefore, the parameter optimization of support vector machine has been a
focus of scholars at home and abroad, and also a key factor restricting the excellent
generalization performance of support vector machine in engineering practice.
The machine optimization methods of Support vector machine proposed at home
and abroad can be classified into the following four categories: Trial and Error
Procedure, Cross Validation Method, Generalization Error Estimation and Gradient
Descend Method, Artificial Intelligent and Evolutionary Algorithm, and so on. Trial
and error procedure is also known as “cut-and-try method”, where the user has
no prior knowledge or relies on less experience, by testing a limited number of
parameter values and retaining the minimum test error of the parameters as the
optimal parameters. Although the trial and error procedure is used by most people
because of its simplicity in practical application, due to insufficient optimization in
Support vector machine parameter space, therefore, the selected optimal parameters
of Support vector machine are neither rigorous nor convincing. Another commonly
used method is cross validation method, which divides the sample data set into k
equal data subsets and uses k − 1 data subsets for training and the remaining data
subsets for testing. The resulting k test error means are used as the test errors under
the current support vector machine parameter combination. Then adjust the param-
eters and repeat the above steps to obtain the test error of the adjusted parameters;
When a satisfactory test error is obtained, the corresponding parameter combination
is taken as the optimal parameter of the support vector machine. It can be seen that
the cross-validation method is time-consuming and time-consuming, and can only
be optimized in a limited Support vector machine parameter space. Generalization
error estimation and gradient descent is a method that uses the error bound estimation
function of Support vector machine generalization performance and combines the
gradient descent to search for the optimal parameters in the Support vector machine
parameter space, however, since the error estimation of generalization performance
of Support vector machine is a complex mathematical problem and requires that the
gradient calculation of the error bounds of Support vector machine parameters is
feasible, therefore, generalization error estimation and gradient descent method is
not commonly used in practical engineering applications.
Currently, the optimization of Support vector machine parameters is strongly
supported by artificial intelligence techniques and evolutionary algorithms. Because
the nature of natural evolution process is an optimization process of survival of the
fittest, which follows a kind of wonderful law and rule, for thousands of years, it has
inspired human beings to learn the laws of nature, simulate the evolution of nature
and biology, and realize invention and practical activities (as shown in Table 2.2).
26 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.2 Biological revelation and human invention


Creature Invention Creature Invention
Bird The plane The tail fin of a fish Rudder
Bats Sonar and radar Pectoral fins of fishes Oars
Dragonflies Helicopters Frogs Electronic frog eyes
Cobwebs Fishing nets Jellyfish Storm forecasters
Spider silk New fibers Flying squirrels Parachutes
Fireflies Artificial luminescence Shells Tank
Flies Smell detector Butterflies Satellite temperature control
system

Scholars at home and abroad have proposed various intelligent and evolutionary
algorithms for Support vector machine parameter optimization, such as covariance
matrix adaptive evolutionary strategy, genetic algorithm, artificial immune algorithm,
particle swarm optimization algorithm, etc.
This section focuses on the Support vector machine parameter optimization
problem. Based on the ant colony optimization algorithm, the mechanism of the
influence of the Support vector machine parameters on the Support vector machine
is analyzed, an ant colony optimization algorithm is proposed to optimize Support
vector machine parameters. Then, through the International Standard Database, the
mechanism of the influence of the parameters of ant colony optimization algorithm on
the process of optimizing the parameters of the Support vector machine is analyzed,
the feasibility and effectiveness of the proposed Support vector machine param-
eter optimization method based on ant colony optimization algorithm are verified
by comparison with other existing methods. Finally, the proposed ant colony opti-
mization algorithm based Support vector machine parameter optimization method is
applied to the analysis of an electric locomotive bearing fault case, the bearing fault
diagnosis method of electric locomotive will not be based on the accurate extrac-
tion of bearing fault characteristic frequency, instead, it simply extracts time-domain
and frequency-domain statistical features of the signal and then uses the proposed
Support vector machine parameter optimization method based on the ant colony
optimization algorithm for fault pattern recognition, so as to highlight the effective-
ness of the parameter optimization method. The results show that this method can
improve the generalization performance of Support vector machine and successfully
identify the common single fault modes in electric locomotive bearings.

2.3.1 Ant Colony Optimization

Dorigo et al. proposed the ant colony optimization algorithm in 1991 [13], and
elaborated their research results in international famous journals such as Nature for
the model and theory of ant colony optimization algorithm, laying a solid foundation
2.3 The Parameters Optimization Method for SVM 27

for the construction of the theoretical system of ant colony optimization algorithm.
Aiming at the parameter optimization problem of Support vector machine, this paper
expounds the algorithm design and example analysis of ant colony optimization
algorithm. For the specific algorithm derivation, see [13].

2.3.1.1 Algorithm Design

For continuous domain optimization problems, ANT colony optimization algorithms


usually need to discretize the continuous domain first, and the artificial ants can move
freely on the abstract points for the convenience of computer. Ant colony optimization
algorithm flow as shown in Fig. 2.4, mainly includes the following five key steps:

(1) Variable initialization: set the number of ants in the ant colony, pheromone
concentration at the initial moment, and pheromone intensity variables such as
the initial value.
(2) Discretization of continuous domain: the value range xilower ≤ xi ≤
upper
xi (i = 1, 2, . . . , n) of the variable to be optimized is discretized as N equal
parts, then the interval between each discrete point is
upper
xi − xilower
hi = , i = 1, 2, . . . , n (2.21)
N
upper
where, xi is the upper bound of the value range of variables xi ; xilower is the
lower bound of the value range of variables xi ; n is the number of variables xi .
(3) Pheromone construction and management: let τi j be the pheromone concentra-
tion on node (i, j ), and its initial value be a constant, so as to ensure that each
node has equal probability of being selected by ants at the initial moment. In the
subsequent search process, after all ants have completed a traversal search, the
pheromone on each node needs to modify the current pheromone concentration
according to the pheromone update equation

Q
τinew
j = (1 − ρ)τiold
j + (2.22)
e fi j

where, τinew
j is the current pheromone concentration on node (i, j ); ρ is
pheromone volatility coefficient; τiold
j is the historical pheromone concentra-
tion on node (i, j ); Q is the pheromone intensity, usually a constant; e is a
mathematical constant; f i j is the value of the objective function f on node
(i, j ).
(4) Calculate the probability of next target node of each ant according to the state
transition probability equation (Pij) to determine the moving direction of the
ants,
τi j
Pi j = Σ N (2.23)
i=1 τi j
28 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.4 Flow chart of ant colony optimization algorithm


2.3 The Parameters Optimization Method for SVM 29

When the movement number of ant colony Nc is less than preset value Ncmax , the
ant colony will continue to optimize according to the pheromone renewal equation
defined in Eq. (2.21) and the state transition probability equation defined in Eq. (2.22).
If the movement number of ant colony Nc reaches the preset value Ncmax , coordinate
m i , i = 1, 2, . . . , n corresponding to the maximum pheromone concentration τinew j
on each node is found, and the value range of variable is narrowed:

xilower ← xilower + (m i − Δ) ∗ h i (2.24)

upper upper
xi ← xi + (m i − Δ) ∗ h i (2.25)

Δ is a constant. Then continue from the above step (2) to start the ant colony
optimization process.
(5) Algorithm iteration termination condition: if the maximum interval max(h i )
between the discrete points in step (2) is less than the given precision ε, then
the optimization algorithm stops and outputs the optimal solution
upper
xilower + xi
xi∗ = , i = 1, . . . , n (2.26)
2

2.3.1.2 Example Analysis

To simply verify the effectiveness of the ant colony optimization algorithm, the ant
colony optimization algorithm is first applied to three basic mathematical examples.
Example 1: for the following typical univariate continuous domain optimization
problem

min f (x) = (x − 1)2 + 1, x ∈ [0, 8] (2.27)

The function is of one variable and has a theoretical minimum f (x ∗ ) = 1 at



x = 1. The ant colony optimization algorithm is used to optimize the function.
The experimental results shown in Fig. 2.5 show that the ant colony optimization
algorithm obtains the optimal value of the objective function f (x̃) = 1.0003 at
x̃ = 0.9396. The calculation time t = 0.2190 s.
Example 2: for the following typical univariate continuous domain optimization
problem

min f (x) = 5 sin(30x)e−0.5x + sin(20x)e0.2x + 6 (2.28)

The univariate function has several local extremum values. The optimization result
of the ant colony optimization algorithm (as shown in Fig. 2.6) is as follows: the
30 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.5 Example 1 curve and ant colony optimization algorithm results

function obtains the minimum value f (x̃) = 1.2728 at x̃ = 0.5754. The calculation
time t = 0.0940 s.
Example 3: for the following typical two-variable continuous domain optimization
problem

min f (x1 , x2 ) = x1 e(−x1 −x2 )


2 2
(2.29)
( )
The function obtains the theoretical minimum f x1∗ , x2∗ = − 0.4289 at x1∗ =
− √12 , x2∗ = 0. The optimization result of the ant colony optimization algorithm is
as follows: the function is at x̃1 = − 0.7022, x̃2 = 0.0058, f (x̃1 , x̃2 ) = − 0.4288,
as shown in Fig. 2.7. Calculation time t = 0.3750 s.

2.3.2 Ant Colony Optimization Based Parameters


Optimization Method for SVM

2.3.2.1 Support Vector Machine Parameters

The performance of Support vector machine is the ability to predict an unknown


sample by learning the attributes of the training samples, which is often referred
to as generalization or generalization. The penalty factor C and kernel parameters
2.3 The Parameters Optimization Method for SVM 31

Fig. 2.6 Example 2 curves and results of ant colony optimization algorithm

Fig. 2.7 Example 3 curves and ant colony optimization algorithm results
32 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

have a significant impact on the generalization performance of the Support vector


machine. The penalty factor C determines the tradeoff between the minimization of
fitting error and the maximization of classification interval. Parameters of the kernel
function, such as the kernel width parameter σ in Gauss’s Radial basis function,
affect the mapping transformation in the data space and change the complexity of
the sample distribution in the high-dimensional feature space. Either parameter is set
too large or too small, which will impair the excellent performance of the Support
vector machine. Therefore, in practical engineering applications, it is very important
to obtain good generalization performance of Support vector machine by optimizing
parameters of the Support vector machine. This section takes the optimization of
the penalty factor C and the Gauss Radial basis function parameters σ as examples
to illustrate the proposed Support vector machine parameter optimization method
based on the ant colony optimization algorithm.

2.3.2.2 Objective Function for Optimizing Support Vector Machine


Parameters

The goal of optimizing the Support vector machine parameters is to use the optimiza-
tion program to explore a finite subset of possible solutions and find the parameter
values that minimize the generalization error of the Support vector machine. Because
the true error on the test samples is unbiased and the variance of the error decreases
with the increase of the verification sample set, the true error on the test samples is
estimated in this section. {( ) }
Suppose a test sample set: S ' = x i' , yi' |x i' ∈ H, yi' ∈ Y, i = 1, . . . , l , where
H is the feature set and Y is the label set, then the objective function of parameter
optimization of Support vector machine is:

1 Σ ( ' ( ' ))
l
min T = ψ −yi f x i (2.30)
l i=1

where, ψ is the step function: when x > 0, ψ(x) = 1; Otherwise, ψ(x) = 0. f is


the decision function of Support vector machine.

2.3.2.3 Pheromone Model of Ant Colony Optimization Algorithm

In the ant colony optimization algorithm, the artificial ant constructs the solution by
measuring the dynamic artificial pheromone in the form of probability. The main
component of ant colony optimization algorithm is the pheromone construction
and management mechanism. The pheromone model for optimizing Support vector
machine parameters is as follows:
(1) State transition rule
2.3 The Parameters Optimization Method for SVM 33

The state transfer rule makes ants have the ability to find the optimal parameters
through pheromones. In the ant colony optimization algorithm for parameter opti-
mization of Support vector machine, the role of each ant is to establish a set of
solutions. An optimal solution is established by applying probabilistic decisions to
move ants in adjacent state spaces. The state transition probability equation is shown
as follows:
τi j
Pi j = Σ N (2.31)
i=1 τi j

where, i represents the parameter value label of a parameter to be optimized, j repre-


sents the parameter value label of another parameter to be optimized, τi j represents
the pheromone value of a parameter value combination (i, j ) of a parameter to be
optimized, and Pi j represents the probability value of a parameter value combination
(i, j ) to be selected by an ant colony.
(2) State update rule
The state update rule is designed to motivate the ant colony to find the optimal
solution. When all the ants have established their own solutions, the state update
rule is only applied to the subset of locally optimal parameter combinations obtained
in the current iteration. By this rule, the pheromone of a subset of locally optimal
parameter combinations will be increased. The state update rule and state transition
rule are used to guide the ant colony to search for a better solution in the vicinity of a
good solution in the current iteration. By applying the status update rule, the updated
pheromone is as follows:

Q
τinew
j = (1 − ρ)τiold
j + (2.32)
eT
where, T is the target function value in Eq. (2.29), ρ is the volatility factor, and
Q is the pheromone intensity. The state update rule assigns more pheromones to
the Support vector machine solution with less generalization error, making these
parameters more likely to be selected by other ants or by subsequent iterations.

2.3.2.4 Ant Colony Optimization Algorithm Based Support Vector


Machine Parameter Optimization Algorithm Steps

The flow chart of the Support vector machine optimization method based on the ant
colony optimization algorithm is shown in Fig. 2.8. It consists of the following three
main steps:

(1) Initialize the parameters and variables, divide the variable interval to be
optimized into N grids, and calculate the grid interval size:
34 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.8 Flow chart of support vector machine parameter optimization based on ant colony
optimization algorithm

upper
vj − vlower
j
hj = , j = 1, . . . , m (2.33)
N
upper
where, v j and vlower
j are respectively the upper and lower bounds of each
parameter to be optimized. h j is the size of the grid interval after grid divi-
sion, and m is the number of parameters to be optimized. The grid interval for
each parameter is equivalent. So, each grid node represents a combination of
parameters. The larger N is, the denser the grid division is, and more ants are
needed to join the calculation, which increases the amount of calculation. If N is
too small, the convergence rate of ant colony optimization algorithm decreases.
Therefore, in this section, considering the calculation time and complexity,
2.3 The Parameters Optimization Method for SVM 35

= 2−10 and N = 10 are set in the following numerical


upper
vj = 210 , vlower
j
examples.
(2) At the initial moment, the pheromone levels on all the parameter combination
nodes are the same, that is, all the grid nodes are the same distribution, so all the
ants randomly choose their initial moving positions, and then, the ant colony
selects its own moving position by a state transition rule described in Eq. (2.30).
After that, the Support vector machine is trained with the selected param-
eter combinations, and the objective function for optimizing the parameters
of Support vector machine parameters is calculated according to Eq. (2.29).
When all the ants have completed the iteration, the state update rules shown
in Eq. (2.31) are applied to the parameter set (grid nodes) that produces the
minimum Support vector machine generalization error. The grid nodes with the
smallest error per iteration are rewarded with more pheromones, which makes
these nodes with more pheromones attract more ants to select, thus forming
a positive feedback search mechanism. Step 2 will loop until the maximum
number of loop iterations is reached.
(3) Find the node with the largest pheromone concentration, and then do the next
round of searching near that node. The range of parameters to be optimized is
narrowed to:
( )
vlower
j ← vlower
j + mj − Δ ∗ hj (2.34)

upper upper ( )
vj ← vj + mj + Δ ∗ hj (2.35)

where Δ is a constant and m j is the nodal subscript of the maximum pheromone


concentration.

The above three steps will loop until the grid interval is less than the given preci-
sion ε, which is also the iteration termination condition of the parameter optimiza-
tion method of support vector machine based on ant colony optimization algorithm.
Generally speaking, the smaller ε is, the more accurate the optimal solution will
be, but the calculation time will be greatly increased. In order to find a compromise
between solving accuracy and computational complexity, ε was set as 0.01 in the
subsequent experiments in this section. The optimal parameters obtained are:
upper
vlower + vj
v ∗j =
j
, j = 1, . . . , m (2.36)
2
where, v ∗j represents the optimal parameter obtained.
36 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.3 Dataset properties


Name Number of groups Number of training Number of test Dimensionality
samples samples
Breast cancer 100 200 77 9
Diabetes 100 468 300 8
Heart 100 170 100 13
Thyroid 100 140 75 5
Titanic 100 150 2051 3

2.3.3 Verification and Analysis by Exposed Datasets

To verify the effectiveness of the proposed Support vector machine parameter opti-
mization method based on the ant colony optimization algorithm, the proposed
method will be validated in this section using datasets from commonly used inter-
national standard databases, it is also compared with other common parameter
optimization methods of Support vector machine.

2.3.3.1 Data Description

The experimental data were selected from Breast Cancer, Diabetes, Heart, Thyroid
and Titanic data sets in the world-famous UCI and Ida Standard Database. Table 2.3
describes the attributes of these data sets: the name of the data set, number of data
groups in the data set, the number of training samples in each data group, the number
of test samples in each data group, and the dimensions of the data set.

2.3.3.2 Support Vector Machine Analysis

In order to analyze the effect of the parameters on the generalization performance


of the Support vector machine, surface graphs and contours of the test errors of the
above five data sets are drawn by taking the Gaussian radial basis kernel function as
an example, as shown in Figs. 2.9, 2.10, 2.11, 2.12 and 2.13. The interval of Support
vector machine parameters C, σ is [2−10 , 210 ]. The (a) plots from Figs. 2.9, 2.10,
2.11, 2.12 and 2.13 show the test error curves, with the x-axis and y-axis representing
log2 σ and log2 C, respectively. Each node in the (x, y)-plane on the test error surface
represents a combination of parameters, the z-axis represents the percentage test error
of the Support vector machine under each combination. Figures 2.9, 2.10, 2.11, 2.12
and 2.13b plots the test error contours for the five data sets, with the x-axis and y-axis
representing log2 σ and log2 C, respectively.
As can be seen from Figs. 2.9, 2.10, 2.11, 2.12 and 2.13, there are many local
minimums on the test error surface (contour lines), and it is difficult to find a parameter
combination that minimizes the error. And the parameters of the Gaussian radial basis
2.3 The Parameters Optimization Method for SVM 37

(a) test error surface chart (b) test error contour

Fig. 2.9 Test error surface and test error contour of breast cancer data set

(a) test error surface chart (b) test error contour

Fig. 2.10 Diabetes data set test error surface and test error contour

(a) test error surface chart (b) test error contour

Fig. 2.11 Test error surface and test error contour of the 11 heart dataset

kernel function which makes the test error of Support vector machine smaller are
generally valued near the point (0, 0). if the Gauss Radial basis function parameter
is too large or too small, the test error of the Support vector machine on the five data
sets will increase. Based on such a rule, the test error surfaces of the five data sets
all present a bowl shape, that is, the error on both sides are large, and the error in the
38 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) test error surface chart (b) test error contour

Fig. 2.12 Thyroid data set test error surface and test error contour

(a) test error surface chart (b) test error contour

Fig. 2.13 Test error surface and test error contour of Titanic data set

middle are small. The error surface of the bowl shape can help us find the minimum
error point through the optimization algorithm.

2.3.3.3 Parameter Influence Analysis of Ant Colony Optimization


Algorithm

When the optimization algorithm is used to optimize the parameters of the Support
vector machine, it is inevitable to introduce some parameters that belong to the opti-
mization algorithm. Whether the parameters of these optimization algorithms are set
properly or not has a great effect on the final optimal solution. Setting the parame-
ters of any optimization algorithm should be at least as important as designing the
optimization algorithm. The main parameters of ant colony optimization algorithm
are: the number of ants, volatility coefficient ρ, volatility intensity Q, and the initial
value of pheromone τ . In order to set the parameters of ant colony optimization algo-
rithm reasonably, the effect mechanism of these parameters on the results of Support
vector machine optimization was analyzed in detail by taking Thyroid data set as an
example.
2.3 The Parameters Optimization Method for SVM 39

(a) graph of the relationship between test error (b) graph of the relationship between
and the number of ants calculation time and the number of ants

Fig. 2.14 The effect of the number of ants on the support vector machine parameter optimization
method based on ant colony optimization algorithm (Thyroid dataset)

(1) Number of ants


In the ant colony optimization algorithm, setting a reasonable number of ants has an
important impact on the results of the ant colony optimization algorithm. In order to
explore the potential optimal solution in the feasible solution space in a short time,
a sufficient number of ants is required. Therefore, the performance of ant colony
optimization algorithm is tested by setting the number of ants of 10, 20, 30, 40, 50,
60, 70, 80, 90 and 100 respectively. The effect of the number of ants on the parameter
optimization method of Support vector machine based on ant colony optimization
algorithm is shown in Fig. 2.14. As can be seen from the graph, when the number
of ants is 20, 30, 70, 80, 90, the performance of the Support vector machine is the
best (the test error is the smallest), and when the number of ants is increased, the
operation time of the program is increased.
(2) Volatility coefficient
In the ant colony algorithm, the volatility coefficient ρ has the function of uniformly
reducing all the pheromones of the state points. From the point of view of practical
application, the volatility coefficient is needed to prevent the ant colony optimization
algorithm from converging to the local optimal solution region at a very fast speed.
The effect of volatility coefficient is beneficial to the solution in a new search space.
Figure 2.15 shows the effect of the volatility coefficient ρ on the parameter opti-
mization method of Support vector machine based on the ant colony optimization
algorithm. The figure shows that when the volatility coefficient ρ values are 0.2, 0.4,
0.6–0.7, the test error of Thyroid data set is the smallest, and the calculation time is
the shortest when ρ is 0.5.
(3) Pheromone intensity
In ant colony algorithm, pheromone intensity Q is used to adjust the global optimal
solution of the algorithm with a suitable evolutionary speed in the process of positive
40 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

feedback. Figure 2.16 shows the effect of volatility Q on the optimization method of
Support vector machine based on the ant colony optimization algorithm. The figure
shows that when the volatility intensity Q is 20–100, 140, 190–200, the test error of
Thyroid data set is small.
(4) Initial value of pheromone
Because the ant colony optimization algorithm mainly obtains the suitable solution
through the artificial pheromone indirect communication between the artificial ants,
so the initial value setting of pheromone is very important. Figure 2.17 shows the
complex effect of the initial value of pheromone τ on the parameter optimization
method of Support vector machine based on ant colony optimization algorithm. The
figure shows that when the initial value of a pheromone τ is 160, the, the test error
of Thyroid data set is minimum, and the computing time is shortest when the initial
value of pheromone τ is 80.
Considering the effect of the parameters of ant colony optimization algorithm,
Samrout et al. [14], Duan et al. [15] and others have systematically studied the
effect mechanism and setting of the parameters of ant colony optimization algorithm.
Samrout et al. [14] suggested that the ratio of the number of ants to the grid nodes
should be approximately 1:0.7, and Duan et al. [15] suggested that the ant colony
optimization algorithm can achieve good global convergence when the volatiliza-
tion coefficient ρ = 0.5. Based on the current research results, Table 2.4 gives the
parameters of the ant colony optimization algorithm set in this section.
(5) Numerical results and analysis

Firstly, the first training set and the test set of Breast Cancer, Diabetes, Heart,
Thyroid and Titanic were analyzed respectively. The parameter optimization method
of Support vector machine based on ant colony optimization algorithm proposed in
this section is compared with the parameter optimization method of Support vector

(a) plot of test error versus volatility coefficient ρ (b) plot of calculation time versus volatility coefficient
ρ

Fig. 2.15 Effect of the volatility coefficient ρ on the support vector machine parameter optimization
method based on ant colony optimization algorithm (Thyroid dataset)
2.3 The Parameters Optimization Method for SVM 41

(a) plot of test error versus volatility intensity Q (b) plot of calculated time versus volatility intensity Q
Fig. 2.16 Effect of volatility intensity Q on the support vector machine parameter optimization
method based on ant colony optimization algorithm (Thyroid dataset)

(a) plot of test error versus pheromone initial value τ (b) plot of calculation time versus pheromone initial
value τ

Fig. 2.17 Effect of initial pheromone values τ on the support vector machine parameter optimiza-
tion method based on ant colony optimization algorithm (Thyroid dataset)

Table 2.4 Parameter settings


Parameter Value
Number of ants 80
Coefficient of volatilization 0.5
Pheromone intensity 100
Initial pheromone value 100

machine based on the grid algorithm [16] in terms of the optimal parameter values
obtained, the test error rate, the calculation time, etc., as shown in Table 2.5.
The results show that the parameter optimization method of Support vector
machine based on ant colony optimization algorithm takes much less time than
42 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.5 Comparison of the ant colony optimization method for support vector machine
parameters and the grid support vector machine parameter optimization method
The Grid method Ant colony optimization algorithm
name of Optimal Optimal Test Calculating Optimal Optimal Test Calculating
the C σ error time (s) C σ error time (s)
dataset (%) (%)
Breast 8.03 5.04 25.97 2547.30 1.98 14.41 25.97 1437.80
cancer
Diabetes 174.30 151.74 23.33 29,078.00 176.23 151.58 23.00 19,298.00
Heart 799.55 150.29 19.00 1446.40 1043.80 2.00 16.00 519.58
Thyroid 89.84 3.31 4.00 702.89 37.38 2.22 2.67 666.20
Titanic 1024.50 60.10 22.57 639.95 1045.00 54.50 22.57 429.22

the grid algorithm in the experiments of five data sets, we also get the same error
results on Diabetes, Heart and Thyroid, Breast cancer and Titanic, as well as the
grid optimization algorithm. This shows that the proposed parameter optimization
method of Support vector machine based on the ant colony optimization algorithm
is easier to obtain the desired optimal parameters than the Support vector machine
parameter optimization method based on the grid algorithm.
Second, we set up the same experiment as in the literature [6, 17, 18] using the
Gauss Radial basis function, the first five training sets and test sets of each data set
were used for the experiment. The experimental results of the proposed parameter
optimization method of Support vector machine based on the ant colony optimization
algorithm are shown in Table 2.6 and compared with those in Refs. [6, 17, 18]. The
experimental results obtained by each method are composed of test error mean ±
test variance. Compared with other methods, the parameter optimization method of
Support vector machine based on ant colony optimization algorithm proposed in this
section achieves the minimum average test error on Breast Cancer, Diabetes, Thyroid
and Titanic data sets. In addition, the minimum mean test error obtained with the
radius-interval boundary on the Heart data set was similar. The variance of test error is
used to describe the deviation of each test error from the mean value. Compared with
other methods, the parameter optimization of Support vector machine method based
on ant colony optimization algorithm achieves the minimum variance in addition to
the minimum average test error on the Titanic data set, in Breast Cancer, Diabetes
and Thyroid, the variance of test error is also better.
The mean test error of the numerical experiment (as shown in Table 2.6) is graphi-
cally represented by the histogram shown in Fig. 2.18, from the figure, we can clearly
see the test error comparison results of various methods on five commonly used data
sets in the International Common Standard Database. The CV in the figure repre-
sents the Five-fold Cross Validation Method mentioned in the literature [12], the
RM represents the Radius-Margin Bound method mentioned in the literature [12],
and the SB represents the Span Bound method mentioned in the literature [12], M
denotes a fast algorithm for optimizing Support vector machine parameters based on
Table 2.6 Support vector parameter optimization method test error comparison
The name of the 5-order Radius-interval Span bound [12] (%) Methods (%) CMA-ES [18] (%) Ant colony
dataset cross-validation boundary [12] (%) proposed in optimization
method [12] (%) literature [17] algorithm (%)
Breast cancer 26.04 ± 4.74 26.84 ± 4.71 25.59 ± 4.18 25.48 ± 4.38 26.00 ± 0.08 23.38 ± 4.00
2.3 The Parameters Optimization Method for SVM

Diabetes 23.53 ± 1.73 23.25 ± 1.70 23.19 ± 1.67 23.41 ± 1.68 23.16 ± 0.11 22.80 ± 1.33
Heart 15.95 ± 3.26 15.92 ± 3.18 16.13 ± 3.11 15.96 ± 3.13 16.19 ± 0.04 16.00 ± 5.70
Thyroid 4.80 ± 2.19 4.62 ± 2.03 4.56 ± 1.97 4.70 ± 2.07 3.44 ± 0.08 3.20 ± 2.02
Titanic 22.42 ± 1.02 22.88 ± 1.23 22.50 ± 0.88 22.90 ± 1.16 – 21.63 ± 0.54
43
44 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.18 Support vector


machine comparison of test
error for parameter
optimization method

empirical error gradient estimation, and ES denotes the Covariance Matrix Adapta-
tion Evolution Strategy (CMA-ES) proposed in the Covariance Matrix, ACO (Ant
Colony optimization) represents the optimization method of Support vector machine
proposed in this section.
The experimental comparison of the internationally used standard database in
Table 2.6 and Fig. 2.18 shows that the proposed parameter optimization method
of Support vector machine based on ant colony optimization algorithm achieves
satisfactory results compared with other common parameter optimization methods
of Support vector machine. Numerical results show that it is feasible and effective to
use ant colony optimization algorithm to optimize the parameters of Support vector
machine.

2.3.4 The Application in Electrical Locomotive Rolling


Bearing Single Fault Diagnosis

2.3.4.1 Fault Diagnosis Method for Rolling Bearing of Electric


Locomotive

In the fault diagnosis of rolling bearing, the fault characteristic frequency of rolling
bearing is usually analyzed by various signal processing methods. Although the
fault characteristic frequency of rolling bearing can be obtained from the bearing
parameters, but in many cases, it is very difficult to analyze the fault characteristic
frequency in the signal.
(1) The change of bearing geometry and assembly makes it difficult to determine
the characteristic frequency of bearing accurately.
(2) Bearing faults at different positions will cause different instantaneous responses
in the signal, moreover, the instantaneous response is easily submerged in the
2.3 The Parameters Optimization Method for SVM 45

wideband response signal and noise signal, which makes it difficult to extract
the bearing fault features.
(3) Even with the same fault, the signal characteristics are different in different
damage stages (different degree of damage).
(4) The running speed and load of the rotating shaft greatly affect the vibration
of the machine, which makes the monitored vibration signal show different
characteristics.
(5) The signal and parameters of special bearing are difficult to obtain, so the method
of analyzing the vibration signal of bearing based on the characteristic frequency
of bearing is not feasible in this situation.
In this section, the method of mechanical fault diagnosis based on ant colony opti-
mization algorithm to optimize the parameters of Support vector machine will not be
based primarily on the accurate extraction of bearing fault characteristic frequencies,
instead, they simply extract the time and frequency domain statistical features of the
signal and then use the proposed ant colony optimization algorithm to optimize the
parameters of Support vector machine for fault pattern recognition, to highlight the
effectiveness of the parameter optimization method Support vector machine based
on ant colony optimization algorithm.
(1) Feature extraction of signal
The time-domain and frequency-domain features of the signal are extracted as shown
in Table 2.7. Among them, feature F1 is the waveform index, feature F2 is the
peak index, feature F3 is the pulse index, feature F4 is the margin index, feature
F5 is the kurtosis index, feature F6 is the deflection index, feature F7 represents the
vibration energy in the frequency domain, features F8 − F10 , F12 and F16 − F19 are
representations of the signal energy concentration in the frequency domain, features
F11 and F13 − F15 are representations of the dominant frequency position changes
[19].
(2) Fault pattern recognition
Because there are multiple fault patterns in mechanical failure, it is necessary to use
a multi-classification strategy of Support vector machine to identify multiple fault
patterns. Let the training set S = {(x i , yi )|x i ∈ H, yi ∈ {1, 2, . . . , M}, i = 1, . . . , l}
be known, where x i is the training sample, H is the Hilbert space, yi is the
attribute label of the training sample x i , M is the number of classes of training
samples, and l is the number of samples. The classification strategy of Support
vector machine usually adopts two methods: the “one-to-many” and “one-to-one”
multi-classification algorithm.
(1) The basic principle of “one-to-many” multi-classification algorithm of Support
vector machine is as follows:
For j = 1, 2, . . . , M − 1, the following operations are carried out. The jth class
is regarded as a positive class and the rest M − 1 classes as a negative class. The
decision function is obtained by Support vector machines
46 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.7 Statistical characteristics in time domain and frequency domain


Feature 1 (waveform index) Feature 2 (peak index) Feature 3 (pulse indicator)
/
1 ΣN / max|x(n)| max|x(n)|
n=1 x(n)
2 F2 = 1 ΣN
F3 = ΣN
F1 = N
1 Σ N |x(n)| N n=1 x(n)
2
1
N n=1 |x(n)|
N n=1

Feature 4 (margin index) Feature 5 (kurtosis index) Feature 6 (skew index)


1 ΣN 1 ΣN
n=1 (x(n)−F1 ) (x(n)−F1 )3
max|x(n)| 4
F4 = ( ΣN √ )2 F5 = (/ N
)4 F6 = (/ N
n=1
)3
1
n=1 |x(n)| 1 ΣN 1 ΣN
n=1 (x(n)−F1 ) (x(n)−F1 )2
N 2
N N n=1

Feature 7 Feature 8 Feature 9


ΣK ΣK
s(k) (s(k)−F7 )2 ΣK
F7 = k=1
K F8 = k=1
K −1 F9 = k=1(s(k)−F7 )3

K ( F8 )3

Feature 10 Feature 11 Feature 12


ΣK ΣK / ΣK
(s(k)−F7 )4 k=1 f k s(k) k=1 ( f k −F11 ) s(k)
2
F10 = k=1
K F82
F11 = Σ K F12 =
k=1 s(k) K

Feature 13 Feature 14 Feature 15


/Σ /Σ ΣK 2
K
k=1 f k s(k)
K
f k2 s(k) f k4 s(k) F15 = /Σ
F13 = Σk=1
K F14 = Σk=1 K 2 K ΣK 4
k=1 s(k) k=1 f k s(k) k=1 s(k) k=1 f k s(k)

Feature 16 Feature 17 Feature 18


F11 ΣK ΣK
F16 = F12 F17 = k=1 ( f k −F11 ) s(k)
3
F18 = k=1 ( f k −F11 )4 s(k)
K F123 K F124

Feature 19 x(n) represents the time series, where n = 1, 2, . . . , N and N


ΣK
( f k −F11 )1/2 s(k) represents the number of data points; s(k) represents the
F19 = k=1 √
K F12 spectrum, k = 1, 2, . . . , K , K represents the number of spectral
lines, and f k represents the frequency value at the kth spectral
line in the spectrum

Fig. 2.19 The “one-to-many” classification flow chart of support vector machine

[ l ]
Σ j
f (x) = sgn
j
yi αi K (x i · x) + b j
(2.37)
i=1

j
where, αi and b j are the coefficients of the jth Support vector machine, and K (x i · x)
is the kernel function.
If f j (x) = 1, then x belongs to class j, otherwise input x to the next Support
vector machine until all Support vector machines have been tried. Figure 2.19 shows
the classification policy.
2.3 The Parameters Optimization Method for SVM 47

(2) The basic principles of the “one-to-one” multi-classification algorithm of


Support vector machine is as follows:
In M classes of samples, each two classes of samples are used to construct a Support
vector machine respectively, and a total of Ck2 = k(k − 1)/2 Support vector machines
can be constructed. For the Support vector machine constructed by the class i and
class j samples, the decision function is
[ m ]
Σ
f (x) = sgn
ij
yn αni j K (x n · x) + b ij
(2.38)
n=1

where, m is the total number of samples of class i and class j; x n is a sample of class
ij
i and class j; yn ∈ {1, 2, . . . , M} is the attribute label of the sample x n ; αn and bi j
are the coefficients of the Support vector machine constructed by the data of class i
and class j; K (x n · x) is the kernel function.
When identifying unknown samples, Ck2 = k(k − 1)/2 Support vector machines
constructed above are used successively to make decisions. If the Support vector
machine determines that x i belongs to class i during the classification between class
i and class j, the number of votes of class i will be increased by 1; otherwise, the
number of votes of class j will be increased by 1. Finally, x i will be identified as the
class with the most votes. The decision principle is based on the maximum voting
method.
In this section, a “one-to-one” multi-class algorithm of Support vector machine
is used as the basic classifier for mechanical fault pattern recognition, and then the
proposed ant colony optimization algorithm is used to optimize the parameters of
Support vector machine, realize mechanical intelligent fault diagnosis.

2.3.4.2 Description of Rolling Bearing Experimental System


for Electric Locomotive

The current operating train equipment state monitoring in our country is mostly
based on the human sense and a few quantitative monitoring systems (such as axle
temperature alarm system), many of the key equipment on the train safety state
cannot be monitored in real time relying on this way, such as running gear system,
braking system and train electrical equipment, etc. It can only be detected when
there is a major accident, and then the damage is inevitable. Therefore, this section
takes locomotive rolling bearings as an example to verify the effectiveness of the
proposed parameter optimization method of Support vector machine based on ant
colony optimization algorithm in the application of mechanical fault diagnosis.
The experiment system mainly refers to the test platform and sensor of locomotive
rolling bearing. The test platform consists of a hydraulic motor, two supports (on
which two normal bearing bearings are mounted), a test bearing (52732QT), and a
tachometer for measuring the speed of rotation, and a loading module for loading
the test bearing. The 608A11-type ICP accelerometer is mounted below the loading
48 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.20 Structure diagram of rolling bearing test platform for locomotive

Table 2.8 Locomotive rolling bearings state description


Status Normal Outer ring fault Inner ring fault Rolling element fault
Label State 1 State 2 State 3 State 4
Rotating speed (r/min) 490 490 500 530

module adjacent to the outer ring of the test bearing. The sampling frequency is
12,800 Hz. The structural diagram of the test platform is shown in Fig. 2.20. The
load applied to the test bearing is 9800 N. The experimental data of rolling bearing
for locomotive include four classes: normal state, outer ring fault state, inner ring
fault state and rolling element fault state.
Table 2.8 briefly describes these four classes of experimental data. The physical
diagram in Fig. 2.21 records bearing conditions under three classes of single fault
conditions. Each of these four classes of experimental data contains 30 samples,
and each sample is composed of 2048 data points, among which 20 samples are
used to train support vector machine and the remaining 10 samples are used to test
the performance of Support vector machine. For each sample, the time-domain and
frequency-domain statistical features shown in Table 2.7 are extracted, and the Gauss
Radial basis function is used as the kernel function of Support vector machine, then
the proposed ant colony optimization algorithm is used to optimize the parameters
of the “one-to-one” multi-classification Support vector machine to achieve accurate
fault pattern recognition.

2.3.4.3 Fault Diagnosis Results and Analysis

In order to analyze the effect of Support vector machine parameters on fault diag-
nosis results, the range of Support vector machine parameters (penalty factor C
and Gaussian radial basis kernel parameter σ ) is [2−10 , 210 ]. A “one-to-one” multi-
classification Support vector machine without parameter optimization is used as the
2.3 The Parameters Optimization Method for SVM 49

(a) outer ring fault (b) inner ring fault (c) rolling element fault

Fig. 2.21 Three classes of faults of rolling bearing of locomotive

basic classifier, and the fault diagnosis error surface and contour lines obtained are
shown in Fig. 2.22.
Figure 2.22a shows the fault diagnosis error surface, where the x-axis and y-axis
are log2 σ and log2 C respectively. Each node in the (x, y)-plane on the diagnostic
error surface represents a parameter combination, and the z-axis represents the diag-
nostic error percentage of the Support vector machine under each parameter combina-
tion. Figure 2.22b shows the diagnostic error contour lines, and the x-axis and y-axis
are log2 σ and log2 C respectively. As can be seen from the figure, the diagnostic
error presents the characteristics of high on the left and low on the right. Therefore,
parameter optimization of Support vector machine is an important step when using
“one-to-one” multi-classification Support vector machine for fault diagnosis.
The proposed parameter optimization method of Support vector machine based
on ant colony optimization algorithm is applied to fault diagnosis experiments. The
range of parameters of Support vector machine (penalty factor C and Gauss Radial
basis function parameter σ ) is still [2−10 , 210 ]. A total of five experiments are carried
out, and the optimal parameter values, fault diagnosis accuracy and operation time
obtained in each experiment are shown in Table 2.9. The optimization parameters

(a) test error surface chart (b) test error contour

Fig. 2.22 Analysis of support vector machine parameters in rolling bearing fault diagnosis
50 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(penalty factor C and Gauss Radial basis function parameters σ ) obtained from the
five experiments are more dispersive because the ant colony optimization algorithm
is a probabilistic optimization algorithm based on multiple ant agents, the starting
point of random selection of ants is different with different experimental times,
however, there is a large zero error area in the test error surface of the four types
of locomotive rolling bearings used in the experiment (as shown in Fig. 2.22), so
the optimization parameters C and σ are different each time. As can be seen from
the diagnosis, the parameter optimization method of Support vector machine based
on ant colony optimization algorithm can accurately identify four common bearing
states in locomotive rolling bearing experiments: normal, outer ring fault, inner ring
fault and rolling element fault, and the operation time is less.
By analyzing the effect of the Support vector machine parameters on its gener-
alization performance, the method is validated on five commonly used data sets in
international standard databases, then the parameter optimization method of Support
vector machine based on ant colony optimization algorithm is applied to locomotive
rolling bearing fault diagnosis. The main conclusions are as follows:
(1) The parameter optimization method of Support vector machine based on ant
colony optimization algorithm is a kind of probabilistic global optimization
algorithm, which does not depend on the strict mathematical properties of
the optimization problem itself, it is an intelligent algorithm based on multi-
ant agents, which has parallelism, convergence, evolution and robustness.
Compared with the gradient method and the traditional parameter optimization
method of Support vector machine based on evolutionary algorithm, it has the
following advantages: the continuity of the problem definition is not required,
the algorithm is simple and easy to implement, only involved in a variety of
basic mathematical operations; only the output value of the objective function,
no need for gradient information and other requirements; fast data processing.
(2) By performing validation analysis on five commonly used data sets in inter-
nationally common standard databases, the parameter optimization method of
Support vector machine based on ant colony optimization algorithm is more
effective than other common parameter optimization methods of Support vector

Table 2.9 Support vector machine fault diagnosis results based on ant colony optimization
algorithm
Number of experiments Optimal C Optimal σ Accuracy (%) Calculating time (s)
1 2.0481 20.4810 100 84.5717
2 630.2929 85.5254 100 104.5814
3 929.7921 224.4616 100 100.5674
4 105.6777 14.3367 100 127.4912
5 700.0886 421.8886 100 97.3418
Average – – 100 102.9107
2.4 Feature Selection and Parameters Optimization Method for SVM 51

machine. The research results show the feasibility and effectiveness of the
algorithm.
(3) The parameter optimization method of Support vector machine based on ant
colony optimization algorithm provides excellent parameters for fault diagnosis
of rolling bearing of electric locomotive, and reduces the blindness of setting
parameters by Support vector machine, four kinds of common rolling bearing
states (normal, outer ring fault, inner ring fault and rolling element fault) can
be accurately identified. The parameter optimization method of Support vector
machine based on ant colony optimization algorithm is proved to be effective
in mechanical fault diagnosis.

2.4 Feature Selection and Parameters Optimization


Method for SVM

In engineering practice, when the structure of the kernel and algorithm of Support
vector machine is determined, the two key factors that restrict the Support vector
machine to achieve its good generalization performance are how to select the optimal
parameters of Support vector machine and how to select the features related to the
sample attributes for the support vector machine to learn. In 2002, Chapelle and
Vapnik pointed out three importance of feature selection while studying the problem
of parameter selection of support vector machine: it can improve the generalization
performance of support vector machine, determine the features related to attributes,
and reduce the dimension of input space [12].
Most parameter optimization problems and feature selection problems of Support
vector machine are studied and solved separately. For example, trial and error proce-
dure, generalization error estimation and gradient descend method, and artificial
intelligent and evolutionary algorithm are proposed for the optimization of Support
vector machine parameters [15]. The research methods of Support vector machine
feature selection include minimum error upper bound method, genetic algorithm and
particle swarm optimization. The two key factors (feature selection and parameter
optimization) that restrict the generalization performance of Support vector machine
are analyzed and solved respectively, the coupling effect of feature selection and
parameter optimization on the generalization performance of Support vector machine
is not taken into account at the same time, so it is inevitable that one side loses the
other side, which restricts the generalization performance of Support vector machine
to full play.
In addition, there is no simple one-to-one correspondence between the faults of
mechanical equipment and the characteristic signs. Different faults can have the
same characteristic signs, and the characteristics of the same fault under different
conditions are not exactly the same. Even the same equipment, in different installation
and use conditions, equipment failure symptoms will vary greatly. Usually, the sample
features input to the Support vector machine are redundant, and the features are
52 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

interrelated, which weakens the generalization performance of the Support vector


machine. Selecting the optimal fault characteristics and optimizing the Support vector
machine parameters simultaneously is a powerful way to improve the generalization
performance of Support vector machine.
Ant colony optimization algorithm-based Support vector machine feature selec-
tion and parameter optimization fusion method uses ACA to solve the problem of
feature selection and parameter optimization in Support vector machine once and
for all, by obtaining the best matching features and parameters synchronously, the
generalization ability of the Support vector machine can be further improved and the
application of multi-class fault diagnosis for mechanical equipment can be realized.

2.4.1 Ant Colony Optimization Based Feature Selection


and Parameters Optimization Method for SVM

Ant colony optimization algorithm-based Support vector machine feature selection


and parameter optimization fusion method mainly uses the heuristic information
between ant colonies to find the optimal feature subset and parameters, the algorithm
consists of four parts: initialization, Ant Colony Algorithm for feature subset, feature
evaluation, pheromone update. The algorithm structure block diagram is shown in
Fig. 2.23.

2.4.1.1 Initialize

Input the original feature set and set the initial parameters of the ant colony opti-
mization algorithm-based Support vector machine feature selection and parameter
optimization fusion method. For example, the size of the ant colony and the number
of ants should be selected according to the size of the input feature set.

2.4.1.2 Ant Colony Algorithm for Feature Subset

Ant colony solving feature subset is an important part of the fusion method of the
ant colony optimization algorithm-based Support vector machine feature selection
and parameter optimization fusion method. At the initial moment of the algorithm,
each ant after initialization freely selects the feature subset from the original feature
set containing N features according to the random selection criterion. The rest of
the time, the ant colony will select the feature subset according to the state transition
criterion. The feature subset solved by each ant is s1 , s2 , . . . , sr , where r is the number
of the ant; Each feature subset contains n 1 , n 2 , . . . , n r features respectively. Ant
colony solving feature subset mainly includes two main contents: random selection
criterion and state transition criterion.
2.4 Feature Selection and Parameters Optimization Method for SVM 53

Fig. 2.23 Flow chart of ant colony optimization algorithm and support vector machine fusion
method

(1) Random selection criteria


At the initial moment, because each feature has the same pheromone level and all
feature quantities have the same distribution, all ants randomly select features to
construct feature subsets.
(2) State transition criteria
In addition to constructing the solution subset according to the random selection
criterion at the initial moment, the ant colony constructs the feature subset using a
probabilistic decision-making strategy called the state transition criterion at other
moments.
54 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

si = arg max{τ (u)}, i = 1, 2, . . . , r (2.39)

where, si is the feature subset constructed by the ith ant, and τ (u) is the pheromone
concentration on feature u. Formula (2.39) aims to guide ants to construct a feature
subset si by selecting feature u with a higher pheromone concentration. In general, a
high concentration of pheromones must be associated with the optimal target solution,
thus ensuring that the features selected by the ant colony are the optimal features
that can produce the desired optimal target solution. This will be explained in more
detail in the pheromone update section below.

2.4.1.3 Feature Assessment

A subset of the features obtained from the ant colony algorithm need to be input into
the Support vector machine to further evaluate their merits. At the same time, since
the parameters (such as penalty factor C and Gauss Radial basis function parameter
σ ) affect the performance of Support vector machine, even if the input of the same
feature subset, setting different parameters of support vector machine will make the
generalization performance of support vector machine different. Therefore, during
the evaluation of the feature subset constructed by the ant colony, the parameters of
the Support vector machine are also optimized to ensure the optimal generalization
performance of the Support vector machine. The feature evaluation section consists
of the following three main elements:
(1) Enter the feature subset
The feature subset s1 , s2 , . . . , sr , which contains n 1 , n 2 , . . . , n r features and is solved
by each ant, is input into the support vector machine respectively, where r is the
number of ants.
(2) Ant colony optimization algorithm optimizes the parameters of support vector
machine
For each feature subset si , i = 1, 2, . . . r , optimal parameters of Support vector
machine under the corresponding feature subset are obtained respectively. Detailed
implementation principles and methods were described in Sect. 2.3.2. The optimal
parameters obtained are as follows:
upper
vlower + vj
v ∗j =
j
, j = 1, . . . , m (2.40)
2
where, v ∗j represents the optimal parameters obtained.
(3) Feature subsets and parameters evaluated by support vector machine
( )
The feature subsets si = e1i , e2i , . . . , eni i , i = 1, 2, . . . r constructed by ant
colony and the corresponding optimal parameters v ∗j , j = 1, . . . , m are respec-
tively input into the Support vector machine. Assuming that the test sample set is
2.4 Feature Selection and Parameters Optimization Method for SVM 55
{( ) }
V ' = x i' , yi' |x i' ∈ sr , yi' ∈ Y, i = 1, . . . , q , Y is the attribute label set, and q is the
number of samples in the test sample set. Then the evaluation error of the ith ant based
on the feature subsets and the corresponding optimal parameters v ∗j , j = 1, . . . , m
is:

1 Σ ( ' ( ' ))
q
i
Tant = ψ −y j f i x j (2.41)
q j=1

where, ψ is the step function: when x > 0, ψ(x) = 1; Otherwise, ψ(x) = 0; f i is


the decision function of Support vector machine constructed by ant i. The resulting
i
error value Tant is the evaluation result of SVM evaluation feature subsets and SVM
parameters.

2.4.1.4 Pheromone Update

Pheromone update is required when the colony has completed the construction
and evaluation of the colony solutions. There are two main criteria for pheromone
renewal: Pheromone update includes two criteria: global update criteria and local
update criteria.
(1) Global update criteria
The application condition of global update criteria is if and only if all ants have
completed the process of solving the feature subsets and the task of feature evaluation.
The goal of global update is to motivate ants to produce optimal feature subsets and
optimal parameters of Support vector machine. The pheromone concentration of each
feature in the optimal feature subset will be enhanced to attract more ants to select
the feature quantity that produces the optimal solution. The global update criteria of
pheromone is:

τ (k + 1) = (1 − ρ)τ (k) + QTmax (2.42)

where, τ (k + 1) is the pheromone concentration value at time k + 1; ρ is the volatile


coefficient; τ (k) is the pheromone concentration at time k; Q is pheromone concen-
tration; Tmax is the optimal solution obtained by ant colony, and its expression is as
follows:
{ i }
Tmax = max Tant (2.43)
( )
where, Tant i
is the evaluation error of the feature subsets si = eii , e2i , . . . , eni i , i =
1, 2, . . . r evaluated by the Support vector machine and the corresponding optimal
parameters v ∗j , j = 1, . . . , m, as shown in Eq. (2.41).
(2) Local update criteria
56 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(3) The goal of the local update criteria is to reduce the pheromone concentration
of those features selected by the ant colony but not achieved good results, and
to maintain the pheromone concentration of those features not selected by the
ant colony. The local update criteria not only reduces the probability that ants
select features that do not achieve good results, but also keeps pheromone that
do not have selected features from being reduced, this increases the probability
that ants will select features that have not been selected yet. The local update
criteria of pheromone is:

τ (k + 1) = (1 − α0 )τ (k) + α0 τ0 (2.44)

where, α0 (0 < α0 < 1) is the local pheromone update coefficient, and τ0 is the
initial pheromone value.
By using global and local update criteria, the pheromone concentration of each
feature that makes up the optimal feature subset is increased, and that of each feature
selected by the ant colony but does not produce the optimal solution is decreased.
Pheromone concentration of features not selected by the colony remain constant.
It is advantageous for the ant colony to continue to select the features that have
produced the optimal solution in the subsequent optimization process, and to continue
to construct new optimal features in the features that have not been selected.

2.4.1.5 Termination Conditions

The ant colony optimization algorithm-based Support vector machine feature selec-
tion and parameter optimization fusion method is terminated when a feature subset
and an optimal parameters can make the generalization performance of SVM reach
100% accuracy or all features have been selected by ant colony. When the termi-
nation condition is reached, the optimal feature subset and the optimal parameters
combination are output.

2.4.2 The Application in Rotor Multi Fault Diagnosis


of Bently Testbench

In order to verify the effectiveness of the proposed ant colony optimization algorithm-
based Support vector machine feature selection and parameter optimization fusion
method in the application of mechanical fault diagnosis, the Bently rotor multi-class
fault diagnosis experiment is carried out.
2.4 Feature Selection and Parameters Optimization Method for SVM 57

2.4.2.1 Description of Test System

The Bently rotor is a general and concise model of rotating machinery, which can
simulate multiple types of faults caused by vibration in large rotating machinery.
This section uses the Bently rotor experimental platform to carry out the simulation
experiment of rotor multi-class faults. The Bently rotor experiment platform is shown
in Fig. 2.24, Fig. 2.24a is the physical picture of the Bently rotor experiment platform,
and Fig. 2.24b is the structure diagram of the Bently rotor experiment platform.
The Bently rotor experiment system mainly consists of a Bently rotor test bench
(consisting of a motor, two sliding bearings, a spindle, a rotor mass disk, a speed
regulator, and a signal regulator), sensors, and a Sony EX data acquisition system.
The diameter of the shaft is 10 mm and the length of the shaft is 560 mm. The mass of
the rotor mass disc is 800 g and the diameter is 75 mm. The eddy current displacement
sensor is installed on the mounting frame in the radial direction of the spindle, and the
sampling frequency is 2000 Hz. The experiment of Bently rotor is carried out under
six different operating conditions and speeds (as shown in Table 2.10): Compound
fault of mass unbalance (0.5 g eccentric mass), oil film eddy, slight radial friction
of the rotor (the rotor is slightly rubbed near the right bearing by the friction rod
that comes with Bently), rotor crack (crack depth is 0.5 mm), mass unbalance (0.5 g
eccentric mass) and radial friction of the rotor, normal state.

(a) the physical diagram of the Bently rotor test stand

(b) the schematic diagram of the Bently rotor test bed

Fig. 2.24 Bently rotor test stand


58 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.10 Bently rotor test bench fault status table


State type Speed (r/min) State type abbreviated name Label
Quality imbalance 1800 U 1
Oil film whirling 1900 W 2
Radial friction of rotor 1800 R 3
Shaft crack 2000 S 4
The combined fault of mass unbalance 1800 C 5
and radial friction of rotor
Normal 2000 N 6

2.4.2.2 Bently Rotor Fault Diagnosis Method

(1) Signal acquisition

The sensor and the Sony EX data acquisition system are used in the experiment to
collect the signals of the Bently rotor under six different running states respectively.
The acquired time-domain signals are shown in Fig. 2.25. Although the time-domain
waveform in the figure reflects some abnormal characteristic information of fault
states, it is not enough to accurately reveal the fault characteristics of each state. In
the experiment, 32 vibration signal samples will be collected from each state, and
each signal sample contains 1024 sampling points.
(2) Feature extraction
The rotor vibration signals under various running conditions collected from the
Bently rotor experimental platform as shown in Fig. 2.25 can be found as follows: the
amplitude of vibration signal time-domain waveform of Bently rotor under normal
operation is small, while the amplitude of vibration signal time-domain waveform
collected under other fault conditions increases, or the vibration signal time-domain
waveform changes to a certain extent. According to the vibration signals collected
by the Bently rotor under various running conditions, the vibration characteristics of
the Bently rotor under different running conditions are characterized by extracting
19 statistical features in time-domain and frequency-domain as shown in Table 2.7.
To verify the effectiveness of the proposed fusion method of SVM feature selec-
tion and parameter optimization based on ant colony optimization algorithm in rotor
intelligent fault diagnosis.
(3) Fault diagnosis
The 19 statistical features shown in Table 2.7 are extracted from the vibration signal
samples in each state. The first 16 samples of each state are taken as training samples
of Support vector machine, and the last 16 samples are taken as test samples. A “one-
to-one” multi-classification Support vector machine is constructed as a basic learner,
and then the feature selection and parameter optimization of SVM are carried out
using the proposed ant colony optimization algorithm.
2.4 Feature Selection and Parameters Optimization Method for SVM 59

(a) Quality Imbalance

(b) Oil film whirling

(c) Radial friction of rotor

(d) Normal

(e) The combined fault of mass unbalance and radial friction of rotor

(f) Shaft crack

Fig. 2.25 Bently rotor vibration signal waveform in time domain under six operating states
60 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

2.4.2.3 Results and Analysis

The experimental results of parameter optimization method of Support vector


machine based on ant colony optimization algorithm (Method 1), feature selec-
tion method of Support vector machine based on ant colony optimization algorithm
(Method 2), and fusion method of feature selection and parameter optimization of
Support vector machine based on ant colony optimization algorithm (Method 3) are
shown in Table 2.11. Since the fault classes are numerous and difficult to distin-
guish, the Support vector machine (Method 1), which only optimizes its parameters
by ant colony optimization algorithm, obtains the optimal[ parameters ] (C = 63.49,
σ = 37.64) within the parameter interval of C, σ ∈ 2−10 , 210 . The accuracy
rate of identifying six operating states of Bently rotor experiment is 97.92%, and
the normal state of rotor is easy to be misdiagnosed. The Support vector machine
(Method 2) that uses the ant colony optimization algorithm for feature selection uses
the optimal parameters obtained in Method 1 and selects the optimal features, but
there is no consistent optimal matching relationship between the two, so the gener-
alization performance of the support vector machine is not improved, and its fault
diagnosis accuracy is the same as the result of Method 1. This shows that the sample
features and support vector machine parameters in support vector machine jointly
affect the performance of support vector machine. Only achieving the optimal perfor-
mance unilaterally cannot make the generalization performance of support vector
machine reach the optimal performance. Only by obtaining the best features and
parameters matching each other synchronously can the generalization performance
of support vector machine be further improved. Therefore, it is necessary to use
an optimization algorithm to optimize the features and parameters of the support
vector machine. The fusion method of feature selection and parameter optimization
of support vector machine (Method 3) based on ant colony optimization algorithm
obtained the best feature (F1 , F3 , F6 , F9 , F10 ∼ F14 , F19 ) and the best parameter
(C = 50.69, σ = 0.39) which matched each other at the same time, so the test
accuracy was 100%, and the fault diagnosis ability was better than that of Methods
1 and 2.
In 2007, Sun et al. from Shanghai Jiao Tong University used C4.5 decision tree
and principal component analysis to carry out fault diagnosis research on six rotor

Table 2.11 Bently rotor multi-class fault diagnosis experiment results comparison
Name Optimal characteristics Optimal Accuracy of each failure (%) Average
parameters U W R S C N accuracy
(C, σ ) (%)
Method – 63.49, 100 100 100 100 100 87.5 97.92
1 37.64
Method F1 , F3 , F5 , F7 , F9 , F14 , F17 , F19 – 100 100 100 100 100 87.5 97.92
2
Method F1 , F3 , F6 , F9 , F10 ∼ F14 , F19 50.69, 100 100 100 100 100 100 100
3 0.39
2.4 Feature Selection and Parameters Optimization Method for SVM 61

Table 2.12 Bently rotor multi-class fault diagnosis experimental results


Method name Accuracy of each failure (%) Average accuracy
U W R S C N (%)

C4.5 decision tree 95 100 100 100 95 100 98.3


(principal component
analysis for feature
selection)
C4.5 decision tree (no 100 100 95 100 95 100 98.3
feature selection)
Back propagation 100 95 100 85 95 100 95.8
neural network (PCA
for feature selection)
Back propagation 100 85 100 90 95 100 95
neural network
(UNFEATURE
selection)

operating states similar to this section (normal, mass unbalance, oil film vortex, radial
friction of rotor, compound fault of mass unbalance and radial friction of rotor, and
rotor shaft crack) on the Bently rotor test bench [20]. In addition, seven time-domain
statistics (peak-peak value, waveform index, pulse index, peak index, margin index,
skewness index, kurtosis index) and eleven frequency-domain characteristics of test
signals under various states are extracted, and C4.5 decision tree and backward prop-
agation neural network are used to realize fault intelligent diagnosis and analysis,
and the experimental results are shown in Table 2.12 [21]. By comparing the exper-
imental results in Table 2.12 with those in Table 2.11, it is found that: The fault
diagnosis experiments of six rotor operating states (normal, mass unbalance, oil film
eddy, radial friction of rotor, compound fault of mass unbalance and radial friction
of rotor, crack of rotating shaft) are carried out on the Bently rotor test bench. The
fault diagnosis method based on C4.5 decision tree also obtains good experimental
results, with an accuracy of 98.3%. Some diagnostic errors appear when identifying
the compound faults of mass unbalance, mass unbalance and rotor radial friction. The
method proposed in this section can accurately identify six common rotor operating
states and obtain the best diagnostic results.

2.4.3 The Application in Electrical Locomotive Rolling


Bearing Multi Fault Diagnosis

The schematic diagram of the experimental test platform for rolling bearing of elec-
tric locomotive is shown in Fig. 2.20. The experimental system is described in
Sect. 2.3.4. The vibration signals of locomotive bearings are collected by a accelerom-
eter mounted under a loading module adjacent to the outer ring of the test bearing in
62 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

nine states: normal, minor outer ring abrasion fault, outer ring abrasion fault, inner
ring abrasion fault, rolling element abrasion fault, outer ring abrasion and inner ring
abrasion compound fault, outer ring abrasion and rolling element abrasion compound
fault, inner ring abrasion and rolling element abrasion compound fault, outer ring
abrasion and rolling element abrasion compound fault, outer ring abrasion and inner
ring abrasion and rolling element abrasion compound fault. The nine states cover
common fault classes, including both complex compound fault classes and different
degrees of damage for the same type of fault. The nine states of locomotive rolling
bearings are shown in Table 2.13. The vibration signals of locomotive bearing in
nine states are collected, and 32 vibration signal samples are collected for each state,
each sample containing 2048 points. Figure 2.25 shows the time-domain signals of
locomotive rolling bearings in the above nine different states. By observing the time-
domain waveforms of the tested signals of locomotive rolling bearings in the nine
different states, we can find that:

(1) The test signals of locomotive bearing in nine states are all disturbed by noise,
and the fault information is drowned in different degrees, and the amplitude of
vibration signal in normal state is smaller than that in eight kinds of fault states.
(2) When the outer ring of the bearing fails, the vibration and shock will occur
when the rolling element passes through the damaged position of the outer ring.
Therefore, the time-domain waveform of the vibration signals in Fig. 2.25b, c
reflect certain shock characteristics The vibration signal amplitude of the bearing
outer ring with minor abrasion as shown in Fig. 2.25b is smaller than that of the
outer ring with serious damage fault (Fig. 2.25c).
(3) The vibration signal waveform in time-domain, as shown in Fig. 2.25d, shows
some impact characteristics when the inner ring of the bearing is damaged, but

Table 2.13 Locomotive rolling bearing nine types of fault state description
State type State type Label
abbreviated name
Normal N 1
Minor abrasion fault on outer ring O 2
Outer ring damage fault S 3
Inner ring abrasion fault I 4
Rolling element abrasion fault R 5
Compound fault of outer ring damage and inner ring abrasion OI 6
Compound fault of outer ring damage and rolling element OR 7
abrasion
Compound fault of inner ring abrasion and rolling element IR 8
abrasion
Compound fault of outer ring damage, inner ring abrasion and OIR 9
rolling element abrasion
2.4 Feature Selection and Parameters Optimization Method for SVM 63

the amplitude modulation phenomenon is not obvious when the inner ring of
the bearing is damaged.
(4) When the rolling element fails, the vibration signal waveform in Fig. 2.25e
appears the shock characteristics in time-domain because the rolling element
contacts with the raceway surface of the inner ring and the rolling surface of
the outer ring once each time during one rotation cycle, and the phenomenon of
amplitude modulation occurs.
(5) When compound faults occur among the outer ring, inner ring and rolling
element of the bearing, the time-domain waveform of the vibration signal has
different impact characteristics. When the compound fault of outer ring damage
and rolling element abrasion occurs, and the compound fault of inner ring abra-
sion and rolling element abrasion occurs, amplitude modulation occurs in the
time-domain waveform of vibration signal, as shown in Fig. 2.26f–i.

The spectrum diagram of the vibration signal of locomotive rolling bearings in


nine different states is shown in Fig. 2.27. The characteristic frequency information
of different bearing fault classes is completely submerged in the signal; it is not easy
to recognize the characteristics of each state from the spectrum.
The data sets obtained in normal, minor abrasion fault on outer ring, outer ring
damage fault, inner ring abrasion fault, rolling element abrasion fault, compound
fault of outer ring damage and inner ring abrasion, compound fault of outer ring
damage and rolling element abrasion, compound fault of inner ring abrasion and
rolling element abrasion, compound fault of outer ring damage, inner ring abrasion
and rolling element abrasion, a total of nine types of states, are taken as diagnostic
objects. According to Table 2.7, 19 time-domain and frequency-domain statistical
features are extracted from vibration signals respectively, and a “one-to-one” multi-
classification Support vector machine is constructed as a basic learning machine,
in which 16 feature sample sets in each state are used to train the Support vector
machine. The remaining 16 feature sample sets are used to test the performance of
the Support vector machine.
The fault diagnosis results of the parameter optimization method of Support vector
machine based on ant colony optimization algorithm (Method 1), the feature selec-
tion method of Support vector machine based on ant colony optimization algorithm
(Method 2), and the fusion method of feature selection and parameter optimization of
Support vector machine based on ant colony optimization algorithm (Method 3) are
shown in Table 2.14. Method 1 uses ant colony optimization algorithm to optimize
the optimal parameters (C = 57.60, [ σ = 64.26)
] of Support vector machine within
the parameter interval of C, σ ∈ 2−10 , 210 , which makes the average recognition
accuracy of nine kinds of locomotive rolling bearing faults reach 89.58%. Although
the optimal parameters obtained in Method 1 is used in Method 2, the generalization
performance of Support vector machine is not improved because the optimal param-
eters does not match the optimization feature solved in Method 2 synchronously. The
average accuracy of fault diagnosis in Method 2 and Method 1 is the same, indicating
that the sample features and parameters of Support vector machine jointly affect
the performance of Support vector machine, and the best features and parameters
64 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) Minor abrasion fault on outer ring

(c) Outer ring damage fault

(d) Inner ring abrasion fault

(e) Rolling element abrasion fault

(f) Compound fault of outer ring damage and inner ring abrasion

Fig. 2.26 Time domain waveforms of vibration signals of locomotive rolling bearings in nine states
2.4 Feature Selection and Parameters Optimization Method for SVM 65

(g) Compound fault of outer ring damage and rolling element abrasion

(h) Compound fault of inner ring abrasion and rolling element abrasion

(i) Compound fault of outer ring damage, inner ring abrasion and rolling element abrasion

Fig. 2.26 (continued)

matching each other should be obtained synchronously to improve the generalization


performance of Support vector machine comprehensively. Therefore, the proposed
fusion method of feature selection and parameter optimization of Support vector
machine based on ant colony optimization algorithm (Method 3) is used to improve
the diagnostic capability of Support vector machine by simultaneously selecting the
optimal feature F1 ∼ F18 and the optimal parameters (C = 1.02, σ = 0.04), and the
average fault accuracy of 95.83% is obtained.
Table 2.14 also shows the following information: The parameter optimization
method of Support vector machine based on ant colony optimization algorithm
(Method 1), the feature selection method of Support vector machine based on ant
colony optimization algorithm (Method 2), and the fusion method of feature selec-
tion and parameter optimization of Support vector machine based on ant colony
optimization algorithm (Method 3) identify the normal state, inner ring fault state
and rolling element fault state of locomotive bearings with the same capability, all
of them have achieved 100% test accuracy, which indicates that these three methods
based on support vector machine have the same diagnostic capability for simple
fault states. The proposed fusion method of feature selection and parameter opti-
mization of Support vector machine based on ant colony optimization algorithm
(Method 3) effectively improves the identification ability of compound faults of outer
ring and inner ring, outer ring and rolling element, inner ring and rolling element,
66 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) Minor abrasion fault on outer ring

(c) Outer ring damage fault

(d) Inner ring abrasion fault

(e) Rolling element abrasion fault

Fig. 2.27 Vibration signal spectrum of locomotive rolling bearings in nine states
2.4 Feature Selection and Parameters Optimization Method for SVM 67

(f) Compound fault of outer ring damage and inner ring abrasion

(g) Compound fault of outer ring damage and rolling element abrasion

(h) Compound fault of inner ring abrasion and rolling element abrasion

(i) Compound fault of outer ring damage, inner ring abrasion and rolling element abrasion

Fig. 2.27 (continued)

outer ring and inner ring and rolling element. It is shown that the fusion method of
feature selection and parameter optimization of Support vector machine based on ant
colony optimization algorithm can further improve the generalization performance
of Support vector machine by solving the matching optimal sample features and
optimal parameters of Support vector machine at one time, thus enhancing the diag-
nosis ability of complex faults. However, it is worth noting that the fusion method
of feature selection and parameter optimization of Support vector machine based on
ant colony optimization algorithm (Method 3) improves the identification ability of
severe fault of outer ring, compound fault of outer ring and inner ring, compound
68

Table 2.14 Comparison of multi-fault diagnosis results of locomotive rolling bearings


Name Optimal features Optimal Accuracy of each failure (%) Average
parameters N O S I R OI OR IR OIR accuracy
(C, σ ) (%)
Method 1 – 57.60, 64.26 100 100 87.5 100 100 75 75 75 93.75 89.58
Method 2 F2 , F4 , F6 , F13 , F14 , F19 – 100 100 87.5 100 100 75 75 75 93.75 89.58
Method 3 F1 ∼ F18 1.02, 0.04 100 93.75 93.75 100 100 87.5 93.75 93.75 100 95.83
2 Supervised SVM Based Intelligent Fault Diagnosis Methods
2.5 Ensemble-Based Incremental Support Vector Machines 69

fault of outer ring and rolling element, and compound fault of outer ring and inner
ring and rolling element. Due to the similarity between the external slight abrasion
fault and the above three faults, the Support vector machine reduces the ability of
the external slight abrasion fault.
Aiming at the coupling effect of sample features and parameters on support vector
machine, a fusion method of Support vector machine feature selection and param-
eter optimization based on ant colony optimization algorithm is proposed, and the
algorithm structure and flow of fusion method of feature selection and parameter
optimization of Support vector machine based on ant colony optimization algorithm
is constructed. The application of feature selection and parameter optimization fusion
method of Support vector machine based on ant colony optimization algorithm in
fault diagnosis of Bently rotor and rolling bearings of electric locomotive is realized.
The following conclusions can be drawn from the experimental results:
(1) The fusion method of feature selection and parameter optimization of Support
vector machine based on ant colony optimization algorithm uses ACA to solve
the problem of feature selection and parameter optimization in Support vector
machine at the same time, the optimal features and parameters are obtained
synchronously, which improves the generalization performance of the Support
vector machine and obtains better fault diagnosis results.
(2) The fusion method of feature selection and parameter optimization of Support
vector machine based on ant colony optimization algorithm can more effec-
tively identify multiple complex fault states including compound fault. However,
Support vector machines which only optimize its parameters by ant colony opti-
mization algorithm or select features by ant colony optimization algorithm have
limited diagnostic ability for complex fault states.
(3) Because the time–frequency statistical characteristics of vibration signal do not
contain much professional knowledge and experience, it is easy to operate and
realize, therefore, the statistical features of Bently rotor and locomotive bearings
vibration signals in time-domain and frequency-domain are extracted as input
sample features. If other advanced fault feature extraction techniques (such as
wavelet analysis) are used to provide more effective fault features, their ability
to diagnose complex fault states can be further improved.

2.5 Ensemble-Based Incremental Support Vector Machines

The generalization performance of the support vector machine classifier (or predictor)
depends mainly on the availability of limited samples and the adequacy of sample
information. In practical application, it is very expensive and time-consuming to
collect representative training samples, and it is not easy to accumulate a certain
amount of training samples in a certain period of time. At the same time, how to fully
mine and use the knowledge and rules of limited sample information to improve the
generalization performance of the classifier (or predictor) is the eternal pursuit and
goal of machine learning. According to statistical learning theory, when the training
70 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

sample is very limited, there is a risk that the classifier (or predictor) has a low
prediction accuracy for unknown samples. In order to improve the generalization
ability of support vector machine, the theory of ensemble learning through training
and combining multiple classifiers (or predictors) and the theory of reinforcement
learning through adaptive learning have become the main research directions in the
field of machine learning in recent years, therefore, studying how to improve the
structure of support vector machine algorithm is an essential problem to improve the
performance of support vector machine generalization.
At present, the research results of ensemble learning emerge constantly. The
construction strategies of ensemble learning algorithms mainly include Bagging,
Adaboost, Grading and Decorrelated. The research on the construction type of basic
classifiers in ensemble learning algorithms mainly includes two different ensemble
strategies: one is the ensemble of the same classifier and the other is the ensemble of
different classifiers. In addition, the basic classifiers in ensemble learning algorithms
are composed in different ways, mainly selection-oriented ensemble and combiner-
oriented ensemble. The construction mode of ensemble learning can be divided into
serial ensemble, parallel ensemble and hierarchical ensemble. The research shows
that ensemble learning can improve the generalization performance of classifiers
(predictors), and overcome the following problems to some extent:
(1) Few-shot learning problem. When the number of training samples is sufficient,
many machine learning methods can construct the optimal classifier (predictor)
and show excellent generalization performance. But when the training samples
are limited, machine learning algorithms can only construct many classifiers
(predictors) with poorly consistent prediction accuracy. Although the structural
complexity of the constructed classifier (predictor) is low, there is a great risk
that the performance of predicting unknown samples will be poor. However,
when multiple single basic classifiers (predictors) are integrated by ensemble
learning method, the generalization performance can be better than that of single
basic classifiers (predictors).
(2) Generalization performance problem. When data are limited, the search space
of machine learning algorithms is a function of the available training data,
which may be much smaller than the hypothesis space considered in the finite
sample asymptotic case. Ensemble learning can expand the function space and
obtain a more accurate approximation and prediction of the target function, thus
promoting the generalization performance of the classifier (predictor).
Since the support vector machine algorithm is ultimately reduced to solving a
linear constrained quadratic programming problem (QP), it is necessary to compute
and store a kernel function matrix whose size is related to the square of the number
of training samples. When the ensemble learning algorithm is applied to train
several single basic classifiers (predictors) and to construct ensemble-based classi-
fiers (predictors), it brings more complex and large-scale training and learning tasks,
this requires learning new knowledge and updating the trained classifier (predictor) in
a manner of reinforcement learning to ensure better generalization performance with
higher learning efficiency. Reinforcement learning is a kind of adaptive learning
2.5 Ensemble-Based Incremental Support Vector Machines 71

method with feedback as input, which can get uncertain motivation and optimal
behavior through interaction with the environment. Because of the characteristics
of on-line learning and adaptive learning, the reinforcement learning method has
good generalization performance in large spaces and complex nonlinear systems,
therefore, it is becoming an effective tool to solve the intelligent strategy optimiza-
tion problem, and has been more and more widely used in practice. In 2007, the
American scholar Parikh et al. [20, 22] proposed a data fusion method based on
ensemble-based reinforcement learning, which searches for the most differentiated
information under various datasets through ensemble-based incremental classifiers,
and build the ability to continuously learn new knowledge from a variety of data
sources.
Therefore, combining the advantages and characteristics of ensemble learning
and reinforcement learning, aiming at improving the generalization performance of
support vector machine, based on the theory of ensemble learning and reinforce-
ment learning, a method of ensemble-based incremental support vector machines is
proposed, to achieve the goal of improving generalization of support vector machines
from the perspective of machine learning architecture and algorithm construction.

2.5.1 The Theory of Ensemble Learning

There is an objective fact in the field of machine learning: it is much easier to find a
large number of rough empirical rules than to find a highly accurate prediction rule.
Although it is difficult to directly establish a highly accurate prediction rule, it is an
achievable goal to indirectly induce a more accurate prediction rule through a large
number of rough empirical rules. The basic idea of ensemble learning theory is to first
find a large number of rough experience rules with a weak learning algorithm; then the
weak learning algorithm is used circularly, and the training sets with different weight
distribution coefficients are input to the weak learning algorithm, and a new empirical
rule is generated each time; after several cycles, the ensemble learning algorithm
generates a final empirical rule according to the empirical rules generated by multiple
cycles. In recent years, ensemble learning theory has been successfully applied to
machine learning to improve the performance of a single classifier (learner). The
ensemble support vector machines were first proposed by Vapnik [1], use boosting
technology to train each single support vector machine, and then combine these single
support vector machines with another support vector machine. Suppose there is a set
of N single support vector machines: { f 1 , f 2 , . . . , f n }. If the performance of each
single support vector machine is equal, the ensemble of these single support vector
machines will be the same as that of each single support vector machine. However, if
the performance of these single support vector machines is different and their errors
are not correlated, except for support vector machine f i (x), the recognition results
of most other support vector machines for sample x may be correct. More precisely,
for a binary classification problem, since the error probability of random guess is 1/2,
assuming that the error probability of each single support vector machine is p < 1/2,
72 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

then the error of the ensemble support vector machines established by the “majority
voting method” is:

Σ
n
PE = p k (1 − p)(n−k) (2.45)
k=[n/2]

Σ ( )k ( )(n−k) Σ ( )n
Since p < 21 and PE < nk=[n/2] 21 21 , then PE < nk=[n/2] 21 . When
the number of single support vector machine is large, the error of ensemble support
vector machines will be very small. Since the error probability of a single support
vector machine with random guess is 1/2, if a single support vector machine better
than random guess can be established, the error of the final ensemble support vector
machines will be greatly reduced.
In order to overcome the limitation of multi classification problem of single
support vector machine in engineering application, the ensemble support vector
machines take multi classification support vector machines composed of k(k − 1)/2
binary support vector machines as the basic classifier, and use ensemble learning algo-
rithm to integrate multiple multi classification support vector machines to improve
the generalization performance of support vector machines. The algorithm diagram
of ensemble support vector machines is shown in Fig. 2.28.
The ultimate goal of ensemble learning is to improve the generalization perfor-
mance of learning algorithms. Because of the huge potential and application prospect
of ensemble learning, ensemble learning method has become one of the most impor-
tant research directions in the field of machine learning, and has been evaluated
by international scholar Dietterich as the first of the four research directions in the

Fig. 2.28 The algorithm diagram of ensemble support vector machines


2.5 Ensemble-Based Incremental Support Vector Machines 73

field of machine learning [23]. However, how to explore and research effective new
ensemble learning methods and apply them to engineering practice is still one of the
key problems of ensemble learning of support vector machines.

2.5.2 The Theory of Reinforcement Learning

The real world is full of massive data and information, which contains rich potential
knowledge waiting for people to explore; on the other hand, the updating speed of data
and information is amazing, and information data technology is required to overcome
the dimension disaster while showing excellent generalization ability. Reinforcement
learning technology is an adaptive learning method with feedback as input, through
interaction with the environment, uncertain incentives can be obtained, and finally
the optimal behavior strategy can be obtained. Due to its characteristics of online
learning and adaptive learning, reinforcement learning is an effective tool to solve
the problem of intelligent strategy optimization, and is gradually becoming one of
the key information technologies to build a “smart earth”. The structural framework
of the standard reinforcement learning algorithm is shown in Fig. 2.29, which is
mainly composed of a state perceptron, a learner and a behavior selector. The state
perceptron mainly completes the mapping process from the external environment to
the intelligent agent’s internal perception. The learner updates the intelligent agent’s
strategy knowledge according to the observation value and incentive value of the
environment state. The behavior selector makes behavior choices based on the current
intelligent agent’s policy knowledge and acts on the external environment. The basic
principle of reinforcement learning is: if a certain behavior of reinforcement learning
leads to positive rewards from the environment, the trend of this behavior will be
strengthened in the future; on the contrary, the trend of this behavior will weaken
[24].
According to the basic principle of reinforcement learning, the goal of reinforce-
ment learning is to learn a behavioral strategy so as to obtain the maximum reward

Fig. 2.29 The algorithm diagram of reinforcement learning


74 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

from the environment. Therefore, its objective function is usually expressed in the
following three forms:

Σ
V d (st ) = γ i rt+i , 0 < γ ≤ 1 (2.46)
i=0

Σ
h
V d (st ) = rt (2.47)
i=0
( )

h
V (st ) = lim
d
rt (2.48)
h→∞ h i=0

where, γ is the discount factor, and rt is the environmental reward value received after
the transfer from the environmental state st to st+1 . Equation (2.46) is the infinite
reward model; Eq. (2.47) is the priority reward model, which only considers the
reward of h steps in the future; and Eq. (2.48) is the average reward model. According
to the objective function, the optimal behavior strategy can be determined:

d ∗ = arg max V d (s), ∀s ∈ S (2.49)


d

where, d is the behavior strategy, V is the objective function, s is the environment


state, and S is the set of environment state. A simple and common enhancement
learning algorithm is as follows (as shown in Fig. 2.30).
According to the initial conditions, the training samples are randomly divided into
independent training subsets: sub1 , sub2 , …, sub N .
Take subset sub1 to train the learner, and obtain the current objective function V1
of the learner and the current optimal behavior strategy (behavior 1).
According to the current objective function V1 of the learner and the current
optimal behavior strategy (behavior 1), combine training subset sub2 to retrain the
learner to obtain the current objective function V2 of the learner and the current
optimal behavior strategy (behavior 2), and then combine training subset sub3 with

Fig. 2.30 The algorithm diagram of a reinforcement support vector machine


2.5 Ensemble-Based Incremental Support Vector Machines 75

the current objective function V2 of the learner and the current optimal behavior
strategy (behavior 2) to retrain the learner; so as to repeat the above process until
training subset sub N .
By using the current objective function VN −1 of the learner and the current optimal
behavior strategy (behavior N − 1) and combining with the training sample subset
sub N to retrain the learner, the final learner obtained is the termination target.

2.5.3 Ensemble-Based Incremental Support Vector Machines

Combining the advantages and characteristics of ensemble learning and reinforce-


ment learning, aiming at the goal of improving the generalization performance of
support vector machines, this section will further propose the Ensemble-based Incre-
mental Support Vector machines (EISVM) based on the theoretical framework of
ensemble learning and reinforcement learning, by fully mining the knowledge infor-
mation contained in the limited sample space data, so as to achieve the goal of
improving the generalization ability of support vector machines from the aspects of
machine learning theory system and algorithm construction.
The goal of ensemble reinforcement learning is to improve the generalization
performance of support vector machines. In the process of ensemble reinforcement
learning, a single support vector machine as a basic learner is considered as an
assumption h from input space X to output space Y . The parameters of a single
support vector machine can be obtained according to the ant colony optimization
algorithm proposed in Sect. 2.3 of this paper. For each iteration t = 1, 2, . . . , Tk ,
dataset Sk (k = 1, 2, . . . , n) is divided into training subset T Rt and testing subset
T E t according to the current distribution Dt (t = 1, 2, . . . , Tk ). Then, using training
subset T Rt , an assumption h t from input space X to output space Y is generated by
a single support vector machine, h t : X → Y . The distribution Dt is obtained from
the weight set assigned to each sample according to the classification performance of
a single support vector machine for each sample. In general, those samples that are
difficult to classify correctly will be given a higher weight to increase the probability
that they are selected to enter the next training subset. The distribution function D1
of the initial iteration is initialized to 1/m (m is the number of samples in dataset
Sk ), and the iterative distribution function gives the same probability to each sample
selected to enter the first training subset. If there are other reasons or prior knowledge,
you can customize the initial distribution function in other ways. The error of single
support vector machine h t on dataset Sk (k = 1, 2, . . . , n) is [20, 22]:
Σ
εt = Dt (i ) (2.50)
i,h t (x i )/= yi

where, error A is the sum of distribution weights of misclassified samples. If εt > 1/2,
the assumption h t : X → Y from the input space X to the output space Y generated
by the current single support vector machine must be discarded, then a new training
76 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

subset T Rt and a testing subset T E t are constructed, and the support vector machine
is retrained. This means that a single support vector machine only hopes to obtain
50% (or less) error on the dataset. Since half of the error in a binary classification
problem means that it is a random guess, this is the easiest condition for a binary
classification problem. However, for classification problems with N types, the error
probability of random guess is NN−1 , so it is more difficult to ensure that the error is
50% or less with the increase of N types. If εt < 1/2 is satisfied, the regularized
error βt can be calculated as follows [20, 22]:
εt
βt = (2.51)
1 − εt

All assumptions h t : X → Y from input space X to output space Y generated


by a single support vector machine in the previous t iterations will be combined
according to the maximum weight election method. The election weight is equal to
the logarithm of the reciprocal of the regularized error βt . Those who get good results
on the own training subset and testing subset will be given more votes. Composite
classification assumption Ht will be obtained according to the combination of each
single classification assumption h t : X → Y [20, 22]:
Σ 1
Ht = arg max log (2.52)
y∈Y
t:h t (x)=y
βt

The classification performance of composite classification assumption Ht depends


on the assumption of obtaining the largest vote among t single classification
assumptions. The error of composite classification assumption Ht is [20, 22]:

Σ Σ
m
Et = Dt (i ) = Dt (i )[|Ht (x i ) /= yi |] (2.53)
i:Ht (x i )/= yi i=1

where, when the result is true, [| · |] is 1, otherwise it is 0. If E t > 1/2, the current
assumption h t will be discarded, and then a new training subset and testing subset
will be reconstructed, and a new assumption h t : X → Y from input space X to
output space Y generated by a single support vector machine will be constructed. We
can find that when a dataset Sk+1 with new information is input, the composite error
E t may exceed 1/2. In other cases, since all single assumptions h t : X → Y , which
constituting composite assumption Ht , have been verified by Eq. (2.49) to ensure a
maximum error of 50% on dataset Sk , the condition E t < 1/2 can be almost always
satisfied. If E t < 1/2, the regularized composite error can be calculated as follows
[20, 22]:

Et
Bt = (2.54)
1 − Et
2.5 Ensemble-Based Incremental Support Vector Machines 77

Update the weight ωt+1 (i) and calculate the next distribution Dt+1 according to
the composite assumption Ht generated in the ensemble learning process. The update
rule of this distribution is the key to ensemble reinforcement learning [20, 22].
{
Bt , if Ht (x i ) = yi
ωt+1 (i ) = ωt (i) ×
1, otherwise
1−[|Ht (x i )/= yi |]
= ωt (i) × Bt (2.55)

ωt+1 (i )
Dt+1 = Σm (2.56)
i=1 ωt+1 (i)

According to this rule, if sample x i is correctly classified by composite classifica-


tion assumption Ht , the weight of the sample is multiplied by a factor Bt less than 1;
If it is wrongly classified, the weight of the sample remains unchanged. This distribu-
tion update rule reduces the probability that the correctly classified sample is selected
as the next training sample T Rt+1 , and increases the probability that the currently
incorrectly classified sample is selected as the next training sample T Rt+1 . The
ensemble-based incremental support vector machines focus on the samples which
are repeatedly and wrongly classified. It can be found that when new types of samples
are input into the sample set, the current composite classification assumption Ht will
be particularly easy to misclassify new samples. Therefore, Eq. (2.55) guarantees
that misclassified samples will be selected into the next training sample set, thus
ensuring the feasibility of ensemble reinforcement learning.
After the Tk classification assumptions of each data subset Sk are generated, the
final classification assumption of the ensemble-based incremental support vector
machines will be output according to all composite assumptions [20, 22]:
⎛ ⎞
Σ
K ( )
Σ
⎝ 1 ⎠
H f inal (x) = arg max log (2.57)
y∈Y
k=1 t:H (x)=y
βt
t

It can be seen that when ensemble reinforcement learning is carried out, if there are
new samples input, the original knowledge is not lost because all the historical clas-
sification assumptions h t : X → Y of the support vector machines are retained.
The ensemble-based incremental support vector machines can inherit the previ-
ously learned knowledge and continue to learn new knowledge from new samples,
thus comprehensively improving the generalization performance of support vector
machines. The algorithm structure of the ensemble-based incremental support vector
machines is shown in Fig. 2.31. And the algorithm flow of the ensemble-based
incremental support vector machines is shown in Table 2.15.
78 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.31 The algorithm structure diagram of the ensemble-based incremental support vector
machines

Table 2.15 The algorithm flow of the ensemble-based incremental support vector machines
Input: Dataset Sk = {(x 1 , y1 ), (x 2 , y2 ), . . . , (x m , ym )}, k = 1, 2, . . . , n
Integer Tk , number of support vector machines to be generated
Do for k = 1, 2, . . . , n
Initialize ω1 (i ) = 1
m, m is the total number of samples in each sample set Sk
Do for t = 1, 2, . . . , Tk
(1) Let Dt = Σmωt (i) , Dt is a distribution
i=1 ωt (i )
(2) Select training subset T Rt and testing subset T E t according to distribution Dt
(3) Use a single support vector machine to generate an assumption h t from input space X to
output space Y , h t : X → Y , and calculate the error εt of classification assumption h t on the
Σ
dataset St = T Rt + T E t , εt = i: h t (xi )/= yi Dt (i )
εt
If εt > 21 , go to step (2), otherwise calculate the regularized error βt = (1−εt )
(4) Use the maximum weight election method to obtain the composite assumption Ht ,
Σ ( )
Ht = arg max t:h t (x)=y log β1t , and calculate the composite error E t ,
y∈Y
Σ Σm
Et = i:Ht (x t )/= yi Dt (i ) = i=1 Dt (i )[|Ht (x i ) /= yi |]
E tk
(5) Let Btk = (regularized composite error), and update sample weights:
1−E tk
{
Btk , if Htk = yi
ωt+1 = ωt ×
1, otherwise
End
End
Output the final classification assumption of the ensemble-based incremental support vector
Σ K (Σ ( ))
machines: H f inal (x) = arg max k=1 1
t:Ht (x)=y log βt
y∈Y
2.5 Ensemble-Based Incremental Support Vector Machines 79

2.5.4 The Comparison Experiment Based on Rolling Bearing


Incipient Fault Diagnosis

Since the ensemble-based incremental support vector machines can improve the
generalization performance of the support vector machines, in order to verify the
effectiveness of the ensemble-based incremental support vector machines, the rolling
bearings early fault experiments of the internationally famous rolling bearings fault
data platform—Case Western Reserve University (CWRU) bearing data center [25]
will be used for method comparison. In addition, the method of ensemble-based
incremental support vector machines will be further applied to the fault diagnosis
of locomotive rolling bearings, including compound faults and different damage
degrees of the same fault type.

2.5.4.1 Introduction to Experimental System

The rolling bearing fault data provided by the Case Western Reverse University
(CWRU) bearing data center in American is often used by scholars around the world
to verify the effectiveness of the proposed method. Therefore, the experimental data
of this data center is used as a standard experiment to carry out method verification
and comparative analysis.
As shown in Fig. 2.32, the rolling bearing test bench contains a 1.5 kW motor
(left), a torque sensor (center), and a dynamometer (right). The experimental bearings
(including the drive end bearings and the fan end bearings) are used to support the
motor shaft. The parameters and other information of the experimental bearings are
shown in Table 2.16. The single point early fault of the test bearings are made by
EDM. The fault diameters are 0.18 mm, 0.36 mm, 0.53 mm and 0.71 mm respectively,
and the fault depth is 0.28 mm. The fault degree is weak, which belongs to early
fault. Table 2.17 shows the detailed fault parameters of drive end bearings and fan
end bearings. It includes different fault types under different speeds and different
fault diameters. Two acceleration sensors are respectively installed on the motor at
the drive end and fan end to test the vibration signals of bearings under different fault
states. The data acquisition system includes a high-frequency signal amplifier with
a sampling frequency of 12,000 Hz. The signal samples under each fault state are
collected, and the statistical characteristic quantities (19 in total) in time domain and
frequency domain are extracted from each sample according to Table 2.7, then the
feature sets are input into the ensemble-based incremental support vector machines
for training. The experimental results of the ensemble-based incremental support
vector machines are compared with those of other methods.
80 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) The picture of CWRU bearing test bench in American

(b) The structural diagram of CWRU bearing test bench in American

Fig. 2.32 The CWRU bearing test bench in American

Table 2.16 The bearing experimental parameters (size unit: mm)


Bearing state Drive end bearing Fan end bearing
Bearing designation 6205-2RS JEM SKF, deep 6203-2RS JEM SKF, deep
groove ball bearing groove ball bearing
Diameter of inner ring 25 17
Diameter of outer ring 52 40
Thickness 15 12
Diameter of rolling element 8 7
Diameter of pitch circle 39 29

2.5.4.2 Comparative Analysis of Experiment I

The same experimental analysis was carried out on CWRU bearing data according to
the experimental process and parameter settings in [26]. Three types of fault signals,
including rolling element fault, inner ring fault and outer ring fault (the loading
area is concentrated in the 12:00 direction, that is, the vertical upward direction), are
collected from the drive end bearings, and the fault size is 0.18 mm. The experimental
2.5 Ensemble-Based Incremental Support Vector Machines 81

Table 2.17 The list of bearing faults


Bearings Fault Fault diameter Fault depth Speed (r/min) Motor load
location (mm) (mm) (kW)
No fault – – – 1797/1772/ 0/0.74/1.48/
(normal) 1750/1730 2.21
Drive end Outer ring 0.18 0.28 1797/1772/ 0/0.74/1.48/
bearings 1750/1730 2.21
0.36 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.53 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
Inner ring 0.18 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.36 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.53 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.71 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
Rolling 0.18 0.28 1797/1772/ 0/0.74/1.48/
element 1750/1730 2.21
0.36 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.53 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.71 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
Fan end Outer ring 0.18 0.28 1797/1772/ 0/0.74/1.48/
bearings 1750/1730 2.21
0.36 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.53 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
Inner ring 0.18 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.36 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.53 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
Rolling 0.18 0.28 1797/1772/ 0/0.74/1.48/
element 1750/1730 2.21
0.36 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
0.53 0.28 1797/1772/ 0/0.74/1.48/
1750/1730 2.21
82 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

parameters are shown in Table 2.18. The sampling frequency is 12,000 Hz, and the
time domain waveforms of the vibration signals are shown in Fig. 2.33 respectively.
The frequency spectrums of bearing vibration signals under three types of fault
states obtained through fast Fourier transform are shown in Fig. 2.34. The same data
samples are used as in [26]. First, 50 signal subsamples are formed by intercepting
1024 points from each signal under three types of fault states, of which 70% are used
as training samples, and the remaining 30% are used as testing samples.
Table 2.19 shows the intelligent fault diagnosis results based on the ensemble-
based incremental support vector machines, and compares them with the fault diag-
nosis results of seven methods (discrete Cosine transform, Daubechies wavelet,
Symlets wavelet, Walsh transform, FFT, Walsh transform + Rough set theory, FFT

Table 2.18 The description of experimental data including three types of fault states
Bearing Fault location Fault diameter Motor speed (r/ Motor load Label
(mm) min) (kW)
Drive end Outer ring 0.18 1750 1.48 1
bearings Inner ring 0.18 1750 1.48 2
Rolling element 0.18 1750 1.48 3

(a) 0.18mm outer ring fault

(b) 0.18mm inner ring fault

(c) 0.18mm rolling element fault

Fig. 2.33 The time domain vibration signals of rolling bearings in three types of fault states
2.5 Ensemble-Based Incremental Support Vector Machines 83

(a) 0.18mm outer ring fault

(b) 0.18mm inner ring fault

(c) 0.18mm rolling element fault

Fig. 2.34 The frequency domain vibration signals of rolling bearings in three types of fault states

+ Rough set theory), which given in [26]. It can be seen from Table 2.19 that the
intelligent fault diagnosis methods based on the ensemble-based incremental support
vector machines and the general support vector machine have the same good gener-
alization performance (100% accuracy) in three types of fault states: outer ring fault,
inner ring fault and rolling element fault, and can completely and effectively identify
three types of simple early bearing fault states. The comparison results with other
seven methods in the table also prove that: under the same experimental environment
and experimental data, the ensemble-based incremental support vector machines and
the general support vector machine have better fault identification capabilities.

2.5.4.3 Comparative Analysis of Experiment II

In order to further verify the intelligent fault diagnosis method based on the ensemble-
based incremental support vector machines proposed in this section, according to the
experimental process and parameters in [27] (as shown in Table 2.20), three types
of fault signals, including rolling element fault, inner ring fault and outer ring fault
(the loading area is concentrated in the 12:00 direction, that is, the vertical upward
direction) collected from the CWRU drive end bearing, and the signal data under the
84 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.19 The comparison of fault diagnosis results of rolling bearings in three types of states
Bearing fault states Methods Accuracy (%)
(1) Outer ring fault Discrete cosine transform 85 [26]
(2) Inner ring fault Daubechies wavelet 78 [26]
(3) Rolling element Symlets wavelet 74 [26]
fault Walsh transform 78 [26]
FFT 84 [26]
Walsh transform + rough set theory 80 [26]
FFT + rough set theory 86 [26]
Support vector machine 100
Ensemble-based incremental support vector machines 100

normal state of the bearing, are used for analysis. The time domain waveform and
frequency spectrum of the vibration signals collected under the four bearing states
are shown in Figs. 2.35 and 2.36 respectively.
The fault size of the three types of bearing fault is 0.36 mm, which is the same type
of fault as the three types of bearing fault in the comparative analysis of experiment I,
but the difference is that the bearing fault size in the current experiment is 0.36 mm,
which is more serious than the 0.18 mm fault in experiment I, the motor speed
is slightly increased compared with experiment I, and the motor load is relatively
reduced. The vibration signals of bearings under the four types of fault states are
intercepted every 1024 points to form 50 samples, of which 70% are used as training
samples, and the remaining 30% are used as testing samples. The experimental data
and parameters shown in Table 2.20 are consistent with [27], so as to ensure the
fairness and reliability of method validation and result comparison.
Reference [27] proposed a fault diagnosis method based on improved fuzzy
ARTMAP method and modified distance evaluation technology for the above four
types of bearing states: normal state, outer ring fault, inner ring fault and rolling
element fault, by extracting nine time domain statistical features (mean, root mean
square, variance, skewness, kurtosis, peak index, margin index, waveform index,
pulse index), seven frequency domain statistical features and First-order Continuous
Wavelet Grey Moment features, using modified distance evaluation technology to

Table 2.20 The description of experimental data including four types of fault states
Bearings Fault location Fault diameter Motor speed (r/ Motor load Label
(mm) min) (kW)
Drive end Normal 0 1772 0.74 1
bearings Outer ring fault 0.36 1772 0.74 2
Inner ring fault 0.36 1772 0.74 3
Rolling element 0.36 1772 0.74 4
fault
2.5 Ensemble-Based Incremental Support Vector Machines 85

(a) Normal

(b) 0.36mm outer ring fault

(c) 0.36mm inner ring fault

(d) 0.36mm rolling element fault

Fig. 2.35 The time domain vibration signals of rolling bearings in four types of fault states

extract the optimal features, and then using improved fuzzy ARTMAP method to
identify fault types. The experimental results of the method proposed in [27] are
compared with three other similar methods: the first is to use the improved fuzzy
ARTMAP method for fault diagnosis without feature optimization; The second is
to use the modified distance evaluation technology to extract the optimal features,
and use the fuzzy ARTMAP method for fault diagnosis; The third is to use fuzzy
ARTMAP method to diagnose faults without feature optimization.
This section will extract the time domain and frequency domain statistical char-
acteristics of each sample signal as shown in Table 2.7, and then use the proposed
86 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) 0.36mm outer ring faul

(c) 0.36mm inner ring fault

(d) 0.36mm rolling element fault

Fig. 2.36 The frequency domain vibration signals of rolling bearings in four types of fault states

ensemble-based incremental support vector machines for fault diagnosis and compare
with the experimental results of the above methods and general support vector
machine. It can be seen from Table 2.21 that under the current experimental condi-
tions, the method of ensemble-based incremental support vector machines proposed
in this section has achieved better experimental results than general support vector
machine and three types of methods based on fuzzy ARTMAP, but it is slightly worse
than the experimental results of fault diagnosis methods based on improved fuzzy
ARTMAP method and modified distance evaluation technology. One of the reasons
for the gap may be that the fault diagnosis method based on the improved fuzzy
ARTMAP method and the modified distance evaluation technology uses the First-
order Continuous Wavelet Grey Moment feature which is more advanced than the
2.5 Ensemble-Based Incremental Support Vector Machines 87

Table 2.21 The comparison of fault diagnosis results of rolling bearings in four types of states
Bearings fault Methods Accuracy (%)
states
(1) Normal Improved fuzzy ARTMAP method + modified distance 99.541 [27]
evaluation technology
(2) Outer ring Improved fuzzy ARTMAP method 89.382 [27]
fault
(3) Inner ring Fuzzy ARTMAP method + modified distance evaluation 91.185 [27]
fault technology
(4) Rolling Fuzzy ARTMAP method 79.228 [27]
element fault Support vector machine 91.67
Ensemble-based incremental support vector machines 98.33

time domain and frequency domain statistical features, and then selects the optimal
feature on this basis, thereby improving the accuracy of fault diagnosis.

2.5.4.4 Comparative Analysis of Experiment III

In order to further verify the generalization performance of the ensemble-based


incremental support vector machine, comparative analysis of experiment III will
include multiple types of faults and different fault degrees under the same fault type:
normal, rolling element fault, outer ring fault, and four types of inner ring faults with
different fault degrees (minor fault degree 0.18 mm, medium fault degree 0.36 mm,
serious fault degree 0.53 mm, severe fault degree 0.71 mm). The experimental data
are shown in Table 2.22. Because it is very difficult to identify different fault degrees
under the same fault type for many signal analysis methods based on extracting fault
feature frequency, this experiment is used to further verify the intelligent diagnosis
ability of ensemble-based incremental support vector machine for different fault
degrees under the same fault type.

Table 2.22 The description of experimental data including seven types of fault states
Bearings Fault location Fault diameter Motor speed (r/ Motor load Label
(mm) min) (kW)
Drive end Normal – 1772 0.74 1
bearings Outer ring fault 0.36 1772 0.74 2
Inner ring fault 0.18 1772 0.74 3
Inner ring fault 0.36 1772 0.74 4
Inner ring fault 0.53 1772 0.74 5
Inner ring fault 0.71 1772 0.74 6
Rolling element 0.36 1772 0.74 7
fault
88 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

The time domain waveforms of bearing vibration signals under seven states are
shown in Fig. 2.37, and the frequency spectrums of vibration signals are shown in
Fig. 2.38. It is difficult to identify these seven states only from the time domain signals
and frequency spectrums, so other methods need to be used for further analysis.
The fault diagnosis process based on the ensemble-based incremental support vector
machines is consistent with [28]: each 1024 points of vibration signals are intercepted
under each type of fault state to form 80 samples. 50% of the samples under each
fault state are used as training samples, and the remaining samples are used as
testing samples. The time domain and frequency domain statistical features of each
sample signal are extracted respectively as shown in Table 2.8, and then the proposed
ensemble-based incremental support vector machines are used for fault diagnosis.
Reference [28] aims at the normal, rolling element fault, outer ring fault and
four kinds of inner ring faults with different degrees of fault (0.18 mm for minor
fault, 0.36 mm for medium fault, 0.53 mm for serious fault and 0.71 mm for severe
fault) described in Table 2.22, by extracting nine time domain statistical features
(mean, root mean square, variance, skewness, kurtosis, peak index, margin index,
waveform index, pulse index), eight frequency domain statistical features and the
First-order Continuous Wavelet Grey Moment feature, the fuzzy ARTMAP network
model based on feature weight learning is used to identify bearing fault states that
contain multiple types of faults and different fault degrees under the same fault type.
Table 2.23 shows the experimental comparison results between the fault diagnosis
method based on the ensemble-based incremental support vector machines and the
general support vector machine and three fuzzy ARTMAP methods proposed in [27,
28].
Among the four methods, the method based on the ensemble-based incremental
support vector machines achieves the highest diagnosis accuracy of 96.42%. The
experimental results show that when faced with the complicated problem of fault
pattern recognition of different fault degrees of the same fault type, the fault diag-
nosis ability of general support vector machine is limited, while the ensemble-based
incremental support vector machines improve the generalization ability of support
vector machine on the basic framework of machine learning of ensemble theory and
reinforcement theory, which enhances the ability to identify the multiple early fault
states of rolling bearings including different fault degrees of the same fault type, and
the experimental results are better than the three methods based on fuzzy ARTMAP.

2.5.5 The Application in Electrical Locomotive Rolling


Bearing Compound Fault Diagnosis

The experimental testing process and platform of rolling bearing for electric loco-
motive are shown in Sect. 2.3.4. The experimental data are described in detail in
Sect. 2.4.3. The vibration signals of each state were intercepted 2048 sampling points
to form a sample data, a total of 32 samples were intercepted. The time-domain and
2.5 Ensemble-Based Incremental Support Vector Machines 89

(a) Normal

(b) 0.36mm outer ring fault

(c) 0.18mm inner ring fault

(d) 0.36mm inner ring fault

(e) 0.53mm inner ring fault

(f) 0.71mm inner ring fault

(g) 0.36mm rolling element fault

Fig. 2.37 The time domain vibration signals of rolling bearings in seven types of fault states
90 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) 0.36mm outer ring fault

(c) 0.18mm inner ring fault

(d) 0.36mm inner ring fault

(e) 0.53mm inner ring fault

(f) 0.71mm inner ring fault

(g) 0.36mm rolling element fault

Fig. 2.38 The frequency domain vibration signals of rolling bearings in seven types of fault states
2.5 Ensemble-Based Incremental Support Vector Machines 91

Table 2.23 The comparison of fault diagnosis results of rolling bearings in seven types of states
Bearings fault states Method Accuracy
(fault diameter: mm) (%)
(1) Normal Improved fuzzy ARTMAP method 77.551 [27]
(2) Outer ring fault Improved fuzzy ARTMAP method + modified distance 84.898 [27]
(0.36) evaluation technology
(3) Inner ring fault Improved fuzzy ARTMAP method + feature weight 87.302 [28]
(0.18) learning
(4) Inner ring fault Support vector machine 89.29
(0.36)
(5) Inner ring fault Ensemble-based incremental support vector machines 96.42
(0.53)
(6) Inner ring fault
(0.71)
(7) Rolling element
fault (0.36)

frequency-domain statistical characteristics of each sample were extracted according


to Table 2.7. Sixteen samples were used to train support vector machines, and the
remaining 16 samples were used to test. The fault types of the rolling bearing samples
of the experimental locomotive were identified using ensemble-based incremental
support vector machines. The experimental results are shown in Table 2.24. Among
them, SVM1 represents the proposed parameter optimization method of support
vector machine based on ant colony optimization algorithm, and SVM2 represents
the feature selection and parameter optimization fusion method of support vector
machine based on ant colony optimization algorithm, EISVM stands for ensemble-
based incremental support vector machines. The experimental results obtained
by optimizing the features and parameters of the ensemble-based incremental
support vector machines with ant colony optimization algorithm are represented by
EISVM *.
The results show the parameter optimization method of support vector machine
(SVM1 ), based on ant colony optimization algorithm, achieves 89.58% of the fault
diagnosis accuracy of locomotive bearings by optimizing the optimal parameters

Table 2.24 Comparison of experimental results of composite fault diagnosis of locomotive


bearings based on support vector machines
Method Accuracy of each fault type (%) Average
N O S I R OI OR IR OIR accuracy
(%)
SVM1 100 100 87.5 100 100 75 75 75 93.75 89.58
SVM2 100 93.75 93.75 100 100 87.5 93.75 93.75 100 95.83
EISVM 100 100 93.75 100 100 87.5 93.75 93.75 100 96.53
EISVM * 100 100 100 100 100 100 93.75 100 100 99.31
92 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(C = 57.60, σ = 64.26). The feature selection and parameter optimization fusion


method of support vector machine (SVM2 ), based on ant colony optimization algo-
rithm, improves the diagnostic ability of support vector machine by simultaneously
selecting the optimal features F1 ∼ F18 and the optimal parameters (C = 1.02,
σ = 0.04), thus obtaining 95.83% accuracy. The ensemble-based incremental
support vector machines (EISVM) proposed in this section, improves the generaliza-
tion performance of a single support vector machine in terms of algorithm construc-
tion, making its fault diagnosis of locomotive rolling bearings achieve 96.53% accu-
racy. Further using ant colony optimization algorithm to simultaneously optimize
features and parameters for ensemble-based incremental support vector machines
(EISVM *), can further improve the generalization ability of ensemble-based incre-
mental support vector machines, making the average accuracy of fault diagnosis
achieve 99.31%.
For the three common types of simple faults: normal state, inner ring fault, and
rolling element fault, the support vector machine parameter optimization method
based on ant colony optimization algorithm (SVM1 ) proposed in Sect. 2.3, the
support vector machine feature selection and parameter optimization fusion method
based on ant colony optimization algorithm (SVM2 ) proposed in Sect. 2.4, and the
ensemble-based incremental support vector machines (EISVM) proposed in this
section can all 100% identify, which shows that these three methods of support
vector machine have equally good performance in identifying simple fault types. In
the face of other complicated fault state (such as serious fault of outer ring, minor
fault of outer ring, compound fault of outer ring and inner ring, compound fault
of outer ring and rolling element, compound fault of inner ring and outer ring and
rolling element), by improving the generalization ability of a single support vector
machine, the ensemble-based incremental support vector machines can effectively
further improve its accuracy in identifying compound faults and different degrees of
damage to the outer ring.
The results of rolling bearing fault diagnosis for electric locomotive show that
the ensemble-based incremental support vector machines improve the generalization
performance of support vector machines through machine learning theory system and
algorithm construction. The method of ensemble-based incremental support vector
machines is applied to the field of mechanical fault diagnosis, and the experimental
data of CWRU bearing data center in American are used as the standard experiment to
validate the method. At last, the method is applied to the fault diagnosis of locomotive
rolling bearing, which includes all kinds of compound faults, can effectively identify
multiple fault types including compound fault and different fault degrees of the same
fault type. The results show that:
The method of ensemble-based incremental support vector machines is based on
the theories of ensemble learning and reinforcement learning. By fully mining the
knowledge information contained in the limited sample space data, it realizes the
goal of improving the generalization performance of support vector machines from
the perspective of machine learning theory system and algorithm construction.
The ensemble-based incremental support vector machines can effectively identify
multiple types of early faults and different degrees of damage of rolling bearings.
References 93

Through three bearing fault testing cases of CWRU bearing experimental center in
American, under the same experimental parameters and process conditions, compar-
ison with other methods shows that the application of ensemble-based incremental
support vector machines in early fault diagnosis of rolling bearings has obtained satis-
factory diagnosis results, which can effectively identify multiple early fault types of
rolling bearings and different damage degrees of the same fault type.
The generalization performance of support vector machines has a great relation-
ship with the parameters and sample features of support vector machines. Therefore,
the feature selection and parameter optimization fusion method based on ant colony
optimization algorithm is applied to the ensemble-based incremental support vector
machines to further improve its diagnostic ability for complicated fault types, which
includes compound faults and different damage degrees of the same fault type. The
research results show that the ensemble-based incremental support vector machines
improve the generalization performance of a single support vector machine, and can
effectively identify various compound faults and different damage degrees of the
same fault type in locomotive rolling bearings.

References

1. Vapnik, V.: The Nature of Statistic Learning (in Chinese). Tsinghua University Press, Beijing
(2000)
2. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995)
3. Mjolsness, E., DeCoste, D.: Machine learning for science: state of the art and future prospects.
Science 293(5537), 2051–2055 (2001)
4. Bian, Z., Zhang, X., et al.: Pattern Recognition (in Chinese), 2nd edn., pp. 296–301. Tsinghua
University Press, Beijing (2000)
5. Yang, Y., Yu, D.J., Cheng, J.S.: A fault diagnosis approach for roller bearing based on IMF
envelope spectrum and SVM. Measurement 40(9–10), 943–950 (2007)
6. Cheng, J.S., Yu, D.J., Yang, Y.: A fault diagnosis approach for gears based on IMF AR model
and SVM. EURASIP J. Adv. Signal Process. (2008)
7. Saravanan, N., Siddabattuni, V.N.S.K., Ramachandran, K.I.: A comparative study on classifi-
cation of features by SVM and PSVM extracted using Morlet wavelet for fault diagnosis of
spur bevel gear box. Expert Syst. Appl. 35(3), 1351–1366 (2008)
8. Poyhonen, S., Arkkio, A., Jover, P., et al.: Coupling pairwise support vector machines for fault
classification. Control Eng. Pract. 13(6), 759–769 (2005)
9. Chu, F.L., Yuan, S.F.: Fault diagnosis based on support vector machines with parameter opti-
misation by artificial immunisation algorithm. Mech. Syst. Signal Process. 21(3), 1318–1330
(2007)
10. Sun, C., Liu, L., Liu, C., et al.: Boosting-SVM based aero engine fault diagnosis (in Chinese).
J. Aerosp. Power 11(25), 2584–2588 (2010)
11. Zhu, Z., Liu, W.: Fault diagnosis of marine diesel engine based on support vector machine (in
Chinese). Ship Eng. 5(28), 31–33 (2006)
12. Chapelle, O., Vapnik, V., Bousquet, O., et al.: Choosing multiple parameters for support vector
machines. Mach. Learn. 46(1–3), 131–159 (2002)
13. Colorni, A., Dorigo, M., Maniezzo, V.: Distributed optimization by ant colonies. In: Proceed-
ings of the First European Conference on Artificial Life, vol. 142 (1991)
14. Samrout, M., Kouta, R., Yalaoui, F., et al.: Parameter’s setting of the ant colony algorithm
applied in preventive maintenance optimization. J. Intell. Manuf. Autom. Technol. 18, 663–677
(2007)
94 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

15. Duan, H.B., Wang, D.B., Yu, X.F.: Research on the optimum configuration strategy for the
adjustable parameters in ant colony algorithm. J. Commun. Comput. 2(9), 32–35 (2005)
16. Chen, C.-W.: Modeling, control, and stability analysis for time-delay TLP systems using the
fuzzy Lyapunov method. Neural Comput. Appl. 20(4), 527–534 (2011)
17. Adankon, M.M., Cheriet, M.: Optimizing resources in model selection for support vector
machine. Pattern Recogn. 40(3), 953–963 (2007)
18. Friedrichs, F., Igel, C.: Evolutionary tuning of multiple SVM parameters. Neurocomputing 64,
107–117 (2005)
19. Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561 (1961)
20. Polikar, R., Topalis, A., Parikh, D., et al.: An ensemble based data fusion approach for early
diagnosis of Alzheimer’s disease. Inf. Fusion 9(1), 83–95 (2008)
21. Sun, W.X., Chen, J., Li, J.Q.: Decision tree and PCA-based fault diagnosis of rotating
machinery. Mech. Syst. Signal Process. 21(3), 1300–1317 (2007)
22. Parikh, R., Polikar, R.: An ensemble-based incremental learning approach to data fusion. IEEE
Trans. Syst. Man Cybern. Part B (Cybern.) 32(2), 437–450 (2007)
23. Dietterich, T.G.: Machine learning research: four current directions. AI Mag. 18(4), 97–136
(1997)
24. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998); Wang,
T., Song, G., Liu, S., et al.: Review of bolted connection monitoring. Int. J. Distrib. Sens. Netw.
2013, 1–8 (2013)
25. Bearing Data Center Seeded Fault Test Data. The Case Western Reserve University Bearing
Data Center Website. http://csegroups.case.edu/bearingdatacenter/pages/welcome-case-wes
tern-reserve-university-bearing-data-center-website
26. Li, Z., He, Z.J., Zi, Y.Y., et al.: Rotating machinery fault diagnosis using signal-adapted lifting
scheme. Mech. Syst. Signal Process. 22(3), 542–556 (2008)
27. Xu, Z.B., Xuan, J.P., Shi, T.L., et al.: A novel fault diagnosis method of bearing based on
improved fuzzy ARTMAP and modified distance discriminant technique. Expert Syst. Appl.
36(9), 11801–11807 (2009)
28. Xu, Z.B., Xuan, J.P., Shi, T.L., et al.: Application of a modified fuzzy ARTMAP with feature-
weight learning for the fault diagnosis of bearing. Expert Syst. Appl. 36(6), 9961–9968 (2009)
Chapter 3
Semi-supervised Learning Based
Intelligent Fault Diagnosis Methods

3.1 Semi-supervised Learning

Machine learning requires large amounts of labeled training data as input to improve
the generalization of supervised learning. However, it is more difficult to obtain
labeled data than unlabeled data, especially when the data is applied to the area
of fault diagnosis. On the contrary, unsupervised learning is an automatic study
method, and the classical labels of the training data are not required. But without
providing supervised information, the trained model is often not accurate enough,
and consistency and generalization of learning results are also difficult to meet the
usage requirements.
Semi-supervised learning is a concept that is situated between supervised and
unsupervised learning. The key to semi-supervised learning is to consider how to
take advantage of the data structure and automatic study ability of unknown data,
and design algorithms that combine the feature of labeled and unlabeled data. The
emphasis of semi-supervised learning is not on the methodological approaches
themselves but on the learning mechanism of collaborative learning of the super-
vised paradigm and the unsupervised paradigm samples. It can be considered that
semi-supervised learning classification is a classification algorithm for incorporating
labeled data into specific unlabeled data to perform classification and recognition
tasks [1].
Semi-supervised focuses primarily on how to get a learning machine that has state-
of-art performance and generalization ability when the lack of partly training data
including the loss of class label or the appearance of noise data, and the loss of feature
dimensions of the data takes place. The theoretical research of semi-supervised
learning has a very important guiding significance for deeply understanding many
important theoretical issues in machine learning, such as the relationship between
data manifold and data class label, reasonable processing of missing data, effective
use of labeled data, the relationship between supervised learning and unsupervised
learning, and the design of active learning algorithm.

© National Defense Industry Press 2023 95


W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_3
96 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.2 Fault Detection and Classification Based


on Semi-supervised Kernel Principal Component
Analysis

3.2.1 Kernel Principal Component Analysis

Principal component analysis (PCA) is a commonly used feature extraction method,


that obtains the principal component variables with the most variation information of
the original data, and realizes the feature extraction of complex system information.
Kernel principal component analysis (KPCA) introduces the kernel method into PCA,
maps the input data to the high-dimensional feature space, and extracts nonlinear
features by linear PCA in the feature space.
PCA is a linear method based on Gauss Statistical Assumptions, that each principal
component is a linear combination of the original variables. Nevertheless, linear PCA
cannot effectively extract the nonlinear features of mechanical fault signals, which,
in turn, affected the accuracy of fault diagnosis. Therefore, nonlinear PCA is required
to deal with the signal of fault diagnosis. The main differences between linear PCA
and nonlinear PCA are as follows. First, in nonlinear PCA, a nonlinear function
is required to be introduced to map original variables into the nonlinear principal
component. Second, the linear principal component is the linear combination of the
original variables, then minimizes the sum of the distance from the data points to the
linear line that it represents. In turn, the nonlinear PCA minimizes the sum of the
distance from data points to the curve or surface that it represents.
KPCA provides a new idea by using kernel trick, mapping nonlinear problems in
low dimensional space to linear problems in high dimensional space and extending
PCA to the nonlinear field, for solving the nonlinear problem. This method maps
the input data matrix X to a high dimensional feature space F through a pre-selected
nonlinear mapping method and gets more separable input data. Then the linear PCA is
used to analysize the mapping data in high dimensional space to obtain the nonlinear
principal component of input data. This nonlinear map is implemented using the
inner product operation and only needs to compute the kernel function corresponding
to the inner product in the original space, without paying attention to the specific
implementation of nonlinear mapping.
For the feature extraction of any test dataset sample Z, it can be achieved by
computing the projections of the mapping data matrix ϕ(Z) on the eigenvector of the
normalized correlation coefficient matrix.
The KPCA method can be summarized by the following steps:
(1) Select the training dataset {x i }i=1
M
and test dataset {z i }i=1
N
.
(2) Compute the kernel matrix K by equation K i j = (ϕ(x i ) · ϕ(x j )), where the
dimension of K is M × M.
(3) Normalize the kernel matrix K by equation K̃ = K − 1 M K − K 1 M + 1 M K 1 M ,
where 1i j = 1 (1 M )i j = 1/M, (i, j = 1, 2, . . . , M).
(4) Compute the eigenvalue λ̃ and eigenvector α̃ by eigenequation λ̃α̃ = K̃ α̃.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 97

(5) Compute the normalization eigenvector α̃ kb by equation λk (α̃ k · α̃ k ) = 1.


(6) Compute the kernel matrix K test by equation K itest j = (ϕ(z i ) · ϕ(x j )), where
the dimension of K test is N × M.
test
(7) Normalize the kernel matrix K test by equation K̃ = K test −1 N K − K test 1 M +
1N K 1M .
test
(8) Extract eigenvalue Ftest
k by equation Ftest
k = K̃ ∗ α̃ kb , where Ftest
k is the kth
eigenvalue of the nonlinear principal component.

3.2.2 Semi-supervise Kernel Principal Component Analysis

Semi-supervised emphasizes the fusion of labeled and unlabeled data to improve the
performance of the learning machine. The kernel function of KPCA is used to achieve
the nonlinear feature extraction. The lack of prior information about different pattern
types during the process of diagnosis will influence the reliability of fault detection
and diagnosis. Therefore, in this section, we wish to build kernel function in the
semi-supervised pattern for fault diagnosis.

3.2.2.1 Separability Index

(1) Intraclass distance and interclass distance for the separability criterion

KPCA is a feature extraction method that requires quantitative indicators and criteria
to evaluate the effectiveness of the extracted feature for classification. Generally, the
accuracy rate of the classifier is used to evaluate the effectiveness of classifica-
tion. However, a huge number of prior information and labeled data are required to
calculate the accuracy rate. It is necessary to introduce some criteria to evaluate the
superiority of the feature extraction method.
The feature samples located in different areas of the feature space correspond to
their fault patterns. Consequently, the samples of different classes are separable. If
the interclass scatter is large and the intraclass scatter is small in the samples cluster
process, it indicates that the separability of samples and the clustering effect of KPCA
is excellent. It is easy to see that the distance of different sample points reflects the
separability of sample classes.
The distance between different classes is researched under the assumption that
there are two types of sample classes to start. Let the two classes be ω1 and ω2 . Any
point in ω1 has a distance from every point in ω2 . The value averaged by summing
the distance of the points represents the distance between the two classes.
( j)
As for the case of clustering for multiple types of classes, let the x (ik ) and x l
(i) ( j)
be the D dimensional eigenvector of class ωi and class ω j ; δ(x k , x l ) is the
distance between the two types of eigenvectors. The average distance between each
eigenvector:
98 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

1Σ Σ
c c
1 Σ
ni Σnj ( )
( j)
Jd (x) = Pi Pj δ x (i)
k , x l (3.1)
2 i=1 j=1
n i n j k=1 l=1

where c is the number of classes; n i is the number of samples in class ωi , n j is


the number of samples in class ω j , Pi and P j are the prior probabilities of the
corresponding classes. When the prior probability is unknown, the training sample
data can also be used for estimation:
ni
P̃i = (3.2)
n
( j)
There are multiple distance measures to compute δ(x (i)
k , x l ) for two vectors in a
multidimensional space. In this paper, we mainly use Euclidean distance to evaluate
( j)
δ(x (i)
k , x l ):

( ) ( )T ( )
( j) ( j) ( j)
δ x (ik ) , x l = x (i)
k − xl x (i)
k − xl (3.3)

Let the mi denotes the mean vector of the i-th class:

1 Σ
ni
mi = x (i) (3.4)
n i k=1 k

Let the m denotes the mean vector of all classes:

Σ
c
m= Pi mi (3.5)
i=1

By substituting Eqs. (3.4) and (3.5) into Eq. (3.1), the result is:
[ ni (
]
Σ
c
1 Σ )T ( )
Jd (x) = Pi x (i) − m i x (i)
− m i + (mi − m)T (mi − m) (3.6)
i=1
n i k=1 k k

where (mi − m)T (mi − m) denotes the square distance between the ith class mean
vector and the population mean vector. After using the prior probability-weighted
average, it can represent the average square distance of all classes of mean vectors:

Σ
c
1Σ Σ (
c c
)T ( )
Pi (mi − m)T (mi − m) = Pi P j mi − m j mi − m j (3.7)
i=1
2 i=1 j=1

The Jd (x) can be defined as:


3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 99

Σ
c
S̃b = Pi (mi − m)(mi − m)T (3.8)
i=1

Σ ni ( )( )T
1 Σ
c
S̃ω = Pi x (i)
k − mi x (ik ) − mi (3.9)
i=1
n i k=1

Σ ni ( )( )T
1 Σ
c
S̃ω = Pi x (i) − m i x (i )
− m i (3.10)
i=1
n i k=1 k k

The above derivation is based on a finite number of sample numbers. Where mi


and m denote the mean of the ith class and the mean of all classes, S̃b and S̃ω denote
the interclass scatter and the intraclass scatter. The formulations are as follows:

μi = E i [x] (3.11)

μ = E[x] (3.12)

Σ
c
( )( )T
Sb = Pi μi − μ μi − μ (3.13)
i=1

Σ
c [( )( )T ]
Sω = Pi E i x − μi x − μi (3.14)
i=1

The mean square distance of all classes also can be defined as:

Jd (x) = tr (Sω + Sb ) (3.15)

(2) Separability index

A distance metric criterion can be obtained by Eq. (3.7):

J1 (x) = tr (Sω + Sb ) (3.16)

To improve the KPCA effect, the interclass scatter should be as large as possible
and the intraclass scatter should be as small as possible. Therefore, the following
criteria are proposed:
( )
J2 = tr S−1
ω Sb (3.17)
[ ]
|Sb |
J3 = ln (3.18)
|Sω |
100 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods
[ ]
tr Sb
J4 = ln (3.19)
tr Sω
|Sb + Sω |
J5 = (3.20)
|Sω |

The actual diagnosis work condition is mostly a few-shot case, for which the
parameters in Jd (x) can be obtained directly by calculating the sample points. There-
fore, the separability criterion Jbw (x) can be constructed by combining the above
separability criterion J1 ∼ J5 :
Σc
Scb Pi (mi − m)T (mi − m)
Jbw (x) = = i=1
Σni ( (i) )T ( ) (3.21)
Scw Σc
i=1 Pi n1i k=1 x k − mi x (ik ) − mi

where Scw is the index of intraclass scatter and denotes the mean distance of intraclass
vector, Scb is the index of interclass scatter and denotes the mean distance of interclass
vector.
The values of the separability criterion Jbw (x) are normalized to obtain the
separability evaluation index Jb (x):

Scb
Jb (x) = (3.22)
Scb + Scw

The Jb (x) which over the interval [0, 1] denotes the similarity of all samples. If
Jb (x) = 0 means all samples belong to the same class, that is, there is no interclass
mean distance. In contrast, if Jb (x) = 1 means each sample belongs to different
classes, that is, there is no intraclass mean distance.
The larger value Jb means higher intraclass sample aggregation, farther interclass
mean distance, and better separability of cluster sampling. Jb is important for feature
extraction which can quickly measure the effectiveness of feature extraction in the
case of few-shot conditions, effectively guide the feature index selection in pattern
recognition and classification process, and reasonably set the parameters of the kernel
function.

3.2.2.2 Nearest Neighbor Function Rule Algorithms and Feature


Classification

Nearest neighbor function rule algorithms [2] which are based on similarity measure-
ment rules can categorize samples according to the clustering distribution of feature
data. The original samples present different distribution feature points on the feature
surface after KPCA processing. Nearest neighbor function rules can obtain clear
detection and classification result by effectively classifying clustering samples.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 101

1. Unsupervised category separation method


In the pattern classification area, due to the lack of prior class label information, or
due to the difficulty of classifying samples in practical work conditions, the learning
machine can usually only be trained using samples without the class label. This is
the motivation for the unsupervised learning method. The unsupervised learning is
divided into two categories: a direct method based on probability density function,
which decomposes a set of functions with mixed probability densities into lots of
subsets, each of which corresponds to a class, and an indirect clustering method based
on similarity measurement between samples, where divides the set into subsets, at
the same time the result of the division should be such that some criterion function
representing the quality of clustering is maximum. Distance is usually used as the
similarity measure between samples.
Iterative dynamic clustering algorithms are commonly used in indirect clustering
methods. It has the following three main features:
(1) Select some distance measurement as the similarity measurement of the sample.
(2) Determine a certain criterion function to evaluate the quality of clustering.
(3) Given a certain initial classification, an iterative algorithm is used to find the
best clustering result which takes the extreme value of the criterion function.
Common dynamic clustering algorithms include C-means clustering algorithms,
dynamic clustering algorithms based on similarity measures of samples and kernels,
and nearest neighbor function rule algorithms.
C-means clustering algorithms take the sum of squares for error (SSE) as the
clustering criterion. Only when the natural distribution of classes is spherical or
nearly spherical, in other words, when the variance of each component in each class
is close to equal, can there be a good classification effect. The C-means algorithm
usually does not work well when normal distributions with elliptical shapes due
to unequal variance of the components [3]. As Fig. 3.1 shows, m1 and m2 are the
clustering centers of class 1 and class 2 respectively. But due to the defect of C-means
clustering algorithms, point A is classified into class 2.
Dynamic clustering algorithms based on similarity measures of samples and
kernel can fix the shortcoming of the above C-means clustering algorithms. The
algorithm defines a kernel parameter K j = K (y, V j ) to represent the class ⎡ j , then
judges whether the sample y belongs to the class ⎡ j by establishing a measurement
function Δ(y, K j ) between some sample points y and the kernel K j . The algorithm
enables the clustering results to fit a priori assumed data constructions of different
shapes. But the algorithm has trouble clustering samples when the form of the defined
kernel function cannot be determined or cannot be represented by a simple function.
For the case of several different shapes of data constructions in Fig. 3.2, dynamic
clustering algorithms based on similarity measures of samples and kernel often fail
to properly select defined kernel functions and the clustering results obtained are still
hardly satisfactory.
2. Nearest neighbor function rule algorithms
102 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.1 Classification effect of C-means clustering algorithms with elliptical distribution

Fig. 3.2 Examples of several different shapes of data construction

For solving the clustering problem in the above cases, the nearest neighbor function
rule algorithms are considered for classification and the specific steps are as follows:
(1) Calculate the distance matrix Δ such that its element Δi j represents the distance
between the sample yi and y j .

Δi j = Δ( yi , y j ) (3.23)

(2) The above distance matrix is adopted to construct the nearest neighbor matrix
M, in which the constituent element Mi j is the value of the nearest neighbor
coefficient of the sample y j to yi . Generally speaking, M is a positive definite
matrix. Therefore, the number of nearest neighbors of a sample point can only
be a series 1, 2, . . . N − 1, so each element of the matrix must be an integer.
(3) Construct the nearest neighbor function matrix L, the elements of L are:

L i j = Mi j + M ji − 2 (3.24)
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 103

where the L i j denotes the value of the nearest neighbor function represents the
connection relation between y j and yi . Set the value of diagonal entries L ii
equal to 2N, i = 1, 2, . . . N .
(4) Connect each point with the points which have the value of the nearest neighbor
function through the matrix L as the initialized clustering condition.
(5) Calculate the parameter γi by each class i obtained in step 4 and compare the
value with αi max and αk max . If it is less than or equal to either αi max or αk max ,
then classes i and k are combined and considered to be connected between the
two classes.
(6) Repeat step 5 until there’s no more γi meets the above conditions then stop it.

3.2.2.3 Semi-supervised KPCA Detection Algorithms

1. Semi-supervised abnormal detection

In the fault diagnosis area, it is widely concerned whether minor faults or fault trends
can be accurately and timely detected by pattern recognition. The equipment or
systems in the industrial environment are operated in normal conditions. Therefore, it
is difficult to get prior information about certain faults. There is even a lack of relevant
sample data to train the learning machine when the equipment has some minor
faults or insignificant fault trends. There are many practical difficulties performing
fault diagnosis using supervised learning. Unsupervised detection which enables the
learning machine to have the ability to detect abnormal patterns using unlabeled
samples provides the basis for condition monitoring and degradation trend analysis
in industrial environments.
Abnormal detection in unsupervised mode is essentially the process of pattern
classification. This method emphasizes the separation of unknown abnormal patterns
from normal patterns for fault detection and advance warning functions.
This section on unsupervised abnormal detection is based on the following two
basic assumptions:
(1) The number of normal data in the training set far exceeds the number of abnormal
data.
(2) Abnormal data is different in nature from normal data.
Therefore, abnormal data are different from normal data in terms of both “quality”
and “quantity”.
The basic idea of unsupervised abnormal detection is to use a dataset of unknown
fault types as the training and test datasets and to map the detected data to feature
points in the feature space in the selected specific algorithm. Then the detection
boundary is determined according to the distribution characteristics of the feature
points, and the points in the sparse area of the feature space are marked as abnormal
data. The method of unsupervised abnormal detection which maps the data in the
original space to the feature space is usually unable to determine the probability
distribution of the data points to be detected. Therefore, the feature points in areas with
104 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

sparse distribution (i.e., low density) in the feature space are identified as abnormal
data.
The first step for constructing a known dataset is to map the original space
composed of all its data elements into the feature space. Usually, often the case
that easily getting into a dimensional catastrophe due to the high dimensionality of
the dimensional space can make the first step difficult. The kernel function uses the
data elements in the original space to directly calculate the inner product between
two points in the feature space. The kernel method does not require knowing the
specific form of the mapping function ϕ(x), and thus the dimension n of the input
space does not affect the kernel matrix [4]. The use of kernel functions can avoid
“dimensional catastrophe” and greatly reduce the computational effort. Therefore,
the kernel method can perform the task of feature mapping in unsupervised abnormal
detection.
Another task of unsupervised abnormal detection is to determine the detection
boundaries in the feature space, which can be done by different unsupervised learning
algorithms such as the k-Means clustering algorithm and support vector machine
algorithm, etc. But every coin has two sides. On the one hand, unsupervised methods
take longer to learn because the training data are mostly high-dimensional. On the
other hand, unsupervised abnormal detection is often less effective than supervised
abnormal detection due to the lack of a priori knowledge to guide it.
2. Improved nearest neighbor function rule algorithms
The nearest neighbor function rule algorithms are widely used in the case of unsuper-
vised classification. However, the nearest neighbor function rules algorithm cannot
be fully applied in pattern detection due to its own limitations.
Calculate the parameter γi by each class i obtained in step 4 (refer to Eq. 3.23),
then compare the value with the maximum connection loss αi max between two points
in class ωi and maximum connection loss αk max between two points in class ωk to
judge if a connection is constructed.
The calculation of the connection loss is based on the nearest neighbor coefficient
and not on the distance between the actual feature points, which may lead to the
situation in Fig. 3.3. As shown in the figure, ωi is a cluster group with lots of samples,
which is formed by “connecting” the initial clusters several times. Obviously, ωi and
ωk are two different types of class samples. The furthest two points in class ωi are
point 1 and point 2, so the maximum connection loss αi max = α12 = 32. Similarly,
we can obtain the maximum connection loss between point 3 and point 4 in class ωk
αk max = α34 = 6. The value of the nearest neighbor function between ωi and ωk is
γi = α23 = 21. Due to this γi < αi max , the algorithm combines ωi and ωk into one
class and it clearly doesn’t make sense. The reason why it makes such mistake is that
the difference in the amount of sample data in clustering, and the nearest neighbor
function rule algorithms mainly consider the nearest neighbor coefficient but ignore
the actual distance between clusters when establishing the “connection”.
In the detection process, the nearest neighbor function rule algorithms are likely
to cause a wrong “connection” and classify the abnormal clustering points into the
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 105

Fig. 3.3 Incorrectly


connected two different 1
classes
α i max = 32

ωi 2 α k max = 6
γ i = 21
3 4
ωk

normal class when the number of normal data far exceeds the number of abnormal
data. In turn, this leads to detection errors.
After the initial clusters are formed in step 4 of the nearest neighbor function
rule algorithms, each cluster group has its cluster center and the actual distance
between clusters can be measured by the distance between cluster centers. Starting
from this, the nearest neighbor function rule algorithms are improved by applying the
concepts of nearest neighbor coefficient and nearest neighbor function to the cluster
centers. Judge which cluster group has the greatest difference from most of the initial
clusters by analyzing the nearest neighbor function of the cluster centers to measure
the similarity between the initial clusters. The identified clusters are discriminated as
abnormal clusters, in which the sample points are abnormal data. Finally abnormal
detection is achieved.
The improved nearest neighbor function rule algorithms are calculated as follows:
Step (1) to step (4) is the same as the above Nearest neighbor function rule
algorithms steps.
(5) Calculate the coordinates of each initial clustering center location and the
distance matrix Δc which contains initial clustering centers information, let
the element Δci j of Δc denotes the distance between cluster center ci and c j .

Δci j = Δ(ci , c j ) (3.25)

(6) The distance matrix Δc is adopted to construct the nearest neighbor matrix M c ,
and its elements Mci j denote the value of the nearest neighbor coefficient of the
clustering center c j to the clustering center ci .
(7) Construct the nearest neighbor function matrix L c , the elements of L c is:

L ci j = Mci j + Mcji − 2 (3.26)

(8) Calculate the sum of values of the nearest neighbor functions on the ith row of
the nearest neighbor function matrix L c .
106 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Σ
n
Ti = L ci j i = 1, . . . n (3.27)
j=1

where n is the number of initial cluster groups.

I f Tk = max(Ti ) i = 1, . . . n (3.28)

Then determine the sample points in the k-th cluster group are suspected
abnormal data.
(9) Assume that the sum of the values of the nearest neighbor functions Ti follows
a normal distribution. And all initial clusters Ti (except Tk ) are grouped into the
same normal distribution. Calculate the mean μ and variance σ 2 .

1 Σ n
μ= Ti (3.29)
n−1
i =1
i /= k

1 Σ n
σ2 = (Ti − μ)2 (3.30)
n−1
i =1
i /= k

If Tk > μ + aσ , the suspected abnormal cluster groups are judged to be abnormal


cluster groups, and the samples in the cluster groups are abnormal data. a is a constant
coefficient. On the contrary, if Tk ≤ μ + aσ , the suspected abnormal cluster groups
are judged to be normal cluster groups, and the samples in the cluster groups are
normal data.
The setting of the coefficient a is related to the detection rate and the false alarm
rate. The detection rate is the ratio of detected abnormal data to the total number of
abnormal data, and the false alarm rate is the ratio of normal data to the total number
of abnormal data. The detection rate reflects the accuracy of the detection model,
while the false alarm rate reflects the stability of the detection model. However,
a high detection rate and a low false alarm rate are often contradictory, and the
usual detection model must find a balance between accuracy and stability. A larger
coefficient a means that the false alarm rate of the algorithm will decrease, but at
the same time, the detection rate will decrease. If the coefficient a becomes smaller
then take the opposite. According to the knowledge of probability, the samples of the
normal distribution are mainly concentrated around the mean, and their dispersion
can be characterized by the value of standard deviation. Sampling from a normally
distributed dataset, about 95% of samples will fall in the interval (μ − 2σ, μ + 2σ ).
Therefore, we’d initialize the coefficient a = 2 and adjust it according to the actual
situation.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 107

After constructing the nearest neighbor function matrix L c in step 7, any row of
L c represents the nearest neighbor function of one initial cluster group to other initial
cluster groups. According to similarity measurement rules, a large nearest neighbor
function indicates that the cluster groups are far away from each other, or said the
initial clusters are different from each other. The nearest neighbor function can be
used as an evaluation criterion to measure the similarity between each initial cluster
group. Therefore, the sum of values of the nearest neighbor functions on each row,
Ti represents the overall similarity of the cluster group corresponding to the i-th row
with other cluster groups. From the perspective of abnormal detection, the cluster
group with the lowest similarity degree is the class which is the most different from
the other initial clusters.
Step 8 assumes that abnormal detection is a process of detecting a certain abnormal
class sample among the most normal class samples, and such abnormal samples can
self-cluster in the feature space after KPCA. The purpose of detection is to find
abnormal states for prediction and alarm. The improved nearest neighbor function
rule algorithms described above are equally capable of accomplishing the task that
the detecting samples contain multiple classes of data. However, the KPCA fault
classification method is needed to further examine the types of various anomalous
patterns.
From the above analysis, it is obvious that the improved nearest neighbor function
rule algorithms combine the effect of the nearest neighbor coefficients of clustering
points and the actual distance between clustering points in categorization, and are
more suitable for abnormal detection.
3. Semi-supervised KPCA detection algorithms
The kernel functions can be combined with different algorithms to form different
methods based on kernel function, and the two parts can be designed separately.
Combining the feature detection characteristics that the output of one feature extrac-
tion process can be used as the input data for another feature extraction and the idea
of unsupervised pattern anomaly detection, propose the improved nearest neighbor
rules algorithms KPCA abnormal detection method. The method uses the eigenvalues
of the principal direction mapping derived from KPCA as the input information of the
improved nearest neighbor function rule algorithms, and to achieve abnormal detec-
tion by using algorithms to categorize and analyze samples. This process covers the
process from feature mapping to determining the detection boundary in the abnormal
detection process.
Combining semi-supervised learning methods with unsupervised detection, the
limited number of labeled sample information is incorporated into the testing process
to guide the final pattern recognition. The steps of semi-supervised KPCA detection
algorithms are as follows:
(1) Calculate the input characteristic index and set parameters of the kernel function.
(2) Train learning machine with the training data which contains unlabeled data in
the training set and part of labeled normal data.
(3) Testing the learning machine with the test data which contains the labeled data
in the test set and the other part of labeled normal data.
108 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.4 The flow chart of the semi-supervised KPCA detection method

(4) Calculate the intra-class scatter Scw based on the labeled data in the test set
and determine if it gets the minimum value. If a negative result is obtained,
the kernel function parameters are adjusted and the KPCA detection model is
reconstructed.
(5) Detect the feature distribution points generated by KPCA algorithms by
performing the improved nearest neighbor function rule algorithms.
As shown in Fig. 3.4, the flow chart of the semi-supervised KPCA detection
method is shown.
The algorithm feeds a limited amount of labeled data into the training process
in step 2 and the testing process in step 3. The purpose of involving labeled normal
class sample data in step 3 is to calculate the intra-class scatter Scw in step 4.
After the feature index is determined, the kernel function parameter settings have
a significant effect on the clustering effect. Higher inter-class scatter of clustering and
lower intra-class scatter indicate a better clustering effect. In the abnormal detection
process, labeled data usually belong to the normal class information occupying the
majority of the feature distribution. The process of evaluating the clustering effect
for a single type of sample does not have inter-class scatter, so the clustering effect
can be evaluated by intra-class scatter.
Referring to Eq. (3.21) to calculate the intra-class scatter Scw on a sample basis. A
certain kernel function parameter is set to obtain the distribution of sample features in
each principal component direction by KPCA. Step 4 analyzes the clustering effect
of the labeled normal data by calculating the intra-class scatter Scw for the labeled
normal data points. The algorithm reconstructs the KPCA by adjusting the kernel
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 109

function parameters according to the exhaustive attack method until the smallest Scw
is obtained, which indicates that the sample has achieved the best clustering effect.
Meanwhile, the prerequisite for the judgment is that there is no overlap between
normal data points (i.e., Scw /= 0).
Step 5 applies the improved nearest neighbor function rule algorithms to cate-
gorize and judge the feature distribution points generated by KPCA to determine
whether the test set contains abnormal data.

3.2.3 Semi-supervised KPCA Classification Algorithms

3.2.3.1 Supervised KPCA Classification Algorithms

1. Algorithm steps
The supervised KPCA classification algorithm is proposed by training on known
sample information. The flow of the proposed method is shown in Fig. 3.5.
The steps of supervised KPCA classification algorithms are as follows:
(1) Set the initial feature index and kernel function parameters.
(2) Train the classifier with the training data which contains labeled data, and pre-
test the samples in the training set.

Fig. 3.5 The flow chart of the supervised KPCA classification method
110 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

(3) Calculate the separability evaluation index Jb based on the labeled data in the
training set and determine if it gets the maximum value. If a negative result
is obtained, the kernel function parameters and feature index combinations are
adjusted to reconstruct KPCA to generate clusters until the maximum Jb is
obtained.
(4) Train the classifier using the combination of feature index and kernel func-
tion parameters corresponding to the maximum Jb , and test the representative
samples of each class.
(5) Categorize samples using the nearest neighbor function rule algorithms
combined with labeled data.
2. Algorithm analysis
The separability evaluation index Jb is calculated by analyzing the labeled sample
information. Therefore, the testing process must combine the labeled data that can
be used as the basis for calculating Jb in the training set and pre-test the samples in
the training set in step 2. The separability evaluation index Jb can guide the selection
of the feature index and the determination of the kernel function parameters, such as
the setting of the kernel width value σ 2 in the Gaussian radial basis function (RBF)
kernel function.
After pre-testing the samples in the training set and calculating the separability
evaluation index Jb , if a certain combination of feature indexes corresponds to a
larger one Jb , it means that this group of feature index is more suitable as the original
input of the classifier. A preliminary analysis of the device to be measured can be
performed before classification, and then its pattern type can be pre-evaluated and
calculated to generate the original set of feature indexes. In pattern classification,
the algorithm automatically selects feature indexes based on specific fault modes
combined with actual clustering effects.
In KPCA, the nonlinear transformation mapping from the input space to the feature
space is determined by the kernel function. The type and parameters of the kernel
function determine the nature of the feature space, which in turn has an impact on
the classification performed in the feature space. Related studies have shown that the
choice of Gaussian radial basis function (RBF) kernel function gives better results
when there is a lack of labeled information for the pattern recognition process [5].
Therefore, the KPCA method here uses a Gaussian RBF kernel function in this
section. The RBF kernel function needs to determine the values of its parameters σ 2 .
However, the classification process often lacks accurate known pattern information
to guide the determination of the kernel function parameters. Obtain the maximum
Jb and select the corresponding kernel function parameter value from the perspective
of the final classification effect.
The input parameters of KPCA contains a combination of feature index and kernel
function parameter. The selection of various input parameters for the classifier can be
known after the initial set of feature indexes and the range of kernel parameter values
are determined. In the initial parameter selection stage, the separability evaluation
index Jb can be used as a criterion to evaluate the clustering effect.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 111

In step 3, combining the above two factors, we can determine the feature index
and kernel function parameters used in the classification method by exhaustive attack
method. Initially set the combination of a certain feature index and the kernel param-
eter values to construct KPCA to generate clusters and calculate the separability eval-
uation index Jb . Adjust the combination of feature indexes and kernel function param-
eters, then reconstruct KPCA and pre-test the training samples to generate feature
clusters. Calculate Jb and compare the value with the previous Jb . After several iter-
ations, the combination of the feature index corresponding to the maximum Jb and
the kernel parameter values which are used as the best input parameters to construct
the test-specific KPCA are retained.
In step 4, add a representative sample of each category from the training set to
the test set to guide the nearest neighbor function rules classification algorithms. The
nearest neighbor function rules classification algorithms are unsupervised clustering
algorithms used to classify test datasets. The samples can be divided into different
groups according to the principle of similarity measure, but the algorithm itself
cannot determine the type of the divided data. Therefore, information about samples
of labeled data is introduced to guide the nearest neighbor function rule algorithms
for categorization.
In step 5, the feature values projected in the direction of the first two principal
components with the largest cumulative contribution in the set of features generated
by the test samples are taken as the coordinate values of the x and y axes on the
classification plane. Therefore, the feature projection points representing different
samples are formed on the classification plane. The nonlinear principal components
obtained by KPCA represent the maximum direction of variation of the sample data.
The feature mapping map also reveals only a spatial distribution of the samples, and
the axes in the map do not correspond to a specific physical meaning [6].
The nearest neighbor function rule algorithms make it possible to effectively
classify and identify data with different shapes that appear on the feature distribution
map. Combine the labeled data information in the test sample set to identify and
analyze the data in each group after categorization, and put the other data in the
group of known samples into the corresponding labeled group. When the same type
of labeled samples is divided into different groups, it is considered that the group
which contains the majority of labeled samples is in the same type of label as the
known labeled sample. When the different type of labeled samples is divided into
the same group, it is considered that the type of known samples account for a larger
number is the type of this group of data. When a group or some groups do not contain
any labeled sample data, it is considered that this group(s) of samples belongs to a
new pattern type.
3. The setting of connection adjustment coefficients for the nearest neighbor
function rule algorithms
The nearest neighbor function rules algorithm is likely to cause false connections
when the number of samples in different cluster groups differs widely.
Further analysis of the process of classifying the initial clusters and establishing
connections by the nearest neighbor function rules. As shown in Fig. 3.6, after the
112 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.6 Consequences of a


wrong connection α i max
ωk
ωi γi α k max

ωi −k ωm

initial connection, three initial clusters are formed on the distribution map: ωi , ωk
and ωm . It is obvious that ωk and ωm should belong to the same type of sample,
while ωm is another type of sample. The nearest neighbor function rules algorithm
analyzes each initial cluster to establish connections and group the clusters. Assume
that the algorithm first analyzes the class ωi and that there is a minimum connection
loss γi between class ωk and class ωi . The algorithm compares γi to the maximum
connection loss αi max between two points in class ωi and the maximum connection
loss αk max between two points in class ωk . The numbers of samples in ωk and ωi
differ widely, thus γi < αi max and cause false connections between class ωk and
class ωi .
The new clusters ωi−k will continue to seek connections with other initial clusters
after the wrong connection. Meanwhile, the numbers of samples in ωi−k and ωm
differ greater which can easily lead to the wrong connection between ωi−k and ωm .
Things are getting worse and may lead to the eventual fault of the classification.
Avoiding incorrect connections requires optimizing the algorithm’s guidelines for
determining connections. In step 5, the algorithm groups ωi and ωk into one class if
γi ≤ αi max or γi ≤ αk max . Therefore, the following adjustment factor is added to the
inequality:

γi ≤ ra × αi max ra ∈ (0, 1] (3.31)

γi ≤ ra × αk max ra ∈ (0, 1] (3.32)

The adjustment coefficient ra serves to optimize the connection rules in the above
inequality. The smaller the value of ra, the stricter the conditions for establishing the
“connection”, which can effectively avoid the wrong connection between different
classes. However, it also increases the risk that samples of the same type cannot be
grouped and categorized. In the classification, the value of the adjustment coefficient
ra should be set according to the specific clustering situation and data distribution
characteristics to improve the accuracy of categorization.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 113

3.2.3.2 Semi-supervised KPCA Classification Algorithms

1. Algorithm steps
The supervised KPCA requires complete labeled samples to train the classifier, and
only with a sample database of multiple pattern types can the reliability of the clas-
sification be guaranteed. However, generally, the limited number of labeled data for
classification greatly limits the application of supervised KPCA. In response to the
limited labeled data and a large amount of unlabeled data in actual situations, it
is necessary to think about the question: “can unlabeled data play an active role in
detection and classification as well as labeled data?” The core idea of semi-supervised
learning is to consider the data composition and self-learning ability of the samples
to be classified can solve this problem.
Combining the performance characteristics of supervised KPCA and the semi-
supervised learning with labeled and unlabeled data co-training mechanism, the
semi-supervised KPCA classification algorithm is proposed and the flow is shown
in Fig. 3.7.
The steps of semi-supervised KPCA classification algorithms are as follows:
(1) Train the classifier with the training data which contains labeled data.
(2) Pre-test the samples in the training set using the classifier generated in step 1.
(3) Calculate the separability evaluation index Jb based on the adjustment of
feature indexes combinations and kernel function parameters by exhaustive
attack method until the maximum is obtained. The corresponding combination
of feature index and kernel parameters are determined as the classifier input
parameters.
(4) Train the classifier using all training samples, and test the classifier using all
test samples.
(5) Categorize samples using the nearest neighbor function rule algorithms
combined and fine-tune parameters of kernel function by the effectiveness of
clustering.
(6) Test the classifier using all the test samples and the representative labeled training
samples of each class.
(7) Categorize samples using the nearest neighbor function rule algorithms
combined with labeled data.
2. Algorithm analysis
In step 1, use the labeled data in the training set to train the classifier, set the initial
kernel function parameter values, and select a certain combination of feature indexes
in the initial set of feature indexes as the sample original input parameters. The
operation is essentially the pre-training of the classifier in supervised learning.
In step 2, pre-testing of training samples with labeled and unlabeled data. Calculate
the separability evaluation index Jb from labeled data.
In step 3, after determining different input parameters by exhaustive attack
method, selecting the combination of feature index and kernel parameter values corre-
sponding to the maximum value of the separability evaluation index. As a result, the
114 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.7 The flow chart of the semi-supervised KPCA classification method

algorithm determines the best combination of feature indexes in the pre-generated


set of feature indexes, while selecting the values of the kernel parameters.
In step 4, train the classifier using all training samples, and test the classifier using
all test samples. And the test samples in step 4 are all unlabeled data. Therefore, the
values obtained in step 3 provide only some degree of separability evaluation in step
4. The kernel function parameters determined in step 3 are not necessarily the best
choice for the unlabeled test samples. This may lead to a reduction in classification
accuracy in step 4 or even to incorrect classification. The nearest neighbor function
rule algorithms directly classify the test samples by analyzing the distribution of
sample features, which can be used as a classification performance evaluation index
to optimize the values of kernel function parameters in the classifier [7].
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 115

In step 5, the kernel parameter values are fine-tuned by analyzing the classification
effect of the nearest neighbor function rules. This is done by determining the kernel
parameter values within a small range of variation in the initial classification. Then
a series of kernel parameter values are generated according to the set value interval
and perform KPCA method selects each parameter value to classify the samples in
the test set. The number of groups of the nearest neighbor function rule algorithms
reflects the final classification effect. In a certain sense, it means that the classification
is more complex and the classification accuracy is not high enough. Therefore, the
principle of fine-tuning is to make the number of groups converge to a value as small
as possible but greater than 1 after the categorization of the nearest-neighbor function
rules. The fine-tuned kernel function parameters are the quadratic optimized values.
In step 6, use the fine-tuned parameters to construct KPCA, and add the labeled
representative samples from the training set to the test set to form a new test sample.
Train the classifier with all training samples and classify the new test samples. The
nearest neighbor function rule algorithm is an unsupervised clustering algorithm that
requires labeled data to guide pattern recognition. Therefore, introduces samples
of known classes into the test set to aid in sample category labeling. It embodies
semi-supervised learning with labeled and unlabeled data co-training ideas.
In step 7, set the connection adjustment coefficient ra. Then classifies the new
test data by the nearest neighbor function rule algorithms and analyzes each group of
data after categorization in combination with the labeled data in the test set. Finally,
determine the pattern of the samples.

3.2.4 Application of Semi-supervised KPCA Method


in Transmission Fault Detection and Classification

3.2.4.1 Experiment and Characteristic Analysis of Typical


Transmission Failure

1. The structure of the experimental system

The structure of the experimental system is shown in Fig. 3.8, and the transmission
experiment table and console are shown in Fig. 3.9a.

(1) The components of the test bench


Traction motor: maximum output power of 75 kW, the maximum speed of
4500 r/min.
Loading motor: maximum output power of 45 kW, the maximum speed of
3000 r/min, the maximum output torque of 150 N m.
Experimental console: control the speed of the traction motor and the torque
of the loading motor. Collect and display the speed and torque of the input and
output ends of the transmission. Calculate and display the transmission input
and output power, etc.
116 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.8 The structure of the experimental system

Fig. 3.9 The experimental transmission system

Input and output speed and torque sensors: collect input and output speed and
torque values respectively.
Accompany test transmission: change the speed and torque to realize the
transmission cooperation between the loading motor and tested transmission.
Tested transmission: Dongfeng SG135-2 transmission with three axes and five
gears.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 117

(2) Experimental transmission route

The console controls the speed of the traction motor, which is used as the power input.
Passing through the tested transmission, the speed and torque sensors at the output
end, and the accompany test transmission to the loading motor. The loading motor
outputs the reverse torque, which is then applied to the tested transmission through
the torque converter of the accompany test transmission to provide a different load
for the transmission operation. The electrical energy generated by the loading motor
is re-input to the traction motor through the reverser to realize the circuit closure.
(3) Experimental transmission
The transmission sketch and sensor arrangement of SG135-2 transmission for the
experiment are shown in Fig. 3.9b. The transmission has three shaft systems including
input, intermediate, and output shafts, and five gear meshing pairs. Each gear
transmission as shown in Table 3.1.

2. The specific details of the test system and signal acquisition

The specific details of the test system and signal acquisition are as follows:
(1) Test and analysis system
Sensor: piezoelectric three-way acceleration sensor, which collects acceleration
and velocity signals.
Charge amplifier: filtering and amplifying the collected vibration signal. The
amplification can be set to 1, 3.16, 10, 31.6, 100, 316, and 1000.
Multi-functional interface box: 16 channels can be collected at the same time,
with filtering and amplifying functions. Amplification can be set to 1, 3.16, 10,
31.6, and 100.
Acquisition card: 12-bit A/D signal conversion.
Signal acquisition and analysis system: DASC signal acquisition and analysis
system with signal acquisition and online and offline time domain waveform
analysis, spectrum analysis, spectrum correction, refinement spectrum analysis,
demodulation analysis, and other functions. As shown in Fig. 3.10a.
(2) Test method

In this experiment, as shown in Figs. 3.9 and 3.10b, four three-phase vibration accel-
eration sensors are arranged near the input shaft bearing housing, the two ends of the
intermediate shaft, and the output shaft bearing housing of the tested transmission.
At the same time collect the vibration signals in the horizontal (X direction), vertical
(Y direction), and axial (Z direction) directions.

Table 3.1 Gear ratio for each gear


Gear 1st gear 2nd gear 3rd gear 4th gear 5th gear
Transmission 5.29 2.99 1.71 1 0.77
118 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.10 Signal acquisition

The vibration signal collected by the sensor is amplified by the B&K2653 charge
amplifier and input to DAS multi-functional interface box. Then input to the portable
computer after A/D conversion. The DASC signal acquisition and analysis system
are applied to record and analysis in the Joint Time–Frequency domain. At the same
time, the speed and torque sensors placed at both ends of the transmission under test
acquire the speed and load information of the input and output of the transmission.
(3) Data acquisition
For the fault detection experiment, the 5th gear of the transmission was used as the
experimental object. The transmission was designed to run under three conditions:
normal, slight pitting of tooth surface, and severe pitting of tooth surface, and the
faulty gears are shown in Fig. 3.11a, b.
For the classification experiment, the output shaft cylindrical rolling bearing of the
Dongfeng SG135-2 transmission was used as the experimental object. The measure-
ment point is placed on the output shaft bearing housing (i.e., the position of measure-
ment points 1 in Fig. 3.9). The experiment was designed to run the transmission under
three conditions: normal, spalling of the inner ring, and spalling of the outer ring.
Figure 3.11c, d show the faulty bearing components.
The transmission is set to 3rd gears, with the ratio of the number of teeth of the
normally meshing gears at the input end being 38/26 and the ratio of the number of
teeth of the meshing gears in the 3rd gear being 35/30.
Pitting detection experimental transmission operating conditions are as follows.
Rotational speed: 1200 r/min (input shaft), 820 r/min (intermediate shaft), 1568 r/
min (output shaft).
Output torque: 100.2 N m, output power: 16.5 kW.
Rotational frequency: 20 Hz (input shaft), 13.7 Hz (intermediate shaft), 26 Hz
(output shaft).
Five gear meshing frequency: 574 Hz, constant meshing gear meshing frequency:
520 Hz.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 119

Fig. 3.11 Faulty bearing components

As shown in Fig. 3.12, the difference between the time domain waveforms of
normal signal and slight pitting is much smaller, and it is impossible to determine
whether there is a fault. The waveform of the severe pitting signal, on the other hand,
has many shock components and the amplitude increases significantly.

3. The work conditions of bearing

(1) Bearing classification experimental transmission operating conditions


Speed: 2400 r/min (input shaft), 1642 r/min (intermediate shaft), 1370 r/min
(output shaft).
Output torque: 105.5 N m.
Output power: 15 kW.
(2) Sampling parameter setting
Acquisition signal: vibration acceleration, vibration speed.
Test direction: horizontal radial, vertical radial, and axial.
Sampling frequency: 40,000 Hz (acceleration), 5000 Hz (velocity).
Anti-mixing filter: 20,000 Hz (acceleration), 3000 Hz (velocity).
Sampling length: 1024 × 90 points.
(3) Transmission characteristic frequency build-up
Rotational frequency: 40 Hz (input axis), 27.4 Hz (intermediate axis), 22.8 Hz
(output axis).
Three-gear meshing frequency: 798 Hz.
120 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.12 The time domain


waveform for each working
condition

Normally meshing gear meshing frequency: 1040 Hz.


Output shaft rolling bearing parameters: model NUP311EN, see Table 3.2 for
parameters.
Output shaft rolling bearing characteristic frequency: as shown in Table 3.2.
Gear surface pitting is a minor fault, and the vibration signal extracted from the
minor pitting condition usually shows a change in vibration energy in the time domain
relative to the normal signal. However, the impact phenomenon is not obvious. In
the frequency domain, the energy in the transconductance band increases to varying
degrees without significant modulation. It is difficult to distinguish from normal
signals by signal processing methods only, which makes it more difficult to detect

Table 3.2 Output shaft rolling bearing parameters and characteristic frequency
Pitch Rolling Rolling Contact Inner ring Outer ring Rolling Cage
diameter element element angle α passing passing element passing
D (mm) diameter number m (°) frequency frequency passing frequency
d0 (mm) (num) fi (Hz) fo (Hz) frequency fb (Hz)
fg (Hz)
85 18 13 0 179.6 116.8 51.4 10.9
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 121

similar minor gear faults. A semi-supervised KPCA method is applied to detect


mild pitting faults in gears in conjunction with gear surface pitting experiments. The
semi-supervised KPCA detection model is evaluated by analyzing the performance
in terms of both correctness and stability, and the evaluation index is the detection
rate and false alarm rate of the detection results. The samples with normal and minor
gear surface pitting were used as combination A. The detection rate was analyzed
to evaluate the ability of the semi-supervised KPCA method to detect minor faults
and test the correctness of the model. The samples with all normal were used as
combination B. The false alarm rate of the detection was analyzed, which is equivalent
to testing the stability of the semi-supervised KPCA detection model in the absence
of abnormal data work conditions.
The vibration acceleration signals were separately collected from the experimental
transmission under two operating conditions: normal and minor gear surface pitting.
The following time domain statistical characteristic including mean, mean square,
variance, skewness, and root mean square amplitude, and one frequency domain
characteristic including the amplitude corresponding to the rotational frequency of
the axis where the 5th gears are located are counted to describe the gear running
status. Assume a set of data is obtained as one sample after feature extraction. For
the normal work condition, a total of 48 samples are obtained from the data collected
in both x and y directions, of which 30 samples are used for training and the other 18
samples are used for testing. To simulate the actual inspection situation, 12 normal
class samples are set as known samples in the training set. For the gear surface minor
pitting condition, 4 samples are collected in x direction which 2 samples are used
for training and the other 2 samples are used for testing.
As a result, the training set contains 30 samples and the test set contains 18
samples in combination A. A 30 × 6-dimensional feature data matrix is generated
in the training set and an 18 × 6-dimensional feature data matrix is generated in the
test set. Similarly, the training set contains 32 samples and the test set contains 20
samples in combination B.

3.2.4.2 The Detection Result of Semi-supervised KPCA Algorithms

The kernel function used in KPCA is the Gaussian radial basis function kernel func-
tion. The detection method applies an improved nearest neighbor function rules algo-
rithm to classify the samples, and coefficient of the standard deviation a = 2 in the
discriminant inequality of algorithm collection. The following is a semi-supervised
KPCA detection analysis of the experimental data, which shows the distribution of
the samples with features projected in the 1–2 principal component direction.
(1) Experimental data detection under normal and minor gear surface pitting
conditions
The samples with normal and minor gear surface pitting were used as combina-
tion A. Detect the sample in combination A with semi-supervised KPCA detection
122 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.13 Combined A data semi-supervised KPCA detection effect

algorithms. All the figures show the distribution of feature samples projected in the
direction of 1–2 principal components. Set kernel function parameters σ 2 = 4.9.
Figure 3.13a shows the effect of the semi-supervised KPCA feature distribution
with 26 sample points labeled with the horizontal coordinate representing the 1st prin-
cipal direction and the vertical coordinate representing the 2nd principal direction.
The “*” samples represent the six labeled normal data in the set of test samples added
to the algorithm, which are used to calculate the intra-class scatter Scw and optimize
the detector performance. The “Δ” samples represent the fault data detected by the
algorithm, and the “◯” samples represent the other data in the test set. Figure 3.13a
shows a densely distributed cluster with known normal samples distributed among
them. A good clustering effect is achieved because the detection algorithm optimizes
the kernel function parameters by calculating the intra-class scatter evaluation index.
Notice that there are two sample points marked with “Δ” far from the cluster distri-
bution, and the results of the improved nearest neighbor function criterion algorithm
for clustering points show that those two data points are abnormal fault samples with
numbers 25 and 26.
To check the correctness of the results, the different types of test samples are
first marked by different icons. The “*” samples represent the normal data and
the “Δ” samples represent the minor surface pitting data as shown in Fig. 3.13b.
Figure 3.13b shows a situation that exactly matches the results detected by the algo-
rithm in Fig. 3.13a, where the normal sample points are clustered into one class and
the 2 faulty sample points are distributed out of the cluster, whose numbers are 25 and
26. This indicates that for the experimental data containing mild pitting samples on
the tooth surface in combination A, the detection rate of the semi-supervised KPCA
method is 100% and the false alarm rate is 0%, and the algorithm reflects a high
detection performance.
(2) Experimental data detection under normal conditions
The samples with all normal data were used as combination B. The distribution of the
feature sample is shown in Fig. 3.14. A decentralized cluster is shown in Fig. 3.14a
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 123

Fig. 3.14 Combined B data semi-supervised KPCA detection effect

with labeled normal class sample points distributed among them. The algorithm
detection results show that there are no abnormal sample data. Set kernel function
parameters σ 2 = 5.2 after the optimizing of the algorithm.
As shown in Fig. 3.14b, all sample points belong to the normal class data, which
verifies the correctness of the detection results in Fig. 3.14a. This indicates that the
detection rate of the semi-supervised KPCA method is 100% and the false alarm rate
is 0% for the experimental data containing normal samples in combination B. The
above results demonstrate the effectiveness and high performance of the algorithm.

3.2.4.3 Application of Semi-supervised KPCA Algorithms Analysis

Application of a semi-supervised kernel function principal element analysis method


is to classify sample data of transmissions in normal and bearing inner ring spalling
and bearing outer ring spalling work conditions.
1. Feature index extraction
The vibration acceleration signals collected from the experimental transmission
in normal, bearing inner ring spalling and bearing outer ring spalling operating
conditions were extracted separately.
(1) Time domain statistical indicators: mean, mean square, kurtosis, variance,
skewness, peak value, root mean square amplitude.
(2) Dimensionless feature indicators: waveform indicator, pulse indicator, peak
indicator, margin indicator.
(3) Frequency domain characteristics: frequency value of the highest peak of the
spectrum, the amplitude value of the inner ring of the bearing through the
frequency in the refined spectrum, and the amplitude value of the outer ring
of the bearing through the frequency in the refined spectrum.
124 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

The above feature index constitutes the feature set for the experimental analysis,
which is used as raw information for further feature selection, feature extraction, and
pattern classification.
For the time domain sampling sequences collected in each direction, a total of 44
sets of sample data were obtained for each state and two directions (i.e., x direction
and y direction), from which 24 sets were selected as training samples and another
20 sets were used as test samples for classification. For each sample data, 14 feature
indicators are extracted, and a total of the 72 × 14-dimensional training data matrix
and 60 × 14-dimensional test data matrix is formed for the 3 types of samples.
To simulate the situation of insufficient information of known categories in the
actual classification, three types of data are in the training sample set. The normal
samples and the bearing inner ring spalling fault samples are taken as known type
samples, and the bearing outer ring spalling fault samples are taken as unknown
samples. The three types of samples are identified and classified in the test set.
2. Classify the experiment data with supervised KPCA algorithms
The supervised KPCA uses labeled normal and bearing inner ring spalling data from
the training samples to train the classifier to classify and identify samples in the
test set. The RBF kernel function is used for KPCA. Set the connection adjust-
ment coefficient ra = 0.8. The supervised KPCA method automatically selects the
following features in the feature set: variance, peak value, root mean square ampli-
tude, frequency value corresponding to the highest peak of the spectrum, bearing
outer ring passing frequency amplitude, and bearing inner ring passing frequency
amplitude.
The nonlinear principal components in KPCA or PCA have the concept of contri-
bution rate, which measures the strength of the ability of the projected eigenvalues in
the direction of this principal component to explain the sample variance through the
magnitude of the contribution of a principal component. In this experiment, the cumu-
lative contribution of the first 2 principal components generated by KPCA is 94.98%.
It indicates that the first two principal components carry enough sample variation
information to be used for transmission fault classification and identification.
The following is the clustering effect of supervised KPCA in the experiment.
And the distribution of feature samples projected in the first 2 principal directions is
shown in Fig. 3.15.
Figure 3.15 shows the effect of supervised KPCA clustering. There are 80 test
sample labeled points in each figure. The “⛛” and “✩” labels in Fig. 3.15a represent
the 10 normal and 10 bearing inner ring spalling data extracted from the training
set and added to the test sample set to guide the final categorization of the nearest
neighbor function rules algorithm. The “◯” labels represent 60 test data. For the
labeled normal class and bearing inner ring spalling class data was added to the test
sample calculating the separability evaluation index Jb = 0.87868. The optimized
RBF kernel parameters σ 2 = 10. As seen in the figure, the data points show three
cluster groups, and the “⛛” sample and the “✩” sample are located in two of the
cluster groups.
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 125

Fig. 3.15 Supervised KPCA clustering effect

To check the correctness of the clustering the following settings are made: the
“⛛” labels represent the normal sample, the “✩” labels represent the bearing inner
ring spalling sample, and the “Δ” labels represent the bearing outer ring spalling
sample. As shown in Fig. 3.15b is the supervised KPCA clustering effect. The three
cluster groups represent three types of samples, and the clustering matches exactly
what is seen in Fig. 3.15a. The separability evaluation index Jb = 0.85378 indicates
that the actual clustering achieves good separability.
3. Classify the experiment data with semi-supervised KPCA algorithms
The semi-supervised KPCA uses the RBF kernel function. Set the connection adjust-
ment coefficient ra = 0.8 and the fine-tuning range c var = 1. The semi-supervised
KPCA method automatically selects the following features in the feature set: vari-
ance, peak value, root mean square amplitude, frequency value corresponding to the
highest peak of the spectrum, bearing outer ring passing frequency amplitude and
bearing inner ring passing frequency amplitude.
Figure 3.16 shows the effect of semi-supervised KPCA clustering. There are 80
test sample labeled points in each figure. The “⛛” and “✩” samples in Fig. 3.16a
represent the 10 normal and 10 bearing inner ring spalling data extracted from the
training set and added to the test sample set to guide the final categorization of the
nearest neighbor function rule algorithms. Use the two labeled data in the test set to
calculate the separability evaluation index Jb = 0.94289. After two optimizations
of the semi-supervised KPCA algorithm set Kernel parameters σ 2 = 10. As seen
in the figure, the data points show three cluster groups, The three clustering centers
are triangularly distributed in the figure, with a small intra-class scatter and a large
inter-class scatter for each cluster.
To check the correctness of the clustering the following settings are made: the
“⛛” labels represent the normal sample, the “✩” labels represent the bearing inner
ring spalling sample, and the “Δ” labels represent the bearing outer ring spalling
sample. As shown in Fig. 3.16b is a semi-supervised KPCA clustering effect. The
126 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.16 Semi-supervised KPCA clustering effect

three cluster groups represent three types of samples, and the clustering matches
exactly what is seen in Fig. 3.16a. The separability evaluation index Jb = 0.91823
indicates that the actual clustering achieves good separability.
Comparing Figs. 3.15 and 3.16, the semi-supervised KPCA method has a better
clustering effect compared to the supervised KPCA method for the bearing class
data.
4. Comparison of KPCA classification results under two models
Before classification, the test samples of each category are numbered. The normal
category samples are numbered: 1–30. Bearing inner ring spalling failure sample
number: 31–60. Bearing outer ring spalling fault sample numbers: 61–80.
KPCA forms a series of feature projection points on the classification plane
consisting of 1–2 principal directions, and the nearest neighbor function rules algo-
rithm classifies the sample points according to the principle of similarity metric.
The following are the results after categorization by the nearest neighbor function
criterion algorithm in both modes.
(1) Supervised KPCA final classification results
Normal samples: 24, 1, 29, 23, 21, 3, 2, 30, 16, 7, 20, 18, 8, 19, 11, 27, 14, 5,
26, 22, 12, 4, 28, 17, 13, 9, 6, 25, 15, 10.
Bearing inner ring spalling failure samples: 58, 38, 31, 56, 39, 37, 59, 34, 54,
46, 44, 35, 33, 55, 47, 36, 50, 49, 60, 53, 51, 57, 48, 52, 42, 32, 45, 43, 41, 40,
79, 65, 64, 72, 71, 63, 80, 76, 69, 75, 70, 62, 78, 74, 68, 66, 73, 61, 77, 67.
The above results indicate that the supervised KPCA algorithm correctly classified
the normal samples while misclassifying the bearing outer ring spalling fault samples
into the bearing inner ring spalling fault type. The categorization misclassification
rate was calculated to accurately evaluate the classification correctness of KPCA.
The misclassification rate is the ratio of the sum of the number of misclassified
samples and the number of samples not correctly classified to the total number of
3.2 Fault Detection and Classification Based on Semi-supervised Kernel … 127

tested samples. The calculation shows that the supervised KPCA method has a 25%
misclassification rate in the bearing class classification experiment.
(2) Semi-supervised KPCA final classification results
Normal samples: 24, 19, 15, 11, 3, 1, 23, 2, 29, 21, 25, 10, 30, 16, 7, 27, 20, 18,
14, 8, 26, 22, 5, 4, 28, 17, 13, 12, 9, 6.
Bearing inner ring spalling failure samples: 58, 38, 31, 56, 39, 37, 59, 34, 54,
46, 44, 35, 33, 55, 47, 36, 50, 49, 60, 53, 57, 51, 48, 52, 42, 32, 45, 43, 41, 40.
Unknown type 1 samples: 74, 68, 64, 62, 80, 76, 69, 79, 75, 70, 78, 66, 72, 71,
65, 63, 77, 67.
Unknown type 2 samples: 73, 61.
It is obvious that the classification results are consistent with the actual sample
categories, and the semi-supervised KPCA method can separate normal samples from
bearing inner ring spalling fault samples, and the unknown type 1 samples are the
bearing outer ring spalling fault samples. Since the classifier does not have a priori
information about the bearing outer ring spalling fault, this type of data is classified
as unknown type 1. The unknown type 2 sample is part of the misclassified data,
which also belongs to the bearing outer ring spalling fault. As seen in Fig. 3.16b,
the two data points located near the bearing outer ring spalling cluster are the data
points corresponding to the unknown type 2 samples. The calculation shows that the
misclassification rate of the semi-supervised KPCA method is 2.5%, which shows a
good classification performance.
5. Comparison of KPCA classification results under two models
Table 3.3 shows the comparison of the classification of supervised KPCA methods
and semi-supervised KPCA methods.
As seen from the table, for the bearing class fault classification experiments, the
semi-supervised KPCA has more desirable classification results than the supervised
KPCA in the case of insufficient a priori sample information.

Table 3.3 Comparison of KPCA classification in two models


Supervised KPCA Semi-supervised KPCA
Clustering effect in the figure Normal Good
Separability evaluation index Jb (detection 0.87868 0.94289
experiment)
Separability evaluation index Jb (control 0.85378 0.91823
experiment)
Near-neighbor function rules algorithms wrong 25 2.5
score rate (%)
Kernel function parameters σ 2 10 10
128 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.3 Outlier Detection Based on Semi-supervised Fuzzy


Kernel Clustering

3.3.1 Correlation of Outlier Detection and Early Fault

The problem of early fault detection in the field of mechanical equipment fault diag-
nosis has many similarities to the problem of outlier data in data mining techniques.
Early fault detection is based on fault separation based on normal signals, while
outlier data mining is the identification of abnormal data in the dataset. Therefore,
the method of outlier data mining is also applicable to the problem of early fault
detection. At the same time, the method of artificial intelligence is applied to the
field of machinery and equipment fault diagnosis to provide a new research path for
machinery and equipment fault diagnosis.

3.3.1.1 Outlier Detection

The outlier detection problem was first widely studied in the field of statistics in the
1980s [8]. With the development of research, many methods for outlier detection
emerged which can be broadly classified into statistical-based methods, distance-
based methods, density-based methods, clustering-based methods, and deviation-
based methods.
(1) Statistic-based method
Statistical-based methods are statistical inconsistency tests that most of these methods
were developed from inconsistency testing methods for data sets with different distri-
butions. This method always sets the situation to follow a normal distribution, a
gamma distribution, and so on. The dataset is then fitted with this distribution and
data that deviate from the model distribution are identified as outliers [9]. The above
method assumes that the potential distribution and the distribution parameters of the
data set are known.
In numerous situations where we do not know whether a particular attribute
follows a standard distribution. We need to perform extensive testing at a great
cost to find a distribution that fits the attribute. At the same time, most distribu-
tion models can only be applied directly to the original feature space of the data,
which lacks variation. Therefore, the statistic-based method cannot detect outliers in
high-dimensional data.
(2) Distance-based method
Knorr et al. [9] firstly proposed a distance-based method and summarized the method
systematically. Define outlier as follows: a data point in the data set P has a distance
greater than r from at least an area β of other data points, then the point is a distance-
based outlier. Outliers are defined as a single global criterion determined by parameter
r and parameter β. This definition contains and extends the idea of distance-based
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 129

approaches, overcoming the main drawback of distribution-based approaches. It is


still effective in finding outliers when the data set does not satisfy any of the standard
distributions.
The method does not require a priori knowledge of the data distribution, and
its definition covers the detection of outliers by statistical methods such as the
normal distribution, the Poisson distribution, and other distributions. However, both
statistical-based and distance-based methods measure the inconsistency of the data
from a global perspective, which is not effective when the dataset contains multiple
types of distributions or has a mixture of different density subsets.
(3) Density-based method
Breunig et al. [10] firstly proposed a density-based method and used the local outlier
factor (LOF) of each data object to define the local density of the object’s nearest
neighbors. The density of an object’s neighborhood can be described by the radius
of the nearest neighbors containing a fixed number of objects or by the number of
objects contained in the nearest neighbors of the specified radius. The LOF of an
object is based on the single parameter of MinPts, which is the number of nearest
neighbors used in defining the local neighborhood of the object. It is considered an
outlier if a data point has a high LOF value.
The density-based method detects outliers by comparing the object’s density and
the average density of its nearest neighbors. The density-based method overcomes the
weaknesses of the distance-based method in detecting data sets mixed with different
density subsets and obtains higher detection accuracy.
(4) Clustering-based method
Many clustering methods such as DBSCAN [11] can perform the task of outlier
detection. However, the main goal of these methods is to produce meaningful clusters
and, incidentally, to complete the outlier detection task. These methods that are not
specifically optimized for outlier detection have difficulty producing satisfactory
outlier detection results. In many application scenarios, clustering-based methods
can detect more meaningful outliers than statistic-based and distance-based methods
due to the ability to extract local characteristics of the data.
(5) Bias-based method
The deviation-based approach considers objects that deviate from the feature descrip-
tion as outlier objects. Bias-based outlier detection methods are classified into
sequential accidental methods and data standing methods.

3.3.1.2 Applicability of Early Fault Outlier Detection Methods

For statistic-based methods, normal data approximately follows Gaussian distribu-


tion, but the distribution of fault data is unknown. Semi-supervised learning methods
cannot fit the distribution of fault data like supervised learning methods. The feature
130 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

space mapped by the kernel function has a high dimensionality which is not suitable
for this method.
For the distance-based method, the selection of the parameter r and parameter
β deeply reflects the detection results. The supervised method can select the best
parameter values by learning from samples with labeled labels, but the unsupervised
method has difficulty obtaining the appropriate values. Meanwhile, in early fault
detection, the difference in data structure between normal data and fault data can
lead to inferior detection results.
For density-based methods, the semi-supervised learning method cannot obtain a
priori knowledge of the outlier threshold or the number of outlier data in the dataset,
thus limiting the applicability of the method. Although outlier detection based on
clustering methods is an incidental task. However, the advantage of unsupervised
learning of clustering methods makes it possible to enhance their outlier detection
capabilities by making algorithmic adjustments. Clustering-based methods are one
of the most promising analysis methods for early fault detection problems [12].

3.3.2 Semi-supervised Fuzzy Kernel Clustering

3.3.2.1 Semi-supervised Learning

Currently, semi-supervised clustering methods are broadly classified into three


categories [13]:
(1) Constraints-based method
The constraints-based method is to guide the clustering process by using labeled data
to finally obtain an appropriate segmentation result.
(2) Distance function-based method
The distance function-based method is to guide the clustering process by a certain
distance function obtained by learning from the labeled data.
(3) Integrating constraints and distance function learning method
Bilenko et al. integrated the above two ideas under one framework based on the
C-means algorithm [14]. Sugato et al. proposed a unified probabilistic model for
semi-supervised clustering [15]. In this section, we study the semi-supervised fuzzy
kernel clustering method based on constraints and distances which belongs to one
of the fuzzy C-means algorithms.

3.3.2.2 Semi-supervised Fuzzy Kernel Clustering Method

The semi-supervised fuzzy kernel clustering method uses a small number of known
labeled samples as a guide for the clustering process and implements part of the
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 131

supervised behavior of the unsupervised clustering method. The performance of the


proposed method is significantly better than that of the simple unsupervised fuzzy
kernel clustering method [16].
The flowchart of the semi-supervised fuzzy kernel clustering method is shown in
Fig. 3.17.
The advantage of the semi-supervised fuzzy kernel clustering algorithm is that it
can use samples with partially known labels as initial clustering centers to overcome
the influence to the fuzzy C-means algorithm by the selection of initial clustering
centers. The fuzzy c-partition of known labeled samples does not change during the
iterative process, acting as a constraint to make the clustering process proceed in
the direction of known classes. However, its priority is to obtain several samples
of known labels for each cluster, which would be detrimental to practical appli-
cations. The semi-supervised fuzzy kernel clustering method implements partially

Fig. 3.17 The process of the


semi-supervised fuzzy kernel
clustering method
132 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

supervised learning of unsupervised methods and makes its performance superior to


unsupervised clustering methods.

3.3.3 Semi-supervised Hypersphere-Based Fuzzy Kernel


Clustering Method

The above semi-supervised fuzzy kernel clustering method requires obtaining a


sample of known labels for each cluster to ensure the effectiveness of the method.
In early fault detection, many samples of early faults are difficult or impossible
to obtain in advance. Therefore, the application of semi-supervised fuzzy kernel
clustering methods is severely affected by the lack of fault samples for some classes.
To find the fault samples, it is necessary to first determine which are normal
samples and use them as the basic basis for determining abnormal samples. The
advantage of clustering-based outlier detection methods is also the detection of
abnormal data from normal data in the clustering process. A semi-supervised
hypersphere-based fuzzy kernel clustering method is proposed to solve the appli-
cation problem in practical engineering. The method can achieve semi-supervised
fuzzy kernel clustering by obtaining only a small number of known normal samples
without any known faulty samples.
The semi-supervised hypersphere-based fuzzy kernel clustering method is basi-
cally the same idea as the semi-supervised fuzzy kernel clustering method described
above. The method determines the initial clustering centers by known labeled
samples, and the iterative process updates only the fuzzy membership of unknown
labeled samples. The difference is that since there is no labeled fault sample, another
way is needed to find the initial clustering centers for fault clustering. The semi-
supervised hypersphere-based fuzzy kernel clustering method is divided into two
steps as follows:
(1) Find the center of normal clusters from labeled normal samples and identify the
majority of normal samples from unknown labeled samples.
(2) The samples that cannot be judged as normal are considered as potential fault
samples, and then go to calculate the initial clustering centers for fault clus-
tering. This solves the problem of not being able to determine the center of fault
clustering due to the lack of labeled fault samples.

3.3.3.1 Outlier Detection Based on Minimum Closed Hyperspheres

(1) Minimum closed hypersphere method

The minimal closed hypersphere algorithm belongs to supervised learning which is


used to identify any abnormal data that does not look like it was generated from the
same training distribution [17]. It uses the training set to learn the distribution of
normal data and then filters future test data based on the resulting pattern function.
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 133

Fig. 3.18 A hypersphere


containing X with a
minimum radius

Assume that given a training set X = (x 1 , . . . , x n ). Let its mapping in the asso-
ciated Euclidean feature space F be Φ(x), and the associated kernel K satisfies
K (x, y) = Φ(x)T Φ( y). By finding the center of the smallest hypersphere v which
contains X, such that the radius r of this sphere is minimized. The above optimization
problem is expressed in the following equation:

v ∗ = arg min max ||Φ(x i ) − v|| (3.33)


v 1≤i≤n

Solving the optimization problem:

min r 2 (3.34)
v,r

Σ
n
st. ||Φ(x i ) − v||2 = (Φ(x i ) − v)' (Φ(x i ) − v) ≤ r 2 (3.35)
i=1

The hypersphere (v, r) is a hypersphere containing X with minimum radius r, as


shown in Fig. 3.18.
The hypersphere algorithm has a problem whenever there exists training data that
is not good enough, the obtained radius will be too large. Under ideal conditions,
such a minimum hypersphere is required, which contains other training data, except
some extreme training data.
(2) Soft minimum closed hypersphere method
The soft minimum hypersphere method is an improved minimum hypersphere
method. The method considers two types of losses equally losses incurred by missing
a small portion of data and losses due to radius reduction. To implement this strategy,
the concept of slack variables ξi = ξi (v, r, x i ) is introduced, which is defined as
follows:
( )
ξi = ||v − Φ(x i )||2 − r 2 + (3.36)
134 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.19 Soft minimum


closed hypersphere
containing most of the data

For points falling inside the hypersphere, it has a value of zero. For points outside
the hypersphere, it measures the distance to the square of the center over r 2 . Let ξ be
the vector consisting of elements ξi , i = 1, . . . , n. Parameter C is a trade-off between
the two purposes of minimizing the control radius and controlling the relaxation vari-
ables. The following is the mathematical description of the soft minimal hypersphere
method. Parameter C is a weighing parameter between the two purposes of mini-
mizing the control radius and controlling the relaxation variables. The following is
the mathematical description of the soft minimal hypersphere method:

min r 2 + C||ξ ||1 (3.37)


v,r

st. ||Φ(x i ) − v||2 = (Φ(x i ) − v)' (Φ(x i ) − v) ≤ r 2 + ξi


(3.38)
ξi ≥ 0, i = 1, . . . , n

The solution to this problem can also be obtained by finding the pairwise
Lagrangian function, and the results are shown in Fig. 3.19.
The soft minimum hypersphere method requires setting trade-off parameters, and
different values of the parameters will yield different optimization results. Therefore,
the reasonable values of the parameters are difficult to determine and often contain
some human prior knowledge.
(3) Data pre-process
To seek radius minimization and avoid human prior knowledge values of parameters
C, a soft hypersphere method based on a pre-optimized training set is proposed. The
method optimizes the training set to eliminate some of the training data that are not
good enough to avoid the method from obtaining a larger radius than what is actual
needed. In addition to a small part of extreme training data, the hypersphere also
contains other training data to avoid the problem of artificial value of parameter C.
Assume that the training sample set follows an independent identical distribution.
In the feature space, the estimated center of mass E[Φ(x)] of the training sample set
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 135

is calculated and the distance between each known sample and its center of mass is
calculated by Euclidean distance:

di = ||Φ(x i ) − E[Φ(x)]|| i = 1, . . . , n (3.39)

Calculate the mean of the distance E(d) between each training sample and the
center of mass, the deviation of each training sample from the mean σi = di − E(d),
and the standard deviation σx :
/
(d1 − E(d))2 + (d2 − E(d))2 · · · + (dl − E(d))2
σx = (3.40)
l

According to the law of normal distribution in the principle of probability statistics:


about 68.26% of samples will fall in the interval (E(d) − σx , E(d) + σx ), about
95.44% of samples will fall in the interval (E(d) − 2σx , E(d) + 2σx ), and about
99.73% of samples will fall in the interval (E(d) − 3σx , E(d) + 3σx ). Weighing
the actual analysis needs, it is sufficient to ensure that 95% of the training samples
are contained within the hypersphere, so the training samples with roots of sample
variance greater than two standard deviations (i.e., σi ≥ 2σx ) are excluded.
The advantages of this method are that it can avoid the excessive radius of the
hypersphere due to some of the training data not being good enough, it does not
require the determination of the parameter C of the soft minimum hypersphere
algorithm, and it can set different confidence probabilities according to the actual
needs.

3.3.3.2 Semi-supervised Fuzzy Kernel Clustering Method Based


on Minimum Closed Hyperspheres

After the analysis of the soft minimum hypersphere algorithm, let the number of
samples determined to be normal be n l , the number of potentially faulty samples
which is not included in the hypersphere is n u , and the total number of samples is n =
n l + n u . Clustering possible fault samples together by extracting boundary normal
samples from the set of potential faults by semi-supervised fuzzy kernel clustering.
The method for semi-supervised fuzzy kernel clustering based on hyperspheres is
as follows. Set c as the number of the clustering, vi (i = 1, 2, . . . , c) is the center
of the i-th clustering, and u ik (i = 1, 2, . . . , c, k = 1, 2, . . . , n) is the membership
function of the k-th sample to the ith clustering. The following restriction should be
satisfied:
Σn
0 ≤ u ik ≤ 1, 0 ≤ u ik ≤ n
⎧ k=1 ⎫

⎨ ⎪
⎬ (3.41)
{ } { u}
U= U l = u lik | U u = u ik

⎩       ⎪

labeled normal data unlabeled fault data
136 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Since n l samples have already been determined to be normal by the hypersphere


algorithm, it is possible to determine the value u lik .

Σ
n
vi = βik Φ(x k ), i = 1, 2, . . . , c (3.42)
k=1

Then the objective function of the fuzzy kernel clustering algorithm in the feature
space is:
( )
Jm (U, v) = Jm U, β 1 , β 2 , . . . , β c
|| ||2
Σc Σn || Σ n ||
m || ||
= u ik ||Φ(x k ) − β il Φ(x l )|| (3.43)
|| ||
i=1 k=1 l=1

where
|| βi = (βi1 , βi2 , . . .||, βin )T , i = 1, 2, . . . , c, m is a constant which m > 1,
||Φ(x k ) − Σn βil Φ(x l )||2 is calculated as follows:
l=1

|| ||2
|| Σ
n || Σ
n
|| ||
||Φ(x k ) − βil Φ(x l )|| = Φ(x k )T Φ(x k ) − 2 βil Φ(x k )T Φ(x l )
|| ||
l=1 l=1
Σ
n Σ
n
( )
+ βil Φ(x k )T βi j Φ x j s (3.44)
l=1 j=1

Putting K (x, y) into Eqs. (3.43) and (3.44), we get:

Σ
c Σ
n
( )
Jm (U, v) = m
u ik K k − 2β iT K k + β iT Kβ i (3.45)
i=1 k=1

Σ
c
st. u ik = 1, k = 1, 2, . . . , n (3.46)
i=1

( )
where K i j = K x i , x j , i, j = 1, 2, . . . , n,
K k = (K k1 , K k2 , . . . , K kn ), k = 1, 2, . . . , n,
K = (K 1 , K 2 , . . . , K n ), k = 1, 2, . . . , n.
Equation (3.45) is optimized under the constraint of Eq. (3.46) to obtain:

u(1/(1−K ( x uk ,vi )))1/(m−1)


u ik = Σc
j=1 ((1−K ( x k ,v j )))
u 1/(m−1)
(3.47)
i = 1, 2, . . . , c, k = 1, 2, . . . , n u
Σnl ( l )m −1 Σ u ( u )m −1
k=1 u ik K K k + nk=1 u ik K K k
βi = Σnl ( l )m Σn u ( u )m , i = 1, 2, . . . , c (3.48)
k=1 u ik + k=1 u ik
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 137

Mostly normal samples in the unlabeled samples can be found by the soft
minimum hypersphere method. To reduce the misclassification rate, the bounded
normal samples are to be extracted from the samples that are not included in the
hypersphere. Calculate the initial clustering center of fault clustering by the above
sample information. The alternating iterative algorithm for semi-supervised fuzzy
kernel clustering based on hyperspheres is as follows:
(1) Use a modified soft hypersphere algorithm to find mostly normal and potentially
faulty samples based on those known to be normal.
(2) Calculate the number of clusters and initial cluster centers based on the obtained
normal samples c and potential fault samples.
(3) Initialize each coefficient vector, and calculate the kernel matrix K and its
inverse matrix K −1 for the unlabeled dataset.
(4) Repeat the following operation until the membership value of each sample is
stable.
(a) Update the membership of the sample data not included in the hypersphere
u
u ik by the current cluster centers according to Eq. (3.47);
(b) Update the clustering centers for faults v i by membership u ik
u
according to
Eqs. (3.42) and (3.48) (Fig. 3.20).

3.3.4 Transmission Early Fault Detection

See Fig. 3.21.

3.3.4.1 Transmission Bearing Outer Ring Spalling Fault Detection

1. Data acquisition

The cylindrical rolling bearing of the output shaft of the Dongfeng SG135-2 trans-
mission was used as the experimental object. The experiment was designed to run
the transmission in two states, normal and spalled outer ring, where the spalled outer
ring was machined to grind a pit. Figure 3.22 shows the faulty bearing components.
Set the transmission to 2nd gear. Set the number of teeth of the normal meshing gear
at the input to 38/26 and the number of teeth of the 2nd gear meshing gear to 41/20.

(1) Transmission operating conditions


Speed: 2400 r/min (input shaft), 1642 r/min (intermediate shaft), 801 r/min
(output shaft).
Output torque: 313.7 N m.
Output power: 25.52 kW.
(2) Sampling parameter setting
Acquisition signal: vibration acceleration, vibration speed.
Test direction: horizontal radial (x), vertical radial (y), and axial (z).
138 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.20 The process of


semi-supervised fuzzy kernel
clustering based on
hyperspheres

Sampling frequency: 40,000 Hz (acceleration), 5000 Hz (velocity).


Anti-mixing filter: 20,000 Hz (acceleration), 3000 Hz (velocity).
(3) Transmission characteristic frequency building
Frequency: 40 Hz (input shaft), 27.4 Hz (intermediate shaft), 13.4 Hz (output
shaft).
Three-gear meshing frequency: 547.4 Hz.
Normally meshing gear meshing frequency: 1040 Hz.
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 139

Fig. 3.21 Technical route for outlier condition detection of transmissions

Fig. 3.22 Bearing outer ring


spalling

Output shaft rolling bearing parameters: model NUP311EN, see Table 3.4 for
parameters.
Output shaft rolling bearing characteristic frequency: as shown in Table 3.4.

2. Normal operating conditions and bearing outer ring spalling signal characteristics
analysis
The vibration acceleration signals were collected in the X-direction at two measure-
ment points under the two conditions of normal and bearing outer ring spalling. The

Table 3.4 Output shaft rolling bearing parameters and characteristic frequency
Pitch Rolling Rolling Contact Inner ring Outer ring Rolling Cage
diameter element element angle α passing passing element passing
D (mm) diameter number m (°) frequency frequency passing frequency
d 0 (mm) (num) f i (Hz) f o (Hz) frequency f b (Hz)
f g (Hz)
85 18 13 0 179.6 116.8 51.4 10.9
140 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.23 Time domain


waveform of normal signal at
two measurement points

Fig. 3.24 Time domain


waveform of bearing outer
ring spalling signal at two
measurement points

sampling frequency is set to 3200 Hz and the low-pass filtering limit is 4000 Hz, and
the generated time domain waveform is shown in Figs. 3.23 and 3.24.
The demodulation analysis found that for the normal signal rolling body and
bearing seat of the periodic is grinding touch. Therefore, in the demodulation results
at the first-order engagement frequency and resonance spectrum, the rolling body
passes through the frequency both as the main spectral line. Also due to the periodic
collisions between the gears, the rotational frequencies of the input and intermediate
axes can still appear in the demodulation spectrum. The demodulation analysis is
still not enough to find the fault characteristics of the bearing outer ring spalling, the
reason is that the bearing outer ring spalling fault is relatively weak. The bearing
signal modulation energy is small compared to the transmission energy and the
modulation phenomenon is not obvious. Under normal operating conditions, there
is also some modulation.

3.3.4.2 Semi-supervised Fuzzy Kernel Clustering Analysis Based


on Hyperspheres

The number of samples: number of samples: 210 normal samples, 20 samples of


bearing outer ring spalling failure, sample data length 4 × 1024.
Sample raw vector: mean square value, mean value, margin index, bearing outer
ring through the frequency amplitude, modulation frequency band each amplitude
of the sum.
Known label samples: 26 normal samples.
Unknown label samples: the remaining 184 normal samples and 20 faulty samples,
a total of 204.
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 141

Fig. 3.25 Analysis results of bearing outer ring spalling

The standard deviation of each sample is calculated in the feature space based
on the Euclidean distance. Eliminate one sample that does not meet the requirement
using 95% confidence probability, and the optimized 25 normal samples are used as
the set of samples with known labels. Set the parameter m = 2 and the cluster center
c = 2 and the semi-supervised fuzzy kernel clustering method based on hyperspheres
is applied to analyze them. Figure 3.25 shows the results of the analysis when σ is
varied from 0.3 to 12.
Shown in Fig. 3.25, the number of samples in the detected potential faults and the
fault clusters obtained by fuzzy kernel clustering analysis decreases as the number
of σ increases. When σ changes from 4.2 to 6.0, the fuzzy kernel clustering method
can separate some normal samples from the set of potential fault samples to achieve
data purification and improve the accuracy rate of fault detection. When σ changes
from 0.3 to 5.2, the fault clusters contain normal samples, and the number of normal
samples decreases as σ increases. When σ changes from 4.2 to 11.2, There is a
smoother variation in the number of fault samples detected. Especially in the interval
from 5.4 to 11.2, there are no normal samples in the fault clusters and the accu-
racy rate of fault detection is 100%. When σ is greater than 11.2, the number of
fault samples detected in fault clustering decreases with increasing until no fault
samples are detected. The number of faulty samples misclassified as normal increases
accordingly.
When σ changes from 4.2 to 11.2, the samples in this interval are well separated
between normal and fault samples in the high-dimensional space, and the number of
detected fault samples does not decrease sharply as σ increases. Therefore, the value
of this interval is selected as the optional interval of the parameter σ of the semi-
supervised fuzzy kernel clustering method, and its average accuracy rate reaches
95.5%.
With the same parameter settings, the clustering results of the fuzzy kernel
clustering method without the guidance of known labeled samples are shown in
142 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.5 The clustering results of the fuzzy kernel clustering method
σ 4 5 6 7 8 9 10 11
Normal data 118 117 116 112 115 117 111 110
Fault data 111 112 113 117 114 112 118 119
Accuracy rate (%) 18 17.9 17.7 17.1 17.5 17.9 17 16.8

Table 3.6 Clustering results of the semi-supervised fuzzy kernel clustering method
σ 4 5 6 7 8 9 10 11
Normal data 206 207 208 210 218 218 218 218
Fault data 29 21 20 20 20 20 20 20
Accuracy rate (%) 69 95.2 100 100 100 100 100 100

Table 3.5, and the corresponding clustering results of the semi-supervised fuzzy
kernel clustering method based on hypersphere are shown in Table 3.6.
Since the fault of the bearing outer ring spalling is relatively weak, the difference
between the faulty and normal samples is small. It is difficult for fuzzy kernel clus-
tering to cluster the fault samples effectively, it simply partitions the sample set into
two roughly equal clusters.

3.3.4.3 Transmission Gear Pitting Fault Detection

The semi-supervised fuzzy kernel clustering algorithm based on hyperspheres was


validated using the gear pitting data in Sect. 3.2.
The number of samples: number of samples: 220 normal samples, 22 samples of
gear pitting fault, sample data length 4 × 1024.
Sample raw vector: mean value, variance, slant value, peak value, sum of each
amplitude corresponding to modulation frequency band.
Known label samples: 30 normal samples.
Unknown label samples: the remaining 190 normal samples and 22 faulty samples,
a total of 212.
The standard deviation of each sample is calculated in the feature space based
on the Euclidean distance. Eliminate two samples that do not meet the requirement
using 95% confidence probability, and the optimized 28 normal samples are used as
the set of samples with known labels. Set the parameter m = 2 and the cluster center
c = 2 and the semi-supervised fuzzy kernel clustering method based on hyperspheres
is applied to analyze them. Figure 3.26 shows the results of the analysis when σ is
varies from 0.3 to 10.
Shown in Fig. 3.26, the number of samples in potential faults which is detected and
fault clusters obtained by fuzzy kernel clustering analysis decreases as σ increases.
When σ changes from 0.7 to 6.0, the fault clusters contain all faulty samples. When
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 143

Fig. 3.26 Analysis results of gear pitting and spalling

σ changes from 1.5 to 5.5, the fuzzy kernel clustering method can separate some
normal samples from the set of potential fault samples to achieve data purification
and improve the accuracy rate of fault detection. When σ changes from 0.3 to 3.7,
the fault clustering contains normal samples and the number of them decreases with
increasing σ . Especially in the interval from 3.7 to 6.0, there are no normal samples
in the fault clusters and the accuracy rate of fault detection is 100%. When σ is
greater than 6, the number of fault samples detected in fault clustering decreases
with increasing, until the fault samples don’t exist. The number of faulty samples
misclassified as normal increases accordingly.
When σ changes from 2 to 6.7, the samples in this interval are well separated
between normal and fault samples in the high-dimensional space, and the number of
detected fault samples does not decrease sharply as σ increases. Therefore, the value
of this interval is selected as the optional interval of the parameter σ of the semi-
supervised fuzzy kernel clustering method, and its average accuracy rate reaches
88.4%.
With the same parameter settings, the clustering results of the fuzzy kernel clus-
tering method without the guidance of known labeled samples are shown in Table 3.7,
and the corresponding improved clustering results of the semi-supervised fuzzy
kernel clustering method are shown in Table 3.8.
Since the fault of slight pitting is relatively weak, the difference between the faulty
and normal samples is small. It is difficult for fuzzy kernel clustering to cluster the

Table 3.7 The clustering results of the fuzzy kernel clustering method
σ 2 2.5 3 3.5 4 4.5 5 5.5
Normal data 122 122 124 124 125 125 124 128
Fault data 118 118 116 116 115 115 116 112
Accuracy rate (%) 18.6 18.6 9 19 19.1 19.1 19 20
144 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.8 Clustering results of the semi-supervised fuzzy kernel clustering method
σ 2 2.5 3 3.5 4 4.5 5 5.5
Normal data 206 207 208 210 218 218 218 218
Fault data 34 33 32 30 22 22 22 22
Accuracy rate (%) 64.7 66.7 68.8 73.3 100 100 100 100

fault samples effectively, it simply partitions the sample set into two roughly equal
clusters.

3.3.4.4 Transmission Gear Spalling Fault Detection

1. Data acquisition
The cylindrical rolling bearing of the output shaft of the Dongfeng SG135-2 trans-
mission was used as the experimental object. The experiment was designed to run
the transmission in three states, normal, minor spalling of gears, severe spalling +
tooth surface deformation, where the spalled outer ring was machined to grind a pit.
Set the transmission to 5th gear. Set the number of teeth of the normal meshing gear
at the input to 38/26 and the number of teeth of the 5th gear meshing gear to 22/42.
Figure 3.27 shows the fault gear.
(1) Transmission operating conditions
Speed: 1200 r/min (input shaft),
821 r/min (intermediate shaft),
1567 r/min (output shaft).
Output torque: 6.2 N m.
Output power: 1.01 kW.
(2) Sampling parameter setting
Acquisition signal: vibration acceleration, vibration speed.

Fig. 3.27 Fault gear


3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 145

Test direction: horizontal radial (x), vertical radial (y), and axial (z).
Sampling frequency: 40,000 Hz (acceleration), 5000 Hz (velocity).
Anti-mixing filtering: 20,000 Hz (acceleration), 3000 Hz (velocity).
(3) Variable speed characteristic frequency building
Frequency: 20 Hz (input shaft),
13.7 Hz (intermediate shaft),
26 Hz (output shaft).
Five gear meshing frequency: 575.4 Hz, constant meshing gear meshing
frequency: 520 Hz.

2. Normal signal and slight gear spalling


The vibration acceleration signals were collected at the direction of measurement
point 1. The sampling frequency is set to 1000 Hz and the low-pass filtering upper
limit is 1250 Hz, and the generated time domain waveforms are shown in Figs. 3.28,
3.29 and 3.30. It is obvious that the waveforms are roughly the same and the difference
is not obvious.
Compared with the normal signal, severe spalling + tooth surface deformation
fault signal waveform appears more chaotic impact components. The signals all have
obvious waveform characteristics, which are typical of amplitude-modulated signals.
At the same time, the waveform amplitudes are large, which indicates a significant
increase in vibration energy.
In this section, we only analysis the minor spalling fault detection process.

Fig. 3.28 Time domain


waveform of normal signal at
measurement point 1

Fig. 3.29 Time domain


waveform of minor spalling
of gears signal at
measurement point 1
146 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.30 Time domain


waveform of severe spalling
and tooth surface
deformation signal at
measurement point 1

Let the analysis frequency be 667 Hz and the low-pass filtering upper limit be
833 Hz. Take 20 segments of data (1024 points/segment) plus Hanning window
averaging and do the panoramic spectrum as shown in Figs. 3.31 and 3.32.
Both conditions have the highest spectral line of the output axis rotational
frequency, and the amplitude of the slightly spalled rotational frequency increases
compared to the normal signal. The third- and fourth-order spectral lines of the
normal signal occurrence frequency, the second-, third- and fourth-order spectral
lines of the minor spalling signal occurrence frequency, and the amplitudes of the
above spectral lines are greater than the amplitude of the normal signal. The fifth gear
mesh frequency and 1/2 times the normal gear mesh frequency of 288 and 260 Hz
were observed for both operating conditions. One of the minor spalling faults has a

Fig. 3.31 Panoramic spectrum of normal signals

Fig. 3.32 Panoramic spectrum of minor gear spalling signal


3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 147

spectral line at 575.4 Hz at the fifth gear engagement frequency and a small amount
of modulated edge band, while the normal signal does not appear significantly. The
spectral characteristics of minor spalling faults are not as obvious as those of serious
faults. Therefore, it should be suspected that early faults have started to appear and
be analyzed further.
The refinement spectrum is shown in Figs. 3.33 and 3.34 with 574 Hz as the
frequency center and refinement by a factor of 10.
From the analysis of the refinement spectrum, it is obvious that the normal signal
also has a modulated edge band of small amplitude at the fifth-gear meshing frequen-
cies, only the amplitude is small and not reflected in the panoramic spectrum, and
the 1/2 times frequency of the rotational frequency appears due to the installation
problem. The amplitude at the slightly spalled engagement frequency appears larger
compared to that of the normal signal, where the 1st spectral line of the normal signal
is comparable in size to the 2nd and 3rd spectral lines of the slightly spalled signal,
but their absolute quantities are still small.
Normal operating conditions due to the presence of slight misalignment of the
shaft in the installation, resulting in the appearance of 1/2 times the rotational
frequency component. Other than that, both normal and minor faults demodulate
the first- and second-order spectral lines of the transconductance, and even the third-
order spectral lines. The amplitude of the demodulated spectral lines of minor spalling
is essentially twice as large as under normal operating conditions, but their absolute
number is still small. Spectrum analysis to a certain extent that the slight spalling fault
has deviated from the normal signal with just a small degree of fault. Compared with

Fig. 3.33 Refinement spectrum of the normal signal at 574 Hz


148 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.34 Refinement spectrum of slight gear spalling signal at 574 Hz

the spectral characteristics of the signal of severe spalling + tooth surface deforma-
tion, the characteristics of its fault are not obvious. It is difficult to determine whether
there is a failure from frequency analysis.
3. Semi-supervised fuzzy kernel clustering analysis based on hyperspheres
The number of samples: number of samples: 210 normal samples, 20 samples of
bearing outer ring spalling failure, sample data length 4 × 1024.
Sample raw vector: mean value, variance, slant value, peak value, sum of each
amplitude corresponding to modulation frequency band.
Known label samples: 26 normal samples.
Unknown label samples: the remaining 184 normal samples and 20 faulty
samples, a total of 204.
The standard deviation of each sample is calculated in the feature space based
on the Euclidean distance. Eliminate one sample that does not meet the requirement
using 95% confidence probability, and the optimized 25 normal samples are used as
the set of samples with known labels. Set the parameter m = 2 and the cluster center
c = 2 and the semi-supervised fuzzy kernel clustering method based on hyperspheres
is applied to analyze them. Figure 3.35 shows the results of the analysis when σ is
varies from 0.3 to 18.
Figure 3.35 shows that the trend of the curve is basically consistent with the law
of the simulation data. In the interval of σ from 1.8 to 16.5, the number of detected
faults changes relatively smoothly, indicating that this interval is an optional interval
for value selection, At the same time, in this stable interval, the accuracy of detecting
fault clusters reaches 100%. It also confirms that when the clustering interval is large,
the stable change interval is also long.
Setting the same parameters, without the guidance of known label samples, the
clustering results of multiple runs of the fuzzy kernel clustering method are listed in
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 149

Fig. 3.35 Analysis results of bearing slight ring spalling

Table 3.9, and the corresponding clustering results of the hypersphere-based semi-
supervised fuzzy kernel clustering method are listed in Table 3.10. It can be seen from
the results that the fuzzy kernel clustering method can realize the correct analysis
of fault clustering at some value points, and these value points have no regularity.
It is precisely because of the σ value at this time that the fault samples are better
separated from the normal samples. More importantly, when the fuzzy kernel clus-
tering algorithm randomly selects the initial cluster center, sometimes it can obtain
a better initial cluster center, so fault clusters can be distinguished from normal clus-
ters; when the initial cluster center selection is incorrect, it is still difficult for fuzzy
kernel clustering to effectively cluster fault samples, and it simply divides the sample
set into two clusters which are approximately equivalent, indicating that the perfor-
mance of the fuzzy kernel clustering method is greatly affected by the initialization
of the cluster centers.
Although the spectral characteristics of slight spalling are not obvious enough, due
to the lack of experimental experience, the processing degree of gear spalling faults
is a bit deep, resulting in a bit of a fault degree, which makes the further purification
ability of the semi-supervised fuzzy kernel clustering method for potential fault
samples not reflected.

Table 3.9 Analysis results of the fuzzy kernel clustering method


σ 2 4 6 8 10 12 14 16
Normal data 113 209 114 116 119 123 140 146
Fault data 116 20 115 113 110 106 89 83
Accuracy rate (%) 17.2 100 17.4 17.7 18.2 18.9 22.5 24.1
150 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.10 Clustering results of the semi-supervised fuzzy kernel clustering method
σ 2 4 6 8 10 12 14 16
Normal data 209 209 209 209 209 209 209 209
Fault data 20 20 20 20 20 20 20 20
Accuracy rate (%) 100 100 100 100 100 100 100 100

3.4 Outlier Detection Based on Semi-supervised Fuzzy


Kernel Clustering

The Self-Organizing Map (SOM) neural network is an unsupervised training network


that can achieve the automatic classification of input patterns after training [18]. The
traditional method of SOM is to visualize the training results using the U matrix.
However, the traditional fault diagnosis SOM method is not effective in visualizing,
because the mesh on the output layer is often distorted. Although GNSOM and
DPSOM visualization methods can solve this problem well, the huge data often
leads to a long training time. If the data can perform dimensionality reduction before
the data is classified, the training time can be greatly reduced, the accuracy rate
can be improved, and storage space can be saved. Therefore, before applying neural
networks to diagnose faults, the dimensionality of the training data should be reduced
as much as possible to simplify the neural network structure.
Linear discriminant analysis (LDA) is an effective dimensionality reduction
method that can transform the original data space into a low-dimensional feature
space to produce an efficient pattern representation. LDA seeks to find a direction
that gives the best separation of the projected data in the least mean square sense.
The semi-supervised SOM studied in this section is implemented by modifying
the algorithm based on the unsupervised SOM and combining it with feature selection
to achieve an improved learning rate and learning effect.

3.4.1 Semi-supervised SOM Fault Diagnosis

3.4.1.1 Introduction to SOM

SOM neural network [18] is a more complete self-organizing feature mapping neural
network scheme with better classification performance proposed by Kohonen [18].
Sometimes also called SOM neural network as Kohonen feature mapping network.
The basic idea is that the neurons of the competition layer compete for the oppor-
tunity to respond to the input pattern. Finally, only one neuron becomes the competi-
tion winner and redirects the connected neurons connected with the winning neuron
in a direction more favorable to its winning. The winning neuron represents the
classification of the input pattern.
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 151

Fig. 3.36 SOM network


structure

Fig. 3.37 Interaction


patterns of neurons

The SOM network structure consists of an input layer and a competition layer
(output layer), shown in Fig. 3.36. The number of neurons in the input layer is n, and
the number of neurons in the competitive layer is m. m < n. The SOM network is
fully connected (i.e., each input node is connected to all output nodes).
SOM networks can map arbitrary dimensional input patterns into one-dimensional
or two-dimensional graphs at the output layer and keep their topology unchanged.
SOM networks have the probability retention ability that the weight vector space
can be made to converge with the probability distribution of the input patterns by
repeated learning of the input patterns. The overall idea of the method is to take the
winning neuron as the center of the circle, showing excitatory lateral feedback to the
near neighboring neurons and inhibitory lateral feedback to the distant neighboring
neurons, with the near neighbors stimulating each other and the distant neighbors
inhibiting each other. Where near neighbors are neurons with a radius of about 50
~ 500 μm from the neuron that emits the signal; far neighbors are neurons with a
radius of about 200 μm ~ 2 mm. Neurons more distant than their distant neighbors
show weak excitation, as shown in Fig. 3.37.

3.4.1.2 U Matrix Visualization Method

There are many methods available to visualize high-dimensional data by dimension-


ality reduction mapping, linear methods such as principal element analysis (PCA),
which is computationally small and efficient and works well for linear structures.
Using nonlinear mapping methods, such as Sammon’s method [19], curvilinear meta-
analysis (CCA) methods, etc., which can handle nonlinear structures in the data, but
152 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

are computationally intensive. For the training results of SOM networks, the most
applied visualization method is the U-matrix method (U-Matrix) and the core is the
calculation of the U-matrix.
The general calculation of the U-matrix is shown in Fig. 3.36. The neuron distribu-
tion of the competing layers in the SOM network structure is in the form of a matrix in
a two-dimensional plane. For each of these neurons, the Euclidean distance between
it and the weight vectors of all neighboring neurons is calculated, and the average
or maximum of these distances or a certain function is taken as the “U-value” of
the neuron. Calculate the U matrix that its element consists of the “U-values” of
all competing layer neurons. When the U matrix is used to visualize the training
results of the SOM network, the U matrix values are used as the third-dimensional
coordinates of the neurons in conjunction with the topology of the competing layers.
The structure of the competing layers of the network is plotted in three dimensions,
and the network clustering training results are shown by observing the distribution of
peaks, troughs, etc., or by representing the third-dimensional coordinates of neurons
in grayscale. The U matrix can also be described as a way to visualize with grayscale
maps.

3.4.1.3 Semi-supervised SOM

Traditional SOM neural networks can be learned by the supervised learning method,
and the algorithm becomes supervised SOM if all input patterns are labeled.
The semi-supervised learning method uses both labeled and unlabeled samples
to learn together. Therefore, in the semi-supervised SOM that the labels of some
samples in the input pattern Ak are removed, and then both labeled and unlabeled
samples are input into the supervised SOM network for training. The steps of the
algorithm are as follows.
Let the input mode of the network be Ak = (a1k , a2k , . . . , a kN ), k = 1, 2, . . . , p
and competitive layer neuron vector is B j = (b1 , b2 , . . . , b M ), where Akl denoted
the labeled data, Ako denoted the unlabeled data, l + o = N , M is the number
of competitive layer neuron, and the network connection weighting is {W i j }, i =
1, 2, . . . N , j = 1, 2, . . . M.
(1) Set the connection weight A of the network equal to a random value over the
interval [0, 1]. Determine the initial value η(0) of the learning rate η(t) (0 <
η(0) < 1), the initial value N g (0) of the neighborhood function N g (t), and
the total number of studies T which denotes terminate when the network has
learned a specified number of times T.
(2) Normalize any input pattern aik selected from Ak . Then normalize the network
power vector.

Ak = Ak /|| Ak ||, k = 1, 2, . . . p (3.49)

|| ||
W j = W j /|| W j ||, j = 1, 2, . . . M (3.50)
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 153

(3) Input the selected pattern Ak into the network and calculate the distance between
the connection weight vector W j = (w j1 , w j2 , . . . w j N ) and the input pattern
Ak .
[ N ]1/2
Σ( )2
dj = ai − wi j
k
, j = 1, 2, . . . M, k = 1, 2, . . . p (3.51)
i=1

(4) Compare their distances to determine the best match unit g:


( )
dg = min d j , j = 1, 2, . . . M (3.52)

(5) Adjust the connection weights between all competing layer neurons within
the neighborhood of the input neuron to the activating neuron and correct the
weights according to the following equation:
⎧ ( k )

⎪ wi j (t) + η(t)N g j (t) ai − wi j (t) , when ai ∈ Ako
k


⎨ w (t) + η(t)N (t)(a k − w (t)), when aik ∈ Akl and g

ij gj ij
wi j (t + 1) = i
is correctly classified

⎪ ( k ) when aik ∈ Akl and g



⎩ wi j (t) − η(t)N g j (t) ai − wi j (t) , is incorrectly classified
(3.53)

where j ∈ N g (t), i = 1, 2, . . . , N , 0 < η(t) < 1, η(t) is the learning rate at


moment t.
(6) Enter the next learning mode into the network and return to step 3 until all
learning modes have been learned.
(7) Update the learning/ rate η(t) and the neighborhood function N g j (t) according
to η(t) = η0 (1 − t T ), where η0 denotes the initial value of the learning rate, t
denotes the number of learning times, and T denotes the total number of learning
times.
(8) Return to step 2, let t = t + 1. End the loop until t = t + 1.
Figure 3.38 shows the classification results of the Iris dataset using semi-
supervised SOM and semi-supervised LDA-SOM. Thirty of the 50 sample data
from each class are selected as labeled samples and the remaining 20 as unlabeled
samples, and then these 50 samples are simultaneously input into the network for
training. The training results are shown in Fig. 3.38, where (a) is the U matrix of
the semi-supervised SOM, (b) is the label of the semi-supervised SOM classification
results, (c) is the U matrix of the semi-supervised LDA-SOM, and (d) is the label of
the semi-supervised LDA-SOM classification results.
In the figure, “Ο”, “Δ” and “⛛” represent labeled samples in Setosa, Versicolor,
and Virginia, “•”, “*” and “×” represent unlabeled samples in Setosa, Versicolor,
and Virginia. As can be seen from the classification results, the semi-supervised
SOM still achieves the separation of the three types of data relatively well. Despite
154 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.38 Classification results of semi-supervised SOM on the Iris dataset

that 40% of the input samples do not contain labels. It can also be found that the
semi-supervised SOM based on LDA feature selection is more intuitive than the
semi-supervised SOM for classification.
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 155

3.4.2 Semi-supervised GNSOM Fault Diagnosis

3.4.2.1 Introduction to GNSOM

To present the structure of data more naturally, Rubio and Gimenez et al. proposed
a new SOM-based data visualization algorithm-GNSOM (Grouping Neuron SOM)
[20]. This visualization method has similar algorithms compared to the original
SOM. In the original SOM network, the neurons in the input layer are represented
by weight vectors, which are updated during the training process, and the neurons
in the output layer are fixed on a fixed network. In GNSOM, on the other hand, the
value of the weight vector is fixed, and it is the position of the neurons in the output
layer that is updated. It should be added that the results obtained from this method
only suggest the spatial topology of the input data. Therefore, the obtained mapping
map represents only one spatial distribution of the input vectors, and its coordinates
do not correspond to a specific physical meaning.
The specific algorithm of GNSOM can be understood in detail in Ref. [20].

3.4.2.2 Semi-supervised GNSOM

Since GNSOM comes from adding one step position adjustment step to the original
SOM, the semi-supervised GNSOM algorithm can be easily obtained by referring to
the semi-supervised SOM. Let the vector of neurons in the competitive layer be the
position of Bj on the output layer as pi j (x, y), the distance of neighboring neurons
on the output layer (competitive layer) as M, the number of neurons in the x-axis
direction as Mx , and the number of neurons in the y-direction as M y . The specific
implementation steps are as follows.
(1) Initialize the initial position, the rest as in the semi-supervised SOM step 1.
(2) The same semi-supervised SOM step 2.
(3) The same semi-supervised SOM step 3.
(4) The same semi-supervised SOM step 4.
(5) Correcting the connection weights of the SOM according to Eq. (3.53); semi-
adjusting the positions of the competing layer neurons on the competing layer
according to the following Eq. (3.54).
⎧ ( )

⎪ pi j (t) + η(t)N g j (t) aik − wi j (t) , when aik ∈ Ako


⎨ p (t) + η(t)N (t)(a k − w (t)), when aik ∈ Akl and g

ij gj ij
pi j (t + 1) = i
is correctly classified

⎪ ( )

⎪ when aik ∈ Akl and g
⎪ pi j (t) − η(t)N g j (t) aik − wi j (t) ,
⎩ is incorrectly classified
(3.54)

(6) The same semi-supervised SOM step 6.


156 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.39 Classification results of semi-supervised GNSOM on the Iris dataset

(7) The same semi-supervised SOM step 7.


(8) The same semi-supervised SOM step 8.
Figure 3.39 shows the classification results of the Iris dataset using semi-
supervised GNSOM and semi-supervised LDA-GNSOM. Thirty of the 50 sample
data from each class are selected as labeled samples and the remaining 20 as unla-
beled samples, and then these 50 samples are simultaneously input into the network
for training. The training results are shown in Fig. 3.39, where Fig. 3.39a shows the
labels of the semi-supervised GNSOM classification results, and Fig. 3.39b shows
the semi-supervised LDA-GNSOM classification results.
In the figure, “Ο”, “Δ” and “⛛” represent labeled samples in Setosa, Versicolor,
and Virginia, “•”, “*” and “×” represent unlabeled samples in Setosa, Versicolor,
and Virginia. As can be seen from the classification results, the semi-supervised
SOM still achieves the separation of the three types of data relatively well. Despite
that 40% of the input samples do not contain labels. It can also be found that the
semi-supervised GNSOM based on LDA feature selection is more intuitive than the
semi-supervised GNSOM for classification.
However, the coordinate range of the x-axis in Fig. 3.39a is [3.1, 3.45] and that
of the x-axis in Fig. 3.39b is [4.1, 4.24], which is because the GNSOM is based on
the Himberg contraction model and thus the problem of over-contraction occurs.

3.4.3 Semi-supervised DPSOM Fault Diagnosis

3.4.3.1 Introduction to DPSOM

Shao and Huang proposed a new SOM-based data visualization algorithm-DPSOM


(Distance Preserving Distance) [21], which can adaptively adjust the position of
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 157

neurons according to the corresponding distance information, thus achieving an intu-


itive presentation of distance information between data. In addition, the algorithm
can automatically avoid the problem of excessive neuron shrinkage, thus greatly
improving the controllability of the algorithm and the quality of data visualization.
In the original SOM algorithm, the neurons as data representatives are fixed on
a low-dimensional conventional grid, and the topological ordering of neurons on
this grid can be eventually achieved by using the neighborhood learning approach.
However, in the DPSOM algorithm, the positions of the neurons on the low-
dimensional grid are no longer fixed but can be adjusted accordingly according
to the corresponding distances in the feature space and the low-dimensional space.
Compared with the original SOM algorithm, the DPSOM algorithm has only one
additional step of position adjustment operation, and since this step does not change
the learning process of the original SOM algorithm, they have the same good robust-
ness as the original SOM algorithms with the same good robustness, which is not
comparable to those dynamic SOM algorithms [22]. In addition, this additional
step operation, with small additional computation and no longer using the Himberg
contraction model, avoids the problem of excessive neuron contraction in the absence
of additional control parameters.
In a similar way to the GNSOM method, the results obtained by this method
only suggest the spatial topology of the input data, so the obtained mapping map
represents only one spatial distribution of the input vectors and its coordinates do
not correspond to a specific physical meaning.
The specific algorithm of DPSOM can be understood in detail in Ref. [21].

3.4.3.2 Semi-supervised DPSOM

The semi-supervised DPSOM differs from the semi-supervised GNSOM only in the
different position adjustment rules for the competing layer neurons. Referring to the
semi-supervised GNSOM, the semi-supervised DPSOM can be easily obtained, and
its position adjustment rules are as follows:
⎧ ( )
δvk

⎪ pk (t) + η(t) 1 − ( pv (t) − pk (t)), when aik ∈ Ako

⎪ dvk

⎨ ( ) when aik ∈ Akl and g
δvk
pk (t + 1) = p k (t) + η(t) 1− dvk ( pv (t) − pk (t)),
⎪ is correctly classified

⎪ ( )

⎪ δvk when aik ∈ Akl and g
⎩ pk (t) − η(t) 1 − p (t) − pk (t)),
dvk ( v is incorrectly classified
(3.55)

The semi-supervised DPSOM algorithm is obtained by simply replacing


Eq. (3.54) in the semi-supervised GNSOM algorithm with Eq. (3.55).
Figure 3.40 shows the classification results of the Iris dataset using semi-
supervised DPSOM and semi-supervised LDA-DPSOM. Thirty of the 50 sample data
from each class are selected as labeled samples and the remaining 20 as unlabeled
158 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.40 Classification results of semi-supervised DPSOM on the Iris dataset

samples, and then these 50 samples are simultaneously input into the network for
training. The training results are shown in Fig. 3.40, where Fig. 3.40a shows the semi-
supervised DPSOM classification results, and Fig. 3.40b shows the semi-supervised
LDA-DPSOM classification results.
In the figure, “Ο”, “Δ” and “⛛” represent labeled samples in Setosa, Versicolor,
and Virginia, “•”, “*” and “×” represent unlabeled samples in Setosa, Versicolor,
and Virginia. As can be seen from the classification results, the semi-supervised
DPSOM still achieves the separation of the three types of data relatively well. Despite
that 40% of the input samples do not contain labels. It can also be found that the
semi-supervised DPSOM based on LDA feature selection is more intuitive than the
semi-supervised DPSOM for classification.
It can also be seen from the figure that the coordinate range of the X-axis in
Fig. 3.40a is [0.5, 5.5] and the coordinate range of the X-axis in Fig. 3.40b is [1.5,
5.5], which is because the DPSOM is no longer based on the Himberg shrinkage
model, which well avoids the excessive shrinkage problem of the GNSOM.

3.4.4 Example Analysis

3.4.4.1 Gear Fault

The test data were obtained from the gear failure test in the Laborelec labora-
tory. The test data were collected from a 41-tooth helical gear in three modes of
operation, namely, normal, slightly spalling, and slightly worn gear surface are
shown in Fig. 3.41. In this experiment, the gear parameters and test conditions
are modulus 5 mm, helix angle, center distance 200 mm, transmission ratio 37/
41, input speed 670 r/min, and sampling frequency 10,240 Hz. The time domain and
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 159

frequency domain waveforms are shown in Fig. 3.42, where Fig. 3.42a–c are the
time domain diagrams of normal, slight spalling, and slight wear states, respectively,
and Fig. 3.42d–f are the frequency domain diagrams of normal, slight spalling and
slight wear states, respectively. From Fig. 3.42b, it can be seen that the amplitude
of the spalling acceleration signal increases compared with that of the normal state
Fig. 3.42a, but there is no obvious difference between the two, and it is difficult
to distinguish the two types of faults from the time domain. It can be seen from
Fig. 3.42c that when the gear shows uniform wear, the signal shows a clear shock
band and also an increase in amplitude, which indicates a significant increase in
vibration energy. This indicates that it is difficult to distinguish the three types of
faults by using conventional signal processing methods, so it is necessary to diagnose
the faults by intelligent diagnosis methods.
Eleven commonly used statistical feature parameters are selected to form the
original set of gear state features for describing gear failure modes, which are: mean
square value, kurtosis, mean value, variance, skewness, peak value, root mean square
amplitude, peak indicator, waveform indicator, pulse indicator, and margin index.
In this experiment, each group of faults contains 54 samples (34 samples with
labels and 20 samples without labels), for a total of 162 samples. The specific semi-
supervised SOM algorithm schematic block diagram is shown in Fig. 3.43.
In the figure, “Ο”, “Δ” and “⛛” represent normal, spalling, and abrasion labeled
data, “•”, “*”, and “×” represent normal, spalling, and abrasion unlabeled data. From
this figure, it is obvious that the semi-supervised LDA-GNSOM method based on
feature selection is significantly better than the semi-supervised GNSOM, however,
as mentioned before, the GNSOM appears to have an excessive shrinkage problem
are shown in Fig. 3.44.
Figure 3.45 shows the classification results using the semi-supervised LDA-
DPSOM proposed in this section, where Fig. 3.45a, b show the classification results

Fig. 3.41 Experimental pictures


160 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.42 Time domain and frequency domain signals of faulty bearing

Fig. 3.43 Semi-supervised SOM schematic

of semi-supervised DPSOM and semi-supervised LDA-DPSOM, respectively. From


this figure, it is obvious that the semi-supervised LDA-DPSOM method based on
feature selection is significantly better than the semi-supervised DPSOM.
Table 3.11 shows the average correctness, average elapsed time, average quan-
tization error, and average topology error obtained from the simulations using
semi-supervised GNSOM and semi-supervised DPSOM.
From Table 3.11, it can be seen that the method with the highest correct rate is
the semi-supervised LDA-DPSOM with 86.3%; the least time-consuming method
is the semi-supervised LDA-GNSOM with 0.6430 s; the smallest average quantiza-
tion error is the semi-supervised LDA-GNSOM and semi-supervised LDA-DPSOM
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 161

Fig. 3.44 Classification results of semi-supervised GNSOM on the Laborelec dataset

Fig. 3.45 Classification results of semi-supervised DPSOM on the Laborelec dataset

Table 3.11 Experimental results of semi-supervised SOM on the Laborelec dataset


Semi-supervised Semi-supervised Semi-supervised Semi-supervised
GENOM LDA-GNSOM DPS LDA-DPSOM
Average 72.1 86.0 73.9 86.3
correctness (%)
Average elapsed 1.3806 0.6430 1.0345 0.6439
time (s)
Average 0.263 0.041 0.263 0.041
quantization error
Average topology 0.046 0.048 0.043 0.038
error
162 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

methods, both with 0.041; and the smallest average topology error is semi-supervised
LDA-GNSOM methods, with 0.038. It can also be seen from the table that the self-
organized mapping network with LDA feature selection not only speeds up the oper-
ation (73.76% for semi-supervised LDA-GNSOM and 37.76% for semi-supervised
LDA-DPSOM) but also improves the correct rate (13.9% for semi-supervised LDA-
GNSOM and 12.4% for semi-supervised LDA-DPSOM), and reduced the average
quantization error (84.41% for semi-supervised LDA-GNSOM and 84.41% for semi-
supervised LDA-DPSOM). In addition, the average topological error varies with the
learning method but basically does not change much.

3.4.4.2 Bearing Fault

The test is conducted on a bearing plant with the model NU205M bearing as the
research object, and the bearing failure simulation experiment is done on the rotating
machinery failure simulation test bench, and the test design bearing is operated
under three states of normal, inner ring pitting and inner ring cracking respectively.
The parameters of the rolling bearing and the characteristic frequency parameters
are shown in Table 3.12. Three measurement points are arranged, and the sensor
is mounted on the bearing seat. The measurement point arrangement, coordinate
system, fault simulation test bench, and test system are shown in Fig. 3.46.
Figure 3.47 shows the time–frequency diagram of bearing failure. Figure 3.47a–c
are the time domain diagrams of normal, inner ring line cutting 0.2 mm, and inner
ring pitting 4 mm state respectively, and Fig. 3.47d–f are the frequency domain
diagrams of normal, inner ring line cutting 0.2 mm, and inner ring pitting 4 mm
state respectively. The impact components are hardly visible in Fig. 3.47a, c, and the
time domain signals of both are extremely similar, while the impact components are
obvious in Fig. 3.47b.
In Fig. 3.47d–f, the peaks all appear near 530.5 Hz, which is the inherent frequency
of the outer ring, and the modulation frequency is 7.4 Hz (530.5 Hz − 523.1 Hz =
7.4 Hz), which is extremely close to the calculated pass-by frequency of the cage
(7.6 Hz), i.e., meaning that the modulation phenomenon of the carrier frequency as
the inherent frequency of the outer ring and the modulation frequency as the passing
frequency of the cage appears, but the modulation phenomenon is not obvious.

Table 3.12 Rolling bearing parameters and characteristic frequency


Pitch Rolling Rolling Contact Inner ring Outer ring Rolling Cage
diameter element element angle α passing passing element passing
D (mm) diameter number m (°) frequency frequency passing frequency
d 0 (mm) (num) f i (Hz) f o (Hz) frequency f b (Hz)
f g (Hz)
38 6.5 12 18.33 128.82 98.78 52.02 7.60
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 163

Fig. 3.46 Bearing fault simulation test

Fig. 3.47 Time domain and frequency domain figure

Figure 3.48 shows the classification results using the semi-supervised LDA-
GNSOM proposed in this section, where Fig. 3.48a, b show the classification results
of semi-supervised GNSOM and semi-supervised LDA-GNSOM, respectively.
In the figure, “Ο”, “Δ” and “⛛” represent normal, spalling, and pitting labeled
data, “•”, “*”, and “×” represent normal, spalling, and pitting unlabeled data. As
164 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.48 Results of the semi-supervised GNSOM classification of bearings

can be seen in Fig. 3.48a, although the semi-supervised GNSOM improves the visu-
alization results slightly, the normal samples are very close to the inner circle pitting
samples and are almost mixed. From Fig. 3.48b, it is obvious that the semi-supervised
LDA-GNSOM method based on LDA feature selection is significantly better than
the semi-supervised GNSOM, and the neurons in the output layer are divided into
three clusters.
Figure 3.49 shows the classification results of the bearing data using the semi-
supervised LDA-DPSOM proposed in this paper. It can be seen from Fig. 3.49a
that although the neurons in the output layer have been adjusted according to the
corresponding distances, the classification effect is not obvious, while the semi-
supervised LDA-DPSOM algorithm with LDA feature selection separates the three
classes well, and the intra-class distances are greatly reduced, and the distinction
between classes is more obvious.

Fig. 3.49 Results of the semi-supervised DPSOM classification of bearings


3.5 Relevance Vector Machine Diagnosis Method 165

Table 3.13 Experimental results of semi-supervised SOM on bearing dataset


Semi-supervised Semi-supervised Semi-supervised Semi-supervised
GNSOM LDA-GNSOM DPSOM LDA-DPSOM
Average 97.2 99.6 97.5 97.8
correctness (%)
Average elapsed 5.2646 2.1268 4.9069 1.9136
time (s)
Average 0.274 0.038 0.274 0.041
quantization error
Average topology 0.042 0.185 0.051 0.161
error

Table 3.13 shows the average correct rate, average time consumed, average quanti-
zation error, and average topological error obtained using semi-supervised GNSOM
and semi-supervised DPSOM simulations.
From Table 3.13, it is obvious that the method with the highest correct rate is
the semi-supervised LDA-GNSOM with 99.6%; the method with the lowest time
consumption is the semi-supervised LDA-DPSOM with 1.9136 s; the method with
the lowest average quantization error is the semi-supervised LDA-GNSOM with
0.038, and the method with the lowest average topology error is the GNSOM with
0.042. The table also shows that the self-organized mapping network with LDA
feature selection not only speeds up the operation (59.59% for semi-supervised
LDA-GNSOM and 61.01% for semi-supervised LDA-DPSOM), but improves the
correct rate (2.4% for semi-supervised LDA-GNSOM and 0.3% for semi-supervised
LDA-DPSOM), but also reduces the average quantization error (86.13% for the
semi-supervised LDA-GNSOM and 85.04% for the semi-supervised LDA-DPSOM).
However, the average topological error has increased.

3.5 Relevance Vector Machine Diagnosis Method

3.5.1 Introduction to RVM

Relevance vector machine (RVM) is based on the theory of Bayesian estimation


and Support vector machine. To understand it, the basic knowledge of Bayesian
estimation and the classification principle of the support vector machine is introduced,
and the relevance vector machine is introduced.
166 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.5.1.1 Bayesian Estimates

The Bayes’ theorem comes from An Essay Towards Solving a Problem in the
Doctrine of Chances, 1763 by the British academic Bayesian in the proceedings
of the Royal Society, in this paper, a method of probabilistic inference of binomial
distribution parameters is presented. In general, Bayes assumed that the parameters
were uniformly distributed over unit intervals, and his method of inferring the param-
eters of binomial distributions came to be known as Bayes’ theorem, and it is extended
to applications other than binomial distribution and any statistical distribution.

3.5.1.2 Introduction to Relevance Vector Machines

Relevance vector machine (RVM) [23] is a sparse model with the same functional
form as SVM, which is proposed by Tipping in 2000 and trained under Bayesian
theory. It uses Bayesian reasoning, has good generalization ability, and the solution
is sparser, does not need to adjust the super-parameter, and is easy to implement.
When applied to classification problems, it can also give a probability measure for
the attribution of categories. Therefore, it will be a meaningful attempt to apply RVM
to fault detection and classification.
The specific algorithms for RVM are detailed in Ref. [23].

3.5.2 RVM Classifier Construction Method

3.5.2.1 Feature Selection Method

Given a set of measurements, dimensionality can be reduced essentially in two


different ways. The first approach is to identify variables that contribute little to
the classification. In discrimination problems, you can ignore variables that have
little effect on category separability. To do this, all that needs to be done is to select
a feature from the dimension measurements (the number of features must also be
determined), this method is called feature selection in the measurement space or
simply feature selection (see Fig. 3.50a). The second method is to find a transforma-
tion from a dimension measurement space to a feature space with lower dimensions.
This method is called feature selection and feature extraction in the transformation
space (see Fig. 3.50b). The transformation can be a linear or nonlinear combination
of the original variables, and it can be supervised or unsupervised. Under supervised
conditions, the task of feature extraction is to find a transformation that maximizes
the separability of a given category. This section first discusses feature selection
methods.

(1) Separability evaluation index


3.5 Relevance Vector Machine Diagnosis Method 167

x1 x1
f1 f1
f ( x) f 1 p

f2 f2
xp xp
a) Feature selection b) Feature extraction

Fig. 3.50 Dimensions compression

The feature samples correspond to their fault modes and are located in different
regions of the feature space. In principle, the samples of different classes can be
separated. If the dispersion between samples is large and the intra-class dispersion
is small, then the separability of samples is good. It can be seen that the “Distance”
between the samples reflects the separability of the samples.
The denominator in Formula (3.21) is derived from the separability evaluation
index Scw in Sect. 3.2.2.1. It represents the mean distance of the intra-class vector. The
molecule is the index of divergence between classes, which represents the average
distance between classes.
To compare intra-class divergence with inter-class divergence, subtract Scw from
molecule Eq. (3.21) and write:

Scb − Scw
Jbw (x)∗ = (3.56)
Scw

To achieve the effect of normalization, based on Formula (3.56), the divisibility


evaluation index is designed as follows:

Scb − Scw
Jb (x) = / (3.57)
2 + (S − S )2
Scw cb cw

On the one hand, this index can be used for feature selection. Firstly, the values
of each eigenvalue Jb are calculated, and the divergence between classes and within
classes are compared. If Jb ≤ 0 the eigenvalues are eliminated first, then the
remaining eigenvalues are used for feature combination calculation and feature selec-
tion. And from the expression of Eq. (3.57), the Jb closer to 1, the better the selected
feature vector. On the other hand, we can use the validity of fast metric feature
extraction in small samples to guide feature selection in pattern recognition and
classification, which is of great significance for feature extraction.
168 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.5.2.2 Feature Extraction

1. Kernel direct discriminant analysis

Kernel direct Discriminant Analysis (KDDA) was proposed by Lu in 2003, the idea
of which is to apply the kernel method to Direct-Linear Discriminant Analysis (DDA)
[24], proposing linear discriminant analysis in the characteristic space, so that it can
be linearly divisible. KDDA method has been well used in face recognition [25], this
paper applies this method to gear fault feature extraction.
The basic idea of kernel direct discriminant analysis is to map the original space
to the higher dimensional characteristic space F by nonlinear functions, then the
improved Fisher Linear Discriminant Analysis and Linear Discriminant Analysis
methods are applied to the calculation. At the same time, the concept of kernel
function is introduced so that the inner product of any two elements in F can be
replaced by the value of kernel function of the corresponding element in F, and
the mapping and F are not need to be found. The “Small Sample Size problem” is
effectively solved by using an improved Fisher discriminant. The detailed algorithm
and derivation of KDDA can be found in Refs. [24, 25].
2. Feature extraction based on KDDA
The feature extraction of arbitrary test sample data z can be realized by calculating
the projection in the direction of the normalized correlation coefficient matrix feature
vector.
Based on the above analysis, the steps of feature extraction using KDDA can be
summarized as follows:
(1) For the given training sample data {zi }i=1L
, calculate the dimension kernel
function matrix.
(2) The kernel matrix is standardized and calculated, calculate ΦTb Φb .
(3) For the characteristic equation (Φb ΦTb )(Φb ei ) = λi (Φb ei ), the eigenvalues and
eigenvectors are obtained.
(4) Calculate the characteristic equation UT SW T H U and find the v-sum for its similar
diagonalization.
(5) Calculate ⊝ and complete the training of the KDDA feature extraction
algorithm.
(6) For the input sample z, the kernel matrix γ(φ(z)) is calculated.
(7) The optimal discriminant eigenvector z is obtained from the formula: y = ⊝ ·
γ(φ(z)) d.

3. Advantages of kernel direct discriminant analysis

KDDA is a feature extraction method that combines direct linear discriminant anal-
ysis (D-LDA) and kernel functions. The advantages of KDDA are summarized as
follows:
3.5 Relevance Vector Machine Diagnosis Method 169

(1) The kernel function is introduced to turn the nonlinear or complex problems
in the input space into linear problems when they are mapped to the high-
dimensional space so that they can be linear discriminant analyses in the char-
acteristic space. It is not hard to see that, at that time, D-LDA was a special case
of KDDA.
(2) KDDA effectively solves the problem of small sample size in high-dimensional
feature space by direct discriminant analysis and makes full use of all infor-
mation in the optimal discriminant vector, including information in SW T H zero
space and information outside of zero space.
(3) If only the kernel discriminant analysis is considered, instead of the direct
discriminant method, the pseudo-inverse matrix K ' of the kernel matrix K needs
to be calculated during the process of discarding SW T H , and the matrix is always
ill-conditioned because of the selection of the kernel and the kernel parameters,
and there is no solution at this time. KDDA avoids this problem by introducing
a direct method.

4. Comparison between kernel direct discriminant analysis and kernel principal


component analysis

The basic idea of KDDA is the same as that of Kernel Principal Component Anal-
ysis, which uses the principle of kernel function to map the sample points in the input
space to the high-dimensional (or even infinite-dimensional) feature space through
a nonlinear mapping, then the eigenvector coefficients are solved in the space. The
difference is that the projection direction of the KPCA transform is to maximize the
total divergence matrix of all samples and retain the largest data component of the
sample variance without considering the variation of the divergence between classes,
that is, information that does not take full advantage of the differences between cate-
gories. When KDDA is applied to feature extraction, the information between cate-
gories is fully considered, which can make the distance between different categories
maximum and the distance between the same category minimum in the reduced
dimension feature space. In other words, after feature extraction by KDDA trans-
form, samples of the same category are clustered together, and samples of different
categories are separated as much as possible. Therefore, KDDA is better than KPCA
in feature extraction.

3.5.2.3 Single-Value Classification Problem

The single-valued classification is different from the traditional binary classification,


which is to classify a given sample into Category I or Category II, and the goal of
single-valued classification is to classify it into that category or not. There is only one
category in the single-valued classification problem, so only one category of sample
information is needed to construct the single-valued classifier.
(1) Relevance vector machine single-value classification
170 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.51 Single-valued


classification schematic

RVM single-valued classification model is built based on the binary classification


algorithm. Different from binary classification, there is only one class of samples
as training samples, so its class is marked as tn = 0 or tn = 1. Single-valued
classifier RVM trains the single-valued classifier according to the above algorithm
and calculates the predictive value of each training sample point. Taking the predicted
value as the input of the logistic function, the probability distribution interval of the
predicted value is calculated. Then we estimate the predicted value and the probability
value of the test point according to the classifier, if the probability value falls in the
training probability distribution interval, then it is the normal sample, otherwise, it
is the abnormal sample. The data samples in normal operation are easy to obtain,
but the fault samples are difficult to extract. The single-valued classification method
only uses the data samples of the normal running state, establishes the single-valued
fault classifier, and identifies the running state of the machine. Figure 3.51 is a sketch
of single-value classification.
(2) Support vector data description
The support vector data description algorithm (SVDD) is a single-valued classifica-
tion based on the support vector algorithm. It was proposed by Tax and Duin and
developed from Vapnik’s Support vector machine. Unlike the optimal hyperplane of
the Support vector machine, the goal of the support vector description algorithm is
to find a minimum volume sphere containing the target sample data [26] and make
all (or most) of the target samples contained within the sphere. The basic principles
and algorithms of SVDD are not detailed here but can be found in Ref. [27].

3.5.2.4 Multi-value Classification Problem

Mechanical fault pattern recognition is a kind of diagnosis method that uses input
original data and takes the corresponding action according to its category. The three
basic operations of the Pattern Recognition System are pretreatment, feature extrac-
tion, and classification. The function of the classifier in the system is to assign a
class mark to an object under test according to the feature vector obtained by the
feature extractor. The ease of classification depends on two factors, one of which
is the fluctuation of eigenvalues between individuals from the same category. The
3.5 Relevance Vector Machine Diagnosis Method 171

second is the difference between the eigenvalues of samples belonging to different


categories.
Pattern Recognition System involves the following steps: data acquisition, feature
selection, model selection, training, and evaluation. As shown in Fig. 3.52.
At present, what pattern recognition technology can do is basically classifica-
tion work, there is still a considerable distance from understanding. As a pattern
classification method, the issues involved are:
(1) feature extraction;
(2) learning training model samples to get decision rules;
(3) using decision rules to classify samples.
The key to the pattern recognition system is the design of the feature selection
and feature discrimination module. If we can select the features with high accuracy
description ability, it is undoubtedly of great significance to the establishment of
the system. It can be stored less and express more physical meaning. The reasonable

Fig. 3.52 Block diagram of


the pattern recognition
system
172 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

design of the feature discriminator can make the system have high stability and accu-
racy. Relevance vector machine (RVM) has good performance in two-class problems
and can be extended to the multi-class problem.
RVM was originally proposed for two-class classification problems, but in terms of
its application in pattern recognition, only binary classifiers obviously cannot meet
the application needs, especially in the field of fault diagnosis, after quantifying
the fault symptom and the fault cause, the corresponding problem is definitely not
only the two-class classification. Therefore, it is an important factor to restrict the
successful application of RVM in the engineering field whether the binary RVM
classifier can be extended to the multi-valued RVM classifier effectively.
Because RVM is developed based on SVM, the research of its multi-value classi-
fication algorithm can be inspired by SVM. According to the existing theory in SVM
theory, there are two main methods to construct SVM multi-value classifiers, which
are called complete multi-class Support vector machine and combined multi-class
Support vector machine, the combined multi-class algorithm is divided into “One-
to-many” algorithm and “One-to-one” algorithm. RVM multi-class classification
algorithm can also adopt the above construction methods.
(1) Complete multi-class relevance vector machine
In Ref. [28], a sparse Bayesian classification algorithm based on Automatic Rele-
vance Determination is proposed to solve multi-classification problems. The tradi-
tional “One-to-many” and “One-to-one” methods are not only complicated in the
training phase but also cause overlap and large classification loss. The sparse Bayes
classification method is a direct multi-classification predictor, which overcomes the
above shortcomings by introducing a regular correlation function.
Sparse Bayes learning includes the Occam pruning rule, which can modify the
complex model to make the model smooth and simple. In this paper, a sparse learning
multi-class classifier based on the Bayesian framework is proposed. The algorithm
sets up a polynomial distribution for each class of multivariable, and the multi-
class output of the model is the limit value (Softmax) of the kernel basis function
(the regular correlation function). The parameter estimation of the model adopts
Automatic Relevance Determination, which ensures the sparsity and smoothness of
the model.
(2) Combined multi-class relevance vector machine
In fact, in the current machine learning process, the usual approach is to break
the large-scale learning problem into a series of small problems, with each small
problem solving a binary classification problem, the reason for doing this is not
only because it can reduce the computational complexity of the whole problem and
improve the generalization performance of the global classifier, but it is also because
some machine learning algorithms degrade rapidly as the problem size increases,
while others can not directly solve multi-class classification problems.
In the design of machine learning algorithms, a binary classification algorithm is
often designed first and then extended to a multi-class or regression estimation direc-
tion. Some algorithms directly extend the existing problem, in this case, the original
3.5 Relevance Vector Machine Diagnosis Method 173

problem is usually converted into a multi-class two-class classification problem, and


then design the reconstruction algorithm combines all the independent results.
At present, the application of pattern classification in multi-classification is based
on the idea of decomposition-reconstruction, which can be divided into two kinds:
“One-to-many” and “One-to-one” classification algorithms.
The “One-to-one” algorithm is better than the “One-to-many” algorithm in
training time but also has higher classification accuracy. However, one disadvantage
of the “One-to-one” algorithm is that each classifier is trained for a set of samples of
two specific classes, and each sample will be input to the classifier used in the test
phase, therefore, the results may be classified into corresponding classes according
to some irrelevant classifiers, which will have a certain degree of negative impact on
the results, and how to punish has become a new problem.

3.5.3 Application of RVM in Fault Detection


and Classification

In this section, the vibration signals of transmission in a normal state, gear fault, and
bearing fault are collected by experiment, and the effective information is extracted
by feature selection, the method is applied to the detection and classification of
typical faults of gear and bearing in RVM gearbox. The performance of RVM and
SVM methods is analyzed and compared.

3.5.3.1 Transmission Typical Fault Experimental Device and Test


Method

The experimental apparatus and test method are the same as in Sect. 3.2, the structure
of the experiment system, the transmission test bed and control board, the trans-
mission diagram, and the arrangement of measuring points are all consistent with
Sect. 3.2.4.1.

3.5.3.2 Application of Gearbox Fault Detection Based on RVM

A good detection model often has both accuracy and stability. Therefore, the detection
rate can be used to measure the effectiveness of the detection. On the one hand, it is
expected that the detection model has a high detection rate, that is, it requires that
the detector can accurately detect minor faults which are not much different from
the normal state. Due to the similarity between normal mode and minor fault mode,
it is difficult to distinguish them in ordinary linear space, so minor fault detection is
especially difficult to perform. On the other hand, the prior information on fault mode
is insufficient in most detection, and the known samples are almost normal data. This
174 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

will lead to the detector training is not ideal, and then affecting the accuracy of the
detection.
The gear surface pitting is a kind of slight fault. It is difficult to distinguish the
vibration signal extracted from the slight pitting state from the normal signal only
by the signal processing method. In this section, the RVM method is used to detect
the slight pitting faults of gears, and compared with the SVDD method.
(1) Experiment with signal acquisition
To identify early gear faults, the test is divided into two parts. At first, the vibration
signal is collected from the normal transmission, and then a fault gear with slight
pitting is used to replace the 5th-gear meshing gear in the output shaft of the trans-
mission, and then the vibration signal is collected. The vibration acceleration signals
were collected in the horizontal direction of 1 measuring point, and the input speed
was set at 1000 r/min and the torque was set at 145 N/m, the output speed was set
at 1300 r/min and the torque was set at 100 N/m. The sampling frequency was set to
40,000 Hz, the upper limit of the low-pass filter was 2000 Hz, the number of sampling
points was 1024 × 90, and six groups of data were recorded. Generated time-domain
waveform as shown in Fig. 3.53, the amplitude difference is not obvious, the wave-
form is very similar, almost indistinguishable, and it is difficult to judge the fault of
the gear.
(2) Feature selection index
There are many eigenvalues reflecting gear faults, and their sensitivity to faults is
different. According to the close relationship between eigenvalues and faults, and the
degree of noise disturbance in testing and analysis, the features are selected. Eleven
eigenvalues in the time domain, such as the mean square value (1), kurtosis (2), mean
value (3), variance (4), skewness (5), peak value (6), root mean square amplitude (7),
waveform index (8), peak value index (9), pulse index (10), and margin index (11),
are selected for calculation. During the selection of training samples, we can do a
pre-processing of cluster analysis on the selected samples to get rid of outliers in
the sample set and to ensure the following classification to achieve good results.
To verify the selected features and the quality of the sample points, separability
evaluation indexes are used for evaluation.
The experimental data in each file is divided into 240 sections, and the above 11-
dimensional characteristic index values are calculated for each section to generate

Fig. 3.53 Time domain waveform


3.5 Relevance Vector Machine Diagnosis Method 175

240 samples. The separability evaluation index value of each dimension eigenvalue
is calculated, as shown in Table 3.14. From the table, the values of kurtosis (2),
skewness (5), waveform index (8), peak index (9), pulse index (10), and margin index
(11) are all less than zero, that is, the divergence between classes is smaller than that
within classes, the samples composed of these eigenvalues have poor separability,
so these eigenvalues are eliminated and the remaining eigenvalues are combined
with (1) mean square value, (3) mean value, (4) variance, (6) peak value and (7)
root mean square amplitude. When the pattern recognition method is used for state
classification, the number of feature quantities is 2 ~ 3, one is less, the error rate is
high, and if the number of feature quantities is too many, will make the discriminant
function complex, larger computation, and worse real-time performance. Moreover,
the error rate does not decrease monotonously with the increase of the number of
features. When the number of features increases to 3, the calculation is complicated
and the real-time performance is poor, but the error rate is not improved obviously.
Therefore, RVM only uses a separability evaluation index to select features, and only
two-dimensional eigenvalues are selected to evaluate the validity of classification.
(3) Analysis of RVM detection results

During classification, 140 normal samples were selected as training samples, and 100
normal samples and 240 pitting samples were tested. To verify the validity of the
values of evaluation indicator Jb , we chose the Gauss kernel function as the mapping
function and compared the classification effects of different values. Figure 3.54 shows
the classification effect at Jb = 0.968, and Fig. 3.55 shows the classification effect
at Jb = 0.838.
According to the classification graph, RVM and SVDD can separate the fault
samples from the normal samples well when they are large, but when they are small,
the fault and the normal samples overlap more and the classification effect is poor.

Table 3.14 Comparison of separability evaluation indexes of each eigenvalue


Eigenvalues code Interclass dispersion Scb Within-class Evaluation indicators Jb
name dispersion Scw
1 1.081 0.323 0.920
2 0.001 0.924 − 0.707
3 0.778 0.147 0.974
4 0.975 0.355 0.868
5 0.000 1.120 − 0.707
6 1.062 0.448 0.808
7 1.176 0.257 0.963
8 0.005 0.865 − 0.705
9 0.000 0.947 − 0.707
10 0.000 0.926 − 0.707
11 0.000 0.914 − 0.707
176 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.54 Jb = 0.968 classification renderings

Fig. 3.55 Jb = 0.838 classification renderings

Table 3.15 is a comparison of the classification accuracy of the two methods for the
combination of partial eigenvalues. Table 3.15 shows that when Jb is greater than 0.9,
the classification accuracy of the two methods is greater than 90%, which verifies
the effectiveness of feature selection.
Because RVM uses Bayes’ theorem to predict the probability of points, it can
quantitatively evaluate whether the detection results belong to this category. Most of
its training points are contained within the classification limits, and the training error
is basically zero. However, SVDD only considers the data structure space, and only
accepts or rejects the attribute of the points, which makes many support vectors out
of bounds, and the training error is large. In addition, the number of RVM is far less
than the number of SVDD support vectors, the sparsity of the solution is better, and
the structure of the model is simpler.
3.5 Relevance Vector Machine Diagnosis Method 177

Table 3.15 Comparison of classification results and separability evaluation indexes after feature
combination
Combination of features 1 and 4 3 and 6 3 and 7 6 and 7 1 and 6 4 and 6
Gauss kernel width 0.2 0.3 0.4 0.4 0.3 0.2
RVM classification accuracy (%) 99.41 97.94 99.41 92.64 91.47 88.24
SVDD classification accuracy (%) 99.12 90.29 97.47 90.65 88.82 84.71
Number of correlation vectors 5 5 2 3 3 4
Number of support vectors 18 37 27 32 35 50
Evaluation indicators 0.947 0.902 0.968 0.909 0.872 0.838

3.5.3.3 Application of Gearbox Fault Classification Based on RVM

1. Experimental analysis on the classification of typical bearing faults

Because the vibration energy of the bearing is smaller than that of gear and shafting,
and the fault characteristics are not obvious, it is difficult to identify and diagnose the
bearing. The classification of different types of bearing faults is even more difficult.
In this section, the relevance vector machine (RVM) multi-classification method is
used to classify the sample data of transmission under normal, bearing inner ring
spalling and bearing outer ring spalling conditions.
(1) Experiment and signal acquisition
The cylindrical rolling bearing of the output shaft of Dongfeng SG135-2 transmission
is taken as the experimental object. The measuring point is located on the bearing
block of the output shaft, i. e. the position of measuring point 1 in Fig. 3.9. The
experimental design of the transmission in the normal, bearing inner ring spalling
and outer ring spalling three states of operation.
In the above three states, it is necessary to ensure that the transmission is under
the same operating conditions so that the experimental data collected can be compa-
rable. Set the transmission to 3th, the input constant gear tooth ratio is 38/26, the 3th
gear tooth ratio is 35/30. The speed of the input shaft and the output shaft are 2400 r/
min and 1370 r/min respectively, the output torque is 105.5 N m and the power is
15 kW. The sampling frequency is 40,000 Hz, the anti-mixing filter is 20,000 Hz, the
sampling length is 1024 × 90 points, and the vibration acceleration signals in hori-
zontal, vertical, and axial directions are collected at the same time. The parameters
of the output shaft rolling bearing are shown in Table 3.16.
(2) Feature selection and extraction

The vibration acceleration signals of the experimental transmission under normal,


bearing inner ring spalling and bearing outer ring spalling conditions were selected.
When the inner and outer rings of the rolling bearing have fatigue spalling, there will
be an obvious modulation peak group near the natural frequency of the outer ring
in the middle and high-frequency region of its frequency spectrum, and the natural
178 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.16 Output shaft rolling bearing parameters and characteristic frequencies
Nodal Roller Number Contact Inner ring Outer ring The rolling Cage pass
diameter diameter of scroll angle α pass pass body frequency
D (mm) d0 (mm) bodies m frequency frequency passes the f b (Hz)
f i (Hz) f o (Hz) frequency
f g (Hz)
85 18 13 0 179.6 116.8 51.4 10.9

frequency of the outer ring will be the carrier frequency, the natural frequency modu-
lation phenomenon [29] in which the bearing passage frequency is the modulation
frequency. Therefore, when choosing the frequency domain index, the frequency
value corresponding to the spectrum energy concentration is considered. At the same
time, the amplitude corresponding to the passing frequency of the inner and outer
rings of the bearing is also the sensitive character of the fault. Therefore, the following
characteristic indicator values are calculated for each segment of data:
(a) time-domain statistical characteristic indexes: mean value (1), mean square
value (2), kurtosis (3), variance (4), skewness (5), peak value (6), root-mean-
square amplitude (7).
(b) dimensionless characteristic indexes: waveform index (8), pulse index (9), peak
index (10), margin index (11).
(c) frequency domain characteristic index: the frequency value (12) corresponding
to the highest peak of the spectrum, the amplitude (13) corresponding to the
passing frequency of the bearing inner ring in the thinning spectrum, and the
amplitude (14) corresponding to the passing frequency of the bearing outer ring
in the thinning spectrum.
The above features constitute the feature set of the experimental analysis, which
is used as “Original information” for further feature selection, extraction, and pattern
classification.
For the time-domain sampling sequence collected in each direction, one segment
was truncated every 2048 points, and a total of 45 segments were truncated for
experimental analysis. By cross-sampling, 90 samples were obtained in x direction in
each state, that is, a total of 270 samples were obtained in three states. Fourteen feature
indexes are extracted from each sample data, the separability evaluation indexes of
each feature vector are calculated, the smaller feature vectors are eliminated, and
the remaining feature vectors are extracted by kernel direct discriminant analysis.
Table 3.17 is the calculation of each characteristic value separability evaluation index.
As can be seen from the table, the separability evaluation indexes of (1) the mean
square value, (4) the variance, (6) the peak value, (7) the root mean square amplitude,
(12) the frequency value corresponding to the highest peak of the spectrum, (13) the
frequency amplitude of the outer ring passing through the bearing, (14) the frequency
amplitude of the inner ring passing through the bearing are all greater than zero
and close to 1, there are great differences among the samples in the three states of
normal, bearing inner ring spalling and bearing outer ring spalling. In the process of
3.5 Relevance Vector Machine Diagnosis Method 179

Table 3.17 Bearing class eigenvalue separability evaluation index value


Eigenvalues code Divergence between Divergence within Separability evaluation
name classes Scb classes Scw index Jb
1 2.168 0.247 0.990
2 0.015 0.991 − 0.702
3 0.246 0.914 − 0.590
4 2.168 0.274 0.990
5 0.585 0.801 − 0.261
6 2.103 0.295 0.987
7 2.465 0.174 0.997
8 0.104 0.962 − 0.666
9 0.038 0.984 − 0.693
10 0.028 0.987 − 0.697
11 0.025 0.988 − 0.698
12 2.989 0 1
13 2.989 0 1
14 2.989 0 1

feature classification, these feature indicators should be used as priority alternative


indicators. For other characteristic indexes, the inter-class divergence is smaller than
the intra-class divergence, and the separability evaluation index is less than zero,
which indicates that the characteristic points of three kinds of samples are mixed
together, and it is difficult to distinguish different kinds of samples, these indexes are
not suitable for feature extraction.
A 7-dimensional eigenvalue matrix is composed of (1) mean square value, (4) vari-
ance, (6) peak value, (7) root mean square amplitude, (12) frequency corresponding
to the highest peak of the spectrum, (13) frequency amplitude through the outer ring
of the bearing and (14) frequency amplitude through the inner ring of the bearing,
KDDA method is used for feature extraction. The effect of feature extraction is shown
in Fig. 3.56.
Using the Gauss kernel function, the kernel parameter σ = 5. Among them,
are Fig. 3.56a for the KDDA feature extraction effect map, and Fig. 3.56b for the
KPCA feature extraction effect map. It can be seen from the graph that, after the
feature selection of the separability evaluation index, three types of separability
can be extracted by both methods. The separability evaluation index can select
features before feature extraction and evaluate the extracted feature vector after
feature extraction. After feature extraction, the separability evaluation index is calcu-
lated again, and Jb of the feature vector extracted by the two methods is equal to
1. The ratio of inter-class divergence and intra-class divergence of feature vectors
extracted by KDDA Scb /Scw = 301.882/1.899 = 158.97, while the ratio of inter-
class divergence and intra-class divergence of feature vectors extracted by KPCA
180 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.56 The effect of feature extraction when σ = 5

is Scb /Scw = 0.870/0.014 = 62.14, and the effect of KDDA feature extraction is
slightly better than that of KPCA.
(3) Classification of bearing experimental data with RVM multi-classification
method
The feature vector extracted by KDDA is used for RVM multi-classification. In each
of the three types of samples, 50 samples were randomly selected as training samples,
that is, a total of 150 training samples, and the remaining 120 as a test samples. The
normal sample, the bearing inner ring spalling sample, and the bearing outer ring
spalling sample are labeled 1, 2 and 3 respectively. The kernel functions select the
Gauss Radial basis function, and the kernel parameters are selected empirically. The
three classification effects are shown in Fig. 3.57. In the figure, ‘.’ represents the
normal sample, ‘*’ represents the bearing inner ring spalling sample, ‘Δ’ represents
the bearing outer ring spalling sample, “O” represents the correlation vector point or
support vector point.
From Table 3.18 and Fig. 3.57, after feature selection and extraction, both RVM
and SVM can achieve good classification results, and the classification accuracy is
100%. The training time of RVM is longer than that of SVM, but the test time is
shorter, and the RVM is much less than SVM, so the sparsity of the solution is better.
2. Experimental analysis of typical gear fault classification
The typical faults of gears include pitting, abrasion, spalling, tooth-breaking, and so
on, in this section, four kinds of faults, i.e. Gear moderate spalling, gear serious
spalling and gear surface deformation, gear breaking, and gear serious pitting,
are selected to carry out classification experiments, and a multi-fault classifier is
established.
(1) Experiment and signal acquisition
The 5th-speed gear of the Dongfeng SG135-2 transmission is taken as the experi-
mental object. The vibration acceleration signal is collected at the input of the trans-
mission at 3 measuring points can be seen from Fig. 3.12. Experimental design of the
3.5 Relevance Vector Machine Diagnosis Method 181

Fig. 3.57 Effect of the three-class task of bearings diagram

Table 3.18 Bearing class


Classification algorithm One on one
three classification results
comparison RVM SVM
Gauss, nuclear parameters 1.5 5
Training accuracy (%) 100 100
Test accuracy (%) 100 100
Training time (s) 4.2014 1.0540
Test time (s) 0.0029 0.0126
Number of associated (or supported) vectors 5 14

transmission in normal, pitting gear, gear moderate spalling, gear serious spalling +
tooth surface deformation, and gear teeth broken 5 states. During the experiment, the
transmission is operated under the same working condition to ensure the correctness
of classification. The transmission is set to 5th gears, with an input shaft speed of
600 r/min, an output shaft speed of 784 r/min, an output torque of 75.5 N m, and an
output power of 6.5 kW. The sampling frequency is set to 40,000 Hz, the sampling
length is 1024 × 90 points, and the horizontal radial (i.e., X direction) signal is
collected.
(2) Feature selection and extraction
The vibration acceleration signals of the experimental transmission were extracted
under the five operating conditions of normal, serious pitting, moderate spalling,
severe spalling + deformation of the gear surface and gear tooth broken. When
the tooth surface pitting, fatigue spalling and other concentrated tooth profile errors
182 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

occur in gear work, the frequency of gear meshing and its harmonics will be the
carrier frequency, the gear shaft rotating frequency and its double frequency are
the meshing frequency, which is called modulation phenomenon of the modulation
frequency. The modulation severity is decided by the gear damage degree. In the
time domain, the variation of the effective value and kurtosis index, which are the
statistical indexes of the vibration energy, is shown. When the fault occurs, the above-
mentioned modulation phenomenon is more obvious, at the same time, it will cause
the natural frequency modulation of the gear, and the rotating frequency energy of
the gear shaft is also obviously increased. Based on the above analysis, the following
characteristic index values were calculated for each data segment:
(a) time-domain statistical characteristic indexes: mean value (1), mean square
value (2), kurtosis (3), variance (4), skewness (5), peak value (6), root-mean-
square amplitude (7).
(b) dimensionless characteristic indexes: waveform index (8), pulse index (9), peak
index (10), margin index (11).
(c) frequency domain characteristic index: the sum of the main frequency and all
side frequency amplitude in the modulation frequency band (12), the corrected
amplitude corresponding to the meshing frequency of experimental gear (13),
and the corrected amplitude corresponding to the rotating frequency of the shaft
where the experimental gear is located (14).
The time-domain parameters can effectively represent the time-domain charac-
teristics and the envelope energy of vibration, and the frequency-domain parameters
reflect the modulation characteristics and the distribution of vibration energy to some
extent.
The experiment of bearing fault classification is a three-class experiment. To
further verify the classification performance of RVM, four-class and five-class exper-
iments are carried out. Select 4 out of 5 operation states of transmission as a combi-
nation, and set 4 state data in each combination as the samples of Class 1, Class 2,
Class 3, and Class 4, respectively, used for quad classification experiments. Two such
combinations are established as follows: (a) normal-moderate spalling of gear-severe
spalling of gear and deformation of gear surface-broken gear teeth, (b) moderate
spalling-severe spalling and tooth surface deformation-tooth breakage-severe pitting.
The combination of 5 kinds of sample generation, (c) normal-moderate spalling-
severe spalling and surface deformation-tooth breakage-severe pitting, was used to
carry out the five-classification experiment, 5 kinds of state data are set as 5 kinds
of samples. In each combination, the different failure states are matched with each
other, which can fully simulate the situation of multiple modes coexisting in the
transmission operation.
For the time-domain sampling sequence collected in each direction, one segment
was truncated every 2048 points, and a total of 45 segments were truncated for
experimental analysis. After cross-sampling, 90 samples were obtained in x direction
in each state, that is, 450 samples were obtained in five states, and 14 feature indexes
were extracted from each sample.
3.5 Relevance Vector Machine Diagnosis Method 183

(3) Using RVM multi-classification method to classify gear experimental data


As the same method of feature selection and extraction of bearings, the separability
evaluation indexes of 14 feature values are calculated first, and the feature extraction
of KDDA is carried out by selecting the feature values with larger Jb first, the extracted
feature vectors are used in RVM classification and compared with SVM classification.
Combination A: Normal-moderate spalling-severe spalling and surface
deformation-broken teeth.
The values of 14 separability indexes are shown in Table 3.19, Table 3.20 shows the
comparison of classification results can be seen from Fig. 3.58. The four-category
effect maps of RVM and SVM after feature extraction by KDDA are shown in
Fig. 3.59. In the picture, “.” indicates a normal sample, “+” indicates moderate
spalling of gear, “*” indicates severe spalling of gear and deformation of the tooth
surface, and “◇” indicates broken tooth of gear, “O” represents a relevance vector
point or a support vector point.
Combination B: Moderate spalling-severe spalling and surface deformation-tooth
breakage-severe pitting as shown in Figs. 3.60, 3.61 and Tables 3.21, 3.22, 3.23.
Combination C: Normal-moderate spalling-severe spalling and tooth surface
deformation-tooth breakage-severe pitting.
3. Experimental results analysis of gear multi-classification
According to the three combined classification results, both RVM and SVM can
achieve better classification results based on feature selection and extraction. RVM

Table 3.19 Gear class combination ➀ divisibility evaluation index value


Eigenvalues code Divergence between Divergence within Separability evaluation
name classes Scb classes Scw index Jb
1 2.697 0.323 0.991
2 1.626 0.591 0.869
3 3.967 0.005 1
4 2.677 0.328 0.990
5 0.490 0.875 − 0.402
6 3.034 0.239 0.996
7 3.280 0.177 0.998
8 2.781 0.302 0.993
9 2.626 0.341 0.989
10 2.518 0.368 0.986
11 2.531 0.364 0.986
12 2.694 0.324 0.991
13 2.292 0.424 0.975
14 1.277 0.678 0.662
184 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.20 Comparison of


Classification algorithm One on one
four classification results of
gear class combination 1 RVM SVM
Gauss, nuclear parameters 2 10
Training accuracy (%) 100 100
Test accuracy (%) 100 100
Training time (s) 8.0572 2.3703
Test time (s) 0.0050 0.0072
Number of associated (or supported) vectors 7 17

Fig. 3.58 KDDA feature extraction of gear class combination 1

Fig. 3.59 Four-class classification effect of gear class combination A diagram


3.5 Relevance Vector Machine Diagnosis Method 185

Fig. 3.60 KDDA feature extraction of gear class combination 2

Fig. 3.61 Gear class combination 2 four classification effect diagram

classification needs less support vector than SVM classification, and the test time
is shorter. But the training stage of RVM is more complex and the training time is
longer.
As shown in Figs. 3.58, 3.60, and 3.62, after KDDA feature extraction, the normal
sample in combination A, the moderate spalling sample and the severe pitting sample
in combination B are close to each other and almost belong to the same category. The
normal samples, moderate spalling samples and severe pitting samples in combina-
tion C are relatively close to each other, and there are some overlapping samples.
After RVM classification, the near and partially overlapping samples were separated,
and the classification accuracy of combinations A and B reached 100%. Combination
C cannot be completely separated due to the overlap of some sample points, but still
186 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.21 Gear type combination 2 divisibility evaluation index value


The eigenvalue code Divergence between Divergence within Separability evaluation
classes Scb classes Scw index Jb
1 2.518 0.368 0.986
2 1.585 0.601 0.853
3 3.979 0.003 1
4 2.496 0.373 0.985
5 0.530 0.865 − 0.361
6 3.126 0.216 0.996
7 3.280 0.177 0.997
8 2.736 0.313 0.992
9 2.398 0.398 0.981
10 2.391 0.399 0.980
11 2.437 0.388 0.983
12 2.844 0.286 0.994
13 2.863 0.281 0.994
14 1.524 0.616 0.827

Table 3.22 Comparison of


Classification algorithm One on one
four classification results of
gear class combination 2 RVM SVM
Gauss, nuclear parameters 1.4 20
Training accuracy (%) 100 100
Test accuracy (%) 100 100
Training time (s) 9.0026 2.2073
Test time (s) 0.0048 0.0070
Number of associated (or supported) vectors 12 17

can achieve high classification accuracy. From Table 3.24, the training accuracy and
test accuracy of RVM are higher than those of the Support vector machine (Fig. 3.63).

3.5.3.4 Analysis of Factors Affecting Classifier Performance

(1) The influence of kernel function on the classification performance of the


classifier

Three RVM-based fault detection classifiers were constructed using polynomial


kernel functions, Gauss Radial basis function, and multilayer perceptron kernel func-
tions, respectively, and the classification accuracy of each feature combination is
3.5 Relevance Vector Machine Diagnosis Method 187

Table 3.23 Gear type combination C separability evaluation index value


Eigenvalues code Divergence between Divergence within Separability evaluation
name classes Scb classes Scw index Jb
1 3.579 0.282 0.996
2 2.099 0.578 0.935
3 4.967 0.004 1
4 3.555 0.287 0.996
5 0.654 0.867 − 0.238
6 3.955 0.207 0.998
7 4.273 0.143 0.999
8 3.526 0.293 0.996
9 3.253 0.347 0.993
10 3.172 0.363 0.992
11 3.202 0.357 0.992
12 3.588 0.280 0.996
13 3.347 0.328 0.994
14 1.874 0.623 0.895

Fig. 3.62 KDDA feature


extraction of gear class
combination 3

Table 3.24 Gear class


Classification algorithm One on one
combination 3 five
classification results RVM SVM
comparison Core parameters 0.7 20
Training accuracy (%) 96 94.4
Test accuracy (%) 97 95
Training time (s) 15.6247 6.5049
188 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.63 Gear class combination 3 five classification effect diagram

compared. The classification accuracy of RVM and SVDD detection is shown in


Table 3.25.
Table 3.25 shows the comparison of the classification results of different kernel
functions when the values of Jb are large. From Table 3.25, the classification perfor-
mance of the polynomial kernel and multilayer perceptron kernel is similar, and the
classification accuracy is basically similar. Gauss kernel functions are good at clas-
sification, which may be related to the choice of kernel parameters, which is one of
the reasons why Gaussian kernels are used more in applications.
In multi-classification, because of the complexity of data, the trained classifier
is unstable when using the polynomial kernel function and multi-layer perceptron
kernel function. Gauss kernel functions are relatively stable and can achieve better
classification results.
After the training set is given, it is necessary to select the kernel function and kernel
parameters of RVM when searching for a decision function with RVM. In most cases,
the kernel function is based on experience and reference to the existing selection
experience, in the above simulation and experimental part, the kernel function is
based on this method.

Table 3.25 Comparison of classification results of different kernel functions


The kernel RVM classification accuracy (%) SVDD classification accuracy (%)
function One and Three and Three and One and Three and Three and
four six seven four six seven
Gauss, kernel 99.41 97.94 99.41 99.12 90.29 97.47
Polynomial 99.12 94.71 98.53 92.35 82.94 91.18
kernel
Multi-layer 99.12 96.18 98.53 94.41 87.65 95.88
perceptron
core
3.5 Relevance Vector Machine Diagnosis Method 189

(2) The influence of kernel parameters on classifier performance


The following study examines the relationship between the classification accuracy
and kernel parameters of the Gauss Radial basis function RVM fault classifier using
the bearing-class data set from Sect. 3.5.3.3. Figure 3.64 shows how the classifi-
cation accuracy of Gauss’s Radial basis function fault classifier varies with width
σ . Among them, graphs (a) and (b) show the relationship between training accu-
racy, test accuracy, and Gauss kernel width, respectively. Figure 3.64 shows that the
kernel parameters have a significant impact on the classification performance of the
classifier, so it is necessary to select the appropriate kernel parameters when using
them.
(3) The effect of relevance vector on the classification performance of a classifier
Figure 3.65 shows the number of correlation vectors as a function of Gauss kernel
width σ .
Combined with Figs. 3.64 and 3.65, it can be seen that the smaller of σ , the more
of relevant vectors used for classification, the more complex the calculation, and the
longer of the training time. The total number of training samples is 150. When the
value of σ is close to zero, the number of relevance vectors is close to 60, accounting

Fig. 3.64 Relationship between classification accuracy and Gauss kernel width

Fig. 3.65 Relationship of σ


and the number of relevant
vectors
190 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

for about one-third of the total samples. Although the classification accuracy is 100%,
the prediction cost is too high, causing the classifier to be overfitting. The larger the
value of σ , the lower of the number of correlation vector, the lower accuracy of
classification, and the classifier is in the state of under-fitting. As can be seen from
the graph, at that time of σ > 3, there is only 1 correlation vector, and the classifier
is in a serious under-fitting state. Therefore, it is necessary to select appropriate
kernel parameters to assure that the classification accuracy is high and the number
of correlation vectors is small.

References

1. Zhu, X., Goldberg, A.B.: Introduction to Semi-supervised Learning. Synthesis Lectures on


Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2009)
2. Bian, Z.Q., Zhang, X., et al.: Pattern Recognition (in Chinses). Tsinghua University Press,
Beijing (2000)
3. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
4. Wang, H.Z., Yu, J.S.: Study on the kernel-based methods and its model selection (in Chinese).
J. Jiangnan Univ. 5(4), 500–504 (2006)
5. Smola, A.J.: Learning with Kernels. Technical University of Berlin, Berlin (1998)
6. Liao, G.L., Shi, T.L., Li, W.H.: Design and analysis of multiple joint robotic arm powered by
ultrasonic motors (in Chinese). Vibr. Meas. Diag. 03, 182–185 (2005)
7. Zhong, Q.L., Cai, Z.X.: Semi-supervised learning algorithm based on SVM and by gradual
approach (in Chinese). Comput. Eng. Appl. 25, 19–22 (2006)
8. Knorr, E.M., et al.: Algorithms for mining distance-based outliers in large datasets. In:
Proceedings of Very Large Data Bases Conference, pp. 392–403 (1998)
9. Knorr, E.M., Ng, R.T.: Distance-based outliers: algorithms and applications. In: Proceedings
of Very Large Data Bases Conference, vol. 8, pp. 237–253 (2000)
10. Breunig, M.M., Kriegel, H.P., et al.: LOF: identifying density-based local outliers. In: Proceed-
ings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas,
15–18 May 2000, pp. 93–104
11. Ester, M., Kriegel, H.P., Sander, J., et al.: A density-based algorithm for discovering clusters
in large spatial databases with noise. In: Proceeding of the 2nd International Conference on
Knowledge Discovery and Data Mining, Portland, pp. 226–231. AAAI Press (1996)
12. He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24(9–
10), 1642–1650 (2003)
13. Zhang, T., Oles, F.J.: A probability analysis on the value of unlabeled data for classifica-
tion problems. In: Proceedings of the 17th International Conference on Machine Learning
(ICML’00), San Francisco, 29 June–2 July 2000, pp. 1191–1198
14. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-
supervise clustering. In: Proceedings of the 21st International Conference on Machine Learning,
Banff, 4–8 July 2004, pp. 81–88
15. Basu, S., Bilekno, M., Monoey, R.J.: A probabilistic framework for semi-supervised clustering.
In: Proceeding of the Tenth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Seattle, 22–25 Aug 2004, pp. 59–68
16. Zhang, D.Q., Tan, K.R., Chen, S.C.: Semi-supervised kernel-based fuzzy c-means. In: Proceed-
ings of the International Conferences on Neural Information Processing, Calcutta, 22–25 Nov
2004, pp. 1229–1234
17. David, M.J.T., Piotr, J., Elzbieta, P., et al.: Outlier detection using ball descriptions with
adjustable metric. In: Proceeding of Joint IAPR International Workshops on Statistical Tech-
niques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR),
Hong Kong, 17–19 Aug 2006, pp. 587–595
References 191

18. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern.
43(1), 59–69 (1982)
19. Backer, S.D., Naud, A., Scheunders, P.: Non-linear dimensionality reduction techniques for
unsupervised feature extraction. Pattern Recogn. Lett. 19(1), 711–720 (1998)
20. Rubio, M., Gimnez, V.: New methods for self-organising map visual analysis. Neural Comput.
Appl. 12(3.4), 142–152 (2003)
21. Shao, C., Huang, H.K.: A new data visualization algorithm based on SOM (in Chinese). J.
Comput. Res. Dev. 43(3), 429–435 (2006)
22. Alhoniemi, E., Himberg, J., Parviainen, J., et al.: SOM Toolbox (Version 2.0) (1999). Available
at http://www.cis.hut.fi/projects/somtoolbox/download/. Accessed 20 Oct 2008
23. Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn.
Res. 211–244 (2001)
24. Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional data-with application to face
recognition. Pattern Recogn. Lett. 34, 2067–2070 (2001)
25. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face recognition using kernel direct
discriminant analysis algorithms. IEEE Trans. Neural Netw. 14(1), 117–126 (2003)
26. David, M.J.T., Robert, P.W.D.: Support vector data description. Mach. Learn. 54, 45–66 (2004)
27. Kressel, U.: Pairwise Classification and Support Vector Machines, pp. 255–268. MIT Press,
Cambridge, MA (1999)
28. Kanaujia, A., Metaxas, D.: Learning Multi-category Classification in Bayesian Framework,
pp. 255–264. CBIM, Rutgers University (2006)
29. Li, X.: Semi-supervised Fault Classification Method Based on Kernel Function Principal
Component Analysis (in Chinese). South China University of Technology, Guangzhou (2007)
Chapter 4
Manifold Learning Based Intelligent
Fault Diagnosis and Prognosis

4.1 Manifold Learning

Manifold, an extension of Euclidean space, on which every point has a neighbor-


hood and an open set homeomorphism of Euclidean space so that it can be described
by a local coordinate system. Intuitively, a manifold can be viewed as the result of
sticking together a block of “Euclidean space”, which is a special case of a manifold,
i.e., Euclidean space is a trivial manifold [1]. Continuously-differentiable mani-
folds are usually studied in differential geometry, while their properties are obtained
by discrete approximation of continuity in practical problems. For a given high-
dimensional data set, the data variables can be represented by a small number of
variables, which is geometrically represented by data points scattered on or near a
low-dimensional smooth manifold. The core of manifold learning is to learn and
discover low-dimensional smooth manifolds embedded in a high-dimensional space
based on a limited number of discretely observed sample data for effectively revealing
the intrinsic geometric structure of the data.
Manifold learning has received a lot of attention in the fields of machine learning,
pattern recognition, and data mining since 2000. In particular, three articles published
in the same issue of Science in December 2000 investigated the problem of manifold
learning from the perspectives of neuroscience and computer science, respectively,
and explored the relationship between neural systems and low-dimensional cognitive
concepts embedded in high-dimensional data space [1–3], making manifold learning
a hot spot in machine learning and data mining research. The application of manifold
learning in mechanical state identification focuses on the following three aspects:
noise removal and weak impulse signal extraction, state identification, and state
trend analysis.
(1) Noise removal and weak impulse signal extraction. In practical engineering, the
collected vibration signals are inevitably disturbed by various noises because
of the complexity of the mechanical system and the variability of the working
environment. Effective noise reduction techniques will help to improve the diag-
nosis accuracy and reduce the failure occurrence by detecting the incipient fault
© National Defense Industry Press 2023 193
W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_4
194 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

of the machinery in time. As one of the hotspots of machine learning and pattern
recognition, manifold learning has been successfully applied in noise reduction
and weak impulse signal extraction. The traditional noise reduction methods
and the current manifold learning methods are working on the time domain
vibration signal. The advantage of the noise reduction method is that the fault
generation mechanism can be thoroughly studied, however, the length of the
signal often leads to low noise reduction efficiency and large storage space,
especially in the mechanical fault diagnosis. To ensure the frequency domain
resolution ratio, a large amount of time domain data is needed, generally tens
of thousands of points. In such cases, there is a great limitation of the time
domain noise reduction method in terms of computational efficiency, which is
not conducive to online monitoring.
(2) State identification. Various feature indicators used to describe the health state
of mechanical systems are redundant, while traditional single indicators cannot
completely describe the operating state of complex equipment. Therefore,
fusing multidimensional feature indicators by eliminating redundant compo-
nents between indicators and extracting effective features for describing equip-
ment health status, have become the key for manifold learning-based diagnostic
methods.
(3) State trend analysis. In addition to describing the state trend of the equipment by
using the feature indicators, building a state prediction model based on manifold
learning can better diagnose the time of failure and predict the remaining life
of the equipment.

4.2 Spectral Clustering Manifold Based Fault Feature


Selection

4.2.1 Spectral Clustering

4.2.1.1 Spectral Graph Theory

The mathematical essence of graph theory is the combination of combinatorial theory


and set theory. A graph G can consist of two sets: a non-empty set of nodes V and a
finite set of edges E. W (e) is assigned on each edge e in the graph G, called the weight
of the edge e. G together with the weights on its edges is called a weighted graph [4].
The main way of studying spectral graphs is to establish and represent the topology of
graphs through graph matrix (adjacency matrix, Laplacian matrix, unsigned Lapla-
cian matrix, etc.), especially the connection between the various invariants of the
graph and the substitution similar invariants represented by the graph matrix.
Many research methods of Laplacian eigenvalues of graphs are borrowed from the
study of eigenvalues of graphs or the eigenvalues ratio of obtained from Laplacian
eigenvalues graphs and the adjacency matrix of graphs. Due to adding the degree of
a vertex in the definition of the Laplacian matrix, the Laplacian eigenvalues better
4.2 Spectral Clustering Manifold Based Fault Feature Selection 195

reflect the graph theory properties of the graph, so the study of Laplacian eigenvalues
has received more and more extensive attention, and the spectral clustering algorithm
is also based on the Laplacian eigenvalue decomposition.

4.2.1.2 Spectral Clustering Feature Extraction

The idea of a spectral clustering algorithm is derived from the spectral graph partition
theory [5]. It is assumed that each data sample is regarded as a vertex V in the graph,
and the edge E between the vertices is assigned a weight value W according to the
similarity between the samples so an undirected weighted graph G = (V , E) based
on the similarity of the samples is obtained. In the graph G, the clustering problem can
be transformed into a graph partitioning problem, and the optimal division criterion
is to maximize the similarity within the two subgraphs and minimize the similarity
between the subgraphs [6].
According to the algorithm of spectral clustering, the feature extraction algorithm
based on the spectral clustering is as follows:

Algorithm 1 Feature extraction based on spectral clustering


Input: The graph similarity matrix W constructed from the original data.
{ Output: Low-dimensional
} dataset after dimensionality reduction of the graph Y =
y1 , y2 , . . . , y N .
Steps:
(1) Calculate the Euclidean distance matrix S of the original data. Euclidean
distance is defined as the similarity of any two points:
|| ||
s(i, j) = || x i , x j ||2 (4.1)

(2) Build a complete graph based on a data matrix (undirected) G = (V , E), where
V is the node corresponds to the data and the edge E connecting any two nodes.
The weight matrix W of the edges is used to represent the similarity between
the data, which is calculated as (where σ is the control parameter):
( )
w(i, j ) = exp −s(i, j )/(2 ∗ σ 2 ) (4.2)

(3) Calculate the degree matrix, whose diagonal elements are the sums of the
corresponding columns of the graph weights.

D(i, i ) = w(i, j ) (4.3)

(4) Construct the Laplace matrix:



D(i, i ) = w(i, j ) (4.4)
196 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

(5) Diagonalize the matrix L and calculate its eigenvalues and eigenvectors:

L = UΛU T (4.5)

where U is the eigenvector whose column vectors are composed of L. Λ is the


diagonal matrix with eigenvalues λ1 , λ2 , . . . , λn .
(6) The eigenvectors corresponding to the first r largest non-negative eigenvalues
are selected to form the transformation matrix, where the sum of the first r largest
non-negative eigenvalues accounts for the sum of all non-negative eigenvalues:
∑r
λi
∑ pi=1 (4.6)
j=1 λ j

The value range is 85 ~ 100%.


(7) The coordinates in low-dimensional space are represented as:

Y = U r Λr1/2 (4.7)

where U r is the n × r matrix composed of the first r column eigenvectors; L r


is the diagonal matrix of r × r order.

4.2.2 Spectral Clustering Based Feature Selection

4.2.2.1 Incipient Fault Feature Selection

In mechanical fault diagnosis, it is crucial to select features that can effectively


reflect fault information. The common method is to use the feature indicators in time
and frequency domains as the original input data of pattern recognition methods.
Common transmission fault features have a certain regularity, it is very important to
grasp its regularity for the analysis and extraction of the corresponding fault features.
When the bearings and gears in the transmission are running normally, the vibration
signal is generally smooth, and the signal frequency components include the rotation
frequency of each bearing and the meshing frequency of the gears, etc. When a
fault occurs, the vibration signal frequency components or amplitude will change.
Constructing indicators that reflect changes in vibration signals is an effective way to
perform fault classification. In addition, there is usually some redundant and useless
information in various existing feature indicators, especially in the time domain. It is
important to select effective indicators for reducing the number of input dimensions
and improving the correct classification rate.
1. Feature indicators in the time domain
The commonly used 11-time domain waveform features can be divided into two
parts: dimensional and dimensionless.
4.2 Spectral Clustering Manifold Based Fault Feature Selection 197

Dimensional indicators: Mean Square Value xa (1), Kurtosis xq (2), Mean x (3),
Variance σ 2 (4), Skewness xs (5), Peak xp (6), Root Mean Square xr (7).
Dimensionless indicators: Shape Indicator K (8), Crest Indicator C (9), Impulse
Indicator I (10), Clearance Indicator L (11).
When statistical feature values with dimension are used for amplitude analysis, the
results obtained are not only related to the state of the electromechanical equipment,
but also the operating parameters of the machine (such as speed, load, etc.). The
measured dimensional feature values of different types and sizes of transmissions
are not comparable, and sometimes even the measured dimensional feature values
of the same type and size of transmissions cannot be compared directly. Therefore,
when conducting equipment fault diagnosis, it is necessary to ensure the consistency
of operating parameters and measurements.
The dimensionless analysis parameters are only related to the state of the machine
and are largely independent of the machine’s operating state. The dimensionless indi-
cator is not affected by the absolute level of the vibration signal and is independent of
the sensitivity of the vibration detector, amplifier, and the amplification of the entire
test system, so the system does not need to be calibrated. There is no measurement
error even if the sensitivity of the sensor or amplifier changes. However, their sensi-
tivity is different from the fault signals. When the fault signal appears as a surge
vibration, the peak indicator, crest indicator, and clearance indicator are more sensi-
tive to the surge-type fault than the root mean square value. These indicators will
decrease as the degree of failure increases significantly, which indicates that the three
indicators are more sensitive to incipient faults.
2. Frequency energy factor
The vibration of a gearbox is generally composed of the following frequency
components [7]:
(1) Rotational frequency (each bearing) and its higher harmonics.
(2) Gear mesh frequency (GMF) and its higher harmonics.
(3) A side-band frequency is generated by the modulation phenomenon of GMF
with the GMF and its higher harmonics as the carrier frequency and the rotational
frequency of the bearing where the gear is located and its higher harmonics as
the modulation frequency.
(4) A side-band frequency is generated by the resonant modulation phenomenon of
the gear by using the inherent frequency of the gear as the carrier frequency and
the rotational frequency of the bearing where the gear is located and its higher
harmonics as the modulation frequency.
(5) A side-band frequency is generated by the box resonance modulation
phenomenon with the inherent frequency of the gearbox box as the carrier
frequency and the rotational frequency of the bearing where the gear is located
and its higher harmonics as the modulation frequency.
(6) A Modulation sideband is generated with the intrinsic frequency as the carrier
frequency and the rolling bearing pass frequency as the modulation frequency.
(7) Implicit ingredients.
198 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

(8) Cross-modulation ingredients.


As shown in Fig. 4.1, the modulation form exhibited by the gear fault mainly
depends on the excitation energy, and different fault degrees show different modula-
tion forms. When there is a slight fault, such as a slight shaft bending or small area, a
small number of tooth surface pitting, meshing frequency modulation phenomenon,
mesh frequency modulation phenomenon usually happens by the meshing frequency
being frequency modulated. When the fault is more serious and the excitation energy
is larger, the inherent frequency of the gear itself is stimulated to generate the reso-
nance modulation phenomenon with the inherent frequency of the gear as the carrier
frequency; when the excitation energy is very large and the fault is very serious,
the inherent frequency of the gearbox box is stimulated to generate the box inherent
frequency modulation phenomenon.
For the gear incipient fault, an indicator energy factor is constructed, which
can reflect the difference of modulation energy in the incipient fault. The specific
approach is as follows:
(1) A fast Fourier transform of the original signal is performed to obtain the FFT
spectrogram.
(2) Calculate the gear GMF and its multiplier n f z , n = 1, 2, 3, . . ..
(3) Calculate the spectral line number km of frequency f m (m = 1, 2, 3, . . .), which
is nearest to the mesh frequency and its multiples.
∑ K m+1 −1
i=K m−1 +1 Ai
(4) Define ∑
Aj
, m = 2, 3, 4, …; when m = 1:

∑ K m+1 −1
Ai
Δ1 = ∑
i=1
(4.8)
Aj

3. Case study
In this section, the gear failure experimental data from Laborelec laboratory were
used to conduct the experiments, including gear tooth face fatigue pitting fault, tooth

Fig. 4.1 Fault modulation of gear


4.2 Spectral Clustering Manifold Based Fault Feature Selection 199

face slight spalling fault and tooth face severe spalling fault. 37 sets of each fault type
were selected, and the gear parameters and experimental conditions are described in
Sect. 3.4.4.
(1) The gear state original feature set S is composed of the above 11 time-domain
statistical feature parameters, which are used to describe the gear fault type.
(2) The calculated gear meshing frequency is 304 Hz, Δ1 , Δ2 , Δ3 and Δ4 are calcu-
lated according to (4.8). The original feature set S of the gear state is composed
of the four feature parameters, which are used to describe the gear fault type.
The 91 labeled samples and 20 unlabeled samples are randomly selected from
111 samples in the two cases of (1), and (2), respectively. Support vector machine
(SVM) is used to train the classifier and predict the type of unlabeled samples, where
C = 100 and σ = 0.85; the results are shown in Fig. 4.2.

Fig. 4.2 Classification results of different feature indicators


200 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

From Fig. 4.2a, samples 7 and 88 are mistakenly classified and the accuracy is
90%. In Fig. 4.2b, the accuracy rises to 100%, which proves the effectiveness of the
proposed energy factor indicator.

4.2.2.2 PCA-Based Feature Selection

In pattern recognition, some of the input feature indicators cannot reflect the state of
machine operation, which not only increases the redundant information for detection
and classification but also lengthens the time, so it is necessary to conduct feature
selection.
The normal and slightly spalled signals from the Laborelec lab’s gear fault exper-
imental data were selected to validate the experimental results. PCA was adopted to
select features with a higher impact on the prediction results. The programming envi-
ronment was Matlab 7.6.0. The CPU was Pentium dual-core processor (CPU clock
speed is 2.50 GHz), and the memory was 2 GB. Table 4.1 provides the contribution
rates and cumulative contribution rates of the 11 features. It can be seen that the
accumulated contribution rate of the first three principal elements rises to 90.52%. In
this section, only the first three principal elements were analyzed because the 85%
accumulated contribution rate can represent most of the information contained in the
original variables.
Figure 4.3 shows the contribution of each feature indicator to the first 3 prin-
cipal components. Figure 4.4 provides the extraction rates of the first three principal
components acting on each feature indicator. Nine feature indicators (Mean Square
Value xa (1), Kurtosis xq (2), Mean x (3), Variance σ 2 (4), Peak xp (5), Root Mean
Square xr (6), Shape Indicator K (7), Crest Indicator C (8), Impulse Indicator I (9),

Table 4.1 Feature values and contribution rates of principal component


Principal component Feature values λ Contribution rates C (%) Accumulated contribution
AC (%)
1 6.0382 54.89 54.89
2 2.9032 26.39 81.28
3 1.0161 9.24 90.52
4 0.5679 5.16 95.68
5 0.4251 3.87 99.55
6 0.0418 0.38 99.93
7 0.0056 0.05 99.98
8 0.0009 0.02 100
9 0.0004 0 100
10 0.0003 0 100
11 4.2285e−005 0 100
4.2 Spectral Clustering Manifold Based Fault Feature Selection 201

Clearance Indicator L (10)) with extraction rates greater than 80% were selected for
classification learning, and the sample set became 74 × 9-D.
44 labeled samples and 30 unlabeled samples were randomly selected from 74 ×
9-D samples. Similarly, SVM was used to train the classifier and predict the type of
unlabeled samples, where C = 100 and σ = 0.5. The classification results are shown
in Fig. 4.5.
From Fig. 4.5a, No. 27, 39, 66, and 73 are misclassified and the accuracy is
86.66%. After running the PCA algorithm, the accuracy rises to 90%, and only No.
40, 50, and 57 are misclassified among the input features with only 9-dimensional
input. In terms of training and prediction time, it takes 0.029924 s when the input
features are 11-dimensional, while it takes only 0.025856 s for 9-dimensional input.

Fig. 4.3 Contribution of the first three principal components on features

Fig. 4.4 Extraction rate by first three pivot elements on feature indicators
202 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.5 Prediction results of different dimensional features

4.2.2.3 Feature Extraction Based on Density-Adjustable Spectral


Clustering

1. Similarity metric

From Fig. 4.6, it is common for distance-based methods to classify a and b samples
in one category while a and c samples are classified in different categories because
the distance between a and b is smaller than that of a and c. It is necessary to design
some kind of similarity metric such that the distance between a and c is smaller than
that of a and b to obtain the correct classification result. Therefore, the density-based
clustering assumption [8] is defined as enlarging the length of those paths that cross
4.2 Spectral Clustering Manifold Based Fault Feature Selection 203

Fig. 4.6 Space distribution


based on density clustering
assumptions

the low-density region while shortening the length that does not cross the low-density
region.
By calculating the dissimilarity between each pair of nodes based on density-
sensitive distance, the original data is transformed into the dissimilarity space of pair-
wise data, and the corresponding dissimilarity matrix can be obtained. Furthermore,
dimensionality reduction is achieved by computing the eigenvalues of the original
data in the lower dimensional space. Such a density-adjustable distance is defined in
Ref. [9].

Definition 1
l(x, y) = ρ dist (x,y) − 1 (4.9)

where dist (x, y) denotes the Euclidean distance between x and y. ρ > 1 is the density-
adjustable factor. Such a distance can satisfy the clustering hypothesis and can be
used to describe the global consistency of the clusters, as demonstrated in Ref. [9].
The regulatory factor ρ can be adjusted to enlarge or shorten the distance between
two points.
As shown in Fig. 4.7, the linear distance between points a and c is L. The paths
of a to c along the data distribution are l1 , l 2 , …, l m . The paths between the nodes
through which the ith path li passes are l i1 , l i2 , …, l in . Obviously, li1 + l i2 + … + lin
≥ L. However, after introducing the adjustment scaling factor ρ, there exists ρ that
makes ρ li1 + ρ li2 + … + ρ lin − n < ρ L − 1. In Fig. 4.7, the distance between a and
c is 8 and ab + bc = 9 > ac, after introducing ρ and setting ρ = 2, the corresponding
distance becomes the value in the box of the figure, i.e., ab + bc = 70 < ac = 255.
Therefore, the side ac will be assigned a weight of 70 when the graph is built.
The similarity metric:

1
s0 (i, j ) = (4.10)
dsp(l(i, j )) + 1
204 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.7 Shortest path when


ρ=2

where, dsp(l(i , j)) is the shortest path distance of i and j after density factor
adjustment.
The density-based clustering hypothesis can be achieved by finding the shortest
density-based adjustable path distance between any two points in the graph and
assigning weights to them. The shortest path can be calculated by Dijkstra’s algo-
rithm, Floyd–Warshall’s algorithm, etc. All the results, in this case, are obtained
by Johnson’s algorithm in Matlab 7.6.0, which is a combination of Bellman–Ford’s
algorithm, Reweighting (reassignment of weights) and Dijkstra’s algorithm, it is
available to calculate the shortest path.
2. Feature extraction based on density-adjustable spectral clustering
After introducing the density adjustment factor, a feature extraction algorithm based
on the density adjustable spectral clustering is proposed, which can shorten the
distance between the same categories and increase the distance between different
categories after executing feature extraction.

Algorithm 2 Feature extraction algorithm based on density-adjustable spectral


clustering
Input: Graph similarity matrix W of the original data, the density-adjustable factor
ρ. { }
Output: Low-dimensional dataset Y = y1 , y2 , . . . , y N after dimensionality
reduction.
Steps:
(1) Repeat (1) in Algorithm 1.
(2) Calculate the distance matrix S0 of ρ.
(3) Construct a complete graph (undirected) G = (V , E) based on the original data
matrix, where V is the nodes and E are the edges connecting any two nodes.
The weights of the edges are used to represent the similarity between the data,
which is calculated as (σ is the control parameter):
( )
w(i, j ) = exp −s0 (i, j )/(2σ 2 ) (4.11)

(4) Repeat steps (3) ~ (7) as in Algorithm 1.


4.2 Spectral Clustering Manifold Based Fault Feature Selection 205

3. Case study

The two circles dataset in Ref. [10], three spirals and toy data in Ref. [6], and the
recognized Fisher iris data in UCI are selected as the artificial dataset. The principal
component feature extraction, spectral clustering feature extraction, and density-
adjustable spectral clustering feature extraction methods are used to extract features
for them respectively.
From Fig. 4.8, it can be seen that the feature extraction based on spectral clustering
is better than the principal component method, which can effectively distinguish the
categories for two circles, three spirals, and toy data graphs with manifold structures,
while the principal component method is almost ineffective for such structures. After
adding the density-adjustable factor, the feature extraction method based on density-
adjustable spectral clustering increases the distance between different categories and
shortens the distance between the same categories. For the Fisher iris dataset, it
also shows that the feature extraction effect based on spectral clustering and density-
adjustable spectral clustering is superior to the principal component method from the
first three dimensions. However, the feature distribution of the three feature extraction
methods on the iris dataset are similar, because the first three dimensions of the dataset
after feature extraction do not completely reflect the feature discrepancy, therefore,
the dimension of the graph after dimensionality reduction is still greater than 3.
The spectral clustering feature extraction method introduces a Gaussian kernel,
so it has a good effect on the manifold structure. Besides the parameter ρ, parameter
σ also be introduced to the density-adjustable spectral clustering algorithm. Taking
three spirals dataset as an example, Fig. 4.9a provides the spectral clustering feature
when σ = 0.1, σ = 0.2, and σ = 0.3, respectively. It can be seen that σ is more
sensitive to the feature extraction results, and the feature discrepancy of σ = 0.1
is significantly better than σ = 0.2 and σ = 0.3. It is verified that the method can
extract better features when σ is the interval of [0.04, 0.12]. Figure 4.9b shows that
the feature distribution is based on the density-adjustable spectral clustering method
with ρ = 20. When σ = 0.2 and σ = 0.3, the feature distribution discrepancies are
inferior to σ = 0.1, but they can also correctly distinguish the three categories. It is
verified that the method can extract better features when σ is the interval of [0.09,
0.3]. Compared to the spectral clustering feature extraction method, the parameter
range is large in this method. The feature discrepancy also changes obviously when
ρ is changed, as shown in Fig. 4.9c, and the results for ρ = 5 and ρ = 15 are better
than when ρ = 10.
206 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.8 Feature distribution by different feature extraction methods


4.2 Spectral Clustering Manifold Based Fault Feature Selection 207

(c)

(d)
Fig. 4.8 (continued)
208 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

σ=0.1 σ=0.2 σ=0.3

0.1 0.1 0.1

0 0 0

-0.1 -0.1 -0.1


0.1 0.1 0.1
-0.04 -0.02 -0.02
0 -0.05 0 -0.04 0 -0.04
-0.1 -0.06 -0.1 -0.06 -0.1 -0.06
(a) Spectral clustering feature distribution of different σ

σ=0.1 σ=0.2 σ=0.3

0.1 0.1 0.1

0 0 0

-0.1 -0.1 -0.1


5 0.1 0.1
-3 0.1 -0.05 0.06
x 10 0 0 0 -0.052 0 0.05
-5 -0.1 -0.1 -0.054 -0.1 0.04

(b) Density-adjustable spectral clustering feature distribution of different σ ( ρ = 20 )

ρ=5 ρ=10 ρ=15

0.1 0.1 0.1

0 0 0

-0.1 -0.1 -0.1


0.1 0.1 0.1
0.1 -0.05 0.1
0 0 0 -0.052 0 0
-0.1 -0.1 -0.1 -0.054 -0.1 -0.1
(c) Density-adjustable spectral clustering feature distribution of different ρ ( σ = 0.5 )

Fig. 4.9 Feature distribution on different parameters

4.2.3 DSTSVM Based Feature Extraction

4.2.3.1 Density-Adjustable Spectral Clustering-Based DSTSVM

1. DSTSVM

The spectral clustering algorithm based on spectral graph theory can obtain the
global optimal solution. By introducing density adjustable factors and similarity
measures based on minimum path, the density-adjustable spectral clustering method
4.2 Spectral Clustering Manifold Based Fault Feature Selection 209

shortens the distance between the same categories and expands the distance between
different categories, which can adequately reflect the data structure. Semi-supervised
SVM [11] incorporates unknown sample information and has few-shot and nonlinear
characteristics. Based on the above theories, density-adjustable spectral clustering
and semi-supervised SVM (DSTSVM) was proposed. In this method, the density-
adjustable spectral clustering-based method is used to extract features that served as
input to DSTSVM. The kernel function is a Gaussian kernel and the classification
results are obtained after training by the gradient descent method.

Algorithm 3 Density-adjustable spectral clustering and semi-supervised SVM


Input:
(1) Data: m × n dimensional raw data, which includes both labeled and unlabeled
data.
(2) Parameters: density-adjustable factor, penalty parameter C, and Gaussian kernel
width σ .
Steps:
(1) Extract feature using Algorithm 2, and derive the kernel function of semi-
supervised SVM.
(2) Train the semi-supervised SVM based on y1 , y2 , …, ym using gradient descent
proposed by Olivier Chapelle [12].

2. Case study
To verify the effectiveness of the DSTSVM, a simulation experiment was conducted
on Fisher’s iris. Randomly select 50 labeled samples from the 150 samples and repeat
the sampling 10 times. Two experimental sets were as follows.
(1) The 50 labeled samples selected each time are fed into the SVM for training
the model, and predicting the labels of the remaining 100 unlabeled samples.
Finally, the output is the average of the accuracy of 10 predictions.
(2) The 50 labeled and unlabeled samples selected each time are fed together into
semi-supervised transductive SVM (TSVM), cluster kernel semi-supervised
support vector machine based on spectral clustering (CKSVM) and DSTSVM
for co-training, and predicting the labels of the 100 unlabeled samples. Finally,
the output is the average of the accuracy of 10 predictions.
In the above experiments, C = 100, the kernel width of the Gaussian kernel
function and ρ in DSTSVM were taken as the best parameters. The specific results
are listed in Table 4.2.
From Table 4.2, it can be seen that the TSVM has the lowest accuracy, the
CKSVM method performs superior to the TSVM method, and the DSTSVM gets
better classification results, similar to the supervised SVM method.
3. Parameter optimization
210 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Table 4.2 Accuracy of


Method M Parameter Accuracy CA (%)
different methods on iris
dataset SVM σ = 1.05 94.4
TSVM σ = 1.05 81.5
CKSVM σ = 1.2 93.3
DSTSVM ρ = 2, σ = 1.3 94.6

(1) Impact of parameters

The role of C is to adjust the range of the confidence interval. The kernel parameter
σ implicitly changes the mapping function, which in turn changes the subspace
distribution complexity of the sample data, i.e., the maximum VC dimension of the
linear classification interface. The ρ parameter changes the distribution of the data
after feature extraction. To some extent, these parameters affect the classification
results.
For the iris dataset, σ = 0.7, ρ = 2, the error rate of the corresponding 10
predictions on different C is shown in Fig. 4.10.
From Fig. 4.10, we can learn that the error is high when C is very small, and
the error rate decreases sharply as C increases. The error rate realizes converges
after increasing to a certain value, such as C = 10 in Fig. 4.10. In this case, the
performance of DSTSVM is not affected by C, however, when C is greater than
about 3000, the error rate increases again.
Setting C = 100 and ρ = 2, the error rate of the corresponding 10 predictions
on different σ is shown in Fig. 4.11. It can be seen that the error decreases first and
then increase as the value of σ increases, and a better value is achieved in the interval
[0.4, 1.9].
Setting C = 100 and σ = 0.7, the error rate of the corresponding 10 predictions
on different ρ is shown in Fig. 4.12. It can be seen that superior results can be achieved
by taking ρ within 35, and the error rate is the lowest when ρ = 5.6. As the value of
ρ increases, the error rate also increases.

(2) Parameter optimization

Fig. 4.10 Error rate of


classification on different C
4.2 Spectral Clustering Manifold Based Fault Feature Selection 211

Fig. 4.11 Error rate of classification on different σ

Fig. 4.12 Error rate of


classification on different ρ

Based on the above analysis, it can be seen that the performance of the classifier
remains stable as long as C exceeds a certain threshold. The effects of σ and ρ are
relatively sensitive to the classification results, especially σ, while the role of ρ can be
seen as a fine-tuning for the classification results, which can improve the accuracy at
certain values. Therefore, the optimal performance (lowest error rate) of DSTSVM
is achieved by selecting the best combination of the parameters.
The classification results of different combinations are shown in Fig. 4.13, where
σ ∈ [0.4, 1.9], ρ ∈ [2, 35], the step length of σ is 0.05, and the step length of
ρ is 1. The maximum value of the average classification accuracy for 10 times is
94.6% when (ρ, σ ) taken as (2, 1.3). From Fig. 4.13, the average accuracy does not
fluctuate very much in the middle region, i.e., σ ∈ [0.4, 1.9] and ρ ∈ [2, 33]. It
can be concluded that the performance of the DSTSVM method remains stable as
long as the parameters are in a certain reasonable range, which is a guideline for the
selection of parameters in the later work.

4.2.3.2 Fault Diagnosis Model and Case Study

The essence of fault diagnosis is pattern recognition. A basic pattern recognition


system consists of four main parts: data acquisition, pre-processing, feature extrac-
tion and selection, and classification decision. The fault features of mechanical
212 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.13 The effect of parameter combination (ρ, σ) on classification accuracy

systems, especially the incipient fault of gear, are often submerged in the noise
signals [7]. However, it is difficult to extract useful features using traditional signal
processing methods, and it is challenging to collect extensive labeled samples in prac-
tical application scenarios. Therefore, it is necessary to construct a DSTSVM-based
fault diagnosis model.
As shown in Fig. 4.14, the fault diagnosis model first simulates various typical
abnormal or fault types of the equipment and obtains its vibration and speed signals
through various sensors. After calculating the feature indexes and normalizing the
data, a fault knowledge library is created. When the actual fault occurs on mechanical
equipment, the extracted fault features and established fault types are used to co-train
the model, and the output is the fault type of the mechanical equipment. As a result,
the final fault type of the device can be obtained by a comprehensive decision and
the related interventions can be implemented. In addition, new data or fault types
obtained from the diagnosis can be added to the fault knowledge library, so that the
fault types for training the DSTSVM model will increase and the model will be more
consistent with the actual situation. The prediction or classification results obtained
by this model become more and more accurate, and such a circular process of training,
identification, and retraining can also be used for online condition monitoring of the
mechanical equipment.
To verify the effectiveness of the DSTSVM-based fault diagnosis model, a gear
fault experiment was conducted in the Laborelec laboratory, as detailed in Sect. 4.2.2,
feature were selected by the PCA method.
The 9 feature indicators (mean square value, kurtosis, mean, variance, peak, root
mean square, crest indicator, impulse indicator, and clearance indicator) were used for
classification learning. In 74 × 9-D samples, 40 labeled samples and 34 unlabeled
samples were randomly selected, and the sampling was repeated 10 times. After
extracting features from all samples using the density-adjustable spectral clustering
4.2 Spectral Clustering Manifold Based Fault Feature Selection 213

Fig. 4.14 DSTSVM-based fault diagnosis model

method, the labeled and unlabeled samples were together input into a transductive
SVM for co-training and predicting the labels of the 34 unlabeled samples. The
penalty factor was set to 100.
To evaluate and validate the performance of the proposed method, a fivefold cross-
validation (5-CV) method was used. By dividing the original data into 5 groups, each
subset of data was considered as a validation set, and the remaining 4 groups were
used as training sets, 5 groups with corresponding learning models can be obtained.
By 10 times of 5-CV, the average of the classification accuracy of 5 groups of learning
models was used as the classification result.
Table 4.3 reports the comparison results between the proposed method and other
methods. It can be seen that similar results are achieved in CKSVM and super-
vised SVM method, TSVM method has the lowest accuracy among the three semi-
supervised methods, which is comparable to inputting features directly to the model
without feature extraction and demonstrating the importance of feature extraction in
incipient fault detection. Among these methods, DSTSVM has the highest accuracy,
which verifies the effectiveness of DSTSVM for incipient fault detection.

Table 4.3 Accuracy of


Method M Parameter Accuracy DA (%)
different methods
SVM σ = 0.50 88.82
TSVM σ = 0.5 88.23
CKSVM σ = 1.15 90.47
DSTSVM ρ = 2, σ = 1.6 92.94
214 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.2.4 Machinery Incipient Fault Diagnosis

4.2.4.1 Incipient Fault Detection and Classification for Transmission


Gear

1. Data acquisition
The experimental system structure and transmission testing platform are detailed in
Sect. 3.2. The gear normal and various fault types used in this section were derived
from the fifth gear with a transmission ratio of 0.77 and a sampling frequency of
40,000 Hz. The vibration acceleration signal in the X-direction was taken at position
3 of the transmission input, and the transmission was operated under three operating
conditions, i.e., normal, gear incipient pitting, and gear incipient spalling. Table 4.4
provides the 27 modes for different faults, different torque, and different speeds.
Table 4.5 lists the characteristic frequency of gearbox at different speeds.
2. Incipient fault detection based on density-adjustable spectral clustering and semi-
supervised SVM

(1) Incipient fault detection under the same working condition


The three fault types cannot be completely detected in the time and frequency domains
under the operating conditions of 800 r/min at the drive end and 75 N m at the output
end. To verify the effectiveness of the DSTSVM method, gear incipient pitting and
incipient spalling faults under this operating condition were detected separately. The
vibration acceleration signals at the normal state and the two fault states (i.e., mode
9, mode 18, and mode 27 in Table 4.4) were collected and the number of samples
is 1024 × 90. The gear original feature set S consisted of 11 statistical features,
i.e., mean square value, kurtosis, mean, variance, skewness, peak, root mean square,
shape indicator, crest indicator, impulse indicator, and clearance indicator. In total,
there were 270 11-D samples in three fault types, and each type of signal was taken
as 90 samples, where each sample consisted of 1024 points.
A sample dataset was constructed for gear fault diagnosis. Those data were
grouped into two subsets, one subset containing samples at normal state and incipient
pitting fault state, and the other subset was collected at normal state and incipient
spalling fault state. Each subset contained 180 11-D samples.
For the gear incipient pitting fault detection, the PCA algorithm was used to select
important features from the 11-D original features. Figure 4.15a provides the contri-
bution rate and accumulated contribution of the first three principal components, it
can be seen that the accumulated contribution of the first three principal components
reaches more than 90%, which can reflect the 11-D original features. The extraction
rates of the 11-D original features extracted by the first three principal components
are shown in Fig. 4.15b, it can be seen that the 9 feature indicators (extraction rate
of mean square value, kurtosis, variance, skewness, peak, root mean square, crest
indicator, impulse indicator, and clearance indicator) are extracted by the first three
principal components with more than 80% of the extraction rates, so they are selected
4.2 Spectral Clustering Manifold Based Fault Feature Selection 215

Table 4.4 Different modes of gears


Modes Fault type Rotational speed at the drive Torque at the output end
end (r/min) (N m)
1 Normal 600 50
2 75
3 100
4 800 50
5 75
6 100
7 1000 50
8 75
9 100
10 Gear incipient pitting 600 50
11 75
12 100
13 800 50
14 75
15 100
16 1000 50
17 75
18 100
19 Gear incipient spalling 600 50
20 75
21 100
22 800 50
23 75
24 100
25 1000 50
26 75
27 100

Table 4.5 The characteristic frequency of gear


Rotational speed at the drive end 600 800 1000
(rpm)
Rotation frequency of input shaft (Hz) 10 13.33333 16.66666667
Rotation frequency of middle shaft 6.842105 9.122807 11.40350877
(Hz)
Rotation frequency of output shaft 13.0622 17.41627 21.77033493
(Hz)
Mesh frequency of fifth-shifting gear 287.3684 383.1579 478.9473684
(Hz)
216 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.15 PCA results for gear incipient pitting

as an important feature and the original dimension becomes 180 × 9-D. In 180 × 9-D
samples, 30 labeled samples and 150 unlabeled samples were randomly selected,
and the sampling was repeated 10 times. All samples were fed into the DSTSVM
classifier for co-training, and predicting the labels of the 150 unlabeled samples after
5-CV.
The same experimental procedure was repeated for the gear incipient spalling
detection. Figure 4.16a provides the corresponding contribution rate and accumulated
contribution. It can be seen from Fig. 4.16b that the extraction rate of the remaining
features by the first three principal components is greater than 80% except for the
shape indicator and the original dimension becomes 180 × 10-D. In 180 × 10-D
samples, 30 labeled samples and 150 unlabeled samples were randomly selected,
and the sampling was repeated 10 times. All samples were fed into the DSTSVM
classifier for co-training, and predicting the labels of the 150 unlabeled samples after
5-CV.
Table 4.6 provides the detection results of the different methods, where C = 100
for all classifiers.
From Table 4.6, the detection accuracy of all methods at the incipient spalling fault
is higher than the incipient pitting fault, which is also consistent with the above anal-
ysis. Thanks to SVM, two groups of experiments obtained good detection results with
a small number of labeled samples. By introducing unlabeled samples for co-training,
the accuracy of the semi-supervised methods (TSVM, CKSVM, and DSTSVM) is
higher than supervised SVM methods, which demonstrates that the effective use of
unlabeled samples information can improve the detection accuracy. Furthermore, the
DSTSVM method shows higher accuracy in two experiments, which indicates that
the sample distribution structure is more discriminative after calculating the sample
similarity based on density-adjustable, and the detection accuracy can be further
improved.
4.2 Spectral Clustering Manifold Based Fault Feature Selection 217

Fig. 4.16 PCA results for gear incipient spalling

Table 4.6 Detection results for 11 statistical features in time domain


Method M Gear incipient pitting detection Gear incipient spalling detection
Parameter Accuracy (%) Optimal parameter Accuracy (%)
SVM σ = 0.50 84.20 σ = 0.50 97.13
TSVM σ = 0.50 86.66 σ = 0.50 100
CKSVM σ = 1.55 87.77 σ = 1.5 100
DSTSVM ρ = 2, σ = 1.10 88.27 ρ = 2, σ = 1.85 100

To improve the detection accuracy of incipient pitting fault, the frequency domain
energy factor proposed in Sect. 4.2 was used as the model input, and the energy factor
was calculated by taking the five gear mesh frequency and corresponding frequency
multiplier ( f 1 = 383 Hz, f 2 = 383 * 2 Hz, f 3 = 383 * 3 Hz, f 4 = 383 * 4 Hz). Table 4.7
shows the detection results based on energy factor features, and the parameters of all
methods are the same as in Table 4.6.
As can be seen from Table 4.7, the detection accuracy of the incipient pitting
fault is substantially improved when the input feature is the energy factor in the
frequency domain, and the accuracy of all methods is 100%. For incipient spalling
faults, the accuracy is similar to using time domain features. The main reason is

Table 4.7 Detection results based on energy factor features


Method Accuracy of incipient pitting fault (%) Accuracy of incipient spalling fault (%)
SVM 100 98.96
TSVM 100 99.72
CKSVM 100 99.60
DSTSVM 100 100
218 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

that the modulation in the incipient pitting FFT spectrum is mainly based on the
mesh frequency and frequency multiplier, while the normally engaged gear mesh
frequency appears in the incipient spalling FFT spectrum, which demonstrates the
effectiveness of the proposed feature indicator in the incipient pitting fault diagnosis.
(2) Incipient fault detection under different working conditions
For fault detection under different working conditions, vibration acceleration signals
(corresponding to all modes in Table 4.4) from normal, incipient pitting fault, and
incipient spalling fault types were collected, and the number of sampling points in
each working condition is 3072 × 30. The gear original feature set S consists of 11
statistical features, there are 270 11-D samples in three fault types, and each type of
signal was taken as 30 samples, where each sample consisted of 3072 points. Those
data were grouped into two subsets, one subset containing samples at normal state
and incipient pitting fault state, and the other subset was collected at normal state
and incipient spalling fault state. Each subset contained 540 11-D samples.
For the gear incipient pitting fault detection, the PCA algorithm was used to select
important features from the 11-D original features. Features 1, 2, 4, 5, 6, 7, 9, 10, and
11 were selected and the original sample becomes 540 × 9-D. The same procedure
was used for gear incipient spalling fault detection, the selected features were 1,
2, 3, 5, 6, 7, 9, 10, and 11. In 540 × 9-D samples, 50 labeled samples and 450
unlabeled samples were randomly selected, and the sampling was repeated 10 times.
All samples were fed into the DSTSVM classifier for co-training, and predicting the
labels of the 450 unlabeled samples after 5-CV.
Table 4.8 provides the detection results of the different methods, where C = 100
for all classifiers.
From Table 4.8, the gear incipient spalling fault detection accuracy of all methods
exceeds 90% even with very few labeled samples. However, the detection accuracy
of all methods is lower for gear incipient pitting faults due to the complex working
conditions. To improve the detection accuracy of incipient pitting faults, multi-sensor
data fusion is adopted for model training.
In this section, four sensors were arranged and the x-direction of each sensor was
taken as the original signal. The 11 statistical features of each sensor were extracted
separately and composed into a 4-dimensional feature vector. 270 × 4-D samples
from normal state and 240 × 4-D samples from incipient pitting fault were selected
since the data of the 4th sensor with 1000 rpm and 75 N m torque were missing for

Table 4.8 Gear pitting and spalling detection under different working conditions
Method M Gear incipient pitting detection Gear incipient spalling detection
Parameter Accuracy (%) Optimal parameter Accuracy (%)
SVM σ = 0.55 72.18 σ = 0.50 91.06
TSVM σ = 0.55 74.02 σ = 0.50 92.46
CKSVM σ = 1.55 76.03 σ = 1.55 93.06
DSTSVM ρ = 2, σ = 1.22 75.24 ρ = 2, σ = 1.40 94.12
4.2 Spectral Clustering Manifold Based Fault Feature Selection 219

incipient pitting fault. 50 labeled samples and 460 unlabeled samples were randomly
selected from 510 × 11-D the samples, and the sampling was repeated 10 times.
All samples were fed into the DSTSVM classifier for co-training, and the detection
accuracy is listed in Table 4.9.
Figure 4.17 visually shows the detection accuracy of different methods with each
feature indicator. It can be seen that the skewness feature has the lowest accuracy, and
better results are obtained on the mean square value feature, mean feature, variance
feature and root mean square feature. Compared to the detection results using a single
sensor, the DSTSVM and CKSVM methods improve by 20% and the SVM method
improves by 15%, which proves the effectiveness and reliability of the multi-sensor
data fusion method.
In addition, the DSTSVM and CKSVM methods are superior to the other two
methods when using mean square, mean, variance and root mean square features for
fault detection, which demonstrates the effectiveness of spectral clustering methods
with fewer feature dimensions and provides a reference for gear incipient feature
selection.
To further explore the influence of feature indicators on the detection results, the
mean square value, mean, variance, and root mean square features were used as input
indicators for incipient pitting fault detection under different working conditions. The
samples from sensor 3 were input to the DSTSVM model, and the detection accuracy
of 490 unlabeled samples was 91.8% based on the availability of 50 labeled samples.
This method was 14% higher than those methods with 9 feature indicators selected
using PCA, which indicates that the feature selection by PCA only eliminates redun-
dancy and does not completely select effective and discriminative feature indicators,
so the feature selection needs further study.
3. Gear incipient fault detection based on density-adjustable spectral clustering and
semi-supervised SVM
Three types of fault signals were collected for gear incipient fault detection under the
working condition of 800 r/min speed at the drive end and 75 N m torque at the output
end. 30 labeled samples were selected to detect whether 150 samples belonged to
incipient pitting fault or incipient spalling fault. Similarly, 30 labeled samples were
selected to detect whether 240 samples belonged to the normal state, incipient pitting
fault, or incipient spalling fault. Each set of experiments was randomly sampled and
averaged 10 times as the detection results. The penalty factor C = 100 for all methods
and the detection results of different feature values are listed in Table 4.10. The 8-D
feature indicators in the time domain were features 1, 2, 4, 6, 7, 9, 10, and 11 selected
by PCA, and the frequency domain energy factor indicators are 4-D calculated from
the mesh frequency and its 2–4 frequency multipliers.
From Table 4.10, when using feature indicators in the time domain, the accuracy of
detecting only the incipient pitting fault and the incipient spalling fault is higher than
that detecting normal state, incipient pitting fault, and incipient spalling fault, which
is mainly because the distinction between the normal state and incipient pitting fault
is insufficient. In both sets of experiments, the semi-supervised method outperformed
the supervised method, and the best method was DSTSVM.
220

Table 4.9 Multi-sensor data fusion for gear pitting detection under different working conditions
Feature SVM TSVM CKSVM DSTSVM
Parameter σ Accuracy (%) Parameter σ Accuracy (%) Parameter σ Accuracy (%) Parameter (ρ, σ) Accuracy (%)
Mean square value 1.0 84.86 1.0 64.78 0.50 88.84 3, 0.70 91.80
Kurtosis 0.50 62.36 0.50 61.78 0.50 61.43 2, 0.70 63.04
Mean 1.50 80.78 1.50 81.06 0.50 89.58 2, 0.75 92.08
Variance 1.45 85.13 1.45 67.58 0.55 89.28 2, 0.75 91.30
Skewness 0.80 56.13 0.80 53.10 0.50 56.04 2, 0.75 56.73
Peak 1.50 74.08 1.50 72.41 0.50 72.60 3, 0.75 74.36
Root mean square 1.45 90.84 1.45 65.76 0.55 89.86 3, 0.55 92.80
Shape indicator 0.55 64.02 0.55 63.58 0.50 61.60 2, 0.75 64.63
Crest indicator 0.50 64.34 0.50 67.10 0.50 63.0 2, 0.50 64.28
Impulse indicator 0.50 65.47 0.50 67.19 0.50 61.10 2, 0.50 66.20
Clearance indicator 0.50 65.06 0.50 66.76 0.50 62.90 2, 0.50 65.21
4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis
4.2 Spectral Clustering Manifold Based Fault Feature Selection 221

Fig. 4.17 Accuracy of


different features

Table 4.10 Detection accuracy on different feature indicators


Feature Method M Incipient fitting fault versus Normal versus incipient fitting
indicator incipient spalling fault fault versus incipient spalling
fault
Parameter Accuracy (%) Parameter Accuracy (%)
Time domain SVM σ = 0.50 99.4 σ = 0.55 86.66
(8-D) TSVM σ = 0.50 100 σ = 0.55 87.29
CKSVM σ = 0.95 100 σ = 0.75 88.16
DSTSVM ρ = 2, σ = 0.60 100 ρ = 4, σ = 0.75 88.87
Energy factor SVM σ = 0.50 100 σ = 0.55 99.48
(4-D) TSVM σ = 0.50 100 σ = 0.55 98.54
CKSVM σ = 0.80 100 σ = 0.75 98.87
DSTSVM ρ = 2, σ = 0.60 100 ρ = 5, σ = 0.75 99.5

When the frequency domain energy factor is used to detect incipient faults, the
results of both sets of experiments are substantially improved, which also proves the
effectiveness of the frequency domain energy factor feature indicator.
To analyze the effect of sample size on the detection accuracy, 8-D features-based
different methods were conducted for incipient fault detection, and the results are
shown in Fig. 4.18. It can be seen that the accuracy of the SVM method increases
as the sample size increases. When the number of labeled samples is 30–60, the
SVM has the lowest accuracy, while the other methods make better results due to
the participation of unlabeled samples. When the sample size is larger than 60, the
SVM method has higher accuracy than TSVM and CKSVM, and the DSTSVM
method always achieves the best detection results than the other methods due to
the introduction of the density-adjustable factor. In addition, the DSTSVM method
222 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.18 Accuracy of different sample sizes

makes superior accuracy in most cases, and the method performance tends to stabilize
after the number of labeled samples is greater than 80.

4.2.4.2 Bearing Incipient Fault Detection

(1) Data description

The data set of CWRU was collected from the test platform at Case Western Reserve
University [13], as detailed in Sect. 2.5.4. Inner race single point faults, outer race
single point faults, and rolling element point faults were introduced to the SKF 6205-
2RS deep groove ball bearings using electro-discharge machining. The diameter of
faults was 0.1778 mm and the depth was 0.2794 mm. The specific parameters of the
bearing are listed in Table 4.11.
The normal state, inner race fault, outer fault, and rolling element fault signals
were collected at 1797 r/min speed and zero motor load, and the sampling frequency
was 12,000 Hz. The fault eigenfrequency of rolling bearings is calculated and shown
in Table 4.12.

(2) Bearing incipient fault detection based on density-adjustable spectral clustering


and semi-supervised SVM

Table 4.11 Geometry parameters of 6205-2RS bearing


Inside Outside Thickness Number of Ball diameter Pitch diameter
diameter diameter (mm) rollers (mm) (mm)
(mm) (mm)
25 52 15 9 7.94 39.11
4.2 Spectral Clustering Manifold Based Fault Feature Selection 223

Table 4.12 Fault eigenfrequency of rolling bearings


Rotation frequency Inner race (Hz) Outer race (Hz) Rolling element (Hz) Cage train (Hz)
(Hz)
29.95 162.1852 107.3648 141.1693 11.9285

The vibration acceleration signal of the normal state and rolling element point fault
were collected under the working condition of 1797 r/min speed zero motor load.
The number of sampling points was 1024 × 100, in which one sample was taken
every 1024 points, and there were 100 samples in total. The 11 statistical features and
the amplitude feature where the rolling body failure frequency was located constitute
200 12-D samples, the shape indicator, and the amplitude feature were removed by
running the PCA algorithm, and the sample dimensions became 10-D.
During sampling, 15, 25, 35, 45, and 55 labeled samples were randomly selected
from 200 × 10 samples, and the treatment was repeated 10 times. All samples were
fed into the DSTSVM classifier for co-training, and predicting the labels of unlabeled
samples after 5-CV. Table 4.13 provides the incipient fault detection results and the
penalty factor C = 100.
It can be seen from Table 4.13 that the accuracy of the SVM method increases as
the sample size increases while the accuracies of the other three methods are 100%
even with only a small number of samples.
(3) Bearing incipient fault detection based on density-adjustable spectral clustering
and semi-supervised SVM
The vibration acceleration signal of the normal state, inner race point fault, outer race
point fault, and rolling element point fault signals were collected. The number of
sampling points was 1024 × 100, in which one sample was taken every 1024 points,
and there were 100 samples for each type of signal. The 11 statistical features,
3 features in the frequency domain, the amplitude feature where the outer race
frequency was located, the amplitude feature where the inner race frequency was
located, and the amplitude feature where the rolling element frequency was located
constitute 400 samples. After the mean feature and the peak feature were removed
by running PCA algorithm, the sample dimensions became 400 × 12-D.

Table 4.13 Rolling element incipient fault detection results of rolling bearing
Labeled SVM TSVM CKSVM DSTSVM
samples Parameter Accuracy Parameter Accuracy Parameter Accuracy Parameter Accuracy
number σ (%) σ (%) σ (%) (ρ, σ) (%)
n
15 0.50 88.10 0.5 99.78 1.4 100 (2, 1.25) 100
25 0.50 95.42 0.5 100 1.5 100 (2, 1.5) 100
35 0.50 96.12 0.5 100 1.5 100 (2, 1.35) 100
45 0.50 97.22 0.5 100 1.5 100 (2, 1.35) 100
55 0.50 97.65 0.5 100 1.5 100 (2, 1.35) 100
224 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Table 4.14 Rolling element incipient fault detection results of different methods
Labeled SVM TSVM CKSVM DSTSVM
samples Parameter Accuracy Parameter Accuracy Parameter Accuracy Parameter Accuracy
number σ (%) σ (%) σ (%) (ρ, σ) (%)
n
20 0.50 94.05 0.50 94.63 1.45 100 (2, 1.75) 100
40 0.55 98.11 0.50 96.80 1.3 100 (2, 1.75) 100
60 0.50 98.64 0.50 97.90 1.4 100 (2, 1.70) 100
80 0.50 98.70 0.50 98.67 1.35 100 (2, 1.45) 100
100 0.50 98.96 0.50 98.73 1.4 100 (2, 0.5) 100

During sampling, 5, 10, 15, 20, and 25 labeled samples were randomly selected
from each type of sample, and the treatment was repeated 10 times. All samples were
fed into the DSTSVM classifier for co-training, and predicting the labels of unlabeled
samples after 5-CV. Table 4.14 provides the incipient fault detection results and the
penalty factor C = 100.
It can be seen from Table 4.14 that the accuracy of the SVM method increases as
the sample size increases and is higher than TSVM. The accuracy of CKSVM and
DSTSVM methods achieves 100% when the number of labeled samples is 20, and the
accuracy keeps stable, which demonstrates the effectiveness of spectral clustering-
based classifiers.

4.3 LLE Based Fault Recognition

4.3.1 Local Linear Embedding

Local linear embedding (LLE), a nonlinear dimensionality reduction algorithm that


represents the original topology with local linear using Euclidean distance, was
proposed by Roweis and Saul and published in Science. The core purpose of the
LLE method is to reconstruct a weight vector between a sample and its neighborhood
samples in the low-dimensional space, and keep the weights in each neighborhood
consistent with the original space, i.e., to minimize the reconstruction error when
the embedding mapping is locally linear [14]. The weights reconstructed by the LLE
algorithm can capture the intrinsic geometric properties of the local space, such as
the invariance regardless of translation, rotation, or scaling.
The nearest neighborhood of each sample is first determined in LLE, and then the
reconstructed weights are obtained by solving the constrained least squares problem.
In this process, the LLE transforms the constrained least squares problem to a possibly
singular linear system of equations and guarantees the non-singularity of the coeffi-
cient matrix of the linear system of equations by introducing a small regularization
factor γ . Furthermore, a sparse matrix can be constructed using these reconstructed
4.3 LLE Based Fault Recognition 225

weights. As a result, LEE can obtain the global low-dimensional embedding by


calculating the smallest eigenvectors of the sparse matrix.
The main advantages of the LLE algorithm are: (1) Only the number of nearest
neighbors and the embedding dimension need to be determined, which is simple for
parameter selection; (2) The process of finding the optimal low-dimensional feature
mapping can be transformed into the problem of solving the eigenvalues for a sparse
matrix, which overcomes the local optimum problem; (3) The low-dimensional
feature space retains the local geometric properties in the high-dimensional space;
(4) The low-dimensional feature space has a full-domain orthogonality coordinate
system; (5) The LLE algorithm can learn low-dimensional manifolds of arbitrary
dimensions and has an analytic global-optimal solution without iterations; (6) The
LLE algorithm is regarded as the feature value computation of sparse matrices, and
the computational complexity is relatively small and easy to execute. The LLE algo-
rithm preserves the intrinsic connection between samples using the weight coeffi-
cients of samples in the neighborhood, which can be effectively applied in nonlinear
dimensionality reduction [15].

4.3.2 Classification Based on LLE

In pattern recognition, the accuracy of status classification depends largely on feature


selection. The running of a mechanical system is a complex stochastic process, which
is difficult to describe by a deterministic time function. The purpose of feature anal-
ysis is to extract useful features from the original signal, which can reflect the running
status of mechanical equipment. Although there are many features, the features that
can reflect the regularity, sensitivity, clustering, and separability of states are different,
and the correlation between the state information contained in each feature is not
consistent. Therefore, it is necessary to select effective features with good regularity
and sensitivity as the initial vector and construct a lower dimensionality feature for
status classification based on eliminating redundant information [16].
The core idea of the LLE algorithm [17] is to determine the interrelationships
between samples in a local coordinate system for learning the intrinsic manifold
structure of the samples. As an unsupervised learning algorithm, LLE is an effective
method for compressing high-dimensional data. In the binary classification, the posi-
tive and negative samples are on their specific manifolds, therefore, their distance
difference can be used to determine the class of the test samples.
226 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.3.3 Dimension Reduction Performance Comparison


Between LLE and Other Manifold Methods

To verify the superiority of the LLE algorithm, the comparative experiments with
the Isomap method and Laplacian Eigenmap (LE) method were conducted on the
classical Twin Peaks dataset, where N = 800, D = 3.
Figure 4.19 provides the dimensionality reduction results of three manifold
learning methods. It can be seen that the global Isomap method maintains the data
topology better, but it takes a longer time, almost 50–90 times longer than the local
algorithms. LE method takes the shortest time, but the topology of the data is severely
damaged. The LLE method can effectively cluster the data after dimensionality reduc-
tion and it takes less time. Although most of the internal topology is destroyed, it is
more effective in dimensionality reduction than the other two methods. The running
time of the three methods on different k are listed in Table 4.15.
When k is within a certain range, it has little influence on the algorithm running
time. However, the local methods have a great advantage when considering factors
such as the representation ability of the selected features, the computation time, and
the simplicity of implementation.

Fig. 4.19 Dimensionality reduction results of three manifold learning methods

Table 4.15 Running time of the three methods on different k


Method k=6 k=8 k = 10 k = 12 k = 14
LLE 2.2561 0.29539 0.3938 0.51363 0.76083
LE 0.51615 0.19607 0.21568 0.23435 0.26801
Isomap 17.6196 17.3278 17.3004 17.2798 17.3269
4.3 LLE Based Fault Recognition 227

4.3.4 LLE Based Fault Diagnosis

When applying local linear embedding methods in fault diagnosis, two problems
need to be addressed: extracting representative fault features, and diagnosing fault
type of new data based on known fault knowledge library [18]. LLE is a classical
local manifold learning method. Although many experiments have proven that LLE
is an effective visualization method, the algorithm still has some shortcomings when
applied to the field of pattern recognition [19]: (1) How to improve the generalization
of the model on unknown samples; (2) The lack of label information. Therefore,
combining supervised linear discriminant analysis and LLE algorithms is a good
solution for fault diagnosis.

4.3.4.1 Diagnosis Algorithm

Local linear discriminant embedding (LLDE), a supervised LLE method, was


proposed by Li et al. of the Chinese Academy of Sciences and was successfully
applied to face recognition [20]. In this section, an improved local linear discrimi-
nant embedding classification (LLDEC) method is proposed based on the principle
of LLDE and the introduction of evaluation criterion of classification, and the steps
of the LLDEC method are as follows.
(1) LLDEC algorithm

Step 1: Determine the neighborhood of X i by using KNN or ε − ball methods.


Step 2: Minimize linear error between X i and its k-nearest neighbors by calcu-
|| ∑ ||2
|| ||
lating the reconstructed weights of X i in εi (W ) = arg min||X i − kj=1 W i j X j || .
[ ]
Step 3: Obtain a weighted matrix W = W i j N ×N of X i by repeating step 2.
Step 4: Construct matrix M, M = (1 − W )T (1 − W ).
Step 5: Construct a matrix X M X T .
Step 6: Calculate the intra-class discrete matrix Sb , inter-class discrete matrix Sw ,
and the weighted distance difference Sb − μSw .
Step 7: Get d-dimensional space embedding Y = V T X by calculating the gener-
alized eigenvalue d of (X M X T − (Sb − μSw ), X X T ) and its corresponding
eigenvector matrix V.
Step 8: Identify the class of the test sample by comparing the difference εY (W )+
between the sample and positive manifold and εY (W )− between the sample and
negative manifold, and select a suitable classifier for embedding results.

(2) The purpose of LLDEC

In terms of visualization, the goal of dimensionality reduction is to map the high-


dimensional space of samples to a two-dimensional or three-dimensional space while
preserving the intrinsic structure of samples as much as possible. However, the goal
228 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

of classification is to map the samples to a feature space where each class sample
can be clearly separated. LLE is an effective visualization method for mapping high-
dimensional data to two-dimensional space, but it has inferior classification ability.
The goal of LLDEC is to improve the classification ability of the original LLE by
making full use of the class information. It is known that the reconstruction weights
are kept constant regardless of translation, rotation, or scale change, which is defined
as:
|| ||2 || ||2
|| || || ||
∑ || ∑ || ∑ || ∑ ||
ϕ(Y ) = || || || W i j (Y j − T i )||
||Y i − W i j Y j || = ||(Y i − T i ) − ||
i || j || i || j ||
(4.12)

where, Ti is the transformation vector of class i. To improve the effectiveness


of LLDEC algorithm, the K-nearest neighbor algorithm is combined to perform
classification.
(3) The algorithm based on the LLDEC and KNN classifier
K-nearest neighbor classification algorithm, named KNN, is a classification method
with k nearest neighbor, which identifies the class of new sample by the class of the
k nearest neighbor(s).
The selection of the nearest neighbor number k depends on the sample number
and degree of dispersion in each class, and different k values are selected for different
applications. If there are few samples around the test sample si , then the range
contained by the k will be large, and vice versa. Therefore, the nearest neighbor
algorithm is vulnerable to noisy data, especially with isolated points in the sample
space, and the intrinsic reason is that the k nearest neighbor samples have the same
effect on the test sample in the basic KNN method. Generally speaking, different
neighbor samples have different effects on the test sample, and the closer the sample
is, the greater the effect on it.
LLDEC is a supervised algorithm with label information. The effectiveness of
the method can be verified by combining the dimensionality reduction and the KNN
algorithms.

4.3.4.2 Case Study

To verify the effectiveness of the proposed LLDEC method, we ran a set of


comparative experiments on the gear dataset collected by Laborelec lab.
A sample was composed of skewness, crest indicator, peak, root mean square, vari-
ance, and shape indicator. Three types of signals (normal, pitting fault, and spalling
fault) were collected respectively, so the dataset contained 162 6-D samples, and the
number of samples in each type was 54. Figure 4.20 provides the classification results
of two methods, it shows that the classification ability of the proposed LLDEC is
greatly improved compared to the original method. In each set of experiments, 34
4.3 LLE Based Fault Recognition 229

samples were randomly selected to train a KNN model and to predict the fault classes
of 20 test samples. Figure 4.21 provides the classification results of both LLE and
LLDEC with KNN.
Table 4.16 shows the classification results of the integrated LLE and LDA model
on the lab data dataset. When k = 8, the classification accuracy is only up to 40%,
while the accuracy of the proposed LLDEC method rises to 96.7%. Different from
the Iris dataset, the dimensionality of the lab dataset increases from 4-D to 6-D, and
the accuracy of LLDEC increases consequently, which illustrates the effectiveness
of the proposed LLDEC method in dimensionality reduction for high-dimensional
nonlinear data.

Fig. 4.20 Classification result of LLE method and LLDEC method on lab dataset

Fig. 4.21 Classification result of LLE + KNN method and LLDEC + KNN method on lab dataset
230 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Table 4.16 Accuracy of four algorithms on lab dataset


Algorithm Training samples Test samples Parameter Accuracy (%)
LLE 102 60 8 40
ULLELDA 102 60 (8, 2) 82
LLDA 102 60 2 81.76
LLDEC 102 60 12 96.7

4.3.4.3 Fault Diagnosis for Automotive Transmission

1. Transmission gearing fault diagnosis

The gear fault data was collected from the Dongfeng SG135-2 automobile transmis-
sion produced by a gear factory. The forward fifth gear was used, and the sensor
was mounted on the output shaft bearing. The transmission was working in 3 modes:
normal, incipient pitting fault, and incipient spalling fault. Please refer to Sect. 3.3.4
for the transmission and the fault gears.
To ensure the fairness of the experiment, it is necessary to ensure that the transmis-
sions operate under the same working conditions. Table 4.17 lists the transmission
operating conditions and signal sampling parameters.

(1) LLE-based gear fault diagnosis

A 15-D dataset of 270 samples consisted of 6 features in time domain, 4 features in


the frequency domain, and 5 features in the time–frequency domain, and the original
LLE algorithm was adopted for dimensionality reduction and fault diagnosis.
Compared to the above case study, it can be seen from Fig. 4.22 that the effec-
tiveness of the selected features and the fault modes can be obviously distinguished
even using the basic LLE method. By combining KNN, the final diagnosis results are
shown in Fig. 4.23 in which 60 training samples and 30 test samples were randomly
selected in each fault mode.
It is not visually noticeable that there is significant differentiation among the fault
modes. Therefore, Table 4.18 provides accuracy in terms of quantitative metrics.

Table 4.17 Working condition parameters of transmission


Parameter Value Parameter Value
Input speed 1000 rpm Output speed 1300 rpm
Input torque 69.736 N m Output torque 50.703 N m
Rotation frequency of input shaft 16.67 Hz Acceleration sampling 40,000 Hz
frequency
Rotation frequency of middle 11.40 Hz Mesh frequency of 433 Hz
shaft fourth-shifting gear
Rotation frequency of output 21.67 Hz Mesh frequency of 478 Hz
shaft fifth-shifting gear
4.3 LLE Based Fault Recognition 231

Fig. 4.22 Two-dimensional


visualization of raw LLE
gear data

Fig. 4.23 Visualization of


classification result by using
LLE + KNN

Even though the highest accuracy rate was achieved in diagnosing spalling faults
and pitting faults, it was still only 73%.
(2) LLDEC-based gear fault diagnosis
Similarly, a 15-D dataset of 270 samples consisted of 6 features in time domain,
4 features in the frequency domain, and 5 features in the time–frequency domain,

Table 4.18 Classification of gear data using LLE + KNN


Algorithm LLE
Fault modes Normal versus Normal versus Pitting versus Three faults
pitting fault spalling faults spalling faults
Training and test (120, 60) (120, 60) (120, 60) (180, 90)
samples
Time (s) 0.6145 0.614 0.6147 0.6237
Accuracy (%) 56 60 73 62.2
232 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

and the LLDEC algorithm was adopted for dimensionality reduction and fault diag-
nosis. Figure 4.24 shows the visualization of gear data dimension reduction by using
LLDEC.
By combining KNN, the final diagnosis results are shown in Fig. 4.25. In which,
60 training samples and 30 test samples were randomly selected in each fault mode.

Fig. 4.24 Visualization of gear data dimension reduction by using LLDEC

Fig. 4.25 Visualization of classification result by using LLDEC + KNN


4.3 LLE Based Fault Recognition 233

From Fig. 4.25, three fault modes can be completely distinguished by using the
LLDEC method, which proves the effectiveness of the method. Table 4.19 provides
accuracy in terms of quantitative metrics. Compared with the original LLE algorithm,
the accuracy of the LLDEC method has been greatly improved.

2. Transmission bearing fault diagnosis

The dataset was collected from the bearing fault platform of CWRU. The experiment
simulated three fault modes: inner race single point faults, outer race single point
faults, and rolling element point faults, as detailed in Sect. 2.5.4.
(1) LLE-based bearing fault diagnosis
A 15-D dataset of 270 samples consisted of 6 features in time domain, 4 features in the
frequency domain, and 5 features in the time–frequency domain, and the original LLE
algorithm was adopted for dimensionality reduction and fault diagnosis. Figures 4.26
and 4.27 provide the fault diagnosis results.
From Figs. 4.26 and 4.27, it is difficult to observe a significant difference among
the fault modes; Table 4.19 provides accuracy in terms of quantitative metrics, and
the normal and rolling element faults are also very difficult to be separated.

Table 4.19 Classification of gear data using LLE + KNN


Algorithm LLE
Fault modes Normal versus Normal versus Normal versus Four faults
inner race fault outer race faults rolling element
faults
Training and test (120, 60) (120, 60) (120, 60) (240, 120)
samples
Time (s) 2.8783 2.8792 2.8791 3.7517
Accuracy (%) 80 100 41.67 85

Fig. 4.26 Two-dimensional visualization of result using LLE


234 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.27 Visualization of classification result by using LLE + KNN

(2) LLDEC-based bearing fault diagnosis


A 15-D dataset of 270 samples consists of 6 features in time domain, 4 features in the
frequency domain, and 5 features in the time–frequency domain, and the original LLE
algorithm is adopted for dimensionality reduction and fault diagnosis. Figures 4.28
and 4.29 provide the fault diagnosis results.
Compared with the original LLE algorithm, the LLDEC method can obvi-
ously identify the different fault modes. Table 4.20 provides accuracy in terms of
quantitative metrics, the accuracy of the LLDEC method has been greatly improved.
From Fig. 4.29 and Table 4.20, the accuracy of the LLDEC method is not good
enough because it is difficult to distinguish between the rolling element fault and the
normal mode of the bearing. Although the improved LLDEC method is better than
the basic LEE method, its effectiveness still needs to be further improved.

Fig. 4.28 Visualization of dimension reduction result using LLDEC


4.3 LLE Based Fault Recognition 235

Fig. 4.29 Visualization of classification result using LLDEC + KNN

Table 4.20 Classification of bearing data using LLE + KNN


Algorithm LLE
Fault modes Normal versus Normal versus Normal versus Four faults
inner race fault outer race faults rolling element
faults
Training and test (120, 60) (120, 60) (120, 60) (240, 120)
samples
Time (s) 0.3930 0.393 0.3930 0.4185
Accuracy (%) 99 97 72.67 91.3

4.3.5 VKLLE Based Bearing Health State Recognition

The LLE algorithm uses Euclidean distance to measure the similarity of samples,
which does not represent the intrinsic structure of samples. It is very sensitive to the
number of nearest neighbors, even if the number of nearest neighbors is similar, it
will cause an obvious discrepancy in the dimensionality reduction results. Therefore,
it is important to choose the number of nearest neighbors, too large a number affects
the local properties, while too small does not guarantee the global properties.

4.3.5.1 Variable K-Nearest Neighbor Locally Linear Embedding

Although the LLE algorithm has achieved research results in the nearest neighbor
parameter setting problem, it still needs further improvement because the number of
nearest neighbors is fixed for all samples, and it has low computational efficiency
and high complexity. Due to the uneven distribution and the different locations of
samples, each sample has its most suitable number of nearest neighbors. Following
this principle, a variable K-nearest neighbor locally linear embedding (VKLLE)
236 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

method is proposed to determine the optimal number of nearest neighbors based on


the property of the residuals that can evaluate the degree to which the data retains
distance information. The smaller the residuals, the more sufficient information about
the original data structure is retained in the sample after dimensionality reduction.
Conversely, the larger the residuals, the more information of data is lost, and the
worse the effect of dimensionality reduction. The experimental results show that the
use of the VKLLE method in the analysis of bearing state signals can effectively
improve the algorithm stability while ensuring the dimensionality reduction effect.
1. VKLLE algorithm

(1) Calculate the nearest neighbor of samples: Given a maximum value k max of
the number of nearest neighbors (since the essence of the LLE algorithm is to
maintain the local linearity of samples, the value of k max cannot be too large,
otherwise it will destroy the local linearity and worsen the dimensionality reduc-
tion effect), calculate the Euclidean distance between each point and the rest of
samples, and construct a nearest neighbor graph by finding its nearest k (k <
k max ) samples.
(2) Obtain weights: For each k, the reconstruction error can be described as
|| ||2
∑ ||
|| ∑ ||
||
ε(W ) = || wi j x j ||
|| x i − || (4.13)
i || j∈Ji ||

∑kminimizing ε(W ), the weight wi = {wi1 , wi2 , . . . , wik } can be obtained, and
by
j=1 wi j = 1.
(3) Obtain low-dimensional embedding: the reconstruction error can be described
as
|| ||2
|| ||
|| ∑
k
||
ε( yi ) = min ||
|| y i − w y
i j j ||
|| (4.14)
yi || ||
j

by minimizing ε( yi ), ε(Y k ) can be expressed as:


[ ( )]
ε(Y k ) = min tr Y k wiT wi Y Tk (4.15)

where, Y k = {yi − yi1 , yi − yi2 , …, yi − yik }, yij ( j = 1, 2, 3, …, k) denoted the


jth nearest neighbor, and tr(Y k Y k T ) = c, tr(·) is the trace of the matrix.
According to the Lagrangian function:
( )
L = Y k wiT wi Y Tk − λ Y k Y Tk − c (4.16)

the partial derivatives can be calculated as:


4.3 LLE Based Fault Recognition 237

∂L
= wiT wi Y Tk − λY Tk = 0 (4.17)
∂Y k

Y k is the eigenvector corresponding to the non-zero minimum eigenvalue of wiT wi .


|Y k | is the distance of the first k nearest points of yi . Therefore, Y k can be determined
by the weight wi of the high-dimensional space xi .
Calculate the residual 1−ρ X2 k Y k of each k, where, X k is the distance matrix between
the xi in the high-dimensional space and its first k nearest neighbors. ρ denotes the
linear correlation coefficient. The smaller the residuals, the more sufficient infor-
mation about the original data structure is retained in the sample after dimension-
ality reduction. Conversely, large residual values cannot adequately preserve the
information of the original samples.
(4) Determine the optimal number k j opt of nearest neighbors for each sample and
the optimal weight matrix.
( )
k i opt = min 1 − ρ X2 k Y k (4.18)

2. Assessment metrics

The classification performance decrease rate index R to the same [21] and the sepa-
rability evaluation index J b in Sect. 3.2.2 are used as the assessment metrics to assess
the classification results of the data after dimensionality reduction. Smaller R means
better classification in the lower dimensional space and more effective label infor-
mation is retained, and vice versa. Larger means the shorter intra-class distance and
longer inter-class distance of the same kind of samples, which represents the high
separability of sample clustering and good dimensionality reduction.
The classification performance decreases the rate index

Nx − N y
R= (4.19)
Nx

where, N x , N y are the number of samples classified correctly in the high-dimensional


and low-dimensional spaces. The K-nearest neighbor classifier is selected to classify
samples, i.e., the class of a given sample is consistent with the class that has the most
class labels in its nearest neighbor samples.
3. Case study
The mixing national institute of standards and technology handwritten digits
(MNIST) dataset was used to verify the effectiveness of the VKLLE method.
Figure 4.30 provides an example of the MNIST dataset, 500 samples (the digits are
0–4) with 784 dimensions were selected from the MNIST training set as simulation
data.
Table 4.21 provides the classification performance decrease rate R for the MNIST
dataset at different numbers of nearest neighbors (K = 1, 2, 5, 9) and low-dimensional
238 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.30 MNIST dataset

space (2-D and 5-D). Table 4.22 provides the separability of samples after dimen-
sionality reduction, where d is the dimensions in low-dimensional spaces. Each
column in Tables 4.21 and 4.22 represents R and J b of the two algorithms with
different numbers of nearest neighbors k = 10, 15, 20, 25, 30 or k max = 10, 15, 20,
25, 30. The smaller R indicates that the more information in the low-dimensional
space is retained, the larger J b indicates the higher separability of the sample after
dimensionality reduction.
When the dimension of the low-dimensional space is larger, the R-value is smaller,
which indicates that more label information is retained under this dimension space,
and is much more suitable for classification. For the same dimension, the R of the
LLE method is quite different for different k, and the maximum difference can reach
20.17% (the difference between k = 10 and k = 30 when d = 2 and K = 9). In
contrast, similar R can be obtained for different k max of the VKLLE method, and
the maximum difference is only 3.54% (the difference between k max = 10 and k max
= 30 when d = 2 and K = 5), which verified the stability of the proposed VKLLE
method.
Compare with the R-value of the LLE method, the VKLLE method can achieve
a smaller R-value. Only when k/k max = 15, d = 5, and K = 2 or K = 5, the R-value
of the VKLLE method is higher than that of the LLE algorithm, but the difference
between the two values is small. The difference is 0.41% at K = 2 and 0.21% at K
4.3 LLE Based Fault Recognition 239

Table 4.21 R of LLE and VKLLE methods on MNIST dataset


d K R (%) of LLE method R (%) of VKLLE method
10 15 20 25 30 10 15 20 25 30
2 1 0 0 0 0 0 0 0 0 0 0
2 3.73 5.18 8.28 16.36 19.25 2.28 1.66 2.48 2.48 2.28
5 6.46 8.33 12.50 21.25 24.79 3.13 3.75 3.75 5.21 4.58
9 5.94 8.07 9.77 22.51 26.11 1.70 2.12 2.76 4.67 4.03
5 1 0 0 0 0 0 0 0 0 0 0
2 1.66 1.04 1.86 2.69 6.00 1.04 1.45 1.04 2.07 1.86
5 1.04 2.08 4.58 5.00 6.04 0 2.29 3.33 2.71 3.54
9 0 0.64 2.55 4.67 5.94 − 1.27 − 0.21 1.49 1.49 1.27

Table 4.22 J b of LLE and VKLLE methods on MNIST dataset


d J b (%) of LLE method J b (%) of VKLLE method
10 15 20 25 30 10 15 20 25 30
2 0.550 0.595 0.551 0.463 0.512 0.645 0.547 0.567 0.573 0.573
5 0.580 0.572 0.531 0.473 0.494 0.618 0.587 0.581 0.580 0.569

= 5. For all other data, the VKLLE method obtained a lower R, and the maximum
difference is − 22.08% (k/k max = 30, d = 2, K = 9).
It can be seen from Table 4.22 that the J b of the VKLLE method is all above
0.54, and is larger than the LLE method (the maximum difference can achieve
0.11) except for the case when d = 2 and k/k max = 15, which demonstrates that
the samples after dimensionality reduction using the VKLLE method have better
clustering performance.
The LLE algorithm belongs to local methods, and a too large number of nearest
neighbors fails to preserve the local structure of the samples. However, it can be
obtained from Tables 4.21 and 4.22 that the VKLLE algorithm still has a small R
even if a large k max is used, which illustrates the VKLLE method can achieve better
dimensionality reduction than the LLE method.
4. Complexity analysis
Assuming that the number of samples is N, the dimension of the original space is D,
the nearest neighbors of each sample in the LLE algorithm is k, the average number
of nearest neighbors of each sample in the VKLLE algorithm is k m , the maximum
number of nearest neighbors is k max , and the dimension in the low-dimensional space
after dimensionality reduction is d. The complexity of the VKLLE algorithm and the
LLE algorithm is calculated as follows, respectively.
(1) The LLE algorithm complexity
240 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

To obtain the optimal number of nearest neighbors and get better classification
results, a popular procedure is repeating the LLE algorithm to drive the optimal
k, the complexity is calculated as follows.
Step 1: Calculate the nearest neighbors O(k max × D × N 2 ).
Step 2: Calculate the weights O(k max × D × N × k 3 ).
Step 3: Calculate low-dimensional embedding O(k max × d × N 2 ).

(2) The VKLLE algorithm complexity


Step 1: Calculate the nearest neighbors O(k max × D × N 2 ).
Step 2: Calculate the weights and the latest nearest neighbor O(N t × D × N
× k m 3 ) and N t = k max + k m .
Step 3: Calculate low-dimensional embedding O(d × N 2 ).

To sum up, the algorithm complexity depends on the dimension D of the original
space and the sample number N. When D > N, the complexity depends on Step 1
and Step 2, and the computational time of the VKLLE algorithm is longer than the
LLE algorithm; When D < N, the complexity depends on Step 1 and Step 3, and
the computational time of VKLLE algorithm is much lower than LLE algorithm.
The MNIST dataset (the total number of samples in the training set is 59,370, and
the dimension is 784) is used to calculate the complexity of two algorithms, where
d = 2, D = 784, and k max = 30. It is obvious from Table 4.23 that the computational
time of the VKLLE algorithm is significantly reduced compared to the LLE algorithm
starting from near the 1000th sample, and the difference in complexity between the
two algorithms becomes more obvious as the number of samples increases.

5. VKLLE-based bearing health state recognition

The rotating machinery fault simulation experimental platform introduced in


Sect. 3.4.4 is used to generate 3 fault states: normal, inner race fault with 1 mm
slice, outer fault with 1 mm slice. Figure 4.31 shows the bearing inner race fault
and outer race fault with 1mm slice. By installing a PCB acceleration sensor on the
bearing housing, the different vibration signals can be obtained using the BBM data

Table 4.23 The complexity


Sample (N) LLE (s) VKLLE (s)
of LLE and VKLLE
algorithms 500 27.909703 75.689061
1000 159.886005 154.756701
1500 422.532133 238.290806
2000 829.198174 292.623492
2500 1454.128425 390.467419
3000 2363.035180 481.147113
Note Processor: Intel(R) Core(TM) i5 CPU M450@ 2.4 GHz,
Memory: 2G
4.3 LLE Based Fault Recognition 241

acquisition front-end device. The specific bearing parameters and feature frequen-
cies are listed in Table 3.12. All vibration signals are collected at a speed of 1100 r/
min, a sampling frequency of 12,000 Hz, and a time of 1.5 s. 40 sets of vibration
signals are collected under each fault status, and 20-D features based on Table 4.24
are extracted to constitute an original fault feature set. In total, 120 samples with
20-D can be obtained, and the dimension is 3-D after dimensionality reduction.
Figure 4.32 provides the waveform in the time domain and demodulation spectrum
of the bearing under each status, it can be seen that the waveform discrepancy in the
time domain is obvious under a different status. In addition to the different vibration

Fig. 4.31 Simulated fault of rolling bearing

Table 4.24 Original feature indicator


Feature indicator in the time domain Feature indicator in the frequency domain
Mean p1 Amplitude at the rotational frequency p11
Square root amplitude p2 Amplitude at the inner circle passing frequency p12
Standard deviation p3 Amplitude at the outer circle passing frequency p13
Peak p4 Amplitude at the rolling element passing frequency
p14
Skewness p5 Amplitude at the cage train passing frequency p15
Kurtosis p6 Amplitude at 2 times the rotational frequency p16
Shape indicator p7 Amplitude at 2 times the inner circle passing
frequency p17
Crest indicator p8 Amplitude at 2 times the rolling element passing
frequency p18
Impulse indicator p9 Amplitude at 2 times the outer circle passing
frequency p19
Clearance indicator p10 Amplitude at 2 times the cage train passing
frequency p20
242 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

amplitudes, the bearing under different statuses also has different shock signals, such
as the shock signal of the inner race fault with a 1 mm slice and the shock signal of the
outer race fault with a 1 mm slice. From the corresponding demodulation spectrum,
the frequencies demodulated from the inner race fault with a 1 mm slice and the
outer race fault with a 1 mm slice contain not only the same rotation frequency
component (18.25 Hz), but also the respective passing frequency (inner race: 127.5
and 382.9 Hz, outer race: 98.51 and 198.8 Hz), while the modulation of bearing in
the normal state is not obvious. Although the fault categories can be distinguished
from the time domain and frequency domain, feature extraction is still necessary to
accurately and intelligently diagnose fault types.
Figure 4.33 shows the distribution of the sample under different feature indicator
spaces. “+” denotes the normal state, “*” denotes the outer race fault with a 1 mm
slice, and “◯” denotes the inner race fault with a 1 mm slice. It should be noted that
all indicators are normalized (mean value of 0 and variance of 1). From Fig. 4.33,
different feature indicators are varying in fault diagnosis, for example, the Standard
deviation indicator is superior to the Kurtosis, which in turn is superior to the mean.
The classification performance decrease rate of LLE and VKLLE methods (d =
3, K = 1, 2, 5, 9, k/k max = 6, 8, 10, 12, 14, 16) are listed in Table 4.25. It can be seen

Fig. 4.32 Waveform and demodulation spectrum of bearing under each status
4.3 LLE Based Fault Recognition 243

that the classification performance decrease rate of VKLLE method is equal to the
LLE method for K = 9, k/k max = 6, where the classification performance decrease
rate of the VKLLE method (R = 0) is lower than the LLE method (R = 1.667).
Figure 4.34 provides the cluster results of the LLE and VKLLE methods on
different k values, and the corresponding J b is listed in Table 4.25. It can be seen
that at each k/k max , the separability evaluation index J b obtained by dimensionality
reduction using the VKLLE method is larger than the LLE method, and the maximum
difference is 0.184 (k/k max = 8). Furthermore, J b (VKLLE) is all above 0.8, and the
maximum value is 0.952 (k/k max = 6), which indicates that the inter-class distance
between different classes is larger and the intra-class distance of similar samples
is smaller after dimensionality reduction using VKLLE method. Compared with
the LLE method, it can be seen from Fig. 4.34 that the VKLLE method can obtain
obvious clustering results, i.e., samples of the same class are clustered together, while

Fig. 4.33 Sample distribution under different feature indicator spaces (“+” denotes the normal
state, “*” denotes the outer race fault with 1 mm slice, and “◯” denotes the inner race fault with
1 mm slice)
244 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.33 (continued)

samples of different classes are separated and have good stability. The LLE method is
sensitive to different k values, and the classification results vary greatly for different
k values. For example, although the difference in k values is small (k = 6 and k =
8), the clustering effect for k = 6 is better than k = 8. The above experiments show
that the proposed VKLLE method can achieve better clustering results than the LLE
method.
4.3 LLE Based Fault Recognition 245

Table 4.25 J b values of LLE


k/k max LLE (J b ) VKLLE (J b )
and VKLLE methods
6 0.883 0.952
8 0.716 0.900
10 0.801 0.912
12 0.729 0.868
14 0.765 0.856
16 0.730 0.825

Although the cluster for different fault statuses of bearings is improved by the
VKLLE method, J b is lower than 0.9 when k max > 12 because the vibration signal in
the test dataset is inevitably disturbed by noise, speed fluctuations, etc. Figure 4.35
provides the cluster results of VKLLE methods after noise reduction in feature space,
and the corresponding J b is listed in Table 4.26.
From Fig. 4.35 and Table 4.26, the cluster results of the inner race fault with 1 mm
slice (“◯”) are significantly improved after noise reduction in feature space. The
J b values under different k max after noise reduction in feature space are improved
correspondingly, and the maximum difference is close to 0.1 (k max = 16), which
demonstrates the effectiveness of the noise reduction. Moreover, the fluctuation of
J b was reduced from 0.127 to 0.033 within the range of k max values, which illustrates
the corresponding stability can also be improved. It is also noted that J b values are all
greater than 0.9 and higher than before noise reduction, which proves the combination
of noise reduction in feature space and the VKLLE method can further improve the
clustering results.

4.3.5.2 NPE-Based Tendency Analysis of Bearing

The LLE method belongs to a nonlinear algorithm, which can effectively calculate
nonlinear data matrix with the advantages of fast computation and few parameters.
However, LLE is also a non-incremental learning algorithm, which cannot effectively
adapt to the new sample. The low-dimensional space of new samples is calculated
by retraining the LLE algorithm, resulting in a large storage space and not suit-
able for real-time monitoring. To address this problem, an intelligent neighborhood
preserving embedding (NPE) algorithm with generalization is proposed by Cai et al.
[22] from the theory of the LLE method. The idea of NPE is assuming that it exists
a mapping transformation matrix from high-dimensional space to low-dimensional
space, and this matrix can be obtained from the training samples. As a result, test
data in the low-dimensional space can be obtained by multiplying with this matrix,
which is a linear approximation of the LLE method.
Self-organizing map (SOM) neural network [23], an unsupervised clustering algo-
rithm, is proposed by Kononen from Finland and has been widely used in condition
246 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

monitoring of mechanical components. The principle is that, when a certain condi-


tion type is an input, a neuron in its output layer gets the maximum stimulation to
become the “winning” neuron, and the neighboring neurons also get a larger stim-
ulation due to the lateral effect. In this case, the SOM network is trained and the
connection weight vectors of the “winning” neuron and its neighboring neurons are
modified accordingly to the direction of the condition type. When the condition type

1 1

VKLLE3
LLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5 0.5
LLE2 0 0 LLE1 VKLLE2 0 0 VKLLE1
(a) k=6, Jb=0.883 (b) kmax=6, Jb=0.952

1 1
VKLLE3
LLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5 0.5
LLE2 0 0 LLE1 VKLLE2 0 0 VKLLE1
(c) k=8, Jb=0.716 (d) kmax=8, Jb=0.900

1 1
VKLLE3
LLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5
0.5
LLE2 0 0 LLE1 VKLLE2 0 0
VKLLE1
(e) k=10, Jb=0.801 (f) kmax=10, Jb=0.912

Fig. 4.34 Cluster results of LLE method and VKLLE method on different k values (“+” denotes
the normal state, “*” denotes the outer race fault with 1 mm slice, and “◯” denotes the inner race
fault with 1 mm slice)
4.3 LLE Based Fault Recognition 247

1 1

VKLLE3
LLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5 0.5
LLE2 0 0 LLE1 VKLLE2 0 0 VKLLE1
(g) k=12, Jb=0.729 (h) kmax=12, Jb=0.868

1 1

VKLLE3
LLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5 0.5
LLE2 0 0 VKLLE2 0 0 VKLLE1
LLE1
(i) k=14, Jb=0.765 (j) kmax=14, Jb=0.856

1 1
VKLLE3
LLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5 0.5
LLE2 0 0 LLE1 VKLLE2 0 0 VKLLE1
(k) k=16, Jb=0.730 (l) kmax=16, Jb=0.825

Fig. 4.34 (continued)

is changed, the original winning neuron in the two-dimensional space will be trans-
ferred to other neurons, so that the connection weights of the SOM network can be
adjusted using a large number of training samples via the self-organization method.
As a result, the output layer’s feature map of the SOM network can reflect the distri-
bution of the samples. Performance assessment of bearing is proposed based on NPE
and SOM methods, in which the NPE method is used to map the samples from the
high-dimensional space to the low-dimensional space and train a SOM model by
using samples under the normal condition. For the test data, they are fed into the
trained SOM model and the condition result can be determined by calculating the
248 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

1 1
VKLLE3

VKLLE3
0.5 0.5

0 0
1 1
1 1
0.5 0.5 0.5 0.5
VKLLE2 0 0 VKLLE1 VKLLE2 0 0 VKLLE1

(a) kmax=6, Jb=0.956 (b) kmax=8, Jb=0.953

1 1
VKLLE3

VKLLE3
0.5 0.5

0 0
1 1
1 1
0.5 0.5
0.5 0.5
VKLLE2 0 0 VKLLE1 VKLLE2 0 0 VKLLE1
(c) kmax=10, Jb =0.955 (d) kmax=12, Jb=0.949

1 1
VKLLE3
VKLLE3

0.5 0.5

0 0
1 1
1 1
0.5 0.5 0.5
0.5
VKLLE2 0 0 VKLLE2 0 0 VKLLE1
VKLLE1
(e) kmax=14, Jb =0.952 (f) kmax=16, Jb =0.923

Fig. 4.35 Cluster results of VKLLE methods after noise reduction in feature space on different
k values (“+” denotes the normal state, “*” denotes the outer race fault with 1 mm slice, and “◯”
denotes the inner race fault with 1 mm slice)

deviation value between the test sample and the normal samples using the minimum
quantization error (MQE). Finally, the MQE is normalized to obtain the confidence
value.
The CWRU dataset was used to validate the proposed method based on NPE and
SOM. The specific vibration signals are collected by a vibration acceleration sensor
mounted in the upper housing of the induction motor output shaft, and the sampling
frequency is 48 kHz, including:
4.3 LLE Based Fault Recognition 249

Table 4.26 J b values of


k max Before noise reduction (J b ) After noise reduction (J b )
VKLLE methods before and
after noise reduction in 6 0.952 0.956
feature space 8 0.900 0.953
10 0.912 0.955
12 0.868 0.949
14 0.856 0.952
16 0.825 0.923

(1) Normal state;


(2) Rolling element fault (fault width are: 0.007 in., 0.014 in., 0.021 in., depth is
0.011 in., represented by B014, B021, B007, respectively).
The vibration signals of four fault types are collected at three different loads
and speeds: 1 hp-1772 r/min, 2 hp-1750 r/min, and 3 hp-1730 r/min. 20 sets of
time domain signals are collected at each load and speed according to 4–25, and the
number of samples for each fault type is 60. Therefore, 20 × 240 data are constructed
in high-dimensional space.
The intrinsic dimension of the samples in the high-dimensional space is deter-
mined by the residual error curve proposed in the literature [24], and the dimension at
the “inflection point” of the residual curve is considered to be the intrinsic dimension
of samples. Figure 4.36 provides the residual error curve of the sample dimension.
It can be seen that the “inflection point” of the residual curve appears in the low-
dimensional space with 3-D, so the intrinsic dimension of the sample is considered to
be 3-D. 50% of samples are selected to calculate the mapping transformation matrix
by using the NPE method, and the rest of the samples are mapped to the 3-D space
by a constructed transformation matrix. Then, the SOM method is used to calculate
the corresponding MQE value, which can be used to determine the fault types.
Figure 4.37a provides the sample curves in low-dimensional space obtained in the
original space without noise reduction using the NPE method, where NPE1, NPE2,
and NPE3 indicate the first-dimensional feature, second-dimensional feature, and
the third-dimensional feature of samples in low-dimensional space. From Fig. 4.37a,
the feature curves of the samples fluctuate obviously and cannot effectively identify
fault types. In the NPE1 feature space, the samples are similar in the normal state
and the B007 fault type, so it is difficult to distinguish them, which shows that the
sample curve is valid in this case. The sample curves after noise reduction are shown
in Fig. 4.37b. It can be seen that the sample curves from the same fault type in the
low-dimensional space become significantly smaller after the noise reduction, which
illustrates that the samples are similar to those of the same fault type and have greater
differences from different fault types.
The MQE curves calculated by using different methods are shown in Fig. 4.38.
Method 1: Calculate the MQE value by inputting samples from the original space
into the SOM model, as shown in Fig. 4.38a; Method 2: 50% of samples from the
original space are selected to calculate the mapping transformation matrix by using
250 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.36 Residual error curve of sample at different dimensions

Fig. 4.37 Sample curves in low-dimensional space before and after noise reduction in the original
feature space

NPE method, and the rest of samples are mapped to the low-dimensional subspace,
and the corresponding MQE value can be calculated by SOM method, as shown in
Fig. 4.38b; Method 3: The samples are noise reduced from original feature space
by using the NPE method and then input to the SOM model to calculate the MQE
values, as shown in Fig. 4.38c.
From Fig. 4.38, different calculation methods have different results. Method 1
shown in Fig. 4.38a cannot effectively diagnose B014 and B021 faults, and the MQE
value of B021 under 1 hp is among the values of B007 and B014, which is not
consistent with the changing trend of the fault, and the MQE values of the same fault
4.3 LLE Based Fault Recognition 251

Fig. 4.38 MQE curves of three calculation methods


252 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

have obvious fluctuations. In Method 2 shown in Fig. 4.38b, the discrepancy in MQE
values between the normal state and B007 is obvious. Under 2 hp and 3 hp loads,
most samples of B021 have higher MQE values than other fault types, indicating that
the fault is deteriorating further. However, there are still some samples that overlap
with the B014 fault, and the MQE value of B021 under 1 hp is also among the values
of B007 and B014, which cannot be distinguished effectively. In Method 3 shown
in Fig. 4.38c, the samples have very small fluctuations in the same fault types, the
discrepancy in MQE values between the normal state and B007 further increases, and
the distinction between different states is more obvious. Under 2 hp and 3 hp loads,
most samples of B021 have higher MQE values than other fault types, indicating
the fault degree is the deepest. Moreover, the MQE value of B021 under 1 hp load
is no longer among the values of B007 and B014, but similar to B014 and even
partially higher than B014, which indicates that the fault has started to deteriorate
and is consistent with the actual degradation trend of bearings.

4.4 Fault Classification Based on Distance Preserving


Projection

The idea of distance preserving projection is that first calculate the Euclidean distance
between any two data from the original database to obtain a distance matrix, from
which get Minimum Spanning Tree. Then project Minimum Spanning Tree from
small to large, from left to right. At the same time, precisely retain the distance
from each data to its nearest neighbors and some of its nearest neighbors in low-
dimensional space to achieve the purpose of dimensionality reduction.

4.4.1 Locality Preserving Projections

(1) Calculate near neighbors: Calculate the Euclidean distance between each x i and
the rest of the data to find k nearest points to it then build a near neighbor graph.
The equation of distance is
( ) || ||
d x i , x j = || x i − x j || (4.20)

(2) Select weight value:



||xi −xj ||2
Wi j = e− t , i and j are near neighbors, t as heat kernel signature
0, others
(4.21)

(3) Calculate eigenvector:


4.4 Fault Classification Based on Distance Preserving Projection 253

X L X T a = λX DX T a (4.22)

where D is the diagonal matrix, while Dii = j W ji and L = D − W .
(4) Calculate the Eigenvalue and Eigenvector of Eq. (4.22). Vectors a0 , a1 , …, and
ad are the Eigenvectors corresponding to Eigenvalues λ0 < λ1 < · · · < λd .
Therefore, the low-dimensional space sample is yi = AT xi , where A = (a0 , a1 ,
…, ad ) as transformation matrix.

By calculating training samples to obtain the transformation matrix from high-


dimensional space to low-dimensional space, test samples can obtain corresponding
low-dimensional test space, which effectively improves the calculating speed, solves
the problem of LE that it fails to deal with test samples effectively, and improves
algorithmic generalization.
Using manifold learning to recognize mechanical status is a useful means of diag-
nosis. At present, in the field of mechanical fault diagnosis, sample space processed
by manifold learning is usually feature space after feature extraction of time-domain
signals. The number of samples in the feature space is far less than the amount of data
in the time domain signal. If noise reduction performance is ensured, direct noise
reduction to time-domain signals can be replaced by noise reduction to the feature
space, which can effectively reduce computational complexity, accelerate calculation
speed and reduce storage space.
Transformation of time-domain signal noise in feature sample space.
(1) Transfer condition of noise when time-domain features are extracted from time-
domain vibration signals.
Assume X(i), (i = 1, 2, …, N) is the measured time-domain vibration signal and its
expression is:

X(i ) = Y (i) + ΔY (i) (4.23)

where Y(i) is the ideal time-domain vibration signal when there is no noise; ΔY (i )
is its corresponding noise.
Time domain transformation is performed on X(i) to obtain corresponding time-
domain feature indicators such as peak-to-peak value, average, variance, mean
square amplitude, kurtosis, impulse indicator, etc. Those feature indicators can all be
expressed in the following equation, on which the study of the noise transformation
process is based.
{ ∑ }m
a [X(i )]n 1 1
M= { ∑ }m (4.24)
b [X(i)]n 2 2

Substitute Eq. (4.23) for Eq. (4.24),


{ ∑ }m
a [Y (i ) + ΔY (i)]n 1 1
M= { ∑ }m (4.25)
b [Y (i ) + ΔY (i)]n 2 2
254 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

where a, b, n1 , n2 , m1 , m2 are coefficients. The different coefficient combination


represents different time-domain features, such as mean square amplitude: a = 1/N,
b = 1/N, n1 = 2, m1 = 1/2, m2 = 0. Equation (4.25) can be written as
{ ∑ }m { ∑ }m
a [Y (i )]n 1 1 + a [ΔY (i )]n 1 1 + c1
M= { ∑ }m { ∑ }m
b [Y (i )]n 2 2 + b [ΔY (i)]n 2 2 + c2
{ ∑ }m
a [Y (i )]n 1 1
= { ∑ }m + c (4.26)
b [Y (i )]n 2 2

where c1 , c2 , and c are the remainders of the equation containing different levels of
noise. Equation (4.26) tells that time-domain features extracted from time-domain
vibration signals are the combination of the ideal feature part and noise part.
For example, the transformation of noise from the time domain to kurtosis is:

1 ∑ [X(i )]4
N
xq = (4.27)
N i=1 (xa )2

∑N
where xa = 1
N i=1 [X(i )]2 is the mean square value. Substitute Eq. (4.23) for
Eq. (4.27),

1 ∑
N
[Y (i) + ΔY (i )]4
xq = ( ∑ )2
N 1 N
i=1 N [Y (i ) + ΔY (i )]2
i=1
{ }2 { }
4 2 2 2
1 ∑ [Y (i )] + [ΔY (i )] + 2Y (i )ΔY (i ) + 2[Y (i )] [ΔY (i)] + 2Y (i )ΔY (i )
N
= ( ∑ )
N 1 N 2 1 ∑ N [ΔY (i )]2 + 2 ∑ N [Y (i )ΔY (i)] 2
i=1 N i=1 [Y (i )] + N i=1 N i=1

1 ∑
N
[Y (i )]4
= { ∑ }2 + Δ
N N
i=1 1 N [Y (i )]2
i=1
= xq + Δ (4.28)

where xq is kurtosis of an ideal no noise condition; Δ is a noise.


As can be seen from the above reasoning, when measured time-domain vibration
signal that contains additive noise is time-domain transformed to obtain time-domain
feature indicators, the influence of additive noise will be transformed into the feature
indicators.
(2) Transformation condition of noise when time-domain vibration signal is under
frequency domain analysis.
The extraction of frequency-domain features is obtained by performing Fourier trans-
form on measured time-domain vibration signals, transforming time-domain signals
to frequency-domain signals, and then doing relevant analysis and processing on
frequency-domain signals to extract corresponding features. Therefore, studying the
4.4 Fault Classification Based on Distance Preserving Projection 255

transformation condition of noise in the process of signals transformed from time


domain space to frequency domain space tells us how noise is superimposed on
frequency-domain feature indicators.
The transformation equation between the time domain and frequency domain is:


N
X(i )e− j

x(k) = N ki
, k = 1, 2, . . . , N (4.29)
i=1

Substitute Eq. (4.23) for Eq. (4.29),


N
[Y (i) + ΔY (i)]e− j

x(k) = N ki
k = 1, 2, . . . , N
i=1


N ∑
N
Y (i )e− j ΔY (i)e− j
2π 2π
= N ki
+ N ki
(4.30)
i=1 i=1

∑N
Y (i )e− j N ki be Fourier transform of time-domain signals when

Let y(k) = i=1
∑N
ΔY (i)e− j N ki be Fourier

the condition is ideal and no noise and Δ y(k) = i=1
transform of the noise part. Therefore, Eq. (4.30) can be simplified as:

x(k) = y(k) + Δ y(k), k = 1, 2, . . . , N (4.31)

As can be seen from the above derivation process, in the process of Fourier
transform, additive noise is transformed from time domain space to frequency domain
space. Therefore, it is inevitable that extracting feature indicators from frequency
domain space is affected by noise.
As can be seen from the transformation process of additive noise, in the process
of feature extraction in the time domain and frequency domain, additive noise exists
still in the form of additive noise. Therefore, the effect of noise reduction on the
extracted features is equivalent to the effect of direct denoising the original time
domain signals.
Using manifold learning such as LPP to identify mechanical status is an effec-
tive means of diagnosis, but the presence of noise seriously affects the accuracy of
identification. At present, in the field of mechanical fault diagnosis, sample space
processed by manifold learning is usually feature space after feature extraction of
time-domain signals. The number of samples in the feature space is far less than the
amount of data in the time domain signal. If noise reduction performance is ensured,
direct noise reduction to time-domain signals can be replaced by noise reduction to
the feature space, which can effectively reduce computational complexity, accelerate
calculation speed and reduce storage space.
256 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.4.2 NFDPP

Figure 4.39 is a simplified graph of LPP and NFDPP when choosing sample structure
information to build a weight map. It can be seen from the figure that NFDPP is an
improved method of LPP, which concerns both the structure information of the nearest
neighbors and the farthest neighbors of the sample, avoiding the shortcomings of LPP
that only concerns near neighbor structure information and ignores the non-neighbor
structure information.
The essence of NFDPP is to maintain the farthest structure properties of the sample
while retaining local structure characteristics so that the sample space can preserve
more original spatial data structure information after dimension reduction.
The specific calculation process is as follows:
Given m-dimensional sample matrix X = [x1 , x2 , x3 … xN ] ⊆ Rm , where N is the
number of samples. Find transformation matrix A, map N samples to low-dimension
subspace Y = [y1 , y2 , y3 … yN ] ⊆ Rd (d << m), where the kth column vector of Y
corresponds to the kth column vector of X.
(1) Calculate the nearest neighbors and the farthest neighbors: calculate the
Euclidean distance from each point x i to the rest of the points, find its k n nearest
samples and k l farthest samples and construct the nearest neighbor matrix plot
and farthest neighbor matrix plot. The distance equation is
( ) || ||
d x i , x j = || x i − x j || (4.32)

(2) Select the weight value for the neighborhood matrix.



1 x i and x j are near neighbors
Wi j = (4.33)
0 others

(3) Set the weight value for the farthest matrix plot.

Fig. 4.39 Concept of the NFDPP algorithm and comparison with LPP approach
4.4 Fault Classification Based on Distance Preserving Projection 257

− 1, xi and x j are the farthest distance from each other
Si j = (4.34)
0, others

(4) For the problem of LPP that concerns only the weight value W ij of the neigh-
borhood matrix of each sample and ignores the farthest sample information S ij ,
here both concern W ij and S ij .

X L X T A = λX DX T A (4.35)
∑ ∑
where D is the diagonal matrix, s.t. Dii = j W ji − j S ji and L = D−W +S.
(5) Calculate the Eigenvalue and Eigenvector of Eq. (4.35), vectors a1 , …, and ad
are Eigenvectors corresponding to Eigenvalues λ1 < λ2 · · · < λd . Therefore,
yi = AT xi is a low-dimensional space sample where A = (a1 , a2 , …, ad ) is the
transformation matrix of mapping.

Proof
(1) For the nearest neighbor matrix plot, it is required to retain local information of
the sample after dimension reduction, so
⎡ ⎤
1 ∑( )2
ε(Y ) = min⎣ yi − y j Wi j ⎦
2 ij (4.36)
[ ( ) ]
= min Y D1 − W Y T

Let L1 = D1 − W
( )
ε(Y ) = min Y L 1 Y T (4.37)

(2) For the farthest neighbor matrix plot, it is required to maintain the farthest
distance position of the samples, so
⎡ ⎤
1 ∑( )2
φ(Y ) = max⎣ yi − y j Si j ⎦
2 ij
[ ( ) ]
= max Y D2 − S Y T (4.38)

Let L2 = D2 − S
( )
φ(Y ) = max Y L 2 Y T (4.39)

Combine Eqs. (4.37) and (4.39), and there is


( )
Ω(Y ) = min Y LY T (4.40)
Y DY T =1
258 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

where D = D1 − D2 , L = L 1 − L 2 = D − W + S.
Assume there is a mapping matrix A, s.t., Y = AT X. Perform Lagrange theorem,

X L X T A = λX DX T A (4.41)

Take feature vector A = (a1 , …, and ad ) corresponding to the smallest d


Eigenvalues as a mapping transformation matrix. So,

X → Y = AT X (4.42)

4.4.3 Experiment Analysis for Engine Misfire

Engine cylinder head vibration signal contains plenty of information that effectively
reflects the change of rotational speed and cylinder pressure, and piston shock. Using
cylinder head vibration signals to perform engine fault diagnosis and status moni-
toring has the advantages of easy signal acquisition and wide application. The misfire
directly causes a change in the engine’s working cycle, which makes the engine’s
excitation force frequency turns into another cycle. For example, one cylinder misfire
will destroy the balance of engine operation, together with the excitation force gener-
ated by the rest of the cylinders forming an additional cycle so that the spectrum of
the vibration signal measured from the cylinder head changes correspondingly. Simi-
larly, when a misfire occurs between two adjacent cylinders or two cylinders apart, the
spectrum of the vibration signal changes as well. Therefore, studying the change of
vibration signal, extracting corresponding features, and combining it with manifold
learning helps classify and identify the engine misfire status.
Take a four-stroke inline four-cylinder gasoline combustion engine on Jetta as
the research object. According to the types of engine misfire fault, a total of four
types of engine status is simulated by setting first-cylinder misfire, first-and-second-
cylinder misfire, first-and-forth-cylinder misfire, and normal. For different types of
status, using PCB vibration acceleration sensor, MKII acquisition front-end, and
PAK analysis software, set the sampling frequency to 12,800 Hz, and measure the
acceleration vibration signal from the surface of engine cylinder at the rotational
speed of 800, 1200, and 2000 rpm. Figure 4.40 is the time-domain plot of each
status. As can be seen from the figure to the same status, the vibration amplitude
of the time-domain signals increases as the rotational speed increases. At the same
rotational speed, the impact amplitude of the misfire fault is significantly larger
than that in normal status, and it’s difficult to distinguish different misfire statuses
from time-domain vibration. As can be seen from the corresponding spectrogram
in Fig. 4.41, in each status, the amplitude reaches its peak at two times the rotating
frequency of the engine (2f : f represents the rotating frequency). For misfire fault,
the amplitude at frequency 0.5f , f , 1.5f occurs but the amplitude at such frequencies
in normal status is not obvious. For example, the amplitude at 0.5f , f , 1.5f occurs
4.4 Fault Classification Based on Distance Preserving Projection 259

when the first-cylinder misfires and the first-and-second-cylinder misfires, and the
amplitude at f occurs when the first-and-forth-cylinder misfires. This is because
when some cylinder stator misfires, it destroys the balance of engine operation,
which makes the gas from the rest of the cylinder stators in the engine break out
excitation force to form another variation period.
To extract the status information of the engine and achieve an intelligent diagnosis,
15 sets of time domain data at each rotational speed are collected, and 25 feature
indicators listed in Table 4.27 are extracted for each group of data, so there are 45
samples in each engine status, and a total of 25 × 180 data matrix is obtained, of
which 180 is the number of samples and 25 is the feature dimension. Since there
are both dimensional and dimensionless indicators in the extracted feature space,
to eliminate the influence of dimension, the extracted features are dimensionally
normalized, that is, the mean is 0 and the variance is 1.
25% and 50% of the total number of samples are randomly selected for training,
and the rest are used for testing. The classification rate and NMI (Normalized Mutual
Information) results are shown in Fig. 4.42, and all curve values were averaged by
20 random times.
As can be seen from Fig. 4.42, the classification rate of the four methods increases
as dimensionality increases, and the classification rate and NMI of NFDPP are greater

Fig. 4.40 Time domain signals at four statuses of the engine


260 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.41 Frequency domain signals at four statuses of the engine

Table 4.27 Original feature indicator of the engine


Time domain feature indicators Frequency domain feature indicators
Average p1 The amplitude of the engine of the 2nd order p11
Root square amplitude p2 The amplitude of the engine of order 0.5 p12
Standard deviation p3 The amplitude of the engine of the first order p13
Peak p4 The amplitude of the engine of order 1.5 p14
Skewness p5 The amplitude ratio of the engine of order 0.5 to order 2 p15
Kurtosis p6 The ratio of the amplitude of the engine to the second-order p16
Shape indicator p7 The ratio of the amplitude of the engine of order 1.5 to order 2
p17
Crest factor p8 Time–frequency domain feature indicators
Impulse factor p9 Wavelet packet energy p18 –p25
Margin factor p10

than the other three methods. Comparing the classification rate curves, it can be seen
that the classification rate of NFDPP is greater than 90% at two different percentages
of training samples and different dimensionalities, and when the dimensionality is
larger than 3, its classification rate is close to 100%. When the dimensionality is two
and training samples are 25% of the total samples, the classification rate of NFDPP is
4.4 Fault Classification Based on Distance Preserving Projection 261

Fig. 4.42 Classification rate and NMI of different methods at different dimensionalities

more than 10% greater than that of PCA in second place, and when the dimensionality
is two and training samples are 50% of the total samples, the classification rate of
NFDPP is more than 4% greater than that of LPP in second place, indicating that better
classification rate of the samples is obtained through NFDPP dimension reduction.
Comparing NMI curves, it can be seen that the NMI value of NFDPP is less
than 0.6 (in this case: 0.42 and 0.54) when the training samples are 50% of the total
samples and dimensionality is between 2 and 3, and the NMI value in other cases is
greater than 0.6. The NMI value intervals of the remaining methods are PCA [0.15,
0.18], LPP [0.25, 0.28], NPE [0.41, 0.46] when the training samples are 25% of the
total samples, and PCA [0.13, 0.17], LPP [0.26, 0.30], NPE [0.36, 0.39] when the
training samples are 50% of the total samples. It can be seen that the NMI values
of PCA, LPP, and NPE are significantly less than that of NFDPP, indicating sample
space obtained by NFDPP is more conducive to sample clustering. The NMI value
of PCA is the smallest among the four methods, which is because PCA is a linear
dimension reduction method that cannot effectively deal with nonlinear samples. The
262 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

vibration of the engine is a high-dimensional nonlinear dynamics problem where the


PCA does not shine.

4.4.4 Local and Global Spectral Regression Method

LPP, NPE, and other methods assume there is a mapping transformation matrix
between high-dimensional space and low-dimensional space, and the transforma-
tion matrix is calculated from the training samples, and the corresponding low-
dimensional space sample is obtained by multiplying the new data sample with
the transformation matrix to achieve incremental learning. However, the mapping
transformation matrix does not necessarily meet the requirements in practice, for
example, when the feature dimension of the sample data matrix is greater than the
number of samples, an ill-conditioned matrix occurs, which will affect the stability of
the method. The traditional way of dealing with that problem is to perform singular
value decomposition on high-dimensional space data first to eliminate the effect of
the ill-conditioned matrix, but singular value decomposition will increase compu-
tation time and storage space, making the computation process more complicated.
Given the problem, Ref. [25] proposes the theoretical idea of Spectral Regression
(SR), making the dimensionality reduction effect more robust and improving the
computation efficiency. Spectral regression theory is an improvement of manifold
learning methods, but the current spectral regression method still has the problem
of concerning only the local structure information of the sample data and ignoring
global information. Therefore, this section proposes a spectral regression method
that preserves both local and global information about sample data structure, that is,
Local and Global Spectral Regression (LGSR). Based on the theoretical idea of SR,
LGSR is a further improvement of NFDPP. By analyzing the feature extraction and
clustering effects of LGSR in engine and transmission status, the effectiveness of the
improved method is indicated.
The essence of spectral regression (SR) is an improvement of LPP, but its prin-
ciple can be applied to methods of manifold learning, and it has been successfully
applied in the field of image recognition. But it has few mechanical applications
(Xia and others [26] used SR in bearing fault classification directly), and the current
spectral regression does not eliminate the limitations of LPP’s only concerning the
near neighbors and ignoring the global information. Therefore, this section proposes
LGSR that concerns both the sample neighbors and the global information based on
SR, combining NFDPP.
Given m dimension sample matrix X = [x1 , x2 , x3 , …, xN ] ⊆ Rm , where N is
the number of samples. Find transformation matrix B, and map N samples to low-
dimensional subspace Z = [z1 , z2 , z3 , …, zN ] ⊆ Rd (d << m), where the kth column
vector of Z corresponds to the kth column vector of X.
From the above analysis, it can be seen that low-dimensional subspace sample Y
and mapping matrix A can be obtained by calculating the eigenvalue of Eq. (4.41),
where mapping matrix A = [a1 , a2 , …, ad ] is a matrix composed of the Eigenvectors
4.4 Fault Classification Based on Distance Preserving Projection 263

of Eq. (4.41) arranged according to the eigenvalues λ1 ≤ λ2 ≤ · · · ≤ λd . New test


samples can be mapping transformed from high-dimensional to low-dimensional
through A.
The above linear mapping process can obtain stable effects in some cases, but
the computation process is based on the assumption that Y = AT X exists, which is
not necessarily true in practice, that is, stable and effective dimensionality reduction
effects can not be obtained. To obtain stable dimensionality reduction effects, the
traditional way is to perform singular value decomposition on matrix X to eliminate
the effect of the ill-conditioned matrix. But performing singular value decomposition
increases computation time and storage space. To solve the problem, using the prin-
ciple of SR and combining NFDPP, Local and Global Spectral Regression, a method
that concerns both the local and global of the sample is proposed.
(1) Spectral regression analysis
Low-dimensional space Y corresponding to the training sample X is obtained by
calculating the eigenvectors of Eq. (4.40), and the transformation matrix A is no
longer assumed to exist s.t. Y = AT X, and the transformation matrix B is calculated
using ordinary least squares. The calculation is as Eq. (4.43):
[ N ]
∑( )2
bk = arg min B x i − yi
T
+ α||B||
2
(4.43)
B i=1

where α ≥ 0 is the controlling parameter.


Calculate the mapping matrix. Equation (4.43) can turn into linear Eq. (4.44),
( )
X X T + α I bk = X yk (4.44)

where I is an N × N identity matrix.


Let B = [b1 , b2 , …, bd ] be m × d mapping matrix, high-dimensional sample
space X can be mapped to low-dimensional subspace through

X → Z = BT X (4.45)

4.4.5 Application of Method Based on Distance Preserving


Projections and Its Spectral Regression in Fault
Classification

4.4.5.1 Case 1: Fault Experiment on Camry Sport Engine

The test object was a four-stroke inline four-cylinder gasoline combustion engine
on Camry, and the vibration signal on the engine when the second-cylinder misfires
264 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

and the clearance of the second-cylinder spark plug became smaller was tested.
Compared the changes of the vibration signal of the engine in normal and two fault
status, and the test system and test points layout is shown in Fig. 4.43, using four
PCB vibration acceleration sensors, MKII acquisition front-end, and PAK analysis
software to measure the vibration acceleration signals of the engine at idle speed,
2000 and 3000 rpm in each status, where the sampling frequency was 12,800 Hz and
the sampling time was 2 s. Figure 4.44 is the time-domain vibration waveform of
the No. 2 acceleration sensor of the cylinder head at three engine statuses and corre-
sponding rotational speeds. It can be seen from the figure that at the same status,
the vibration amplitude increases as the rotational speed increases and the distinc-
tion is obvious. Between different statuses, in idle speed, the vibration acceleration
difference between normal status, second-cylinder misfire, and second-cylinder spark
plug’s clearance becoming smaller is obvious, which is due to the faults changing
the working cycle of the engine, resulting in corresponding changes in the periodic
vibration. As the rotational speed increased, the difference between the three statuses
became smaller, which made it difficult to distinguish different types of fault from
the time domain. Collected 30 sets of vibration signals for each status, extract 25
feature indicators according to Table 4.27, and obtain a data matrix with a dimension
of 100 (25 feature indicators for each sensor, 4 with a total of 100 feature indicators)
and a sample number of 270 (100 × 270).
To verify the effectiveness of the proposed method, 25% (equivalent to 100 ×
68, when the feature dimension is greater than the number of samples) and 50%
(equivalent to 100 × 135) of the samples were randomly selected from the total
number of samples for training, and the remaining samples were used for testing.
The recognition accuracy and clustering effect after dimensionality reduction was
calculated by 1NN (nearest neighbor classifier) and NMI, and the recognition effect
of PCA, NPE, LPP, SR, and the method NFDPP and LGSR mentioned in this chapter
were compared. Figure 4.45 is the classification rate and NMI indicator curves of
six methods when the number of training samples was 25% and 50% of the total
samples. All curves are averaged over 20 random results.
It can be seen from Fig. 4.45 that except for PCA, which has a classification
rate of 99.8% under the number of training samples was 25% of the total samples

Fig. 4.43 Test equipment and testing points arranged


4.4 Fault Classification Based on Distance Preserving Projection 265

Fig. 4.44 Time domain vibration signal of the engine at three status and different rotational speed

and two-dimensional space, the classification rate of the other methods has reached
100%, and each method has achieved good classification results.
It can be seen from comparing normalized mutual information curves that when
the training samples were 25% of the total samples, the NMI values of LGSR and
NFDPP were between [0.58, 0.71] and [0.52, 0.63], respectively, while NPE, LPP
and SR were all about 0.2, and PCA was below 0.1. As the number of training samples
increased (the number of the training samples was 50% of the total samples), the NMI
values of LGSR and NFDPP also rose correspondingly to: [0.60, 0.78], [0.54, 0.66],
and the NMI values of NEP were between [0.22, 0.28], and the rest of the methods
were not sensitive to changes in the number of training samples. It can be seen by
comparing NMI curves at different numbers of training samples that compared to
266 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.45 Classification rate and NMI of the methods

other methods, LGSR and NFDPP’s NMI values were greater than the remaining
four methods because they preserved more sample structure information, indicating
better clustering effect can be obtained using LGSR and NFDPP, and the samples
after dimensionality reduction are better in clustering of the same type of samples
and classifying of the different types of samples.

4.4.5.2 Case 2: Fault Experiment Analysis of Transmission

The five-speed gear of Dongfeng SG135-2 transmission was taken as the experi-
mental object (with three shafts and five gears, the structure is shown in Fig. 3.25),
and the vibration acceleration signal was collected at measuring point No. 3 at its
input. The transmission was simulated in four statuses: normal, moderate spalling
of gear in 5 gear, severe spalling of gears + deformation of the tooth surface, and
4.4 Fault Classification Based on Distance Preserving Projection 267

Fig. 4.46 Three types of faulty gears

tooth fracture. Figure 4.46 shows the faulty gear. The transmission had a gear ratio of
22/42 in 5 gears. The transmission input shaft and output shaft speeds were: 600, 410,
and 784 rpm loading torque: 75.5 N m, sampling frequency: 40,000 Hz, sampling
length: 1024 × 90 points, five gear meshing frequency: 287.5 Hz.
Figure 4.47 is the time domain waveform of four gear status, it can be seen from
the figure that the vibration amplitude was small in the normal status, the vibration
amplitude was larger when the gear fault occurred, and the impact of tooth fracture
was obvious, but there was no obvious impact when the gear was in spalling and
severe spalling + tooth deformation status. To better distinguish different types of
faults and achieve an intelligent diagnosis, manifold learning methods were used to
extract the most effective features. The acceleration vibration signal of each status
was divided into 44 groups, with a total of 176 groups, and the feature indicators
were extracted according to Table 4.28 for each group of acceleration signals, which
constituted a sample feature space of 13 × 176 dimensions, where 13 represented
the feature dimension and 176 was the number of samples.
25% and 50% of the total number of samples were randomly selected for training,
and the remaining samples were used for testing. 1NN and NMI were used to calculate
the classification rate and normalized mutual information respectively, which were
used to evaluate the dimensionality reduction effect of different methods, as shown
in Fig. 4.48, all the curves in the figure were the average of 20 random results.
It can be seen from the classification rate curves in Fig. 4.48 that except for NPE
when the training samples were 25% of the total samples and the low-dimensional
dimension was 2, the classification rate of each method was greater than 90% in
other cases, indicating that all methods could get a good classification effect. At
the same time, the classification rate of NFDPP and LGSR was above 96%, which
was significantly higher than the other four methods (when the low-dimensional
dimension was 2, the classification rate of NFDPP, LGSR, SR, and PCA was close).
When the training samples were 25% of the total samples, the classification rate of
SR when the dimension was 2–3 was greater than that of LPP, and the difference
between the two was up to 5.19%, indicating the superiority of SR. When the training
samples were 50% of the total samples, the classification rate of SR is smaller than
268 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.47 Time domain waveform of the gear at four status

Table 4.28 Original feature indicators of gear status


Time domain feature indicators Crest factor p8
Average p1 Impulse factor p9
Root square amplitude p2 Margin factor p10
Standard deviation p3 Frequency domain feature indicators
Peak p4 Rotational frequency amplitude of the shaft where the gear was
located p11
Skewness p5 Gear meshing frequency amplitude p12
Kurtosis p6 The sum of the main frequency in the modulation band and the
Shape indicator p7 amplitude of each side frequency p13

that of LPP except when the dimension was 2. PCA achieved better results only when
the low-dimensional dimension was 2 and had the lowest classification rate at other
dimensions.
It can be seen comparing normalized mutual information curves that the NMI
value of LGSR was 0.89 when the training samples were 50% of the total samples
and the low-dimensional dimension was 2, and the normalized mutual information
value of NFDPP and LGSR was higher than 0.94 in other cases. The NMI values of
the other four methods were less than 0.93, and when the training samples were 50%
of the total samples and the low-dimensional dimension was 2, the NMI value was
less than 0.87, the NMI value of PCA was greater than NPE, and the NMI value of
SR was greater than LPP. When the training samples were 25% of the total samples,
the NMI value of SR was greater than that of LPP, and the NMI value of SR was
4.4 Fault Classification Based on Distance Preserving Projection 269

Fig. 4.48 Classification rate and NMI of different methods

smaller than that of LPP when the training samples were 50% of the total samples,
indicating that SR and LPP were greatly affected by the number of training samples,
and the NMI value of PCA was the lowest when the dimension was greater than 2,
indicating that the clustering effect was poor.
It can be seen comparing the classification rate and NMI curves of each method that
the two indicators of proposed NFDPP and LGSR are better than the other methods
under two different numbers of training samples and different low-dimensional
dimensions, indicating that the proposed two dimensionality reduction methods can
preserve the structure information of the original data more effectively, improve clas-
sification rate and clustering effect. The dimensionality reduction effect of LGSR and
NFDPP is only obvious when the training samples are 50% of the total samples and
low-dimensional dimension is 2, and the difference between classification rate and
NMI value in other cases is small, and the dimensionality reduction effect of the two is
similar. But NFDPP uses singular value decomposition to improves stability, causing
the increase of computation time and storage space, while LGSR uses the idea of
270 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

spectral regression to improve stability, which has high computation efficiency and
small storage space and has better application in practice.

References

1. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500), 2319–2323 (2000)
2. Seung, H.S., Lee, D.D.: The manifold ways of perception. Science 290(5500), 2268–2269
(2000)
3. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding.
Science 290(5500), 2323–2326 (2000)
4. Liu, Y.W.: Application Diagram Theory. National University of Defense Technology Press,
Changsha (2008)
5. Algebraic, F.M.: Algebraic connectivity of graphs. Czechoslov. Math. J. 23(2), 298–305 (1973)
6. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 22(8), 888–905 (2000)
7. Ding, K., Li, W.H., Zhu, X.Y.: Practical Technology for Gear and Gearbox Fault Diagnosis.
Mechanical Industry Press, Beijing (2005)
8. Zhou, D.Y., Bousquet, O., Lal, T.N., et al.: Learning with local and global consistency. In:
Proceeding of Advances in Neural Information Processing Systems, Cambridge, pp. 321–328
(2004)
9. Wang, L., Bo, L.F., Jiao Li, C.: Density-sensitive spectral clustering. Acta Electron. Sin. 35(8),
1577–1581 (2007)
10. Ng, A., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceeding
of Advances in Neural Information Processing Systems, pp. 849–856 (2002)
11. Chen, Y.S., Wang, G.P., Dong, S.H.: A progress transductive inference algorithm based on
vector machine. J. Softw. 14(3), 451–460 (2003)
12. Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: Proceed-
ings of the Tenth International Workshop on Artificial Intelligence and Statistics, pp. 57–64
(2005)
13. Case Western Reserve University Bearing Data [DS/OL]. The Case Western Reserve University
Bearing Data Center Website: https://csegroups.case.edu/bearingdatacenter/pages/apparatus-
procedures
14. Wang, J.: Research on the Theory and Methods of Manifold Learning. University of Zhejiang
(2006)
15. Zhang, T., Hong, W.X., Jing, J., et al.: Representation problems in pattern recognition. J.
Yanshan Univ. 23(5), 382–388 (2008)
16. Duin, R.P.W., Pekalska, E.: The science of pattern recognition: achievements and perspec-
tives. In: Challenges for Computational Intelligence, pp. 221–259. Springer, Berlin, Heidelberg
(2007)
17. Gunn, S.R.: Support vector machines for classification and regression. ISIS Techn. Rep. 14(1),
5–16 (1998)
18. Bengio, Y., Paiement, J., Vincentp, T., et al.: Out-of-sample extensions for LLE, Isomap,
MDS, eigenmaps, and spectral clustering. In: Proceeding of Advances in Neural Information
Processing Systems, Cambridge, pp. 177–184 (2004)
19. Kouropteva, O., Okun, O., Pietikainen, M.: Supervised locally linear embedding algorithm for
pattern recognition. In: Proceeding of Iberian Conference on Pattern Recognition and Image
Analysis, vol. 2652, pp. 386–394 (2003)
20. Li, B., Zheng, C.H., Huang, D.S.: Locally linear discriminant embedding: an efficient method
for face recognition. Pattern Recogn. 41(12), 3813–3821 (2008)
References 271

21. Lecun, Y., Cortes, C.: MNIST handwritten digit database [DS/OL] (2010). Available at: http://
yann.lecun.com/exdb/mnist
22. He, X., Cai, D., Yan, S., et al.: Neighborhood preserving embedding. In: Proceedings in
International Conference on Computer Vision, Beijing, 17–21 Oct 2005
23. Kohonen, T.: Self-organizing Maps, 3rd extended edn. Springer, Berlin (2001)
24. Levina, E., Bickel, P.J.: Maximum likelihood estimation of intrinsic dimension. In: Proceeding
of Advances in Neural Information Processing Systems, Cambridge, pp. 777–784 (2004)
25. Cai, D.: Spectral Regression: A Regression Framework for Efficient Regularized Subspace
Learning. University of Illinois at Urbana-Champaign, Urbana-Champaign (2009)
26. Xia, Z., Xia, S., Wan, L., et al.: Spectral regression based fault feature extraction for bearing
accelerometer sensor signals. Sensors 12(10), 13694–13719 (2012)
Chapter 5
Deep Learning Based Machinery Fault
Diagnosis

5.1 Deep Learning

Deep Learning (DL) is one of the most popular technologies in the current machine
learning field, and the 2013 MIT Technology Review ranked DL as one of the Top
10 Breakthrough technologies in 2013 [1]. Essentially, DL is a deep neural network
with multiple hidden layers, and the main difference between DL and the traditional
multilayer perceptron network (MLPN) is the learning algorithm. The concept of
“deep learning” was first introduced by Professor Hinton of the University of Toronto
(a leader in the field of machine learning) and he published an article in Science in
2006 [2], which started the wave of DL research. This paper pointed out two main
features of DL: First, neural networks with multiple hidden layers (DBNs, deep
confidence networks) have excellent feature learning ability, and the learned features
have a more essential picture of the data, thus facilitating classification. Second, deep
neural networks have difficulties in training, and this challenge can be overcome by
training with the “Layer-wise Pre-training” strategy. In this paper, layer-by-layer
initialization is implemented by unsupervised learning of the restricted Boltzmann
machine (RBM).
Bengio et al. of the University of Montreal proposed a DL algorithm AutoEncoder
(AE) and applied it in the field of speech recognition and natural language recognition
[3]. Lecun et al. of New York University proposed Convolutional Neural Network
(CNN) for door number and traffic sign recognition [4]. Ng et al. of Stanford Univer-
sity and Google use DBNs and sparse coding for real-time display of Google Maps
street view and car autopilot [5]. Jordan [6] and Elman [7] proposed a recurrent feed-
back neural network framework, namely Recurrent Neural Network (RNN). NIPS
and ICML, important conferences in neural computing and machine learning, have
opened DL topics since 2011 until now, and Google, Facebook, Microsoft, Baidu,
and other companies have set up special DL R&D departments to industrialize this
technology for research [8]. In March 2016, Alphago, which adopted deep learning
technology, defeated Mr. Lee Sedol, the ninth dan of Go in South Korea, which further

© National Defense Industry Press 2023 273


W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_5
274 5 Deep Learning Based Machinery Fault Diagnosis

promoted the popularity of deep learning, reinforcement learning, neural network,


and so on.
From the DL perspective, most existing intelligent fault diagnosis methods can
be considered as shallow structured algorithms containing only a single layer of
nonlinear transformations, including HMM, SVM, logistic regression, and other
traditional neural networks. Shallow models usually contain only a single simple
structure that transforms the original input signal to a specific feature space. As a
result, the limitation is that the models have an inferior ability to represent complex
functions with limited sample computational units, and the generalization perfor-
mance for complex classification problems is constrained. Therefore, these shallow
structured algorithms based IFD methods can be applied in specific mechanical
components effectively but obtain the impression that it is “tailor-made”.
The core of DL is to construct a machine learning model with multiple hidden
layers which can learn effective features from a large amount of training data and
achieve excellent classification or prediction results. Compared with traditional
shallow learning models, DL-based models have better performance in extracting
high-dimensional feature representations from low-dimensional data. Therefore,
Deep Belief Network, AutoEncodor, Convolutional Neural Network, Recurrent
Neural Network, and other deep models can perform better in processing multi-
dimensional data in industrial health monitoring of mechanical systems. Quoting
Prof. Ng from Stanford university: “Deep learning happens to have the property that
if you feed it more data it gets better and better.” [9]. “Stones from other hills may
serve to polish the jade of this one”, IFD can obtain low-dimensional nonlinear repre-
sentation algorithms and health state classification algorithms for multi-source data
with these deep models. In 2013, scholars have already applied DL in multi-sensor
fault diagnosis. Tamilselvan et al. [10] proposed a novel health assessment method
using a deep belief network for multi-sensor systems and applied it for fault classifi-
cation of aero-engines and power transformers successfully. Tran et al. [11] utilized
deep belief networks to fuse different kinds of signals such as vibration, pressure, and
current for fault classification and identification in reciprocating compressor valves.
At present, the common deep learning networks include Convolutional Neural
Network (CNN), Deep Belief Network (DBN), Stacked Auto-Encoder Network
(SAE), and Recurrent Neural Network (RNN).

5.2 DBN Based Machinery Fault Diagnosis

Deep belief network (DBN) has the advantage of forming high-level abstract repre-
sentations by combining lower-level features and learning the distributed feature
representations, which makes the high- level representations can characterize the
low-level data features partly and helps the network to maintain the original data infor-
mation. Due to the aforementioned characteristics, the DBN can classify and identify
fault states directly from the original data. This method avoids the manual feature
5.2 DBN Based Machinery Fault Diagnosis 275

extraction and optimization process and enhances the intelligence of mechanical


fault diagnosis.

5.2.1 Deep Belief Network

5.2.1.1 Deep Belief Network

Boltzmann Machine (BM) is a stochastic spin-glass model with an external field, i.e.,
a Sherrington–Kirkpatrick model proposed by Hinton and Sejnowski in 1986 [12].
The energy model can be intuitively understood as follows: a small ball with a rough
surface and an irregular shape is randomly placed into a bowl with the same rough
surface. Due to the gravitational potential energy, the ball is usually most likely to
stop at the bottom of the bowl, but it is also possible to stop at other locations at
the bottom of the bowl. The energy model defines the final stopping position of the
ball as a state, each state corresponds to an energy, and the energy can be defined
by the energy function. Therefore, the probability that the ball is in a certain state
can be defined in terms of the energy that the ball has in the current state. For a
system, the energy of the system corresponds to the probability of the state of the
system. The more orderly the system is, the smaller the energy of the system and the
more concentrated the probability distribution of the system; if the system is more
disorderly, the larger the energy of the system and the more uniform the probability
distribution of the system tends to be.
Each Boltzmann machine consists of two different layers, which can be defined
as a visual layer (v) and a hidden layer (h). Each layer consists of several random
neurons, which are connected to each other by weights (w), and the output of the
neurons has only two states, unactivated and activated, which can be represented
by binary 0 and 1. The states of the neurons take values according to probabilistic
statistical rules. As shown in Fig. 5.1a, BM is a fully connected neural network of
random neurons, which has a powerful ability to unsupervised learning to specific
rules from complex data, but it has a long training time and large computational cost.
Smolensky et al. [13] proposed Restricted Boltzmann Machine (RBM) by combining
the properties of Markov model, which eliminates the connection between the same
layers in the BM so that the state of each layer is only related to the state of the
previous layer. The structure of the RBM is shown in Fig. 5.1b, where the neurons
between the visual layers are independent of each other and the neurons between the
hidden layers are also independent of each other, but the neurons in the visual and
hidden layers can be connected to each other by the weights (w).
DBN is a multilayer perceptron neural network consisting of a series of RBMs
stacked together, so a deep confidence network can also be interpreted as a Bayesian
probabilistic generative model consisting of multiple layers of random hidden vari-
ables. The specific algorithm derivation is shown in [2]. The structure of DBN consists
of three stacked RBMs, which is shown in Fig. 5.2. The input data are learned by
the RBMs in the lower layer, and the output results are used as input to the RBMs in
276 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.1 The architecture of BM and RBM

Fig. 5.2 The architecture of


DBN

the higher layer, which is passed in turn layer by layer, thus forming a more abstract
and representational feature representation at the higher layer than the lower layer. It
is this layer-by-layer greedy learning idea of DBN [14] that allows DBN to classify
and identify common mechanical faults of transmissions directly from raw data. The
DBN learning process consists of two parts: forward stacking RBM learning from
low to high levels and backward fine-tuning learning from high to low levels. The
forward stacked RBM learning process is unsupervised without the involvement of
labeled data; the backward fine-tuning learning is supervised with the involvement
of labeled data.

5.2.1.2 Parameters Setting of DBN

(1) Initialization

There are three parameters that need to be initialized in the DBN: the RBM weight
matrix w, the RBM offset vector for input units a, and the RBM offset vector for
hidden units b. All three parameters can be set with Minimax random initialization.
5.2 DBN Based Machinery Fault Diagnosis 277

The RBM offset vectors a and b can also be initialized to zero, because the parameters
of both the visible and hidden layer will be iteratively updated, and the specific value
of the initialization will have little effect on the algorithm results.
(2) Learning rate
There are two learning rate parameters in DBN, namely, the learning rate ε of forward
stacked RBM learning and the learning rate α of backward fine-tuned learning. The
algorithm will converge faster with a larger learning rate, but it may lead to instability
of the algorithm. The smaller the learning rate is, the slower the algorithm converges,
and it affects the computation time. In the follow-up study, Hinton proposed the
parameter Momentum, which combines both the direction of the last parameter update
and the direction of the current gradient. In other words, the direction of the current
parameter value modification is not completely determined by the direction of the
current sample likelihood function gradient, which allows more accurate convergence
to the local optimal point. The Momentum can be defined as:

∂ ln L
Momentum = ρθ + ε (5.1)
∂θ
where ρ is the learning rate of momentum; θ is the initialization parameter θ =
{w, a, b}; L is the maximum likelihood function; traditionally, the value of ρ gets
form 0.5–0.9 and that of ε and α are set 0.1.
(3) Network layers
The specific meaning of “depth” in deep learning refers that models have more layers.
In general, the network with more layers is more likely to explore the essential features
of the data in detailed ways. However, there is no convincing and complete conclusion
from the existing research on how many layers a deep learning network needs to be.
Furthermore, it is uncertain that the network with deeper architecture has better
performance. With more layers in the network, the structure is more complex and
more time-consuming needed for the network computation. And cumulative error
may also be generated, which will reduce the training efficiency. In this section, we
choose a deep learning network composed of three RBMs for experiments. As shown
in Table 5.2, the DBN network structure is composed of three RBMs: a visible layer
that acts as an input layer, three middle hidden layers, and an output layer that is
connected to the classification model.
(4) Node number
In DBN structure, the node number of the input layer is directly equal to the dimen-
sions of input data, and the node number of the output layer is equal to the number
of categories. However, there is no convincing research on how to set the number of
nodes in the middle-hidden layer, and the specific selection is highly subjective. Deep
learning network is developed from the Back Propagation Neural Network (BP). The
research of BP is comparatively mature, and the selection of nodes number can be
considered from empirical formulae [15], such as:
278 5 Deep Learning Based Machinery Fault Diagnosis

S= m+n+a (5.2)

√ k
S= mn + (5.3)
2

S= mn + n (5.4)

S = log2 m (5.5)


S≤ n(m + 3) + 1 (5.6)

where m is the input of the nodes number; n is the output of the nodes number; the
number of a and k is the constants taken from [0, 10].
Combined with the DBN theory described above, the main steps of DL-based
mechanical fault classification are as follows.
Step 1: Acquire the vibration data, and set up training data and testing data.
Step 2: Input the training data into the first RBM, and train all RBMs in DBN
layer by layer.
Step 3: Finetune the parameters from the high-layer to the low-layer step by step
backward on the basis of Step 2 with labeled training data and Softmax function,
then complete the whole training process of the DBN model.
Step 4: Input the testing data into the trained model in Step 3 and calculate the
accuracy of faults classification.

5.2.1.3 Bearing Fault Classification

In order to explore the practical applications of DBN in fault diagnosis, the following
experiment is set up to simulate the common bearing fault in the transmission. The
structure of the experimental system is shown in Fig. 5.3, the fault bearing is installed
in position 2, and an acceleration sensor is installed directly above the bearing seat
in position 2 to collect the vibration acceleration signal.
Before the experiment, three different grooves with a depth of 0.5 mm and a
width of 0.5 mm, 1 mm and 2 mm were machined by wire cutting in the outer and
inner rings of the bearing to mild, moderate and severe faults respectively. During
the experiment, the intermediate rotating shaft speed is 1400 r/min, the sampling
frequency is 24 kHz, the sampling time lasts 20 s, and the experimental working
conditions are shown in Table 5.1.
Figure 5.4 shows the time-domain signals of the seven bearing states. It can be
concluded that the time-domain signals of different bearing states have a certain
discrepancy, such as, the amplitude range of the normal state being within [−10,
10], the amplitude range of the outer ring mild fault being within [−25, 25], and
the amplitude range of the outer ring severe fault is within [−50, 50] in. The signals
5.2 DBN Based Machinery Fault Diagnosis 279

Fig. 5.3 The bearing test


platform and bearing faults

Table 5.1 Description of experimental conditions


Group Size of fault/mm Fault degree Bearing conditions
1 0 None Normal
2 Width 0.5 mm, depth 0.5 mm Mild fault Inner race mild fault
3 Width 1 mm, depth 0.5 mm Moderate fault Inner race moderate fault
4 Width 2 mm, depth 0.5 mm Severe fault Inner race severe fault
5 Width 0.5 mm, depth 0.5 mm Mild fault Outer race mild fault
6 Width 1 mm, depth 0.5 mm Moderate fault Outer race moderate fault
7 Width 2 mm, depth 0.5 mm Severe fault Outer race severe fault

Table 5.2 The combination of node number in the DBN hidden layer
Group First hidden layer Second hidden layer Third hidden layer Total node number
1 5 5 5 15
2 5 10 15 30
3 10 15 20 45
4 10 10 10 30
5 10 10 15 35
6 10 10 20 40
7 10 5 10 25
8 10 20 10 40
9 10 15 10 35
10 10 15 15 40
11 5 15 10 30
12 15 15 15 45
13 20 20 20 60
14 20 15 10 45
15 20 15 5 40
16 20 10 5 35
280 5 Deep Learning Based Machinery Fault Diagnosis

with higher discrepancy are relatively easy to be classified and identified, but the
amplitude ranges of inner ring mild fault, inner ring severe fault, inner ring moderate
fault, and outer ring moderate fault are all within [−25, 25], which do not have large
discrepancy and their fault types are difficult to be distinguished. From the viewpoint
of shock phenomenon: the signals of the normal state and inner ring mild fault have
tiny fluctuation and can be considered as no shock; inner ring moderate fault, outer
ring mild fault and outer ring moderate fault all have certain shock phenomenon; inner
ring severe fault and outer ring severe fault has a more prominent shock. Although
some faults of the bearing can be distinguished from the time-domain diagram, it is
still difficult to accurately classify all the seven different fault states of the bearing.
There are some common statistical indicators of vibration signals in seven states
are extracted, such as maximum value, minimum value, peak-to-peak value, mean
value, mean square value, variance, root mean square value, average amplitude,
mean square amplitude, cliffness, peak value, waveform indicator, peak indicator,
pulse indicator and margin indicator. And these 14 statistical indicators are used as
the input features of DBN, which means the number of input nodes of the DBN
is 14. The output nodes are 7 because there are seven bearing states in the case
required classification and identification. Calculation results of different empirical

Fig. 5.4 Time-domain signals of different bearing states


5.2 DBN Based Machinery Fault Diagnosis 281

Eqs. 5.2–5.6 vary greatly, because the number of input nodes and the number of
output nodes has little difference and the constants a and k with the value range
of [0, 10] have a negative impact √ to the results. For example, the first hidden layer
takes the value range of [10, 15] 14 × 7+k with Eq. 5.3, and takes the value
√ 
range of [5, 15] 14 + 7+a with Eq. 5.4. Considering the empirical formula and
the discrepancy between DBN and BP shallow learning networks, this section sets
up several deep learning networks with different combinations of hidden layers as
follows to discuss the influence of different combinations of hidden layers on the
classification results. For expression convenience, m × n × z is set to represent
different combinations of hidden layers, which means the first hidden layer has m
nodes, the second hidden layer has n nodes, and the third hidden layer has z nodes.
For each bearing state, 80 training samples and 80 testing samples are randomly
selected. And there are 560 training samples and 560 testing samples totally can
be obtained for all seven bearing states in Table 5.1. The details of the DBN deep
network are shown in Table 5.2 and the data input to the hidden layer. The number
of iterations is set to 200, and the average value of each item is calculated after 20
repetitions.
Figure 5.5 shows the relationship between bearing state classification accuracy
and the number of iterations, where the 14 sets of common features are input to
the DBN depth network and the number of hidden layers are 5 × 10 × 15, 10 ×
15 × 20, 10 × 10 × 10, 20 × 15 × 10 and 20 × 10 × 5, respectively. As the
figure shows, the fault classification accuracy is lower than 20% before iteration and
the classification accuracy increases sharply with the iterations increasing. When
the number of iteration steps exceeds 50, the classification correct rate increases
relatively slowly. The curve has gradually leveled off after 200 iterations, and the
change in classification accuracy is not significant. As a result, higher iteration steps
are beneficial to improve the accuracy of bearing states classification.
Table 5.3 shows the classification results of the seven bearing states after 200
iterations of DBN. “Accuracy” refers to the highest accuracy in 200 iterations, and

Fig. 5.5 The relationship


between the
classification accuracy of
bearing faults and the
number of iterations
282 5 Deep Learning Based Machinery Fault Diagnosis

the higher value means the stronger recognition performance. “The number of iter-
ations” refers to the number of iterations corresponding to the highest accuracy,
and the higher value means the higher number of iterations required to obtain the
higher classification correct rate, meanwhile, means a bigger number of iterations
are required to obtain higher accuracy, and the computational cost is also increased.
As shown in Table 5.3, accuracies of 13 combinations with different hidden layers
exceeded 80%, and the results of combinations of 5 × 5 × 5, 20 × 10 × 5 and 20 ×
15 × 5 are relatively low. The accuracies of 9 sets exceeded 90%, and the accuracies
of two sets were close to 90% (89.55% for set 3 and 89.71% for set 6). It can be
concluded that DBN can be applied in complex classification and identification tasks
of bearing states. Furthermore, the iterations corresponding to the highest accuracy
of 16 groups are approximate, and all of them are close to the preset value of 200.
Therefore, the highest accuracy in 200 iterations can be used as the main index for
evaluating different combinations in this case. Comparing the highest accuracies of
different combinations in Table 5.3, it can be concluded as follows:
(1) In the “constant value” combinations, the combination close to the number of
input nodes is better. The “constant value” means the node number of the three
hidden layers is the same. There are four “constant value” combinations in this
case, namely 5 × 5 × 5, 10 × 10 × 10, 15 × 15 × 15 and 20 × 20 × 20, and the
corresponding classification results are 77.01, 91.17, 92.43 and 90.16%. The
DBN has 14 input nodes and 7 output nodes, and the combination of 15 × 15

Table 5.3 The conclusion of bearing condition classification


Group First hidden Second Third hidden Total node Iterations Accuracy (%)
layer hidden layer layer number
1 5 5 5 15 199 77.01
2 5 10 15 30 199 87.09
3 10 15 20 45 200 89.55
4 10 10 10 30 197 91.17
5 10 10 15 35 200 90.11
6 10 10 20 40 191 89.71
7 10 5 10 25 197 82.48
8 10 20 10 40 200 93.89
9 10 15 10 35 200 90.91
10 10 15 15 40 200 93.92
11 5 15 10 30 199 90.53
12 15 15 15 45 199 92.43
13 20 20 20 60 200 90.16
14 20 15 10 45 195 92.08
15 20 15 5 40 200 79.89
16 20 10 5 35 175 73.97
5.2 DBN Based Machinery Fault Diagnosis 283

× 15 has the highest accuracy. Because the deep learning network mines the
distributed feature representation of the data, and the distributed features of the
data are easier to unfold when the nodes number of the network is closer to the
input nodes number.
(2) In the “rising value” combinations, the combination with more total nodes is
better. The “rising value” means the nodes of high-hidden layers are not less
than the nodes of low-hidden layers (excluding “constant value”). For example,
there are four “rising value” combinations of 5 × 10 × 15, 5 × 15 × 20, 10
× 10 × 15 and 10 × 15 × 20 in this case, and the highest accuracies of them
are 87.09%, 89.55%, 90.11% and 92.21%, respectively. Comparing these four
combinations, it can be seen that the result gradually increases with the increase
of the total nodes number. Because the network with more nodes can mine the
distributed features from the data in detail, and have a stronger ability to interpret
the data.
(3) In the “declining value” combinations, the combination with more total nodes
is better. In contrast to the “rising value” combinations, the nodes in the high-
hidden layer are not more than the nodes in the low-hidden layer (excluding
the “constant value”). In this case, there are 20 × 15 × 10, 20 × 15 × 5 and
20 × 10 × 5, and the corresponding accuracies are 92.08, 79.89 and 73.97%.
The combination with more nodes has a higher accuracy, which is consistent
with the results of “rising value” combinations. And also confirms the above
view that the network with more nodes has better performance in mining the
distributed features of the data in detail, and a stronger ability to explain the
data.
(4) In the “concave-convex” combinations, the combination with more total nodes
is better. The “concave-convex” combination means the intermediate nodes in
the hidden layer are lower (or higher) than the nodes in the other two layers.
In this case, there are four combinations of 10 × 5 × 10, 5 × 15 × 10, 10
× 15 × 10 and 10 × 20 × 10, and the corresponding accuracies are 82.48%,
90.53%, 90.91% and 93.89%, respectively. It is obvious that the accuracy of the
combination with more nodes is higher than the combination with fewer nodes,
as a result, more nodes in DBN are beneficial for data mining.
From the experimental analysis, it can be seen that the complex bearing states can
be efficiently classified and identified by using DBN. By setting different combina-
tions of hidden layers and comparing the classification accuracies corresponding to
different combinations, we can draw a preliminary conclusion that the combination
that hidden layer nodes are close to the input nodes and that has higher total nodes are
more favorable. Although there is no strict data-theoretical support and the validation
among a large number of statistical data is lacking, this preliminary conclusion still
has some reference significance for constructing the DBN structure.
284 5 Deep Learning Based Machinery Fault Diagnosis

5.2.2 DBN Based Vibration Signal Diagnosis

To further validate whether deep learning methods can omit the manual feature
extraction and optimization process in the field of fault classification and recognition,
it is necessary to study the state of the output of the original vibration signal after
being computed by the deep learning network.

5.2.2.1 Reconstructed DBN

DBN is staked by multiple layers of RBMs, the input data is used as the visible layer
to the RBMs of the low layer, then outputs to the hidden layer after being learned. It
is also used as the input of the visible layer of the higher layer of RBMs, and is passed
in turn layer by layer. The output value of the last RBM layer of the network is the
probability value of the hidden layer P(h|v), and the value is not easy to quantify and
observe. Furthermore, the node number of the hidden layer is related to the degree
of data segmentation, so it is not meaningful to discuss P(h|v) directly.
The hidden and visible layers in RBM are connected by the sigmoid function. In
fact, the forward stacked RBM learning process in DBN can be regarded as a sigmoid
function coding process. If the input signal of the network can be reconstructed by
decoding and there is less difference between the original signal and the reconstructed
signal, it can be indicated that the original signal can be recovered from the high-level
features of DBN with low distortion, i.e., the high-level features can characterize the
original data to a certain extent. As a result, it indicates that the DBN has a strong
ability to maintain the details of the original data and it also explains why deep
learning has the ability to learn relevant features autonomously without the manual
feature extraction process.
In order to explore the output state of the original signal after DBN, the highest
layer output value of DBN is directly imported into the decoding network to recon-
struct the original input signal without going through the Softmax classifier, and the
reconstructed network is constructed based on DBN. To distinguish the original DBN
and the reconstructed network, the reconstructed network based on DBN is called
a reconstructed deep brief network. The forward stacking RBM learning process is
described uniformly and simply in coding form. Set the input as a x and the encoding
function can be denoted as,

1
h= (5.7)
1 + exp(−wx − b)

The decoding function can be denoted as,

1
x̂ =   (5.8)
1 + exp −w T x − c

where w is the weight; b and c is the bias.


5.2 DBN Based Machinery Fault Diagnosis 285

There are certain errors between the reconstructed signal and the original signal
because the original signal is compressed, rotated, filtered, and reformed during it
passed through the encoder and decoder to obtain the reconstructed signal. The errors
can be defined as the distortion and the distortion can be denoted as the relative mean
square error. Set x as the original signal, y as the reconstructed signal, m as the length
of the signal, the distortion can be denoted as:
m
(xi − yi )2
√
S = i=1 (5.9)
m 2
m i=1 x i

When the reconstructed signal is more consistent with the original signal, the original
signal has lower distortion, and the DBN has better performance in learning the
high-level features directly from the original data autonomously.
The encoding and decoding process of the network is not involved with labeled
data, which can be considered as an unsupervised process. Corresponding to the
backward fine-tuning learning process in DBN, labeled data can be added to the
deep belief reconstruction network for further fine-tuning the model. Suppose the
deep network consists of n l RBMs, x represents the initial sample, n represents the
sample length, unl (x) represents the coded output (last layer of RBM output), and
ynl (x) represents the decoded output. The decoded output ynl (x) is a reconstruction
of the initial samples. The error between the original signal and the reconstructed
signal can be denoted as the mean squared error as followed,

1  || nl |2
n
|
J (x) = |y j (x) − x j | (5.10)
m j=1

According to Eq. 5.7, the first term in Eq. 5.10 can be a simplified expression as
nl
y nl (x) = f w,b (x). The gradient ascent method is used to minimize errors as followed,

∂ J (x) ∂ f ni (x)
∇ J (θ) = = 2 w,b (5.11)
∂θ ∂θ
where θ = {w, b, c}, 1 ≤ l ≤ n l . The parameter updates can be denoted as followed,


w̃l = wl − α J (x) (5.12)
∂wl

b̃l = bl − α J (x) (5.13)
∂bl

c̃l = cl − α J (x) (5.14)
∂cl
286 5 Deep Learning Based Machinery Fault Diagnosis

5.2.2.2 The Reconstruction Process of Vibration Signals

The original vibration signal is a continuous function over time and requires a certain
length of data to characterize the vibration signal. For example, periodic signals
require at least one cycle of data to obtain the peak value, mean value, etc. of the
signal. However, usually, one-cycle data is relatively large. For example, a vibration
signal with a sampling frequency of 24 kHz collected at 1000 rpm has 1440 data
points per cycle (60/1000 × 24 × 1000). When input with the original vibration
signal, the number of data points of the sample is equal to the number of sample
data dimensions. When the number of data dimensions is too large, it not only
increases the computational effort but also may cause inconvenient analysis due to
the interference between different dimensions. If the algorithm can effectively reduce
the dimensionality of the data while analyzing the data, it will be beneficial to the
subsequent analysis of the data. When the number of hidden nodes in the top layer of
RBM in DBN is less than the sample dimension, the dimension of the output value
after DBN learning will be smaller than the sample dimension. Therefore, the deep
confidence network completes the dimensionality reduction process while deeply
mining the original data features.
In summary, the steps of vibration signal reconstruction based on reconstructed
DBN are as follows.
(1) Acquire vibration data and select the number of sample data points according
to the actual demand.
(2) Input the original signal into the first visual layer of RBM (RBM1) and learn to
calculate the hidden layer and the inter-layer connection weights.
(3) Use the hidden layer of the first RBM (RBM1) as the visual layer of the second
RBM (RBM2), and learn to calculate the hidden layer and inter-layer connection
weights.
(4) Repeat step (2) until all RBMs are learned. (2) to (4) may also be referred to as
the encoding process, and the hidden layer output from the highest layer may
also be referred to as the encoding layer.
(5) Decoding the coding layer data in the reverse layer by layer, the reconstructed
signal corresponding to the original signal can be obtained.
(6) If a fine-tuning step is added after step (5), the fine-tuned reconstructed signal
can thus be obtained.
Assume that the number of sample data points is selected as 512 according to the
actual demand, i.e., 512 data points in the length of the original signal is intercepted
as the input dimension. In order to fully exploit the original data information and
also take into account the dimensionality reduction effect of the network, the first
hidden layer is set to about twice the original data, i.e., 1000 nodes; the second
layer is gradually reduced to 800 nodes; the third layer is further reduced to 400
nodes, and the fourth layer is the target layer with d nodes. The steps of vibration
signal reconstruction based on reconstructed DBN can be graphically represented in
Fig. 5.6.
5.2 DBN Based Machinery Fault Diagnosis 287

Fig. 5.6 The procedure of vibration signal reconstruction

The original signal (512 dimensions) is used as the input data of the first visual
layer and forms the first RBM (RBM1) with the first hidden layer (1000 dimensions);
the first hidden layer (1000 dimensions) is used as the second visual layer and forms
the second RBM (RBM2) with the second hidden layer (800 dimensions); the second
hidden layer (800 dimensions) is used as the third visual layer and forms the third
RBM (400 dimensions) to form the third RBM (RBM3). The coding layer is the
target layer d after dimensionality reduction, which is structurally similar to the fault
category in the fault classification recognition model. Since the target layer and the
third hidden layer (400 dimensions) are also coded according to the RBM rules, the
third hidden layer (400 dimensions) and the target layer can be used as the fourth
RBM (Top RBM4). In the figure, the connection weights between the layers and
the fine-tuning parameters are shown. As can be seen from the figure, the encoding
and decoding process with fine-tuning only has more fine-tuning parameters than
the encoding and decoding process without fine-tuning, and the other structures and
steps are the same.
288 5 Deep Learning Based Machinery Fault Diagnosis

5.2.2.3 Reconstruction Analysis of Vibration Simulation Signal

(1) Reconstruction analysis of initial sinusoidal simulation signal

Usually, the complex vibration signal can be regarded as the accumulation of several
sinusoidal signals. In order to facilitate the observation of the effect of the vibration
signal after encoding and decoding by the reconstructed DBN, and considering the
positive and negative characteristics of the vibration signal, the following signals are
simulated based on the standard sinusoidal signal. The details are shown in Table 5.4.
The sampling frequency is 10,240 Hz, the sampling time is 8 s, a total of 81,920
data points are collected, and the network structure is reconstructed using the depth
beliefs in Fig. 5.6, setting the initial length of the data to 512 data points. Because
the number of nodes in the hidden layer of the multilayer RBM structure is relatively
complex, for a simple discussion of the reconstruction effect of the network, the
hidden layer structure of 1000 × 800 × 400 is kept unchanged and the number of
iterations is set to 20. Three different target lengths, i.e., d = 16, 32, 48, are set
to discuss the effect of the target layer length on the reconstructed signal. After
encoding and decoding by the reconstructed DBN, the comparison of the initial
sinusoidal signal reconstruction effect is shown in Table 5.5.
There are some comments that are concluded from Table 5.5, which are as follows:
(1) Negative signals are not suitable for network calculation at all. The negative
signal in the sigmoid and un-sigmoid reconstructed signal is near zero value
level; the negative part of the simple sine signal is near zero value after being
reconstructed by the deep reconstructed network, only a very narrow positive
part remains positive, but the value size in the reconstructed signal is much
smaller than the value size in the original signal, i.e., the distortion is too large.
This is because the network in the encoding and decoding process are completely
used in the sigmoid function, the function value range for [0, 1]. When the input
value of the sigmoid function is negative, the output value is a minimal value
approaching zero; when the value approaching zero is the input, the output value
is still a minimal value approaching zero. Therefore, even after the multi-layer
network coding and decoding calculation, the final output value of negative
value is still a minimal value close to zero. When the network calculates the
error function, it actually integrates the deviation values of each data point,
and the error values of different data points will affect each other, such as the
deviation values before and after encoding and decoding of the positive part

Table 5.4 The initialized sinusoidal simulated signal


Signal The simulated signal formula
Negative signal x(t)1 x(t)1 = −0.5| sin(2 × 16π t+1)|
Positive signal x(t)2 x(t)2 = 0.5| sin(2 × 16π t+1)|
Simple sinusoidal signal x(t)3 x(t)3 = 0.5 sin(2 × 16π t+1)
Scaled and translated signal x(t)4 x(t)4 = 0.4 sin(2 × 16π t+1) + 0.5
5.2 DBN Based Machinery Fault Diagnosis 289

Table 5.5 The comparison of the initialized sinusoidal simulated signals and their reconstructed
signal

16-dimensional 32-dimensional 48-dimensional

Negative 0.1 0.1 0.1

0 0 0
Signal
-0.1 -0.1

x(t)/mm
x(t)/mm
-0.1
x(t)/mm

-0.2 -0.2 -0.2

-0.3 -0.3 -0.3

-0.4 -0.4 -0.4

-0.5 -0.5 -0.5


0 0.05 0.1 0 0.05 0.1 0 0.05 0.1
t/s t/s t/s

Positive 0.5 0.5 0.5

0.4 0.4 0.4


Signal

x(t)/mm
x(t)/mm
x(t)/mm

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
0 0.05 0.1 0 0.05 0.1 0 0.05 0.1
t/s t/s t/s

Simple Sin 1 0.5 0.5


Signal
x(t)/mm

0.5
x(t)/mm

x(t)/mm

0 0

-0.5 -0.5 -0.5


0 0.05 0.1 0 0.05 0.1 0 0.05 0.1
t/s t/s t/s

Scale 1 1 1

0.8 0.8
Translate 0.8
x(t)/mm
x(t)/mm

x(t)/mm

0.6 0.6 0.6


Signal
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.05 0.1 0 0.05 0.1 0 0.05 0.1
t/s t/s t/s

and the deviation values of the negative part exist. In the simple sinusoidal
signal where both positive and negative values exist, the negative part changes
slightly, resulting in the positive part skewing toward the negative part, and the
phenomenon that only a very narrow positive part of the reconstructed signal still
maintains positive values. In the simple sinusoidal signal with a 16-dimensional
target layer, the unstable abrupt change at the junction of some positive and
negative values in the two reconstructed signals is also caused by the interaction
of positive and negative errors.
290 5 Deep Learning Based Machinery Fault Diagnosis

(2) The smaller the dimensionality of the target layer, the greater the distortion.
Comparing the reconstructed maps in 16, 32 and 48 dimensions, both the unad-
justed reconstructed signal and the adjusted reconstructed signal deviate from
the original signal as the number of output dimensions decreases, i.e., the distor-
tion is greater. As in the positive signal, when the output is 16-dimensional, both
reconstructed signal curve values deviate far from the original signal curve, and
the amplitude and periodicity features are not obvious; when the output is 32-
dimensional, the amplitude and periodicity have been obvious, and the recon-
structed signal curve is similar to the original signal curve trend, but the peak
of the two curves are far apart; when the output is 48-dimensional, the peak of
the reconstructed signal is further close to the peak of the original signal. This
shows that the more the number of target dimensions, the more beneficial to
the vibration signal reconstruction, which is because the more the number of
target dimensions, the more the corresponding number of nodes, the stronger
the bearing and interpretation ability of the signal, and the smaller the distor-
tion after reconstruction by the network, which to a certain extent also shows
that there is a certain relationship between the number of nodes and the state
recognition effect in DBN—the number of nodes is too The node number is too
small, the network operation effect is poor.
(3) The impact of fine-tuning is prominent. From the reconstructed plots of the
positive signal, simple sinusoidal signal and scaled translational signal, it can be
seen that the fine-tuned reconstructed signal is closer to the original signal than
the un-trimmed reconstructed signal, indicating that fine-tuning the parameters
based on the error between the reconstructed signal and the original signal is
beneficial to the network learning calculation. This explains, to some extent, the
need for DBN to add a backward fine-tuning learning step.
Table 5.6 shows the relative mean squared deviation values of the 4 sets of simu-
lated signals. Comparing the untrimmed reconstructed signals and the trimmed recon-
structed signals in three different target layers, it is easy to find that the trimmed
signal error is smaller than the untrimmed signal in all the reconstructed signals
except the negative signal, which again indicates that the trimming helps the opti-
mization calculation of the network; the untrimmed and trimmed reconstructed error
values are constant for all three target dimensions in the negative signal, which again
indicates that the deep belief reconstructed network has failed in the negative range.

(2) Reconstruction analysis of sinusoidal stimulation signals

From the above analysis, it can be seen that the positive input value should be one of
the applicable conditions for the deep network. In fact, although the positive value
input network keeps a positive value after coding and decoding, the value range of the
sigmoid function is [0, 1], and the part of the input value beyond 1 is not applicable
to this network. To discuss the effect of input signal amplitude on the network, the
following signals are simulated based on translational scaled sinusoidal signals. The
details are shown in Table 5.7.
5.2 DBN Based Machinery Fault Diagnosis 291

Table 5.6 The relative mean square error of the initialized sinusoidal simulated signal
16-dimensional (mm) 32-dimensional (mm) 48-dimensional (mm)
Normal Fine-tuned Normal Fine-tuned Normal Fine-tuned
Negative 0.632 0.632 0.632 0.632 0.632 0.632
signal
Positive 0.121 0.110 0.048 0.010 0.078 0.019
signal
Simple 0.630 0.620 0.625 0.607 0.612 0.577
sinusoidal
signal
Scaled and 0.172 0.132 0.084 0.016 0.020 0.002
translated
signal

Table 5.7 The sinusoidal simulated signals with different amplitude


Signal The simulated signal formula
Siganl x(t)1 with amplitude 0.4 mm x(t)1 = 0.4 × (0.5 × sin(2 × 16π t+1) + 0.5)
Siganl x(t)2 with amplitude 0.8 mm x(t)2 = 0.8 × (0.5 × sin(2 × 16π t+1) + 0.5)
Siganl x(t)3 with amplitude 1.0 mm x(t)3 = 1.0 × (0.5 × sin(2 × 16π t+1) + 0.5)
Siganl x(t)4 with amplitude 1.2 mm x(t)4 = 1.2 × (0.5 × sin(2 × 16π t+1) + 0.5)
Siganl x(t)5 with amplitude 1.6 mm x(t)5 = 1.6 × (0.5 × sin(2 × 16π t+1) + 0.5)

The initial sample length is set to 512 data points, and the three target lengths are
d = {16, 32, 48}. After encoding and decoding with the same reconstructed DBN as
above, the reconstruction results of sinusoidal signals with different amplitudes are
shown in Table 5.8.
As can be seen from the figure, when the original signal amplitude is greater than 1,
the reconstructed signal shows a smooth constant value 1 phenomenon near the ampli-
tude; the larger the original signal amplitude is, the more obvious the phenomenon is.
When the original signal amplitude is less than or equal to 1, the trend of the recon-
structed signal and the original signal remains the same. Table 5.9 shows the relative
mean squared deviation values of the simulated signals with different amplitudes.
As can be seen from the above table, the error of the fine-tuned reconstructed
signal is smaller than that of the untrimmed reconstructed signal; in the three signals
with an amplitude greater than or equal to 1 mm, the reconstructed error increases
significantly with the increase of amplitude. However, in the signal with an amplitude
less than or equal to 1 mm, the reconstruction error does not appear in a clear pattern,
and the error value is relatively small. For example, for the signal with an amplitude
of 0.8 mm, the reconstruction error is only 0.008 mm after fine-tuning the target layer
of 48 dimensions, and the reconstruction error is only 0.022 mm without fine-tuning.
It can be seen that the signal with an amplitude greater than 1 is also not suitable for
this depth network.
292 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.8 The comparison of the sinusoidal simulated signals with different amplitude and their
reconstructed signal

16-dimensional 32-dimensional 48-dimensional

0.4 mm 0.4 0.4 0.4

0.3 0.3 0.3


x(t)/mm

x(t)/mm

x(t)/mm
0.2 0.2 0.2

0.1 0.1 0.1

0
0 0.02 0.04 0.06 0.08 0.1 0 0
t/s 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s

0.8 mm 0.8 0.8 0.8

0.6 0.6 0.6


x(t)/mm

x(t)/mm

x(t)/mm
0.4 0.4 0.4

0.2 0.2 0.2

0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0
t/s 0 0.02 0.04 0.06 0.08 0.1
t/s t/s

1.0 mm 1 1 1

0.8 0.8 0.8


x(t)/mm

0.6
x(t)/mm

x(t)/mm

0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1.2 mm 1.5 1.5 1.5

1 1 1
x(t)/mm
x(t)/mm

x(t)/mm

0.5 0.5 0.5

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1.6 mm 2 2 2

1.5 1.5 1.5


x(t)/mm
x(t)/mm

x(t)/mm

1 1 1

0.5 0.5
0.5

0 0
0 0 0.02 0.04 0.06 0.08 0.1
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 t/s
t/s t/s
5.2 DBN Based Machinery Fault Diagnosis 293

Table 5.9 The relative mean square error of the sinusoidal simulated signals with different
amplitude
16-dimensional (mm) 32-dimensional (mm) 48-dimensional (mm)
Normal Fine-tuned Normal Fine-tuned Normal Fine-tuned
Signal of 0.088 0.007 0.152 0.113 0.085 0.023
amplitude
0.4 mm
Signal of 0.206 0.160 0.068 0.016 0.022 0.008
amplitude
0.8 mm
Signal of 0.324 0.245 0.074 0.029 0.030 0.010
amplitude
1.0 mm
Signal of 0.437 0.381 0.198 0.057 0.056 0.029
amplitude
1.2 mm
Signal of 0.539 0.387 0.630 0.617 0.622 0.450
amplitude
1.6 mm

(3) Normalized simulation signal reconstruction analysis


From the above analysis, it can be seen that the input value ranged from [0, 1] is a
necessary for the deep network, which is determined by the Sigmoid function used
in RBM. No matter what range the values of the visual layer belong to, when the
activation function is transferred to the hidden layer, the value range of all nodes in
the hidden layer is [0, 1]. However, the original vibration signal always exist negative
values and an amplitude greater than 1. Therefore, when applying the deep confidence
network directly, it is necessary to scale the original data so that all the original data
are restricted to the range of [0, 1], which is usually called normalization.
Normalization is a common preprocessing method in AI learning, which can
significantly improve AI algorithms. Xiao et al. [16] used the feature normalization
method to preprocess 14 types of data, and the normalization process can significantly
improve the classification accuracy and also reduce the computation time of the clas-
sifier. Liu [17] investigated the limitations of the Sigmoid activation function in the BP
network algorithm, conducted a systematic normalization study on the input layer of
the network, and proposed a new method of joint normalization applied to mechanical
fault diagnosis, and practice shows that the normalization method can improve the
convergence speed and diagnostic accuracy of the network. Liu et al. [18] concluded
that complex high-dimensional data have large numerical differences, which lead
to slower or even non-convergence of machine learning, and compared the applica-
tion effects and applicability of six commonly used normalization methods based
on simulation data and actual data. Assuming that the discrete sequence of vibration
signals is xi (i = 1, 2, 3 . . . m), several commonly used normalization methods are
listed in Table 5.10.
294 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.10 The common


Method The normalization formula Range of value
normalization method
xi
1 x̃i = |x max | [−1, 1]
xi −x
2 x̃i = |x max | [−1, 1]
xi
3 x̃i = x max −x min [−1, 1]
2xi −x max −x min
4 x̃i = x max −x min [−1, 1]
5 x̃i = m·xm
i
[−1, 1]
i=1 x i

i −x)

m
6 x̃i = m(x
m , x= 1
xi [−1, 1]
i=1 (x i −x) m
i=1
xi −x min
7 x̃i = x max −x min [0, 1]

xi −x 
m
8 x̃i = σ , σ = 1
m−1 (xi − x i )2 [−1, 1]
i=1
2xi +x max −x min
9 x̃i = 3x max −x min [0, 1]

From the above analysis, it can be seen that if the signal is positive, all the above
methods can achieve [0, 1] after normalization. However, the vibration signal has
alternating positive and negative values, and only linear normalization and transla-
tional scaling normalization can normalize positive and negative signals to the range
of [0, 1], so only linear normalization and translational scaling normalization meet
the requirements. In essence, translational scaling normalization is the process of
first translating all data to positive numbers and then scaling the positive-valued data
to the [0, 1] range, and the derivation process is divided into two steps.
First, all data are scaled to positive numbers,

xmax − xmin
xi' = xi + (5.15)
2
Second, scale the data between 0 and 1,

x' xi + xmax −x min


2xi + xmax − xmin
x̃i =  ' i = xmax −xmin =
2
(5.16)
xi max x max + 2
3xmax − xmin

From the derivation of Eqs. 5.15 and 5.16, it can be seen that when the posi-
tive amplitude (maximum positive value) and the negative amplitude (maximum
negative absolute value) of the vibration signal are not equal, the translation and
scaling normalization does not guarantee that the value range is [0, 1]. If the nega-
tive amplitude of the vibration signal is greater than the positive amplitude, there is
xi' = xi + xmax −x
2
min
≤ 0, i.e., it does not satisfy 0 ≤ x̃i ≤ 1. The linear normalization
formula requires that the denominator cannot be zero, i.e., the maximum value of
the input signal cannot be equal to the minimum value, but the vibration signal is
described as a signal with position change characteristics, so a constant value state
5.2 DBN Based Machinery Fault Diagnosis 295

is not possible, i.e., the xmax = xmin does not occur. To facilitate understanding and
simple operation, linear normalization is used in this section.
To explore the effect of normalization on the reconstruction of vibra-
tion signals, the following two relatively complex signals are simulated sepa-
rately. Variable
n amplitude variable frequency vibration signal superposition:
x(t) = A
i=1 i sin(2π f i x); Multiplies variable-amplitude multifrequency vibra-
tion signals: x(t) = A1 sin(2π f 1 x) sin(2π f 2 x); where Ai = {2, 4, 6, 8}, f i =
{32, 64, 96, 128}. In addition, on the basis of the above two signals plus the signal-to-
noise ratio of 10 dB and 20 dB respectively, plus the original two groups of vibration
signals, a total of six groups of simulation signals can be obtained, as shown in
Table 5.11.
The six sets of simulations were preprocessed with linear normalization and then
input to a reconstructed DBN with a hidden layer structure of 1000 × 800 × 400 and
target layers of 16, 32, and 48 dimensions. The reconstructed effects after calculation
are shown in Table 5.12: Because the structure and essential properties of the original
signal do not change after linear normalization, the signal can also be called the
original signal after normalizing simple preprocessing only, and the original signal
in this figure refers to the original signal after normalized simple preprocessing.
From Table 5.12, it can be seen that both the noise-free signal and the noise-
containing signal can reconstruct the original signal more obviously after the network
coding and decoding. The reconstructed signal with more target layers is closer to the
original signal than the reconstructed signal with fewer target layers, and the vibration
phenomenon in the reconstructed signal with 16-dimensional target layers is not
obvious, because when the high-dimensional data is reduced to the low-dimensional
data, the information of the high-dimensional data will be attenuated accordingly,
and when the dimension of the target layer is too low, the attenuation of the high-
dimensional data is too large, resulting in the reconstructed signal deviating from

Table 5.11 The normalized simulation signal


Signal The simulated signal formula
The accumulated signal (SNR = 0 dB)x(t)1 
n
x(t)1 = Ai sin(2π f i x+1)
i=1

The accumulated signal (SNR = 10 dB)x(t)2 


n
x(t)2 = Ai sin(2π f i x+1) + 10 dB
i=1

The accumulated signal (SNR = 20 dB)x(t)3 


n
x(t)3 = Ai sin(2π f i x+1) + 20 dB
i=1

The multiplied signal x(t)4 = A1 sin(2π f 1 x+1) sin(2π f 2 x+1)


(SNR = 0 dB)x(t)4
The multiplied signal (SNR = 10 dB)x(t)5 x(t)5 =
A1 sin(2π f 1 x+1) sin(2π f 2 x+1) + 10 dB
The multiplied signal (SNR = 20 dB)x(t)6 x(t)6 =
A1 sin(2π f 1 x+1) sin(2π f 2 x+1) + 20 dB
296 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.12 The comparison of the normalized simulation signals and their reconstructed signal

16-dimensional 32-dimensional 48-dimensional

1 1 1

0.8 0.8 0.8

The accumu-
x(t)/mm

x(t)/mm
x(t)/mm
0.6 0.6 0.6

lated signal 0.4 0.4 0.4


(SNR=0db)
0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1 1 1

0.8 0.8 0.8

The accumu-
x(t)/mm
x(t)/mm

x(t)/mm
0.6 0.6 0.6

lated signal 0.4 0.4 0.4


(SNR=10db)
0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1 1 1

0.8 0.8 0.8

The accumu-
x(t)/mm

x(t)/mm
x(t)/mm

0.6 0.6 0.6

lated signal 0.4 0.4 0.4


(SNR=20db)
0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1 1 1

0.8 0.8 0.8

The multi-
x(t)/mm

0.6
x(t)/mm

0.6
x(t)/mm

0.6

plied signal 0.4 0.4 0.4


(SNR=0db)
0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1 1 1

0.8 0.8 0.8

The multi-
x(t)/mm

0.6
x(t)/mm

x(t)/mm

0.6 0.6

plied signal 0.4 0.4 0.4


(SNR=10db)
0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s t/s

1 1 1

0.8 0.8 0.8

The multi-
x(t)/mm

0.6
x(t)/mm

0.6
x(t)/mm

0.6

plied signal 0.4 0.4 0.4


(SNR=20db)
0.2 0.2 0.2

0 0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
t/s t/s
t/s
5.2 DBN Based Machinery Fault Diagnosis 297

the original signal. This phenomenon also shows that when using a DBN network to
reduce the dimension, the target dimension has a certain applicable range and is not
suitable to be too small. Secondly, the larger the signal-to-noise ratio is, the smaller the
noise is and the less the noise interferes with the original signal, and the reconstructed
signal is closer to the original signal after dimensionality reduction, coding and
decoding. On the other hand, the four noisy signals can be reconstructed by DBN
after dimensionality reduction, coding and decoding, and the reconstructed signal is
smoother than the original signal, and the noise of the signal is weakened to some
extent. From high-dimensional data to low-dimensional data, some of the original
high-dimensional data cannot be fully expressed, and the information that cannot
be expressed will be filtered out by the downscaling and reconstruction process by
reconstructing the downscaled low-dimensional data to the original high-dimensional
data. Noise is an irregular random signal, which is difficult to be expressed normally
in the downscaling process, and when reconstructing the original signal again, part
of the noise signal will be filtered. From the above analysis, it can be seen that
the dimensionality reduction process based on reconstructed DBN also has a noise
reduction function to some extent. Comparing the reconstructed maps with a signal-
to-noise ratio of 20 dB and 10 dB shows that as the noise increases (i.e., the smaller
the signal-to-noise ratio value), the higher the distortion between the reconstructed
signal and the original signal, and the amplitude value of the reconstructed signal
is compressed, which is because the overpowering random noise interferes with the
encoding and decoding of normal information. However, it can also be found that
although the larger noise interferes with the expression of normal information, the
characteristics of the signal such as periodicity do not change. Table 5.13 shows the
relative mean squared deviation values of the normalized simulated signals.
From Table 5.13, the same phenomenon can be obtained: the error of the fine-tuned
reconstructed signal is smaller than that of the untuned reconstructed signal; except
for the superimposed signal without noise, the error of the high target dimension is
smaller than that of the low target dimension. However, the noise strength and the
relative mean squared deviation value size do not show obvious regularity, which
may be caused by the randomness of the noisy signal.

5.2.3 DBN Based Fault Classification

As the definition of deep learning, a deep learning network can discover distributed
feature representations of data by combining low-level features to form more abstract
high-level representations. In other words, a deep learning network can start learning
directly from low-level data and extract high-level specific features layer by layer.
If the low-level signal is the original signal or the original data after simple pre-
processing (the simple pre-processing does not change the original data structure
and essential properties, such as normalization, etc. For the description convenience,
the data after pre-processing in this section will be called the original data, and
the original data hereinafter refers to the original data after normalization), deep
298 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.13 The relative mean square error of the normalized simulation signals
16-dimensional (mm) 32-dimensional (mm) 48-dimensional (mm)
Normal Fine-tuned Normal Fine-tuned Normal Fine-tuned
The 0.035 0.007 0.205 0.126 0.011 4.3e−5
accumulated
signal (SNR =
0 db)
The 0.153 0.104 0.052 0.005 0.011 0.001
accumulated
signal (SNR =
10 db)
The 0.141 0.081 0.045 0.020 0.011 5.3e−5
accumulated
signal (SNR =
20 db)
The multiplied 0.217 0.163 0.040 0.008 0.014 4.2e−5
signal (SNR =
0 db)
The multiplied 0.092 0.074 0.038 0.013 0.026 0.009
signal (SNR =
10 db)
The multiplied 0.135 0.114 0.029 0.010 0.009 0.002
signal (SNR =
20 db)

learning does not require manual feature extraction and selection process, and can
learn distinguishing features completely. This characteristic reduces the complexity
and uncertainty brought by the traditional feature extraction and selection process,
which greatly improves the operability of machine learning and enhances the intelli-
gence of machine learning. The analysis process comparison of traditional analysis
and deep learning in the field of intelligent fault diagnosisand recognition is shown
in Fig. 5.7.

5.2.3.1 Raw Data Based Bearing Fault Classification

To facilitate the experiment comparison, the raw data of bearing fault in Sect. 5.2.1
is still used. There are three health states in the bearing dataset, which are normal,
outer ring fault and inner ring fault. And each fault corresponds to three degrees, as a
result, there are seven different sets of different health state data in total. During the
experiment, the middle rotating shaft speed is 1400 r/min, the sampling frequency
is 24 kHz, sampling time lasts 20 s, so there are 480,000 valid data in total. The
specific
 60experimental conditions 10  shown in Table 5.1. The sensor collects about
are
1024 1400 × 24,000 ≈ 1028 ≈ 2 data points in one cycle. In order to explore the
DBN ability of the bearing fault state classification by using the raw data directly,
5.2 DBN Based Machinery Fault Diagnosis 299

Fig. 5.7 The procedure of traditional analysis and deep learning analysis

every 1024 data points of the collected raw data are set as one sample, and 160 samples
can be obtained for each bearing state. And 80 samples are randomly selected from
these 160 samples as training samples and the remaining 80 samples are testing
samples. So there are 560 training samples and 560 testing samples of the seven
bearing states in Table 5.1 that can be obtained finally. Firstly, the total number of
iterations is set to 100, and the results are taken as the average of 20 times repetitions.
(1) Hidden Layer Combination Analysis
First of all, referring to the empirical formula of the BP neural network, the hidden
layers can be set. As Eq. 5.3 shows, it √ can be obtained that: the first hidden layer
node number is computed with S1 = 1024 × 7 + 2k = 85+ 2k and S1 = 90; the

second hidden layer node number is computed with S2 = 90 × 7 + k2 = 25+ k2

and S2 = 25; the third hidden layer is computed with S3 = 25 × 7 + k2 = 13+ k2
and S3 = 15. As a result, a DBN with 3 RBMs can be built with a hidden layer
combination of 90 × 25 × 15. Similarly, referring to Eq. 5.2, the “Constant value”
combination of hidden layers is set to 35 × 10 × 10. Referring to Eq. 5.4, the
“Increased value” combination of hidden layers is set to 90 × 25 × 15. Referring to
Eq. 5.5, the “Descend value” combination of hidden layers is set to 10 × 10 × 10.
Referring to Eq. 5.6, the “Fluctuation value” combination of hidden layers is set to 85
× 20 × 10. The combination of hidden layers obtained by referring to the empirical
formula of the BP neural network is called the “BP empirical formula” combination.
The number of hidden layer nodes obtained from the empirical formula is relatively
small, and the maximum number of nodes in several combinations is only 90, which
is much less than the input node number of 1024. A deep learning network can mine
the distributed feature representation of data, but the distributed features of the data
are not easily expanded when the number of nodes of the network is small and the
300 5 Deep Learning Based Machinery Fault Diagnosis

data features are also not easily mined. The different combinations of hidden layers
are set, as shown in Table 5.14.
(1) BP empirical formula. The results of different combinations of hidden layers
according to the BP empirical formula are shown in Fig. 5.8. As the figure shows,
the accuracy increases significantly with the increase in the number of iterations.
Among the four combinations, it basically shows that the network with a lower
total node number has worse performance. The combination of 90 × 25 × 15
obtains the best result, but it is less than 80%. The combination of 85 × 20 ×
10 obtained the second-best result, which is less than 60%. The combination
of 35 × 10 × 10 obtained the third-best result, which is less than 40%. And
the combination of 10 × 10 × 10 obtained the worst result, which is less than
20%. In Sect. 3.2.1, the model using common statistical features as input can
achieve an accuracy of over 90%, which is much higher than the BP empirical
formula combination. The comparison results show that the combination of
hidden layers obtained by the BP empirical formula is not desirable.

Table 5.14 The different combinations of hidden layers


First layer Second layer Third layer Total node number
BP empirical formula 10 10 10 30
35 10 10 55
85 20 10 115
90 25 15 130
Constant value 512 512 512 1536
1024 1024 1024 3072
1536 1536 1536 4608
2048 2048 2048 6144
Increased value 512 1024 1536 3072
512 1024 2048 3584
1024 1536 2048 4608
Descend value 1536 1024 512 3072
2048 1024 512 3584
2048 1536 1024 4608
Fluctuation value 512 1024 512 2048
1024 1536 1024 3584
1536 2048 1536 5120
2048 1536 2048 5632
1536 1024 1536 4096
1024 512 1024 2560
5.2 DBN Based Machinery Fault Diagnosis 301

Fig. 5.8 The fault classification accuracy of the BP empirical formula combination

(2) Constant value combination. The fault classification result of the hidden layer
with the “constant value” combination is shown in Fig. 5.9. The highest classi-
fication accuracy of all four combinations has exceeded 85%, and the classifi-
cation recognition effect is obviously much better than that of the BP empirical
formula combination. Among the four constant-value combinations, the best
one for fault classification recognition is the 1024 × 1024 × 1024 combination,
with the highest classification correct rate higher than the other combinations,
and the corresponding number of iterations is also less than the other combi-
nations. Next, the combinations with better results are 1536 × 1536 × 1536,
512 × 512 × 512 and 2048 × 2048 × 2048. The input network data is 1024
dimensions and the output is 7 dimensions (number of fault state categories), and
comparing the four constant-value combinations, it can be seen that the hidden
layer combination close to the input network data dimensions has better results.
When the number of nodes in the network is closer to the input dimension, the
distributed features of the data are easier to unfold. When the number of nodes is
much smaller than the input dimension, some details of the initial data may not
be expressed, leading to a lower recognition effect; when the number of nodes
is much larger than the input dimension, there may be a mutual interference
situation when too many nodes analyze the same information.

(3) Increased value combination. The fault classification recognition of the hidden
layer as an appreciation type combination is shown in Fig. 5.10, and the highest
correct classification rate of all three combinations has exceeded 90%, and the
recognition effect is very good. Among the three appreciated combinations, the
combinations with better results are 1024 × 1536 × 2048, 512 × 1024 × 2048
and 512 × 1024 × 1536 in order. It is clear to find that the more the total number
of nodes in the hidden layer, the better the classification recognition effect of
302 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.9 The fault classification accuracy of the constant value combination

the network. This is because within a certain range, as the number of nodes
increases, the network is more likely to mine the distributed features of the data
from the details, and better the ability to interpret the data.

(4) Descend value combination. The fault classification recognition of the hidden
layer for the decreasing-value combination is shown in Fig. 5.11, and the highest
correct classification rate of all three combinations is around 90%. The node
order of the hidden layer is set in the opposite order for the degraded-value
combination and the ascending-value combination, and the comparison shows
that the classification recognition effect of the ascending-value combination

Fig. 5.10 The fault classification accuracy of the increased value combination
5.2 DBN Based Machinery Fault Diagnosis 303

Fig. 5.11 The fault classification accuracy of the descend value combination

is better than that of the degraded-value combination. Secondly, comparing


the highest correct classification rates of the three combinations, it is easy to
find that, consistent with the conclusion of the value-added combination, the
more the total number of nodes in the hidden layer, the better the classification
recognition effect.

(5) Medium-convex combination. The classification recognition effect of the hidden


layer as a medium-convex combination is shown in Fig. 5.12, and the classifi-
cation correct rate of all three combinations is high, which is over 95%. Second,
the 512 × 1024 × 512 and 1024 × 1536 × 1024 combinations are significantly
better than the 1536 × 2048 × 1536 combination. Although the 512 × 1024 ×
512 is better than the 1024 × 1536 × 1024 combination in the early iteration,
the difference between them is small in the late iteration, and the highest correct
classification rate exceeds 98%. 1536 × 2048 × 1536 combination deviates far
from the input data dimension 1024, and the correct classification rate is rela-
tively low, and the number of iterations required is also higher. It can be seen
that in the combination of medium-convex hidden layers, the combination with
the number of nodes close to the input data dimension is better.

(6) Medium-concavity combination. The hidden layer is the medium concave


combination classification recognition effect as shown in Fig. 5.13, the clas-
sification accuracy of 1536 × 1024 × 1536 combination is more than 98%, and
the highest accuracy of the other two groups is 87.5%. The analysis shows that
the number of nodes of 1536 × 1024 × 1536 combination is closer to the input
data dimension 1024, and the other two groups deviate relatively far. It can be
seen that the same similar conclusion is reached in the medium-concave hidden
layer combination: the combination with the number of nodes close to the input
data dimension has better results.
304 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.12 The fault classification accuracy of the medium-convex combination

Fig. 5.13 The fault classification accuracy of the medium-concavity combination

From the above analysis, it can be seen that the combination of the hidden layer
structure of DBN has a great influence on the results, and a good combination is an
important condition to obtain a good classification recognition effect. The combi-
nation of ascending hidden layers is relatively better when close to the dimension
of input data and the total number of nodes is relatively high. Secondly, there exists
5.2 DBN Based Machinery Fault Diagnosis 305

more than 90% correct classification rate, which proves the feasibility of more accu-
rate classification and recognition of bearing multi-fault states by using DBN directly
through raw data.
(2) Sample Length Analysis
When the original data is directly used as the input of the deep confidence network,
besides the network construction itself having a great influence on the classification
recognition effect, the sample length setting is also an important factor affecting the
classification recognition effect. In order to fully utilize the limited data resources,
the sample data is obtained by the “continuous non-repetitive” interception method,
as shown in Fig. 5.14. Starting from the initial position of the vibration signal, n
data points are intercepted as sample 1; starting from the nth + 1st data point, n data
points are intercepted again as sample 2; and so on, starting from the (N − 1)n +
1st data point, n data points are intercepted as sample N. Finally, an N × n sample
database can be obtained, from which the corresponding training and testing sample
database can be randomly selected.
From Fig. 5.14, assuming that the sensor collects a total of X n valid data points
when using the “continuous non-repetitive” interception method to obtain the sample

Fig. 5.14 The method of continuous sampling without duplicate


306 5 Deep Learning Based Machinery Fault Diagnosis

data, there is a relationship: the number of samples N × sample length n ≤ X n. Theo-


retically, the sample length can be taken in the range of [1, X n], which obviously
does not meet the practical application. In the above case study, the number of data
points collected in a week is directly taken as the sample length, and the better clas-
sification recognition effect indicates that the sample length has a certain reference
value. As can be seen from Fig. 5.14, when the sample length n is less than one
week’s data points, the database containing multiple samples also contains the data
of the whole cycle after the “continuous non-repetitive” interception.
In order to study the effect of sample length on the DBN algorithm in a week,
the data can be set into 16 groups (1024/64 = 16) of different sample lengths with
64 data points as the step size. To ensure that each sample length data has the same
training and testing samples, 80 training samples and 80 testing samples are randomly
selected for the calculation. By integrating the 7 types of data in Table 5.1, a total
of 560 training samples and 560 test samples can be obtained. In this study, the
longest input sample length is 1024 data points, the shortest length is 64 data points,
and the number of hidden layer nodes is close to the average length of 500, and the
combination of “constant value” hidden layer nodes is constructed, i.e., the hidden
layer is 500 × 500 × 500 DBN deep network structure, and the number of iterations
is set to 100, and the result is taken as the average value after 20 calculations.
Table 5.15 shows the fault identification corresponding to the 16 kinds of data after
100 iterations of DBN calculation. From Table 5.15, it can be seen that the classifica-
tion correct rate of all 16 groups of data is high, except for the sample lengths of 64,
128 and 192 data points, the classification correct rate of the remaining 13 groups
is over 90%, which again indicates the feasibility of more accurate classification
and recognition of bearing fault states directly through the raw data using DBN, and
further indicates that DBN can form more abstract high-level by combining low-level
features. It also shows that DBN can form more abstract high-level representations by
combining low-level features. From group 8 to group 16, although there is a certain
improvement in the classification correct rate, the difference is very small, and when
the sample length increases to a certain extent, increasing the sample length angle is
less effective in improving the classification correct rate. On the other hand, it shows
that when the sample length is within the range of half a circle to one circle, a better
recognition effect can be obtained.
Figure 5.15 shows the relationship between the correct classification rate of
bearing faults and the number of iterations after DBN calculation when the sample
length is 64, 256, 512 and 1024 data points, respectively. From Fig. 5.15, it is known
that when there is no iteration calculation, the classification accuracy of faults is all
below 40%; when iteration increases, the correct classification rate increases sharply;
when iteration exceeds 20, the correct classification rate increases slowly, while there
are some slight fluctuations occasionally. Although it is known from Table 5.15 that
some of the iterations corresponding to the highest classification correct rate are
close to 100, it is known from Fig. 5.15 that although the increase in the number
of iterations is beneficial to improve the fault classification recognition effect, when
it increases to a certain number of times (e.g., more than 20 steps), the classifica-
tion correct rate does not improve much but greatly increases the computational cost.
5.2 DBN Based Machinery Fault Diagnosis 307

Table 5.15 The fault conditions classification of data with different sample lengths
Group Length Accuracy (%) Iterations Group Length Accuracy (%) Iterations
1 64 66.7 92 9 576 98.3 35
2 128 81.5 85 10 640 98.9 37
3 192 89.6 82 11 704 99.0 33
4 256 94.2 72 12 768 99.0 42
5 320 95.6 57 13 832 99.2 31
6 384 96.1 53 14 896 99.3 41
7 448 97.2 48 15 960 99.2 23
8 512 97.6 52 16 1024 99.4 29

Therefore, in practice, fault state recognition is not only the one-sided pursuit of clas-
sification accuracy, but also requires comprehensive consideration of computational
economy.
(3) Time Cost Control Analysis
The time complexity of an algorithm is a function O(S, t) that quantitatively
describes the time consumed by the algorithm and depends on the number of compu-
tations of the basic operation S and the time required to execute the basic operation
once t. Assuming that the basic operation is executed once per unit of time, the number
of computations in different structures can be directly added up and the cumulative
number of computations is to some extent equivalent to the time complexity of the
algorithm. The higher the cumulative number of computations, the greater the time
complexity of the algorithm, the longer the algorithm takes, and the less economical
it is.

Fig. 5.15 The relationship between the classification accuracy of bearing faults and the number of
iterations
308 5 Deep Learning Based Machinery Fault Diagnosis

A complete algorithm with a relatively complex number of executions of basic


operations is generally more difficult to calculate the time complexity rigorously.
However, only the basic operations with the highest number of executions deter-
mine the time consumed by an algorithm, so usually, only the most number of basic
operations executed in an algorithm is of interest. Commonly used methods for
calculating time complexity are the summation method, hypothesis method, iteration
method; direct calculation method, roundabout calculation method, recursive calcu-
lation method, etc. Usually, the time complexity calculation follows the following
principles.
(1) All addition constants are replaced by the constant 1 and only non-constant
terms are retained.
(2) Only the highest order term is retained in the same level of function calculation
(if there are multi-parameter expressions, multiple parameters are retained).
(3) Remove terms of very small order.
(4) Delete the constant in front of the highest term.
According to the above principle (4), the time complexity needs to directly delete
the constant in front of the highest term, resulting in the same time complexity with a
large difference in the number of executions.For example, the number of calculations
are n 3 and 8n 3 , the time complexity are O n 3 , which brings inconvenience to the
subsequent work. Therefore, we focus on the highest number of executions of basic
operations and follow the principle of (1)–(3) in time complexity calculation, without
affecting the understanding of the premise, the subsequent time complexity referred
to is equivalent to the concept of the highest number of executions here.
Forward stacked RBM learning and backward fine-tuning learning are the keys of
the DBN algorithm, and the corresponding time complexity determines the elapsed
time of the whole DBN algorithm. Assuming a DBN consists of a 3-layer RBM
with implied layers m1, m2, and m3, respectively, with input sample size N , sample
length n, and iteration number K . Ignoring the number of constant terms computed,
the steps of the CD-based RBM algorithm know that a single sample generates
P h (1)= 1| v 1 computed  times nm 1 , samples out h (1)
computed times m 21 , gener-
(2) (1) (2)
ates P v =  1| h computed
 times m 1 n, samples out v computation times n 2 ,
(2) (2)
generates P h = 1| v computation times nm 1 , samples out h (2) computation
times m 1 , and parameter update computation times m 1 n + n + m 1 . Total computation
2

times of N samples in the first RBM after K times iteration are computed as:
 
F1 = K N 3nm 1 + 2m 21 + n 2 (5.17)

Similarly, the total calculation times of the second and third layers are F1 and F2 ,
respectively
 
F2 = K N 3m 1 m 2 + 2m 22 + m 21 (5.18)

 
F3 = K N 3m 2 m 3 + 2m 23 + m 22 (5.19)
5.2 DBN Based Machinery Fault Diagnosis 309

Therefore, the calculation number of forward stacked RBM learning is

F = F1 + F2 + F3 (5.20)

Similarly, the number of calculations for backward fine-tuning learning is,

R = K N (nm 1 + m 1 m 2 + m 2 m 3 ) (5.21)

The total count of forward and backward learning is,

S=F+R (5.22)

 
S = K N n 2 + 2m 21 + 2m 22 + m 23 + 4nm 1 + 4m 1 m 2 + 4m 2 m 3 (5.23)

The partial derivative of the function of the parameter is equivalent to the unit
change of the function value in the direction of the parameter, which can be interpreted
as the sensitivity of the function to the parameter. The larger the absolute value of
the partial derivative, the more sensitive the function is to the parameter, and the
greater the impact of the parameter on the function value, which can be adjusted by
adjusting the parameter to the function value. The total number of calculations S
to the parameters are derived separately, the partial derivatives are positive, and can
be ordered by the size of the partial derivative, and priority adjustment of the larger
partial derivative of the parameters can more quickly control the total calculation
time.
As can be seen from Eq. 5.23, there are six parameters in total for calculating the
total number of times S. If the parameters are sorted strictly by the size of partial
derivatives, and then the parameters with larger partial derivatives are adjusted in
priority, it leads to inconvenient operation and great difficulty in the calculation.
Considering that the number of hidden layer nodes has a certain symmetry and is
relatively stable, the DBN structure can be assumed to be determined, i.e., the number
of hidden layer nodes is assumed to be a definite value m 1 , m 2 , m 3 . In general, the
sample size is related to the number of data points collected, and the data sample
size is also relatively stable, and it is also assumed that N is a deterministic value.
Focus on the influence of the sample length of the original data and the number of
iterations of the algorithm on the DBN, i.e., the total number of calculations S is
only related to two variables, the number of iterations K and the sample length n.
Compare
/ the magnitude
/ of S for the partial derivatives of the parameters K , n. When
∂ S ∂ K > ∂ S ∂n, it means that the number of iterations K is more sensitive than
the sample length n, and/adjusting the / value of K can better control the computational
cost of DBN. When ∂ S ∂ K < ∂ S ∂n, it indicates that adjusting the sample length
is more effective in controlling the computational complexity.
In the bearing fault classification experiments, if the number of iterations K is
more sensitive than the sample length n, under the premise of achieving a certain
target classification rate correctly, the minimum number of iterations K is preferred,
310 5 Deep Learning Based Machinery Fault Diagnosis

and then the minimum sample length n is selected, which can better control the
computational cost and improve the computational speed.
Combined with DBN theory, the process of iteration number and sample length
selection is shown in Fig. 5.16, and the main steps are as follows.

(1) Collect vibration data, and set training group and test group data.
(2) Set the target value of classification accuracy Q m , the maximum number of
iterations K m , and the sample length growth step t.
(3) Assign the initial value K 0 to the number of iterations K and the initial value
n 0 to the sample length to form the RBM structure.
(4) Input the training data into the RBM1 visual layer v, and learn the hidden layer
h. Substitute the hidden layer h from RBM1 into RBM2 visual layer v, and

Fig. 5.16 The procedure of parameter selection in DBN


5.2 DBN Based Machinery Fault Diagnosis 311

learn the hidden layer h. Substitute the hidden layer h from RBM2 into RBM3
visual layer v, and learn the hidden layer h.
(5) Input the hidden layer h in RBM3 into the Soft-max classification model and
calculate the classification error J .
(6) Fine-tune the parameters from the highest layer to the lowest layer to obtain the
complete DBN training model.
(7) Input the test data into the resulting model and calculate the correct classification
rate of faults. Steps (4)–(7) constitute a complete DBN calculation.
(8) Judge the condition Q ≥ Q m or K ≥ K m , if not satisfied, then judge the sample
length n ≥ n m , increase K or sample length n, and repeat steps (4)–(7); if the
condition is satisfied, the calculation terminates and outputs the classification
correct rate Q, the number of iterations K , and the sample length n.
Obtain the total calculation times by passing N = 560, m1 = 500, m2 = 500, and
m3 = 500 into Eq. 5.23:
 
S = 560K n 2 + 2000n + 3,250,000 (5.24)

/  
∂ S ∂ K =560 n 2 + 2000n + 3,250,000 (5.25)

/
∂ S ∂n = 560K (2n + 2000) (5.26)

From Eq. 5.24, the calculation number of DBN increases with the increase of
iteration number K and sample length n. Therefore, the calculation cost of DBN
can be controlled by adjusting K and n. From Fig. 5.15 and Table 5.15, we know
that/ 0 < K ≤ 100, 0 < n ≤ 1024, substituting /  into Eqs. 65.25 and 5.26, we get
∂ S/ ∂ K ∈ 1.82 / × 10 9
3.55 × 10 10
, ∂ S ∂n ∈ 1.12 × 10 2.27 × 10 8
, that is
∂s ∂ K ∂s ∂n. It means that in this case, when the number of samples and
DBN structure are fixed, K is far more sensitive than n, and adjusting the value of
K has a more obvious impact on controlling the economy of DBN. Therefore, under
the premise of achieving a certain classification correct rate, the least number of
iterations is preferred, and then the least sample length is selected, which can control
the computational cost more quickly and improve the economy. The operation flow
is shown in Fig. 5.16.
In this case, the target classification accuracy is set to 98%, the initial sample
length is 64 points, the step length is 8 points, and the maximum iteration is 100
times. Under the computer configuration with processor Intel(R) Xeon(R) CPU E3-
1230, main frequency 3.3 GHz, and memory 8 GB, the best sample length of 664
data points can be obtained after about 13 h calculation, and the number of iterations
is 5, at which time the fault identification time is 16 s. If the minimum sample length
is preferred, and then the minimum number of iterations is selected, the best result
must be obtained after about 34 h calculation under the same computing environment.
To get the best results, the calculation time is about 2.6 times of the above method.
At this time, the optimal sample length of 424 data points and 46 iterations, the
312 5 Deep Learning Based Machinery Fault Diagnosis

fault identification time is about 192 s, and the time required is about 12 times of
the above method. It shows that preferentially adjusting the parameters with larger
time complexity bias values can better control the DBN calculation cost. In the
actual operation, the sample length and the number of iterations after adjusting the
selected samples can be used as the reference values for bearing fault classification
and recognition.
(4) Traditional Methods Comparison
To further verify the performance of the algorithm, the bearing data were processed
in the following five ways with reference to Fig. 5.7 and then input them into the
Softmax for classification and identification:
M1: Input the raw data directly.
M2: Select the common vibration signal feature kurtosis as the input.
M3: Extract multivariate features by the methods proposed in [11] and use all 14
sets of features as the input.
M4: Select dimensionless statistical features in (3) as input, such as, waveform
indicators, peak indicators, pulse indicators and margin indicators.
M5: Draw the time–frequency map after wavelet transform, and extract the time–
frequency image features (300 × 400 pixels) directly as the input.
M6: Compress the time–frequency map in way (5) by Two Dimensional Prin-
cipal Component Analysis (TD-PCA) [19] to 10 × 10 pixels, and then input the
compressed image features to the network.
The fault classification results obtained by the above six methods are compared
with DBN, which is shown in Fig. 5.17.

Fig. 5.17 The fault classification accuracy of 6 ways


5.2 DBN Based Machinery Fault Diagnosis 313

As shown in Fig. 5.17, the M1 obtain the worst classification accuracy, which is
less than 20%. This indicates that the Softmax classification model itself is unable
to classify and identify faults directly from the raw data. When only kurtosis is
used as the input (M2), the classification accuracy is still poor, which is less than
30%. This indicates that the model has a limited ability to identify faults based on
univariate features only. When 14 sets of multivariate features are used as input (M3),
the classification accuracy is significantly improved. When using one of the dimen-
sionless index features as input (M4), the recognition performance is worse than
M3. This is because different variables portray the vibration signal from different
aspects and contain different biases of fault information, and the number and types of
multivariate features directly affect the fault recognition performance. When image
features are used as input (M5), the classification accuracy decreases as the sample
length increases, and the highest recognition accuracy is less than 80%. After TD-
PCA (M6), the classification performance in data with longer sample lengths is
improved, but the classification performance of data with shorter sample lengths
decreases instead. It can be seen that in the time–frequency diagram, the valuable
fault information is seriously disturbed by other information, and the classifica-
tion recognition performance of time–frequency features is unstable. In contrast, the
classification result of DBN is greater than 90% in group 4 (sample length is 256
data points), and the classification accuracy gradually improves with the increase of
sample length. In summary, multivariate features need to select the number and type
of features, and time–frequency graph features need additional processes, which are
inseparable from human factors and weaken the intelligence of machine learning.
(5) Experimental Study
Taking gear and bearing faults as an example, the classification and recognition of
transmission faults by DBN based on original data are discussed.
(1) Experimental setting. The same transmission fault test platform and transmis-
sion as in Sect. 5.2.1 are used for experiments. The arrangement of measuring
points is as Fig. 5.18.
The bearing and gear are processed by a wire-cut machine and a variety of types
of fault are set. which are a mild broken tooth, moderate broken tooth, single broken
tooth and bearing inner ring fault with 0.2 mm width, specifically. The physical faults
are shown in Fig. 5.19.
The fault bearing (Item Number: NUP311EN) is mounted at the connection of
the output shaft and the box body (embedded in the box body), which parameters
are shown in Table 5.16.
A combination of the above-mentioned multiple faults can be obtained in
Table 5.17 in eight groups of fault combination experiments. Such as the seventh
group of experiments, the simulated fault state is the moderate broken tooth fault of
fifth gear and bearing inner ring fault with 0.2 mm width.
In order to reduce the attenuation of the fault signal transmission and obtain the
real information of the signal as much as possible, the measurement point location,
i.e., the acceleration sensor installation location, should be as close as possible to
314 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.18 The arrangement of the monitoring point

a) Mild broken tooth b) Moderate broken tooth

c) Single broken tooth


d) Inner ring fault with 0.2mm width

Fig. 5.19 The gear and bearing faults


5.2 DBN Based Machinery Fault Diagnosis 315

Table 5.16 The parameter of fault bearing


Outside Bore diameter Pitch diameter Number of Ball diameter Contact
diameter d O dI dp balls z dB angleα
120 mm 55 mm 85 mm 13 18 mm 0°

Table 5.17 The details of fault conditions


Health Fifth gear Inner ring
condition Normal Mild broken Moderate Single broken Normal Fault with
tooth broken tooth tooth 0.2 mm width
√ √
1
√ √
2
√ √
3
√ √
4
√ √
5
√ √
6
√ √
7
√ √
8

the fault location. A total of five measurement point locations were selected for this
experiment, as shown in Fig. 5.18, and the corresponding physical measurement
point locations are shown in Fig. 5.20.
Point 1: located on the bearing seat of the transmission output shaft (non-faulty
bearing, the fault bearing is located at the connection between the output shaft and
the case, embedded in the case), close to the faulty bearing.
Point 2: near the output shaft on the left side of the box, considering that there is
a non-rigid oil seal connection between the bearing seat of the output shaft and the
transmission body, there may be too much attenuation when the bearing fault and

Fig. 5.20 The pyhsical arrangement of the monitoring point


316 5 Deep Learning Based Machinery Fault Diagnosis

gear fault information is transmitted to measurement point 1, measurement 2 is also


a supplement to measurement point 1.
Point 3: Near the output end of the intermediate shaft on the box, except for direct
gear and neutral, the power input to the transmission has to be transmitted through
the intermediate gear.
Point 4: Spatially close to the fifth gear.
Point 5: on the box close to the input shaft, the fifth gear is not directly connected
to the box, and the fifth gear signal cannot be directly transmitted to measure point 4.
From the perspective of the power transmission path, when the transmission works
in the fifth gear, the input shaft is closer to the fifth gear, so measure point 5 is chosen
as a supplement to measure point 4.
In the experiment, the transmission was put into fifth gear, the input shaft speed
was 1000 rpm, the load was 50 Nm, the sampling frequency was 24 kHz, and the
continuous sampling time was 60 s. Calculated as the reference [20]:
Fifth gear frequency: fr = n × 26 38
× 42
22
= 1000
60
× 26
38
× 42
22
= 21.77(Hz).
Fifth gear meshing frequency: f z = fr z = 21.77 × 22 = 478.95(Hz).
Inner loop passing frequency: B P F I = z fr
1 + ddBp cos α =
  2
13×21.77
2
1 + 85 cos 0 = 163.1(Hz).
13

(2) Analysis of the time–frequency domain. Table 5.18 shows the time domain
signal plots for measurement point 2 (near the output axis) and measurement
point 5 (near the input axis) for eight states. As can be seen from the figure, the
time domain signals of the two measurement points have a strong similarity. In
the time domain signal of the same measurement point position, when the inner
ring is normal, the difference in amplitude of the five gears in the normal, lightly
broken teeth and moderately broken teeth states are not large, and the amplitude
of a single broken tooth is relatively large, and the impact phenomenon is rela-
tively more obvious as the degree of gear fault deepens, but it is still difficult to
judge. When the inner ring fault, vibration amplitude, and shock phenomenon
are not more obvious with the deepening of gear fault, gear mild broken teeth
and gear single broken teeth amplitude is larger, shock phenomenon is more
obvious; moderate broken teeth and gear normal state signal amplitude and
shock phenomenon variability is smaller. In the same test point of the same gear
state, the existence of bearing fault vibration signal amplitude is not always
larger than the normal bearing amplitude, such as five gear moderate broken
gear state, the bearing fault signal amplitude is smaller than the normal bearing
amplitude, the impact phenomenon is also not obvious. This shows that when
multiple complex states exist at the same time, it increases the difficulty of fault
diagnosis.
From the time domain signal analysis, it can be seen that the signals of the two
measurement points have strong similarities, and the analysis can focus on only
one of the measurement points. Figure 5.21 is the frequency domain diagram of the
low-frequency band of measurement point 2 in eight states, eight fault states, and
close to the fifth gear output shaft rotation frequency 21.77 Hz are very obvious.
Table 5.18 The time domain vibration signal of monitoring point 2 and monitoring point 5
Monitoring point 2 Monitoring point 5

Normal inner ring condition Inner ring fault with 0.2mm width Normal inner ring condition Inner ring fault with 0.2mm width

Normal gear
condition

Mild broken
tooth
5.2 DBN Based Machinery Fault Diagnosis

Moderate bro-
ken tooth

Single broken
tooth
317
318 5 Deep Learning Based Machinery Fault Diagnosis

When the bearings are normal, as the degree of gear fault deepens, the multiplication
frequency of the rotational frequency is more obvious. However, except for a single
broken tooth state, the fifth gear meshing frequency in other states is not obvious. In
the case of bearing fault, the multiplication frequency of rotational frequency in the
state of mild broken tooth and the single broken tooth is obvious, and it is also obvious
inner ring passing frequency; in the state of normal gear and moderate broken tooth,
the inner ring passing frequency is not obvious. It can be seen that although there is
some variability in the frequency spectrum of different states in the low-frequency
spectrum, the difference is small, and it is still difficult to identify bearing fault and
gear fault at the same time.
(3) Single fault classification. The single fault in this section refers to the fault of a
single object, including the type of fault and the degree of fault. Referring to the
fault type in Table 5.1, two single fault combinations can be discussed: bearing
fault combination under the normal gear condition and gear fault combination
under the normal bearing condition.

The input shaft speed is 1000 rpm, and the sensor samples data points is 60 ×
24,000/1000 = 1440 for each week of input shaft rotation, and one week of sampling
points, i.e. 1440 data points, are selected as the input of the DBN network. When the
number of hidden layer nodes is close to the input data dimension and the combination
is of ascending type, a 3-layer RBM deep confidence network with a hidden layer
of 1500 × 2000 × 2500 can be constructed uniformly. 250 training samples and
250 test samples are randomly selected for each fault state. Initially, the number of
iterations is set to 100, and the raw data normalized to the five test points are input
into the network one by one, and the average value is taken as the result after 20
repetitions.
(a) Bearing fault under normal gear condition. The bearing faults in the normal gear
state contains two health states, namely, the normal state (state 1) and the inner
ring fault with 0.2 mm width (state 5), and the two states can get 500 training
samples and 500 test samples in total. The DBN model is trained by 500 training
samples, and then 500 test samples are input into the trained DBN model, and
the classification recognition results of the test samples after calculation are
shown in Table 5.19.
The classification accuracy in the table refers to the highest correct classifi-
cation rate, which is calculated by the logical indicator function expressed as
i =k}
ρ = 1{y1{y
i =k||yi / =k}
(k is the actual category and yi is the predicted category of the
DBN model). The average accuracy refers to the weighted average of the classifica-
tion accuracies of all categories, and the number of iterations refers to the number of
iterations required for the highest average correct classification rate. From Table 5.19,
it can be seen that the average highest correct classification rate of all five measure-
ment points is very high, and the highest correct classification rate of measurement
points 2, 3 and 5 all reach 100%. The five sets of high correct classification rates
indicate that DBN has the ability to extract information directly from the original
5.2 DBN Based Machinery Fault Diagnosis 319

Fig. 5.21 Frequency domain vibration signal of monitoring point 2


320 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.19 The classification results of bearing fault under the condition of normal gear
Fault Accuracy
conditions Monitoring Monitoring Monitoring Monitoring Monitoring
point 1 point 2 point 3 point 4 point 5
Fault 84.4 100.0 100.0 98.0 100.0
condition 1
(%)
Fault 92.4 100.0 100.0 99.2 100.0
condition 5
(%)
Average (%) 88.4 100.0 100.0 98.6 100.0
Iterations 100 37 6 37 25

data, and also indicates that all five measurement points are able to classify and iden-
tify bearing faults under normal gear conditions more accurately. The classification
rate of measurement point 1 is relatively low, and the lowest fault state 1 is prob-
ably due to the existence of a non-rigid oil seal connection in measurement point
1, and the bearing fault signal attenuation is larger. The correct classification rate
of measurement point 4 is also slightly low, probably because measurement point
4 is far from the rotating shaft (three shafts in the transmission). Secondly, except
for measurement point 1, the number of iterations required for other measurement
points is less than 50, and measurement point 3 only needs 6 iterations to reach 100%
correct rate, which shows that the data of measurement point 3 is relatively better.
The average classification recognition of bearing faults under the normal gear
condition versus the number of iterations is shown in Fig. 5.22. The classification
correct rate of measurement point 1 has been in a growing trend and did not fully
converge after 100 iterations, so it can be inferred that the classification correct rate
of measurement point 1 may continue to improve with the increase of the number
of iterations. After 20 iterations, the data of other measurement points basically
stabilized. The data of measurement points 2, 3 and 5 are better in terms of the
correct classification rate and the time required.

(b) Gearing fault under the normal bearing condition. In this case, the gear fault
in normal bearing condition specifically contains four types: fifth gear normal
condition (state 1), fifth gear mild broken tooth (state 2), fifth gear moderate
broken tooth (state 3) and fifth gear single broken tooth (state 4). And 1000
training samples and 1000 test samples can be obtained. After the DBN calcu-
lation, the classification recognition results of the test samples are shown in
Table 5.20. It can be seen that the average highest classification correct rate of
all the four test points except test point 1 exceeds 98%, and the highest clas-
sification correct rate of test point 3 reaches 100%, which again shows that
DBN has the ability to mining information directly from the original data, and
also shows that test points 2, 3, 4 and 5 are all good in making more accurate
fault classification identification. Secondly, among the measurement points, the
5.2 DBN Based Machinery Fault Diagnosis 321

Fig. 5.22 The average classification accuracy of bearing faults under normal gear condition versus
the number of iterations

correct classification rate of two types of fault states 1 (inner ring fault with
mild broken tooth) and 2 (inner ring fault with normal gear) is relatively low.
For example, in measurement point 1, the classification rate of fault state 1
is only 64.8% and that of fault state 2 is only 47.6%, while the classification
rates of fault state 3 and fault state 4 are more than 90%, indicating that the
two states of normal gears and mild broken tooth gears are more difficult to be
classified and identified. The classification correct rate of measurement point 1
has a large gap with the other four groups, which may be due to the distance
from the location of the faulty gear being too far and the gear fault information
being severely attenuated. Perhaps, it is due to the existence of non-rigid oil
seals at the connection between the bearing housing and the transmission, the
fault information of the bearing and the gear is severely attenuated.
As shown in Table 5.20, all of them are close to or have reached 100 itera-
tions, except for measurement point 3 which requires fewer iterations. As shown
in Fig. 5.23, the average classification recognition versus the number of iterations
for gear faults in the normal bearing condition is plotted. As can be seen from the
figure, except for measurement point 1, the classification correct rate of the other
four groups has exceeded 90% at 40 iterations, and the correct rate of measurement
points 2, 3, 4 and 5 has exceeded 95% at 20 iterations, which indicates that the clas-
sification recognition effect of these four measurement point locations for gear fault
in the normal bearing condition is better.
The analysis of classification recognition of two single fault combinations shows
that DBN has the ability of mining information directly from the original data, and
all five measurement points can classify and recognize the faults more accurately.
However, there is some discrepancy in the classification and recognition of different
fault states by different measurement points. Relatively speaking, measurement point
322 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.20 The classification results of gear fault under the condition of normal bearing
Fault Accuracy
conditions Monitoring Monitoring Monitoring Monitoring Monitoring
point 1 point 2 point 3 point 4 point 5
Fault 64.8 98.0 100 98.0 98.8
condition 1
(%)
Fault 47.6 98.8 100 98.4 96.4
condition 2
(%)
Fault 93.2 99.6 100 98.8 100.0
condition 3
(%)
Fault 96.0 100 100 100 99.2
condition 4
(%)
Average (%) 75.4 99.1 100 98.8 98.6
Iterations 100 89 15 98 100

Fig. 5.23 The average classification accuracy of gear faults under normal bearing condition
versus the number of iterations

1 and measurement point 4, which are far away from the position of the faulty gear,
have poor classification and recognition effects, while measurement points 2, 3 and
5 have better effects, indicating that the choice of measurement point location has
some influence on the diagnosis results when using vibration signals to diagnose
complex faults.
(4) Complex fault classification. The above analysis shows that DBN can classify
and identify single object faults more accurately from the raw data. Integrating
5.2 DBN Based Machinery Fault Diagnosis 323

the 8 sets of data containing two object faults in Table 5.17 can form a more
representative, more difficult and more comprehensive database of complex
fault data. For each fault state, 250 training samples and 250 test samples are
randomly selected, i.e., a total of 2000 training samples and 2000 test samples
can be obtained from the 8 sets of data. Similar to the single fault classification
and identification method, one week sampling points, i.e. 1440 data points,
are selected as the input of DBN network, and a 3-layer RBM deep confidence
network with hidden layer of 1500 × 2000 × 2500 is constructed, with the initial
set of iteration number 100. The original data normalized by five measurement
points are input into the network one by one, and the average value is taken as
the result after 20 times of repeated calculations.
(a) Result analysis. The classification recognition results of the test samples are
shown in Table 5.21. Compared with a single fault, the classification result of
complex faults is relatively low, which is because the number of target cate-
gories increases, the difference of data to be classified decreases, and the diffi-
culty of classification increases. The classification recognition results of the five
measurement points also showed an obvious difference, and measurement point
3 and measurement point 5 had relatively good results, with the average classi-
fication correct rate exceeding 95%. However, the average correct classification
rate of the other three measurement points was below 90%, and only 62.5% for
measurement point 1. Once again, it shows that the selection of measurement
point location has a great influence on the fault diagnosis effect. In addition, the
classification recognition effect of different measurement points for different
fault states has a large difference, and the effect of fault state 1 (Fifth gear in
normal condition), fault state 2 (Fifth gear with mild broken tooth), fault state
5 (Inner ring fault with 0.2 mm width & Fifth gear normal condition) and fault
state 6 (Inner ring fault with 0.2 mm width and Fifth gear mild broken tooth) has
relatively low because when the fault intensity is weak, the signal discrepancy
of these four states is smaller and the classification is more difficult.
As shown in Table 5.21, the number of iterations required is high for all four
measurement points except measurement point 5. The average classification recog-
nition of complex faults versus the number of iterations is shown in Fig. 5.24. It can
be seen that measurement points 2, 3 and 5 are already close to the highest classifica-
tion correct rate at 20 iterations, and the classification correct rates of measurement
points 1 and 4 are still in the trend of gradual improvement.
In summary, the highest classification correctness of measurement point 3 and
measurement point 5 shows that DBN can classify and identify complex faults of
the transmission more accurately directly through the raw data. However, the results
of the other three measurement points are not satisfactory and are consistent with
the conclusion of single fault classification identification, which again shows that
the choice of measurement point location has a greater impact on the fault diagnosis
effect.
324 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.21 The classification recognition results of different faults


Fault Accuracy
conditions Monitoring Monitoring Monitoring Monitoring Monitoring
point 1 point 2 point 3 point 4 point 5
Fault 1 (%) 56.8 64.0 98.3 57.6 98.4
Fault 2 (%) 30.0 71.6 99.8 98.8 99.6
Fault 3 (%) 85.6 100.0 100 92.0 100
Fault 4 (%) 96.8 91.6 100 87.6 100
Fault 5 (%) 69.2 100 99.7 75.2 100
Fault 6 (%) 24.4 76.4 99.6 72.0 93.6
Fault 7 (%) 92.0 100.0 100 90.0 100
Fault 8 (%) 44.8 84.0 100 79.6 94.8
Average (%) 62.5 86.0 99.7 81.6 98.3
Iterations 100 89 67 100 35

Fig. 5.24 The average classification accuracy of complex faults versus the number of iterations

(b) Time cost control analysis. From the analysis in Sect. 5.2.3, it can be seen that
when the sample data is obtained by the “continuous non-repetitive” interception
method, the sample length does not require a complete cycle; from Fig. 5.24, it
can be seen that for complex fault classification identification, it does not require
100 complete iterations. Assuming that the number of samples N = 2000,
the number of DBN implied layer sections m1 = 1500, m2 = 2000, m3 =
2500 are determined, and the total number of calculations can be obtained by
substituting Eq. 5.23:
 
S = 2000K n 2 + 6000n + 50,750,000 (5.27)
5.2 DBN Based Machinery Fault Diagnosis 325

/  
∂ S ∂ K = 2000 n 2 + 6000n + 50,750,000 (5.28)

/
∂ S ∂n = 2000K (2n + 6000) (5.29)

For further control the time cost in the range of one cycle of the sample length
and 100 iterations, substituting the parameters
/  of 0 < K11≤ 100, 0 < 11 n ≤ 1440 / into
Eqs.
 5.28 and 5.29, and obtain
/ ∂ S ∂ K
/ ∈ 1.02 × 10 1.23 × 10 , ∂ S ∂n ∈
1.2 × 107 1.78 × 109 , ∂s ∂ K ∂s ∂n, which indicates that K is more sensitive
than n, and adjusting the value of K is more effective to control the time cost of DBN.
As shown in Table 5.21, only measurement point 3 and measurement point 5 have
the highest average classification accuracy which is over 90%, and measurement
points 1, 2 and 4 may not be practical. In the practical application of intelligent diag-
nosis of complex faults, only the data of measurement point 3 and measurement point
5 can be considered. Taking the complex fault data of measurement point 5, which is
slightly less effective, as a representative, the operation flow in Fig. 5.16 is set with a
target classification correct rate of 95%, an initial sample length of 100 points, a step
size of 10 points, and a maximum of 100 iterations. Under the computer configura-
tion with processor Intel(R) Xeon(R) CPU E3-1230, main frequency 3.3 GHz, and
memory 8 GB, the optimal sample length of 980 data points and the number of iter-
ations is 16 after about 37 h. If the minimum sample length is preferred, and then the
minimum number of iterations is selected, the optimal result must be obtained after
about 83 h of calculation under the same computing environment, and the calculation
time is about 2.24 times of the above method. At this time, the optimal sample length
is 620 data points and 68 iterations.
The sample length and the number of iterations are set according to the results
after the above two adjustments are selected, and all other settings are unchanged,
and the classification recognition effects of measurement point 3 and measurement
point 5 are shown in Table 5.22.
In Table 5.22, “Total time” includes the total time of sample training and sample
testing; “Adjustment method 1” refers to the priority adjustment of iteration number;
“Adjustment method 2” refers to the priority adjustment of the sample length; “Clas-
sification accuracy” refers to the highest average classification accuracy of the eight
fault states. From Table 5.22, although the classification rate of mode 2 is higher than

Table 5.22 The classification results after the parameters adjustment


Monitoring Unadjusted Adjusting way 1 Adjusting way 2
point Accuracy Time (min) Accuracy Time (min) Accuracy Time (min)
(%) (%) (%)
Monitoring 99.7 51.3 95.4 19.1 96.4 38.4
point 3
Monitoring 98.3 51.3 95.0 19.1 95.0 38.4
point 5
326 5 Deep Learning Based Machinery Fault Diagnosis

that of mode 1, the time control of mode 2 is 2.01 times of mode 1, and mode 1 is
more effective than mode 2.
(c) Comparative analysis of DBN input. To further explore the classification recog-
nition effect of inputting into DBN in different ways, the raw data are processed
in the following ways and normalized before inputting into DBN visual layer.
(a) Raw data with 1440 data points in a week as sample length; (b) 14 groups
of common statistical feature variables in 3 of 5.2.1 are selected; (c) 20 groups
of multi variables are selected regarding [21]; (d) The time–frequency map is
drawn after wavelet transform, and the time–frequency image pixels are scaled
equally to 60 × 80 as input; e The time–frequency map in way d is compressed
by two-way principal component analysis (TD-PCA) to pixel 10 × 10, and then
the compressed image features are input. From the previous analysis, it can
be seen that the classification effect by DBN has a large correlation with the
hidden layer structure, and the input data dimensions of the above five ways are
very different, so the same hidden layer structure cannot be used. For different
input methods, this section adopts the “constant” hidden layer structure with
the number of nodes equal to the number of input dimensions. The hidden layer
structure of mode 1 is 1440 × 1440 × 1440; the hidden layer structure of mode
2 is 14 × 14 × 14; the hidden layer structure of mode 3 is 20 × 20 × 20;
the hidden layer structure of mode 4 is 4800 × 4800 × 4800; the hidden layer
structure of mode 5 is 100 × 100 × 100. 250 training samples and 250 test
samples are randomly selected for each fault state. Test samples, integrating
8 fault states, a total of 2000 training samples and 2000 test samples can be
obtained. Set the number of iterations to 100, repeat 20 times and take the
average as the final result, the classification recognition results of test samples
are shown in Table 5.23.
In Table 5.23, the “average total time” refers to the average total time for each
measurement point, including the training and testing time for all samples. Since
the data structure of the five measurement points is identical, the time required
for each measurement point in the DBN calculation is the same, and the “average
total time” is equivalent to the total time for a single measurement point. As shown
in Table 5.23, the shortest time is for the 14 sets of multivariate inputs, because
the number of input dimensions is small and the number of nodes in the hidden
layer of DBN is small, so the number of corresponding calculations is also small.
The longest time consuming input is the scaled time–frequency pixel input because
it takes some time to convert from the original data to the time–frequency map.
The number of nodes in the corresponding hidden layer is large and the number
of calculations is large when inputting into DBN with 4800 dimensions. After TD-
PCA compression, the time–frequency map is converted into 100-dimensional input
to DBN, the computation time is greatly reduced and the classification correct rate
is greatly improved. From the perspective of classification correct rate, although the
effect of some measurement points with raw data input method is slightly worse
than other methods, for example, the correct recognition rate of measurement point
2 is lower than that of TD-PCA time–frequency pixel input method, but the overall
5.3 CNN Based Fault Classification 327

Table 5.23 The classification results of different input ways


Input ways of Accuracy Time
DBN Monitoring Monitoring Monitoring Monitoring Monitoring (min)
point 1 (%) point 2 (%) point 3 (%) point 4 (%) point 5 (%)
Raw data 62.3 82.6 99.5 80.6 98.7 32.4
Multivariate 61.5 73.6 82.1 75.1 79.4 0.7
variables of 14
groups
Multivariate 63.4 81.5 85.15 82.6 87.2 0.9
variables of 20
groups
Scale the 56.5 62.1 62.3 61.1 61.3 397.2
time–frequency
map pixels
The 66.5 83.1 85.3 79.1 81.3 68.1
time–frequency
map pixels of
TD-PCA

method with raw data input method is better than other methods. The lowest accuracy
is the scaled time–frequency pixel method because the fault information in the time–
frequency map only accounts for a small part (e.g., information near the characteristic
frequency). The classification accuracy of the selected 20 groups of variables is
higher than that of the 14 groups of variables, indicating that the selection of suitable
features is also a key factor affecting the calculation results, which again confirms
the advantage of avoiding the manual feature extraction and selection process to
diagnose the fault states directly from the raw data.

5.3 CNN Based Fault Classification

The time–frequency image obtained from time–frequency analysis represents the


joint distribution information of the time domain and frequency domain, which intu-
itively reflects the relationship between each frequency component of the signal with
time. The time–frequency map contains rich information about the device status,
and intelligent diagnosis of gearbox faults can be realized by analyzing the time–
frequency image. Convolutional neural network (CNN), as one of the popular algo-
rithms in the field of deep learning, has good classification performance for the image.
In this section, CNN is applied to the analysis of time–frequency images of vibration
signals and is utilized for the classification and recognition of faults.
328 5 Deep Learning Based Machinery Fault Diagnosis

5.3.1 Convolutional Neural Network

CNN is a multilayer artificial neural network and the structure of CNN is more
special compared with traditional neural networks. It adopts a weight-sharing struc-
ture which makes the complexity of the network model greatly reduced and reduces
the computational effort. The image can be directly input to the network and there is
no need to extract and select features manually, which makes the convolutional neural
network have obvious advantages in recognizing two-dimensional shaped inputs.
A convolutional neural network consists of the input layer, alternating connected
convolutional layers and downsampling layers, fully connected layers and the output
layer. The original images are input into the model. The convolutional layer is applied
for feature extraction and the convolutional kernel is a feature matrix. The role of the
downsampling layer is to reduce the feature dimensionality and reduce computational
complexity. Both the convolutional layer and the downsampling layer consist of
multiple two-dimensional planes, each plane represents the output feature map after
processing in each layer. The fully-connected layer is located at the end of the CNN
and is used to calculate the output of the whole network. CNN can choose each
structural parameter of the network according to the actual situation. The structure
of the CNN used in this section is shown in Fig. 5.25.
Instead of requiring an exact mathematical expression between the output and the
input, CNN is training with known samples to obtain a mapping relationship between
the output and the input. The CNN training process consists of two stages: forward
propagation and backward propagation. Forward propagation is to input samples into
the network to get the network output. Backward propagation is to calculate the error
between the network output and the target output first, then backward propagate the
obtained error value to get the error of each layer. Next, the network parameters are
updated by stochastic gradient descent (SGD) until the network converges or reaches

Fully-connected
k113 layer
1 1 1 1

k11
1

2 2 2 2

Input
layer
k1m
1

kmn
3

m m n n Output
layer
C1 S2 C3 S4 V5

Fig. 5.25 The specific structure of the CNN


5.3 CNN Based Fault Classification 329

the max iteration, and the trained CNN is obtained. The specific algorithm is shown
in [22].

5.3.1.1 Flowchart of CNN Based Classification

The CNN based classification consists of two processes: training and testing. The
training samples are used to train the network, and then the testing samples are fed
into the trained network to test the classification performance of the model. The
training process requires forward propagation and backward propagation, while the
testing process only requires forward propagation.
When using CNN for classification tasks, after selecting the training and testing
samples, the specific classification steps are as follows.
(1) Randomly initialize the model parameters such as weight matrix and bias.
(2) Select a sample (X, Y ) from the training samples and input X into the network,
then obtain the corresponding output vector O. In practice, a batch input of
samples is often used, i.e., a certain number of samples are input each time. In
this case, the output is calculated as a matrix, and each column corresponds to
the actual output vector of one of the samples.
(3) Calculate the error between the actual output vector O and the target output
vector Y.
(4) Back propagate the error obtained in the previous step layer by layer, and then
use the SGD to obtain the gradient of the error cost function on the parameters,
and then update the weight parameters.
(5) Input the remaining training samples into the network in turn to complete steps
(2) to (4) until all the training samples are input to complete one iteration.
(6) Perform multiple iterations to improve the accuracy of the network. Stop the
iteration when arriving the specified recognition rate or iteration termination
condition is reached.
(7) Input testing samples into the trained neural network and use the classifier to
compute the correct classification accuracy.

5.3.1.2 CNN Based MNIST Digit Classification

To verify the CNN recognition performance for MINST digit classification, the
samples from the MNIST dataset are input into the CNN. The structure of the CNN in
this experiment is shown in Fig. 5.25, and the specific parameters are set as follows:
the kernel size is set to 5 × 5 with 6 channels for the first convolutional layer, and the
kernel size is set to 5 × 5 with 12 channels for the second convolutional layer. The
size of the downsampling layer is set to 2 × 2. The batch size is set to 50, the learning
rate is set to 1, and the iterations is set to 20. The experiment is repeated five times,
and the average of the five classification results is taken as the final classification
accuracy, then the relationship between the correct classification accuracy and the
number of iterations is shown in Fig. 5.26.
330 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.26 The classification result of handwritten digit

As Fig. 5.26 shown, when the number of iterations is set to 1, the classification
rate is already considerable, reaching 88.19%, which is due to the fact that the digital
images to be recognized in this experiment are not complicated and the number
of samples involved in the training reaches 60,000, so the classification accuracy
reaches more than 88% in only one iteration. With the increase in the number of
iterations, the classification rate of handwritten digits also increased gradually, and
the classification rate reached more than 95% after 5 iterations, which required a
computation time of about 450 s. Before the number of iterations was 10, the clas-
sification rate increased more, because the model parameters of the convolutional
neural network were continuously adjusted in each iteration to better fit the sample
data. This is because the model parameters of the CNN are continuously adjusted
in each iteration to better fit the sample data, and the ability of the convolutional
neural network to extract the sample features improves as the number of adjustments
increases. After the number of iterations reaches 10, the correct classification rate still
increases with the number of iterations, but the increase becomes smaller, indicating
that after the number of adjustments of the network structure reaches a certain level,
the effect of continuing to adjust the network structure on the correct classification
rate becomes smaller. When the number of iterations reaches 15, the classification
rate basically stops increasing and stabilizes, indicating that the convolutional neural
network model can extract the features of the samples well after 15 iterations under
the current network structure.
5.3 CNN Based Fault Classification 331

5.3.2 CNN Based Fault Diagnosis Method

To analyze the role of CNN in fault diagnosis, the experiment about the fault diagnosis
of automobile transmission will be performed in the following.

5.3.2.1 Experiment Settings

The selected test platform and transmission are consistent with that in Sect. 5.2.3.
To simulate different types of compound faults, different degrees of faults are
set on the inner race of the bearing and the teeth of the fifth gear, as shown in
Fig. 5.27. The bearing is located at the end of the output shaft, which number is
NPU311EN. The bearing fault is set in the inner race of the bearing, including three
bearing conditions: the fault condition of 0.2 mm width with 1 mm depth, the fault
condition of 2 mm width with 1 mm depth and the normal condition, which is shown
in Fig. 5.27a. Furthermore, the driven wheel of the fifth gear with 22 teeth is used
to generate different degrees of fault gear, which includes four gear conditions: the
mild broken tooth, the moderate broken tooth, the gear single broken teeth and the
normal condition. The health conditions of the gear are shown in Fig. 5.27b.
The 10 fault states are obtained by combining gear faults and bearing faults, which
are shown in Table 5.24. The speed of the input shaft is set to 1000 r/min, the sampling
frequency is 24 kHz, and the sampling time is set to 60 s. The vibration signal at the
same position is collected in each working condition.

5.3.2.2 The Procedure of Fault Diagnosis

The specific procedure for using CNN to diagnose the gearbox fault is shown in
Fig. 5.28.
The procedure of gearbox fault diagnosis can be summarized as follows: First,
the vibration signal is collected, and then each vibration signal is divided into several
samples of a certain length. Second, the multiple time–frequency maps are obtained
by performing time–frequency transformed to these samples respectively, and the
pixels of the time–frequency maps are scaled for the benefit of passing into CNN.
Third, the training samples and test samples will be randomly divided from the time–
frequency graph corresponding to each signal, and the training samples are used to
train the CNN model. Finally, the classification result is obtained by passing the test
sample into the trained neural network.

5.3.2.3 The Analysis of Signals’ Time Domain and Frequency Domain

The 10 health conditions are set in the actual experiment, which includes 9 fault
conditions and normal conditions. In the stage of time domain analysis, only four
332 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.27 The bearing and gear faults

Table 5.24 The fault conditions


Groups Fault conditions
1 Normal gear condition
2 Mild broken tooth
3 Moderate broken tooth
4 Single broken tooth
5 Inner ring fault with 0.2 mm width and Normal gear condition
6 Inner ring fault with 0.2 mm width and Mild broken tooth
7 Inner ring fault with 0.2 mm width and Moderate broken tooth
8 Inner ring fault with 0.2 mm width and Single broken tooth
9 Inner ring fault with 2 mm width and Moderate broken tooth
10 Inner ring fault with 2 mm width and Single broken tooth
5.3 CNN Based Fault Classification 333

Fig. 5.28 The procedure of transmission fault diagnosis

working conditions are selected to perform simple time domain analysis, which
consists of the normal condition, the fifth gear single broken tooth, the bearing inner
ring fault with 0.2 mm width and the fifth gear single broken tooth, the bearing inner
ring fault with 2 mm width and the fifth gear single broken tooth. The time domain
vibration signal of each condition is intercepted with a duration of 0.15 s and shown
in Fig. 5.29.
Comparing the time domain graphs of the normal condition and the fifth gear
single broken tooth, it can be seen that in the normal condition, the vibration signal
has no obvious impact, and the amplitude is small. However, the presence of gear
fault not only makes the amplitude of the vibration larger but also causes a certain
degree of impact. In the condition of the bearing inner ring fault with 0.2 mm width
and the fifth gear single broken tooth, the impact phenomenon is more obvious, and
the amplitude is larger; When the fault width of the inner ring is increased to 2 mm,
the vibration amplitude is larger and the impact phenomenon is more serious.
From the time domain signal analysis above, the only information can be
observed is the magnitude of the vibration amplitude and whether there is an impact
phenomenon. For fault diagnosis, this information is far from enough. The time–
frequency diagram can reflect the relationship between all the frequency components
334 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.29 The time-domain signals of four fault conditions

of the signal with time and contains a large amount of information, which is more
suitable for fault diagnosis.
In the consideration of the rich information contained in the time–frequency graph
and the input of the CNN must be two-dimensional shape in the study of the diag-
nosis performance of the CNN for the transmission compound fault. Therefore,
the time-domain signal should be performed time–frequency transformation after
collecting the vibration signal of the transmission under each working condition.
The S-transform is selected after considering the time–frequency presentation effect
will be affected by the selected window function or wavelet basis function when the
signal is performed short-time Fourier transform or continuous wavelet transform.
According to the parameters, the rotational frequency fr and the tooth-mesh
frequency f z of the fifth gear are calculated as follows:

26 42 1000 26 42
fr = n × × = × × = 21.77(Hz) (5.30)
38 22 60 38 22

f z = fr z = 21.77 × 22 = 478.92(Hz) (5.31)

The time-domain signal of each fault is transformed by the S-transform, and the
time–frequency graphs obtained are shown in Fig. 5.30.
5.3 CNN Based Fault Classification 335

Fig. 5.30 The time–frequency graphs of each signal


336 5 Deep Learning Based Machinery Fault Diagnosis

From the time–frequency graphs of 10 fault conditions in Fig. 5.30, it can be


seen that there all is exist a frequency component of about 480 Hz, which is close
to the calculated meshing frequency of the fifth gears (478.94 Hz). Therefore, the
frequency component is the first-order meshing frequency of the fifth gear and that
is caused by the conventional meshing vibration of the gear. In addition, there is also
an energy concentration band in the around 1500 Hz and 2500 Hz of time–frequency
graphs, which include the natural frequency activated by the vibration signal in
the corresponding frequency band of the transmission box, shaft or gear and the
high-order frequency multiplier component of the fifth gear meshing frequency.
From the comparison of Fig. 5.30a, c, e, g, it can be seen that in the case of gear
fault, with the heavier of the degree of gear fault, the amplitude of vibration is higher.
When adding the fault of the inner ring of the bearing, the relatively obvious high-
frequency band is presented because the intermediate and high-frequency natural
vibration of the bearing system is activated by the impact between the fault surface
and the roller with the rotation of the bearing. Furthermore, the amplitude of the
vibration will become larger when the fault degree of the inner ring is increasing
from the 0.2 mm fault width to the 2 mm fault width.
From the above time–frequency graphs, it can be seen that the time–frequency
graphs obtained by various faults are different according to their fault class and the
degree of fault. However, it is difficult to distinguish the various class of faults by
experience, especially such 10 fault states. At this time, there is a need for a method
that can not only find out the common characteristics of the time–frequency graphs
of the same faults class but also distinguish the different classes of faults. In the
consideration of the excellent performance of CNN in image recognition, such as
face recognition and handwritten digit recognition, which is selected for fault image
classification.

5.3.2.4 The Time–Frequency Image Classification

The selected structure parameters of CNN have a great impact on the classification
results when CNN is applied to identify the time–frequency graphs of automotive
transmission compound faults. Therefore, we should select appropriate parameters to
reflect the classification performance of CNN for automotive transmission compound
faults.
The time–frequency matrix is used to pass into CNN, and the size of the time–
frequency matrix is 32 × 32 pixels, which is obtained by scaling the size of the time–
frequency graph. Furthermore, 1000 time–frequency graph samples are constructed
for each signal, half of them are selected to train CNN, and the remaining samples
are used to test the classification performance of the model.
The parameters that have a great influence on the classification results include
the iterations, the batch size, the size of the convolutional kernel and the number of
the convolutional kernel. In the following, the influence of these parameters on the
classification results will be analyzed to select the best network parameters.
5.3 CNN Based Fault Classification 337

(1) The number of iterations


The iteration is a process of constantly approaching the fitting: the fitting effect is
not enough when the number of iterations is too small, however, when the number
of iterations increases to a certain extent, the fitting error will not be reduced, but the
time cost will increase with the increase of the number of iterations. Therefore, it is
necessary to choose the best number of iterations that can reach a certain accuracy
under the condition of a relatively low time cost.
To simplify the complexity of the discussion, the other parameters except for the
number of iterations set fixed values when discussing the influence of the number
of iterations on the classification results. The batch size is selected as 5, and the
convolutional kernel size of the two convolutional layers is 5 × 5. The number of
convolutional kernels in the second layer is twice that of the first layer, and the
number of convolutional kernels in the first layer is set to 4, 5, and 6 respectively. In
the consideration of the network will lose more information when the pooling area
of the downsampling layer is too large, besides, the input size of the image in this
experiment is 32 × 32. Therefore, the size of the pooling area is selected as 2 × 2,
and the pooling method is the average pooling. The experiment is repeated 5 times,
and the average of the 5 times classification results is taken as the final classification
accuracy. The relationship between the classification accuracy and the number of
iterations is shown in Fig. 5.31. The labels “4–8” of the labeled box in the figure
represent the first convolutional layer with 4 kernels and the second convolutional
layer with 8 kernels, which is suitable for all the classification tasks.
Figure 5.31 shows that the classification accuracy gradually increases with the
increase in the number of iterations. The classification accuracy is more than 96%
when the number of iterations reaches 5. The classification accuracy reaches more

Fig. 5.31 The relationship between iterations and accuracy


338 5 Deep Learning Based Machinery Fault Diagnosis

than 98.5% when the number of iterations is higher than 10. Finally, the classification
accuracy tends to be stable with the increase in the number of iterations. Therefore,
under the conditions of the number of samples and the size of the graphs in this
section, 10 iterations can not only meet the high classification accuracy but also
achieve lower time costs.
(2) The batchsize
The one time of iteration in training CNN can be concluded as followed, First, the
batchsize number of samples is randomly selected each time to perform training.
Second, the weights are adjusted and the next group of samples with is randomly
selected. Finally, all the training samples are passed through CNN and the model
performs the next iteration. The larger batchsize lead to faster convergence but the
number of weights adjustment is less and the classification accuracy is reduced.
In the contrast, the smaller batch size can improve the classification accuracy, but
lead to higher time costs. Therefore, classification accuracy and computation time
should be considered to reduce the time cost under the premise of ensuring sufficient
classification accuracy when selecting the batchsize.
The number of convolutional kernels in the first layer is set to be 1 to 6, the
number of convolutional kernels in the second layer is twice that of the first layer,
the size of convolutional kernels in the two convolutional layers is 5 × 5, and the
number of iterations is selected as 10 according to the conclusion obtained in the
previous test. Meanwhile, the principle of batchsize selection must satisfy that the
number of training samples can be divided evenly, so the batchsize is selected as 2,
4, 5, 8, 10, 20, 25, 40, 50; The experiment is repeated 5 times, and the average of the
5 classification results is taken as the final classification accuracy. The relationship
between the classification accuracy and the batchsize is shown in Fig. 5.32.

Fig. 5.32 The relationship between batchsize and accuracy


5.3 CNN Based Fault Classification 339

From Fig. 5.32, it can be seen that when the batchsize is less than 10, the change
of batchsize has little influence on the classification accuracy. However, when the
batchsize is higher than 10, the classification accuracy significantly decreases with
the increase in the batch size. Therefore, in the case of a certain number of samples,
the batchsize must be able to divide the number of training samples. Furthermore,
the smaller batchsize can increase the fault recognition accuracy which is beneficial
to judge the fault category. Such as, in this test, the classification accuracy is the
highest when the batch size is 4, 5, 8. Finally, in the consideration of the time cost
factor, the batchsize of 5 or 8 is selected.
(3) The number of convolutional kernels
Each convolutional kernel corresponds to a specific extracted feature when
performing a convolutional operation. The small number of convolutional kernels
are hard to fully extract features and the accuracy is not enough. Meanwhile, the
large number of convolutional kernels introduced more parameters and the time cost
is higher. Therefore, the best number of convolutional kernels should be selected
according to the complexity of the image and the classification.
The number of iterations is set to 10. According to the analysis in the previous
subsection, the accuracy is higher when the batchsize is smaller, but the time cost
is higher, so the batchsize is set to 5. The size of the convolutional kernel of the
two layers is selected as 5 × 5. The number of convolutional kernels in the first
and second layers can be arbitrarily chosen, whose combination is too complex.
Therefore, the case where the number of convolutional kernels in the second layer is
multiple of that in the first layer is considered. In addition, the cases where the number
of convolutional kernels in the second layer is 1–4 times that in the first layer and the
combination of the number of convolutional kernels in the second layer to be 1.5, 2.5,
3.5 times that in the first layer when that in the first layer is even are considered. The
experiment was repeated 5 times, and the average of the 5 classification results was
taken as the final classification accuracy. The relationship between the classification
accuracy and the ratio of the number of convolutional kernels in the two layers was
shown in Fig. 5.33.
From Fig. 5.33, it can be seen that the fault classification accuracy is far less than
other combinations when the number of convolutional kernels in the first layer is 1.
In the case of other combinations, the accuracy of fault classification is all above
95%, and the difference in diagnosis accuracy is small when the ratio of the number
of two convolutional kernels is between 2 and 4 times. Therefore, when the number
of convolutional kernels is selected, the number of convolutional kernels in the first
layer should be higher than 1, and the number of convolutional kernels in the second
layer should be maintained at 2–3 times that in the first layer, in which not only the
diagnosis effect is better but also the time cost is less than the combinations of other
higher multiples.
(4) The size of the convolutional kernels
When the size of the convolutional kernel is larger, the feature space that the network
can represent will become larger, and the learning ability of the network is stronger,
340 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.33 The relationship between accuracy and the ratio of the number of the kernel in two
convolutional layers

however, there are more parameters to be trained, and the calculation is more complex,
which easily leads to the phenomenon of overfitting, and the training time is greatly
increased.
The number of iterations is set as 10, the batchsize is 5, the number of convolutional
kernels in the second convolutional layer is twice that in the first convolutional layer,
and the number of convolutional kernels of the first layer is 1 to 6, respectively.
The combination of convolutional kernel sizes is expressed as (k 1 , k 2 ), where k 1
× k 1 and k 2 × k 2 represent the size of the convolutional kernel of the first layer
and the second layer respectively. The pooling operation of scale × scale size is
performed following the convolution operation. However, the feature map output by
convolutional module is not always an integer multiple of scale, in which case, the
edge parts of the feature map cannot be pooled. The usual way to deal with these
edges is to either remove them or fill them with zeros so that they are a multiple
of scale. To simplify the computation, the edges of the feature map output by the
convolutional layer are set to be divisible by scale, that is, the feature map has no
edge part. The size of downsampling performed in this section is 2 × 2, which
requires (32-k 1 + 1) and ((32-k 1 + 1)/2-k 2 + 1) to be even. Therefore, the size of
the combination is set as (3, 4), (3, 6), (3, 8), (5, 3), (5, 5), (5, 7), (7, 4), (7, 6), (7, 8),
as shown in the abscissa of Fig. 5.34. The experiment was repeated 5 times, and the
average of the 5 classification results was taken as the final classification accuracy.
The relationship between the classification accuracy and the size of the two-layer
convolutional kernel was shown in Fig. 5.34.
The influence of the size of the convolutional kernel is put aside. The classification
accuracy increases rapidly when the number of the first layer of the convolutional
kernel increases from 1 to 2 and the classification accuracy changes little when the
number of the first layer of the convolutional kernel increases from 2 to 6, which
5.3 CNN Based Fault Classification 341

Fig. 5.34 The relationship between accuracy and the size of the convolutional kernel

confirms one of the conclusions of the previous subsection, that is, the effect is better
when the first layer of convolutional kernel more than 1.
From Fig. 5.34, it can be seen that the classification accuracy is generally lower
than that of the first layer convolutional kernel size of 5 × 5 and 7 × 7 when the
size of the first layer convolutional kernel is 3 × 3. There is little difference in the
classification accuracy when the size of the first layer convolutional kernel is 5 × 5
and 7 × 7, and the results are better when the size of the second layer convolutional
kernel is small. In the consideration of more parameters are introduced when the size
of the convolutional kernel is larger and combined with the fault diagnosis effect, the
size of the first layer of the convolutional kernel is selected as 5 × 5, and the size of
the second layer of convolutional kernel is 3 × 3.
(5) The parameters validation
To verify the reliability of the selected parameters, the parameters of CNN are set as
follows: the first convolutional layer has 6 convolutional kernels with a size of 5 ×
5, and the second convolutional layer has 12 convolutional kernels with a size of 3
× 3. The size of the pooling area was 2 × 2, and the average pooling method was
used. The batchsize is set as 5 and the number of iterations is 10.
There are 10 types of faults to be identified, and the samples are from the time–
frequency image samples has performed S-transform, and the input image size is
32 × 32. Furthermore, there are 1000 samples in each class, and 50% of them are
randomly selected as the training sample set, and the rest are used as the test sample
set. The samples are randomly selected 10 times, and the training accuracy and testing
accuracy under each sample is calculated, as shown in Fig. 5.35.
342 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.35 The accuracy of the training set and test set

From Fig. 5.35, it can be seen that under the condition of selected convolutional
neural network structure parameters, there is little difference between the test accu-
racy and the training accuracy when classifying the time–frequency images of 10
types of signals. Besides, except for the accuracy of one time is about 98%, the rest
are above 99%, indicating the effectiveness of parameter selection.
Table 5.25 shows the training accuracy and classification accuracy of each class
of fault respectively under 10 trials. As can be seen from the table, there is little
difference between the training accuracy and classification accuracy for each class
of fault. Among the 10 classes of samples, the lowest accuracy is the moderate broken
tooth state and the broken tooth state of the fifth gear, indicating that compared with
other classes of faults, the features of the time–frequency images of these two classes
of faults extracted by the convolutional neural network fail to cover all samples.

5.3.2.5 The Validation of CNN Diagnosis Performance

To verify the diagnosis performance of the CNN for time–frequency graphs, the deep
learning algorithms, deep belief network (DBN) [2] and stacked auto-encoder (SAE)
[23], are introduced to time–frequency graphs classification of fault signals. The
configuration of the computer used in this experiment is CPU i7-4790K@4.00 GHz,
RAM 16.00 GB, GPU Nvidia Geforce GTX960, and operating system Win7.
According to the analysis of the parameters in the previous section, the parameters
of the CNN are selected as follows: the first convolutional layer has 6 convolutional
kernels with a size of 5 × 5, the second convolutional layer has 12 convolutional
kernels with a size of 3 × 3, the batch size is set as 5, and the iterations is set as 10.
The parameters of the deep belief network are selected as follows: the whole
DBN is composed of an input layer, three hidden layers, and an output layer. The
number of nodes in the input layer is the graph dimension 1024, the number of
nodes in the output layer is the number of categories 10, and the number of nodes
5.3 CNN Based Fault Classification 343

Table 5.25 The accuracy of each fault condition in the training set and test set
Fault condition The accuracy of the training The accuracy of the test set
set (%) (%)
Fifth gear normal condition 100 100
Fifth gear mild broken tooth 100 99.88
Fifth gear moderate broken 97.82 97.62
tooth
Fifth gear single broken tooth 98.08 98.1
Inner ring fault with 0.2 mm 100 100
width and Fifth gear normal
condition
Inner ring fault with 0.2 mm 100 100
width and Fifth gear mild
broken tooth
Inner ring fault with 0.2 mm 99.38 99.44
width and Fifth gear moderate
broken tooth
Inner ring fault with 0.2 mm 99.94 99.88
width and Fifth gear single
broken tooth
Inner ring fault with 2 mm 99.62 99.12
width and Fifth gear moderate
broken tooth
Inner ring fault with 2 mm 99.62 99.92
width and Fifth gear single
broken tooth

in the hidden layer is set to 1000, 800, and 500, respectively. In the pre-training
stage, the unlabeled data were used to train each restricted Boltzmann machine
hierarchically, and the pre-trained weights were obtained by iterating 100 times. In
the fine-tuning stage, the pre-trained network and softmax classifier were combined
to form a classification model, and the labeled data were used to fine-tune the whole
network through the backpropagation algorithm. Finally, the parameters of the whole
network after fine-tuning were obtained by iterating 200 times.
The parameters of the stacked autoencoder are chosen as follows: The network
structure is stacked by one input layer, three hidden layers, and one output layer. The
number of nodes in the input layer is the image dimension 1024, and the number
of nodes in the output layer is the number of classes 10. In the consideration of the
number of input and output nodes and training time, the combination of the node in
the three hidden layers with a better classification effect is selected, which is 400,
200, 50, respectively. Both the training and testing processes go through two stages:
pre-training and reverse fine-tuning, with 100 iterations for pre-training and 200
iterations for fine-tuning.
There are 1000 samples for each fault class, from which 50% of the samples are
randomly selected as training samples and the rest are used as testing samples to
344 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.36 The diagnostic performance of the three classes of algorithms

calculate the classification accuracy. Samples were randomly selected 20 times, and
the average accuracy of five times in each group of samples was calculated. The
classification results of the three types of algorithms were shown in Fig. 5.36.
From Fig. 5.36, it can be seen that the classification accuracy of the three classes
of algorithms after 20 times tests has reached more than 95%, in which DBN has a
relatively low classification accuracy of about 96% and the accuracy of convolutional
neural network and stacked autoencoder is relatively high, maintaining above 99%.
Since the selection of 20 samples is random, meanwhile, it can be seen from the figure
that the accuracy of the three classes of algorithms does not have large fluctuations,
reflecting that the three classes of deep learning methods have good robustness for
the diagnosis of compound faults in automobile transmission.
The average accuracy of 20 times test is taken as the final classification accuracy,
and the time required to complete a task under the current structural parameters of the
three types of methods is recorded respectively. The results are shown in Table 5.26.
Table 5.26 shows that under the selected parameters, the average classification
accuracy of the three types of algorithms reaches more than 96%, in which the
average classification accuracy of the DBN is 96.27%, which is lower than the average
accuracy of the other two types of algorithms, meanwhile, the average classification
accuracy of the CNN and SAE is more than 99%. Finally, the average classification

Table 5.26 The accuracy of


Algorithms Accuracy (%) Time (s)
the three classes of algorithms
CNN 99.37 169.91
SAE 99.13 2196.25
DBN 96.27 3265.82
5.3 CNN Based Fault Classification 345

accuracy of CNN is the highest, reaching 99.37%. In addition, the running time of
the three types of algorithms is compared, the DBN takes the longest time, which
is 3265.82 s, and the CNN has the shortest running time of 169.91 s, which is only
about 1/19 of the running time of DBN and 1/13 of the SAE, which is much lower
than the running time of the other two algorithms.
From the average classification accuracy and the stability of the algorithm, the
three kinds of algorithms have a good diagnosis for time–frequency images. Consid-
ering the classification accuracy and computing time, CNN has the highest fault
diagnosis accuracy, and the running time is much lower than the other two algo-
rithms. Therefore, CNN has shown better performance in time–frequency image
classification.

5.3.2.6 The Performance Analysis of Different Time–Frequency


Methods Combined with CNN

To satisfy that the input of CNN is a two-dimensional shape, the method of time–
frequency transformation is used to convert the vibration signal into the time–
frequency graphs. The characteristics of time–frequency analysis methods will have
an influent on the final recognition results when using CNN to recognize time–
frequency graphs. To explore the performance of different time–frequency methods
combined with CNN, three commonly used time–frequency analysis methods are
selected, which are the short-time Fourier transform, the continuous wavelet trans-
form and the S-transform. Furthermore, the Hamming window is used as the window
function in short-time fourier transform, and the Morlet wavelet is used as the wavelet
basis in continuous wavelet transform.
(1) The comparative analysis of time–frequency graphs of the three methods
The process of the experimental setup, fault setup and signal collection in this part
is the same as the previous setup, and there are signals in 10 health conditions. To
facilitate the analysis, only the time–frequency graphs under the four conditions of
normal, the fifth gear single broken tooth, bearing inner ring fault with 0.2 mm width
and the fifth gear single broken tooth, bearing inner ring fault with 2 mm width and the
fifth gear single broken tooth were selected for analysis. The time–frequency graphs
in these four conditions are shown in Figs. 5.37, 5.38, 5.39, and 5.40 respectively.
From Figs. 5.37, 5.38, 5.39, and 5.40, it can be seen that under the same fault
state, the time–frequency maps obtained by the three time–frequency methods are
similar. The first-order meshing frequency of the fifth gear caused by the conventional
meshing vibration of the gear is 478.94 Hz. This frequency component has always
existed whether the fault exists or not, so it is present in each time–frequency graph.
The energy concentration frequency band in the figure is mainly around 1500 and
2500 Hz, which is the natural frequency of the transmission box, shaft or gear in this
frequency band activated by the vibration signal, and also contains the high-order
frequency multiplier component of the fifth gear meshing frequency. The results
346 5 Deep Learning Based Machinery Fault Diagnosis

a) Short-time Fourier transform b) Continuous wavelet transform c) S-transform

Fig. 5.37 The time–frequency graphs in normal condition

a) Short-time Fourier transform b) Continuous wavelet transform c) S-transform

Fig. 5.38 The time–frequency graphs in the fifth gear single broken tooth condition

a) Short-time Fourier transform b) Continuous wavelet transform c) S-transform

Fig. 5.39 The time–frequency graphs in the condition of inner ring fault with 0.2 mm width and
fifth gear single broken tooth

a) Short-time Fourier transform b) Continuous wavelet transform c) S-transform

Fig. 5.40 The time–frequency graphs in the condition of the inner ring fault with 2 mm width and
fifth gear single broken tooth
5.3 CNN Based Fault Classification 347

of three time–frequency transform methods show that these energy concentrated


frequency bands always exist.
In the normal state, the amplitude of the vibration is small, and the amplitude of
the vibration increases when the fifth gear single tooth is broken. The generation of
the inner ring fault aggravates the amplitude of vibration, and a high-frequency band
appears at the same time, especially when the width of the inner ring fault is 2 mm,
the high-frequency band in the time–frequency diagram is more obvious.
In the three fault states, the fault degree is the largest and the characteristics are
the most obvious when the inner ring fault width is 2 mm with the fifth gear single
tooth is broken. Therefore, this fault condition is taken as an example to analyze
the characteristics of the three time–frequency methods. The time–frequency graphs
under the condition that the inner ring fault width is 2 mm with the fifth gear single
tooth broken are shown in Fig. 5.40a, b, c are the time–frequency graphs obtained by
the signal under the fault condition transformed by the short-time Fourier transform,
the continuous wavelet transforms and the S-transform, respectively. The time width
of the high-frequency components in Fig. 5.40a is wider than that of Fig. 5.40b, c,
indicating that its time resolution is not high enough. This is because the short-time
Fourier transform uses a window function with a fixed window length, and once
the window function is selected, its time resolution and frequency resolution are
already fixed. Meanwhile, the relationship between time resolution and frequency
resolution is inversely proportional, which means, if improving the time resolution,
the frequency resolution will decrease. In this way, the shortcoming of the fixed
resolution of the short-time fourier transform is reflected in the figure. In contrast,
the continuous wavelet transform and the S-transform can avoid this problem. From
Fig. 5.40b, c, it can be seen that for high-frequency components, both continuous
wavelet transform and S-transform have good time resolution, however, it can be
seen that the frequency band of wavelet transform has more concentrated energy and
higher frequency resolution when comparing the frequency bands around 1500 and
2500 Hz in the two figures. Finally, the Morlet wavelet is the wavelet basis used in this
experiment. However, many kinds of wavelet bases can be used for wavelet analysis
in practice, and each of them has its characteristics, so it is a challenge to choose an
appropriate wavelet basis when performing the continuous wavelet transform.
(2) The fault diagnosis results under the three time–frequency methods
According to the above analysis of CNN parameters, the network parameters are
selected as follows: the first convolutional layer has 6 convolutional kernels with a
size of 5 × 5, and the second convolutional layer has 12 convolutional kernels with
a size of 3 × 3; the pooling area of the downsampling layer is 2 × 2, and the average
pooling method is used; the batchsize is set as 5; the number of iteration is 10 times.
The experiment is repeated 5 times, and the average of the 5 classification results is
taken as the final classification accuracy. The relationship between the classification
accuracy and the number of iterations is shown in Fig. 5.41.
From Fig. 5.41, it can be seen that the accuracy under the three types of time–
frequency methods all increase with the number of iterations. The classification
accuracy of the three time–frequency methods reaches more than 99% when the
348 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.41 The relationship between the accuracy and the iterations under different time–frequency
methods

number of iterations reaches 5. The classification accuracy of the short-time Fourier


transform is the lowest among the three types of methods, and the number of iterations
to achieve stable results is 5, while the number of iterations to achieve stable results
of S-transform and continuous wavelet transform is 4 and 2, respectively. Therefore,
when the continuous wavelet transform is selected as the time–frequency transform
method and combined with CNN, the fault diagnosis effect is the best, in which not
only the convergence speed is fast, but also the diagnosis result is still good when
the number of iterations is small.
To compare the robustness of the three time–frequency methods, samples were
randomly selected 20 times, and the average classification accuracy of five times
under each group of samples was calculated. The structure parameters of CNN are
the same as the previous experiment, and the iterations are 5 times. The reason for
selecting the number of iterations as 5 is that the analysis result in Fig. 5.41 shows
the classification accuracy of the three time–frequency methods under 5 iterations is
about 100%, therefore, the stability of the algorithm under this number of iterations
can be compared. The obtained classification results under the three methods are
shown in Fig. 5.42.
From Fig. 5.42, it can be seen that among the three kinds of time–frequency
transformation methods, the result of continuous wavelet transform is the best. In
the process of 20 random sample selections, only one classification accuracy is
slightly lower, which is 99.14%, and the others are all above 99.9%. The classification
result under the short-time Fourier transform is the worst, although the classification
accuracy is maintained above 94% each time, the result is not stable, and the overall
accuracy is lower than the other two time–frequency methods. The classification
accuracy under S-transform is above 99% except for one time, which is slightly low
at 97.78%. From the analysis results above, it can be known that the classification
5.3 CNN Based Fault Classification 349

Fig. 5.42 The robustness of the three time–frequency methods

results under the three time–frequency transform methods are as follows: wavelet
transform > S-transform > Short-time Fourier transform. Therefore, the wavelet
transforms with Morlet wavelet basis are selected to combine with the CNN, and the
effect of fault diagnosis is better and relatively stable.

5.3.3 Transmission Fault Diagnosis Under Variable Speed

The speed that the engine transmits to the gearbox is not constant in most cases
during the operation of the car. The speed of the input shaft of the gearbox changes
with time, and the vibration is more complicated than that of the input shaft keeping
a steady speed. Due to its unique structure, the CNN is invariant to a certain degree
of translation, scaling and torsion. Therefore, the time–frequency analysis combined
with the convolutional neural network method is used in this section, and the fault
diagnosis of the gearbox under the variable speed is carried out.

5.3.3.1 Experimental Setup

The experiment is still carried out on the three-axis five-speed transmission, and the
experimental equipment is consistent with 5.2.3. The fault of the gear is set on the
driven wheel of the fifth gear, and the gear is cut in different degrees to simulate the
three fault states of gear mild broken tooth, moderate broken tooth and single broken
tooth. Bearing Fault is located in the output shaft of the rolling bearing inner ring,
350 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.27 Fault conditions


Group Fault condition Group Fault condition
1 Fifth gear normal condition 5 Inner ring fault with 0.2 mm width and
Fifth gear normal condition
2 Fifth gear mild broken tooth 6 Inner ring fault with 0.2 mm width and
Fifth gear mild broken tooth
3 Fifth gear moderate broken tooth 7 Inner ring fault with 0.2 mm width and
Fifth gear moderate broken tooth
4 Fifth gear single broken tooth 8 Inner ring fault with 0.2 mm width and
Fifth gear single broken tooth

for 0.2 mm width of the fault. Combination gear and bearing fault state and normal
state, a total of eight states of the signal to be identified, as shown in Table 5.27.

5.3.3.2 Fault Diagnosis Under the Speed-Up Condition

Variable speed refers to the input shaft speed changes with time, including speed
up, speed down, speed up or down in three cases. The rising speed is similar to the
falling speed, so only the rising speed can be selected for analysis. Speed-up refers
to the input shaft of the gearbox speed increasing with time.
(1) Speed-up signal analysis
1. Simple Analysis in Time Domain and Frequency Domain
The sampling frequency is set to 12 kHz, and the vibration signals of each fault
state and normal state are collected. In order to show the overall change trend of the
signal with time, the time-domain signal of 60 s is selected for each type of fault
state, as shown in Fig. 5.43.
Under the condition of increasing speed, the vibration amplitude of the signal
increases with the increasing speed of the input shaft. The degree of fault can also
affect the amplitude of vibration, the greater the degree of fault, the more intense
vibration, and the greater the amplitude of vibration. Because the speed of the input
shaft set up in the experiment is an overall increasing trend, it cannot guarantee that
each fault state will have the same speed at the same time, so at a fixed time in the
time domain diagram of eight states, the amplitude of vibration signal with large
fault degree may be lower than that with small fault degree. Time-domain signal
can only reflect the amplitude of the vibration signal and whether there is an impact
phenomenon, and cannot be analyzed by the time-domain signal to determine the
fault state.
The frequency components of the vibration signals and the corresponding vibra-
tion amplitude values of each frequency component can be analyzed through the
frequency spectrum. The vibration signals of the broken teeth of the fifth gear were
selected and Fourier transform, the spectrum is shown in Fig. 5.44. The frequency
5.3 CNN Based Fault Classification 351

Fig. 5.43 The time domain vibration signal of each health condition
352 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.44 The frequency spectrum of the fifth gear single broken tooth condition

components in Fig. 5.44 are so numerous that it is not possible to analyze the
frequency components by observing the spectrum to determine the fault status.
When the rotating speed is changed, the rotating frequency of each shaft is
changed, and the meshing frequency of the gears is also changed. The corresponding
frequency component is an interval. The meshing frequency of the constant meshing
gear and the fifth output gear f m1 and f m2 are calculated as follows

f m1 = z 1 × f n1 (5.32)

f m2 = z 4 × f n3 (5.33)

where, z 1 is the constant meshing gear drive teeth, z 1 = 26, f n1 is the input shaft
frequency, z 4 is five gear driven teeth, z 4 = 22, f n3 is the output shaft frequency.
When the speed of the input shaft slowly changes from 0 to 1500 rpm, the
frequency of the input shaft f n1 gradually increases from 0 to 25 Hz. The maximum
sum of the meshing frequencies of the constant-meshing gear and the fifth output
gear f m1 max and f m2 max are calculated as follows:

f m1 max = z 1 × f n1 max = 26 × 25 = 650 (Hz) (5.34)

26 42
f m2 max = z 4 × f n3 max = 22 × f n1 max × × = 718.42 (Hz) (5.35)
38 22
The range of the meshing frequency of the constant meshing gear is 0 ∼ 650 Hz,
and the range of the meshing frequency of the fifth gear is 0 ∼ 718.42 Hz. If two pairs
of gears still have second-order vibration response, the range of meshing frequency
of constant meshing gear and fifth gear will be 0 ∼ 1300 Hz and 0 ∼ 1436.84 Hz
respectively. Because the frequency of meshing changes with the speed of rotation,
its frequency range is in a range, easy to cover up the fault frequency components.
Therefore, it is difficult to make fault diagnosis by frequency spectrum under the
condition of variable speed.
5.3 CNN Based Fault Classification 353

Fig. 5.45 The time–frequency graph of the signal from the 20 to 40 s in the condition of gear single
broken tooth

2. Time–frequency Analysis
When combined with the convolutional neural network algorithm, the performance
of the time–frequency transform method is in the order of Continuous wavelet trans-
form > S-transform > Short-time Fourier transform. Therefore, the time–frequency
transform method uses continuous wavelet transform, wavelet base still uses Morlet
wavelet.
In order to observe the trend of time–frequency variation of the whole signal under
the condition of variable speed, the time-domain signal in the condition of broken
gear teeth, which is from 20 to 40 s, is intercepted, and the continuous wavelet
transform is carried out, the resulting time–frequency diagram is shown in Fig. 5.45.
As can be seen in Fig. 5.45, there is a large amplitude band near 400 Hz, which is
the gear mesh frequency component. With the increase of rotation speed, the value
of the meshing frequency increases slowly, which is reflected in the time–frequency
graph as a frequency band of meshing frequency similar to a line with a positive
slope but a small slope value. The energy-concentrated frequency band around 1500
and 2500 Hz is the natural frequency component of the gearbox system excited by
the impact component. The frequency value is not affected by the rotational speed,
but the amplitude increases with the increase of the rotational speed. In addition, the
number of impacts will increase with the increase of rotation speed and the number
of meshing times per unit time, so the impact components will be more and more
intensive in the time–frequency diagram.
In order to explain the difference between the same kind of samples, two signals of
the same length at different time are selected to carry out time–frequency transform
respectively. Two segments of 0.5 s in length at the 10th and 60th s of five gears with
broken teeth are taken for continuous wavelet transform, and the time–frequency
diagram is shown in Fig. 5.46.
354 5 Deep Learning Based Machinery Fault Diagnosis

a) The signal at the 10s is performed continuous b) The signal at the 60s is performed continuous
wavelet transform wavelet transform

Fig. 5.46 The time–frequency graphs of the 0.5 s length signal of the fifth gear single broken tooth
condition

As can be seen in Fig. 5.46, the amplitude of the time–frequency diagram at 60 s


is significantly larger than that at 10 s due to the increasing speed of the input shaft.
Moreover, the frequency of impact will increase with the increase in rotation speed,
because the number of teeth entering the mesh will increase.
When the time–frequency images of different fault signals are recognized, the
samples of each kind of signal are the amplitude matrix corresponding to the time–
frequency images of a fixed length for a short period of time. Under the condition of
constant speed, the samples of the same kind of signals should be basically the same
without considering the influence of noise and random factors. Under the condition
of variable speed, the value of gear meshing frequency, bearing fault characteristic
frequency, impact frequency and amplitude of frequency components all change
with time, but they are similar in general. In view of the CNN invariance of image
translation, scaling and torsion, the convolutional neural network is used for time–
frequency image recognition at variable speed.
(2) Time–frequency image analysis
From time zero to time t1 , the rotational speed of the input shaft increases from 0 to
n 1 . Taking t0 at a certain time within this period as the boundary, the time–frequency
graph corresponding to signals in time 0 − t0 is taken as the training sample set, with
a total of 500 training samples. The time–frequency graph corresponding to signals
in the t0 − t1 period is used as the test sample set, with a total of 500 test samples. The
size of the time–frequency graph obtained by the time–frequency transform method
is adjusted to 32 × 32 so that it can be used as the input of the convolutional neural
network. During image recognition, there are eight types of time–frequency graphs
under different states, each of which has 1000 samples, with training samples and
test samples accounting for 50% each.
5.3 CNN Based Fault Classification 355

Three deep learning algorithms represented by convolutional neural network, deep


belief network and stacked autoencoder were used to identify time–frequency images
under speed-up conditions.
Based on the analysis of the convolutional neural network parameters, the convo-
lutional neural network parameters were selected as follows: 6 convolution cores
with a size of 5 × 5 in the first convolution layer and 12 convolution cores with a
size of 3 × 3 in the second convolution layer; The pool area of down-sampling layer
is 2 × 2, average pool is used, batch size = 5, iteration 10 times.
The parameters of deep belief network are: the number of input layer nodes is
1024, the number of output layer nodes is 8, and the number of hidden layer nodes
is 1000, 800, 500. In the pre-training phase, each Restricted Boltzmann Machine
is traineded with unlabeled data and iterated 100 times, while in the fine-tuning
phase, the pre-trained network and the softmax classifier are combined to form a
classification model, fine-tune the network by Backpropagation it with labeled data,
iterating it 50 times.
The parameters of stacked autoencoder are as follows: the number of nodes in the
input layer is 1024, the number of nodes in the output layer is 8, and the number of
nodes in the three hidden layers is 400, 200, 50 respectively. The pre-training phase
iterates 100 times. Because of the relationship between the computation time and
the structure of the algorithm, the number of iterations is chosen to calculate the
classification accuracy every 20 times until 200 iterations.
Each experiment was repeated 5 times, and the average of 5 classification results
was taken as the final classification accuracy rate, then the relationship between the
classification accuracy of the three kinds of deep learning algorithms and the number
of iterations is shown in Fig. 5.47 respectively.
Figure 5.47a shows the training and test classification accuracy of the convolu-
tional neural network for recognition of time–frequency images at speed-up condi-
tions. When the number of iterations increased from 1 to 2, the classification accuracy
of the training samples and test samples increased significantly. After four iterations,
the training accuracy was over 99%, and after six iterations, the training accuracy
was 99.9%, which indicated that the fitting effect of the convolutional neural network
to the training samples was very good, there is little point in adding more iterations.
For the test samples, after four iterations, the test accuracy is maintained at more than
90%, but the accuracy does not increase with the number of iterations but slightly
fluctuates, the test accuracy reached a maximum of 98% after 7 iterations.
The deep belief network underwent both pre-training and fine-tuning, and
Fig. 5.47b shows the curve of training and test accuracy with the number of fine-
tuning iterations. As can be seen from the graph, the training accuracy increased from
15.01% to 97.7% and the test accuracy increased from 13.98% to 86.78% when the
number of fine-tuning iterations increased from 1 to 2. The classification accuracy
of training and testing continues to increase slightly as the number of fine-tuning
iterations increases. When the number of iterations is 12, the training accuracy is
nearly 100%, and the test accuracy is 93.9%. After 27 iterations, the test accuracy
remained above 96% and reached the maximum of 96.71% after 40 iterations.
356 5 Deep Learning Based Machinery Fault Diagnosis

a) Convolutional Neural Network b) Deep Belief Networks

c) Stacked autoencoder

Fig. 5.47 The accuracy of different algorithms under the speed-up working condition

Stacked Autoencoder is also pre-trained and fine-tuned. Figure 5.47c shows the
change of training and test accuracy with the number of fine-tuning iterations.
Because all the training samples are one-time whole input and the network fitting is
for all training samples, the training accuracy is 100% after pre-processing. A value
of 0 on the horizontal axis represents the classification accuracy without fine-tuning,
which is 84.72% after 100 iterations of pre-training. After 60 tweaks, the test accu-
racy is maintained above 98%, and reaches the maximum at 120 iterations, reaching
99.43%.
As a whole, when the number of iterations or the number of fine-tuning iter-
ations reaches a certain degree, the training accuracy of the three deep learning
algorithms can basically reach 100%, indicating that the three algorithms can basi-
cally completely fit the training samples. The highest classification accuracy of the
three algorithms is above 96%, which shows that the three deep learning algorithms
can be used to classify and identify the gearbox fault time–frequency diagram under
speed-up conditions.
Comparing the three algorithms, the stacked autoencoder has the highest test
accuracy, which is 99.43%, followed by the convolutional neural network of 98%,
and the deep belief network has the highest test accuracy of 96.71%. As far as the
5.3 CNN Based Fault Classification 357

Table 5.28 The highest training and test accuracy in the four algorithms
Algorithm The highest training accuracy (%) The highest test accuracy (%)
CNN 100 98
DBN 100 96.71
SAE 100 99.43
SVM 99.975 46.025

stability of the algorithm is concerned, both the deep belief network and the stacked
autoencoder are stable, while the convolutional neural network fluctuates slightly.
Considering the computation time of the three algorithms to achieve the maximum
test accuracy, the single operation time of the convolutional neural network is 88.9 s,
which is about one-fifth of the deep belief network’s and one-eleventh of the stacked
auto encoder’s.
From the above analysis, it can be concluded that the convolutional neural network
can effectively identify time–frequency images of gearbox faults under speed-up
conditions, although its test accuracy may fluctuate, it is not as stable as deep belief
networks and stacked autoencoder, but its time cost is much lower than the other two
kinds of algorithms.
The support vector machine of the shallow machine learning algorithm is also
used for time–frequency image recognition. The parameters of the Support vector
machine are set as follows: first, the format of the input data is adjusted using the
algorithm in LIBSVM [23], using the radial basis function as the kernel function, the
penalty coefficient C and the kernel parameter g were selected by cross-validation
and grid search for the Support vector machine, and the whole training sample set
was trained with the best parameters, to get the Support Vector Machine model.
Table 5.28 shows the highest test and classification accuracy for the Support Vector
Machine and three types of deep learning algorithms.
As can be seen from Table 5.28, the training process is successful with a maximum
classification accuracy of 99.975% for the time–frequency image recognition using
the optimized Support vector machine. However, the highest test accuracy was only
46.025%, which was far lower than the test accuracy of the three deep learning
algorithms. It not only shows that the deep learning algorithm is better than the
shallow algorithm in the whole Support Vector Machine, but also shows that the
Support Vector Machine is not suitable for time–frequency image recognition under
speed-up conditions.

5.3.3.3 Fault Diagnosis Under the Condition of Speed-Up


and Speed-Down

The vibration under the rising and falling speed condition is similar to that under
the rising speed condition. The higher the rotating speed, the larger the vibration
amplitude, the higher the frequency of impact, and the higher the gear meshing
358 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.48 The rotational


speed curve of the input shaft

frequency. For time–frequency samples, since each sample is only a time–frequency


map obtained by time–frequency transformation of a short period of time in which the
rotation speed can be regarded as constant, moreover, the fault setting is consistent
with the speed-up condition, so the time-domain signal and time–frequency diagram
under the speed-up and speed-down condition are no longer analyzed.
The specific speed regulation process of the input shaft lifting speed is to first set
a high-speed value n 1 at time t1 , then gradually reduce the speed, reduce to n 2 at
time t2 and then increase the speed again, increase to n 3 at time t3 , then gradually
lower, and reduce to n 4 at time t4 . Figure 5.48 shows how the speed of the input shaft
varies with time. The t0 moment in the figure is the dividing line between the selected
training sample and the test sample. The time–frequency graph corresponding to the
signals in the time period t1 − t0 is taken as the training sample set, with a total of
500 training samples. The time–frequency graph corresponding to signals in the time
period t0 − t4 is used as the test sample set. There are 500 test samples in total.
After collecting the vibration signals of various fault states and normal states,
a large number of time–frequency maps are obtained by using the time–frequency
transform method. The size of the time–frequency maps is adjusted to 32 × 32,
so that they can be used as input samples of convolutional neural network. Time–
frequency transform method is still continuous wavelet transform, wavelet base is
Morlet wavelet. In the process of image recognition, there are eight kinds of time–
frequency graphs in different states, each class has 1000 samples, of which the
number of training samples and test samples each accounts for 50%.
The convolutional neural network, deep belief network and stacked autoencoder
are still used to identify the time–frequency images under speed-up and speed-down
conditions, and the parameters of each network are consistent with those under
speed-up conditions.
The experiments under each algorithm were repeated 5 times, and the average
of the 5 classification results was taken as the final classification accuracy, the rela-
tionship between the training accuracy and test classification accuracy of different
algorithms and the number of iterations is shown in Fig. 5.49.
5.3 CNN Based Fault Classification 359

a) Convolutional Neural Network b) Deep Belief Network

c) Stacked autoencoder

Fig. 5.49 The accuracy of different algorithms under the up and down speed working condition

Figure 5.49a shows the training and test classification accuracy of the CNN for
recognition of time–frequency images under speed-up and speed-down conditions.
When the number of iterations is less than or equal to 5 times, the accuracy of training
samples and test samples increases with the number of iterations. After 8 iterations,
the correct rate of the training samples has been maintained above 99.9%. For the
test samples, after 5 iterations, the test accuracy did not increase with the number of
iterations, but there was a significant fluctuation, but it remained near 90%, and it
reaches a maximum of 96% on eight iterations.
Figure 5.49b shows the change of the training and test accuracy of the DBN with
the number of fine-tuning iterations. As can be seen from the graph, the training
accuracy increased from 12.84 to 97.61% and the test accuracy increased from 10.79
to 62.8% when the number of iterations was increased from 1 to 2. As the number
of fine-tuning iterations increased, the classification accuracy of training and testing
continued to increase. When the number of iterations is 17, the training accuracy is
nearly 100%, and the test accuracy is 89.93%. After 27 iterations, the test accuracy
remained above 96%, and reached the maximum of 93.05% after 49 iterations.
Figure 5.49c shows the training and test accuracy of stacked autoencoder as a
function of the number of fine-tuning iterations. A value of 0 on the horizontal axis
represents the classification accuracy without fine-tuning, which is 93.48% after 100
360 5 Deep Learning Based Machinery Fault Diagnosis

Table 5.29 The highest training and test accuracy in the four algorithms
Algorithm The highest training accuracy (%) The highest test accuracy (%)
CNN 100 96
DBN 100 93.05
SAE 100 97.55
SVM 97.75 71.65

iterations of pre-training. After 120 tweaks, the test accuracy was maintained above
96% and reached the maximum at 160 iterations, reaching 97.55%.
The highest classification accuracy of the three kinds of deep learning algorithms
is above 93%, which shows that the three kinds of deep learning algorithms can
recognize the gearbox fault time–frequency graph under the condition of speed-up
and speed-down. Considering the maximum test accuracy of the three algorithms
respectively, the stacked autoencoder has the highest test accuracy of 97.55%, the
next is the convolutional neural network of 96%, and the maximum test accuracy
of the deep belief network is 93.05%. As far as the stability of the algorithm is
concerned, both the DBN and the stacked autoencoder are stable, while the CNN
fluctuates slightly.
The time–frequency image is identified by the Support vector machine. Table 5.29
shows the highest test and classification accuracy for the support vector machine and
three types of deep learning algorithms.
As can be seen from Table 5.29, when the time–frequency images are identified
using the optimized Support vector machine, the highest test accuracy is 71.65%,
which is lower than the test accuracy of the three kinds of deep learning algorithms,
it is shown that the deep learning algorithm is better than the Support vector machine
with the shallow algorithm in identifying the time–frequency under the conditions
of speed-up and speed-down.

5.4 Deep Learning Based Equipment Degradation State


Assessment

The fatigue damages produced by mechanical and thermal stress often happen in
large machines and make them fail to work normally. Condition monitoring-based
power equipment PHM has received wide concern in recent years, which consists of
multiple parts: Condition monitoring & data acquisition, Feature extraction & selec-
tion, Fault diagnosis & health assessment, System Maintenance Policy, etc. Based on
multi-sensor monitoring, various features with fault correlation are extracted from
multi-type monitoring signals (vibration, temperature, oil pressure, acoustic emis-
sion, electrical signal, etc.) for fault detection and residual life prediction, which
can reduce the cost of system maintenance and avoid the accident of shutting down
effectively.
5.4 Deep Learning Based Equipment Degradation State Assessment 361

Fig. 5.50 The architecture


of an AE

When encountering the problem of equipment degradation uncertainty, the


shallow network hard to extract deep degradation features in a highly abstract way,
and only obtain general shallow features. The assessment of equipment performance
degradation involved the adjustment and adaptation of parameters for the evalua-
tion model in different equipment, meanwhile, requiring the evaluation model to
perform deep feature extraction and selection from the collected signal. Fortunately,
the DL-based method can complete this assessment.
To address two challenging problems in degradation state assessment: extracting
multi-dimensional features and modeling the correlation of the degraded time series
signal, a novel DAE-LSTM based equipment degradation state assessment method is
introduced in this section. Furthermore, a case study of the milling tool degradation
state assessment is shown at the end.

5.4.1 Stacked Autoencoder

5.4.1.1 AutoEncoder

An AutoEncoder (AE) [24] is a particular type of neural network, whose architecture


is shown in Fig. 5.50. The network learns the mapping relationships of g(σ (X )) ≈ X
for minimizing the error of reconstruction. Here, σ (X ) denotes encoder and g(Y )
denotes decoder. Specifically, the purpose of the encoder is complete the nonlinear
transformation from X to Y , and that of the decoder is reconstructing Z .

5.4.1.2 Network Architecture

Similar to DBN, the SAE [14] is a stack of multiple AE as shown in Fig. 5.51.
Specifically, the output of each AE trained in an unsupervised way was used as
362 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.51 Stacked AutoEncoder

inputs for the next AE and the features are extracted from low-level to high-level by
multilayer learning.
The procedure of multilayer learning can be concluded as the following steps:
First, after the first AE (AE1) was trained, feature 1 output by hidden layer was used
as inputs of the second AE (AE2). Second, feature 2 was obtained by AE2 trained
in an unsupervised way. Third, all the AE were trained by repeating the above steps.
Finally, the SAE network with multiple hidden layers in space was constructed.
Furthermore, following the final feature layer of the SAE network, the classification
layer was employed to perform the task of classification. After that, a deep layer
network with the ability of feature extraction and data classification was constructed
by training the neural network in a supervised way.

5.4.2 Recurrent Neural Network

The architecture of Recurrent Neural Networks is each independently proposed by


Jordan [6] and Elman [7] and is characterized by allowing output from some nodes
to affect subsequent input to the same nodes, which can be denoted by:

Yt = Fun(X t , Yt−1 ) (5.36)


5.4 Deep Learning Based Equipment Degradation State Assessment 363

Fig. 5.52 A simplified


architecture of three layered
Recurrent Neural Networks

A simplified architecture of three layered Recurrent Neural Networks as shown


in Fig. 5.52. It is very similar between the Forward Propagation of the Recurrent
Neural Networks and traditional Neural Networks, except introduces historical data.
However, in the Back Propagation (BP) field, the traditional BP algorithm can’t be
used for the model training because the Recurrent Neural Networks (RNNs) introduce
the calculation through time. Therefore, the Back Propagation Through Time (BPTT)
[25] is developed based on the traditional BP algorithm. BPTT is used to train the
model of RNNs, the procedure can be concluded as the following steps: First, the
output of each neuron is calculated by the Forward Propagation. Second, the error
term of each neuron is calculated by the BPTT. Finally, the weights are trained with
gradient escent method.
The weight matrix of the RNN has a shared structure through time, therefore, it will
appear continued multiplication through the chain rule when dealing with the deriva-
tives. The form of continued multiplication may lead to Back Propagation errors
from vanishing or exploding when the time step is larger, and the Recurrent Neural
Networks will not be trained. Therefore, Long Short Term Memory (LSTM) [26] is
proposed to solve the above problem, in which the nodes of traditional Recurrent
Neural Networks are being reconstructed as shown in Fig. 5.53.
The increased number of loop layers in the RNN may lead to the activation
functions into the region of saturation gradient. To reduce this risk, the input gate,
output gate, and forget gate are introduced in LSTM to control the magnitude of
calculation quantity in the network. Furthermore, these gates are benefits parameter
optimization by containing more layers in the network. The function and procedure
of these gates are summarized as followed. First, the input gate is used to decide what

Fig. 5.53 The architecture


of LSTM
364 5 Deep Learning Based Machinery Fault Diagnosis

information can be added to the network and the procedure of it can be concluded as
followed: (1) the candidate vector gs will be obtained by passing the current input xt
and hidden state h t−1 through the tanh function. (2) the vector i s will be obtained by
passing the same information of the current input and hidden state into the sigmoid
function which can transform the values between 0 and 1. Here, i s is used to decide
what information of gs can be fed into the next calculation. Second, the forget gate
is used to decide what information needs to focus on and which needs to be ignored,
in which the vector f s will be obtained by passing the current input xt and hidden
state h t−1 into the sigmoid function which can transform the values between 0 to
1. Here, f s is used to decide what information of St−1 should be preserved. Finally,
the output gate is used to decide what information of St will be passed into the next
layer, in which the vector Os will be obtained by passing the current input xt and
hidden state h t−1 into the sigmoid function.

5.4.3 DAE-LSTM Based Tool Degradation State Assessment

When encountering the problem of equipment degradation state uncertainty, the


shallow network hard to extract deep degradation features in a highly abstract way,
and only obtain general shallow features. As a class of typical deep learning prob-
lems, the assessment of equipment performance degradation involved the adjustment
and adaptation of parameters for the evaluation model in different equipment, mean-
while, requiring the evaluation model to perform deep feature extraction and selection
from the collected signal. However, the selection and dimension reduction of features
become more difficult because of the significant information duplication in the detec-
tion data of equipment performance. Deep Auto-Encoder [27] (DAE) is beneficial
to solve this problem, which contains multiple hidden layers that can learn from
the training data in an unsupervised way and obtain a better reconstruction effect.
Meanwhile, for obtaining the quantitative judgment of the equipment degradation
state, the cross-correlation of time series data should be combined with the degrada-
tion assessment model when comprehensively judging the equipment performance.
To solve the two problems of the extraction and reduction of multi-dimensional
features and the time-series correlation modeling of degradation signals, a DAE-
LSTM based equipment degradation assessment method was proposed, whose proce-
dure can be concluded as followed: First, the feature extractor is obtained by unsuper-
vised feature self-learning dimension reduction and supervised reverse fine-tuning.
Second, the optimized feature sequence is used as the input of the LSTM. Finally, the
cross-correlation of the degradation process information is obtained by the LSTM,
and the complete information of the equipment degradation process data is used to
quantitatively evaluate the equipment degradation state.
The procedure of the DAE-LSTM based degradation assessment method is shown
in Fig. 5.54. First, the statistical features of the signals are extracted from the multi-
sensor monitoring signals, and the degradation feature dataset in the training set
5.4 Deep Learning Based Equipment Degradation State Assessment 365

is used as the input of the DAE. Second, the DAE is used to extract the low-
dimensional degradation signal highly related to the fault from the high-dimensional
feature signal in an unsupervised self-learning way. To ensure the maximum corre-
lation between the dimensional reduction coding and the fault features, the weight
parameters of DAE are adjusted by the method of fine-tuning learning. Finally, the
DAE encoding finishing parameter fine-tuning is arranged in chronological order
and used as the input of the LSTM. Furthermore, to address the two problems: (1)
the number of nodes cannot be selected when the number of layers is uncertain. (2)
the network parameters are overfull when the number of layers is too many in the
stage of constructing DAE and LSTM networks. The method of stacking the middle
hidden layer is used to simplify the number of nodes in the middle layers into two
parameters, which is, the number of middle hidden layers and the number of middle
hidden layer nodes. The parameters of the network structure can be determined by
Particle Swarm Optimization (PSO) [28].
To verify the effectiveness of the proposed DAE-LSTM based equipment degrada-
tion assessment method in industrial data, the milling tool wears data were analyzed.
The experimental data comes from NASA Ames Research Center, including 16
sets of monitoring data on tool wear degradation [29]. Each set of data contains
signal with different modes, such as vibration signals, acoustic emission signals and
current signals during tool wear. Meanwhile, the signal is collected at a sampling
frequency of 250 Hz. The working conditions and the final wear of the milling tool
of each group are shown in Table 5.30.
The raw time-domain signals obtained by each monitoring sensor during the first
sampling of CASE1 are shown in Fig. 5.55, which are the current signal of the AC
spindle motor, the current signal of the DC spindle motor, the vibration signal of
the working table, the vibration signal of the machine spindle, the acoustic emission
signal of the working table, the acoustic emission signal of the machine spindle,
respectively.
From the time domain data, it can be seen that the milling cutter has an entry phase,
a stable cutting phase and an exit phase when performing the cutting process. The
signals of the stable cutting phase are selected for analysis. The four time-domain
features including effective value, absolute mean value, variance and peak value are
extracted from six sensor monitoring data. And then, the training set formed with the
four time-domain features is used as the training samples of the DAE which is used
to perform feature extraction and dimension reduction. To determine the network
structure parameters, the reconstruction error after dimension reduction coding is
used as the parameter update fitness of the particle swarm algorithm. The regression
model output of the dimension reduction features is shown in Fig. 5.56, and the label
of the regression model is the wear of the milling cutter.
When performing the feature extraction and dimension reduction of CASE1 data,
the data of the other 15 CASEs were used as the training set. The extracted features
of CASE1 data are passed into the regression model, and the output is shown in
Fig. 5.56a. Furthermore, when performing the feature extraction and dimension
reduction of CASE2 data, the data of the other 15 CASEs were used as the training
set. The extracted features of CASE2 data are passed into the regression model, and
366 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.54 A multi-dimensional features and DAE-LSTM based method for equipment performance
degradation assessment
5.4 Deep Learning Based Equipment Degradation State Assessment 367

Table 5.30 The working conditions and wear status of milling tool data
CASE 1 2 3 4 5 6 7 8
Number of 17 14 16 7 6 1 8 6
samples
Wear status 0.44 0.55 0.55 0.49 0.74 0 0.46 0.62
Cutting 1.5 0.75 0.75 1.5 1.5 1.5 0.75 0.75
depth
Cutting 0.5 0.5 0.25 0.25 0.5 0.25 0.25 0.5
speed
Materials Cast iron Cast iron Cast iron Cast iron Steel Steel Steel Steel
CASE 9 10 11 12 13 14 15 16
Number of 9 10 23 15 15 10 7 6
samples
Wear status 0.81 0.7 0.76 0.65 1.53 1.14 0.7 0.62
Cutting 1.5 1.5 0.75 0.75 0.75 0.75 1.5 1.5
depth
Cutting 0.5 0.25 0.25 0.5 0.25 0.5 0.25 0.5
speed
Materials Cast iron Cast iron Cast iron Cast iron Steel Steel Steel Steel

the output is shown in Fig. 5.56b. From the regression output, it can be seen that
the information retained by dimension reduction coding is highly correlated with
the degree of wear. After training and fine-tuning with labeled data, the dimension
reduction coding is consistent with the degree of wear in the changing trend, but
there is a deviation in the amplitude, which indicates that the DAE is effective for
feature extraction and dimension reduction of multi-dimensional sensor feature sets.
Therefore, the low-level network of the regression model, which is the DAE, can be
retained as the feature extractor of the newly monitored degradation data.
When the feature extractor of the DAE has been constructed, the dimension reduc-
tion code is set as the input of the LSTM after setting the time step, and the degree
of wear is used as the label of the degradation degree.
The DAE is trained by the data except for CASE1, which is used to perform the
feature extraction and dimension reduction of CASE1 data. Furthermore, the data
except for CASE1 are used in LSTM for labeled training. The model prediction
output of CASE1 data is shown in Fig. 5.57a. In the same way, the model prediction
output of CASE2 data is shown in Fig. 5.57b.
The experimental results show that the deep network highly fitted to the training
data can be obtained by using the deep network as the feature extractor to perform the
training. When removing its regression layer, the low-level network still has a good
feature extraction effect in similar test data with little difference, which indicates
that the deep network can learn the general features in the shallow layer, and further
extract and learn the shallow features. In the end, the deep learning model suitable for
368 5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.55 The time domain signals of the monitoring sensors for the milling tool conditions

Fig. 5.56 The output of the regression model


References 369

Fig. 5.57 The diagnostic results of DAE-LSTM

equipment degradation assessment can be obtained by passing the low-level feature


coding into other deep network models for post-processing.

References

1. Hof, R.D.: 10 Breakthrough Technologies 2013, MIT Technology Review, 23 Apr 2013
2. Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neuralnetworks.
Science 313, 504–507 (2006)
3. Bengio, Y.: Learning deep architectures for AI. Foundations Trends Mach. Learn. 2(1), 1–127
(2009)
4. Sermanet, P., Chintala, S., LeCun, Y.: Convolutional neural networks applied to house numbers
digit classification. In: International Conference on Pattern Recognition (ICPR 2012) (2012)
5. Le, Q.V., Ranzato, M., Monga, R., Ng, A.Y., et al.: Building high-level features using large
scale unsupervised learning. In: Proceedings of the 29th International Conference on Machine
Learning (2012)
6. Jordan, M.I.: Serial order: a parallel distributed processing approach. Adv. Psychol. 121, 471–
495 (1997)
7. Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)
8. Yu, K., Jia, L., Chen, Y., et al.: Deep Learning: Yesterday, today, and tomorrow. J. Comput.
Res. Dev. 50(9), 1799–1804 (2013)
9. Jones, N.: Computer science: the learning machines. Nature 505(7482), 146–148 (2014)
10. Tamilselvan, P., Wang, P.: Failure diagnosis using deep belief learning based health state
classification. Reliab. Eng. Syst. Saf. 115, 124–135 (2013)
11. Tran, V.T., Thobiani, F.A., Ball, A.: An approach to fault diagnosis of reciprocating compressor
valves using Teager-Kaiser energy operator and deep belief networks. Expert Syst. Appl. (41),
4113–4122 (2014)
12. Hinton, G.E., Sejnowski, T.J.: Learning and Relearning in Boltzmann Machines, vol. 1, pp. 282–
317. MIT Press, Cambridge (1986)
13. Smolensky, P.: Information Processing in Dynamical Systems: Foundations of Harmony
Theory, vol. 1, pp. 194–281 (1986)
14. Bengio, Y., Lamblin, P., Popovici, D., et al.: Greedy layer-wise training of deep networks. Proc.
Adv. Neural Inf. Process. Syst. 19, 153–160 (2007)
15. Ma, D.: Research on Image Retrieval Based on Deep Learning. Inner Mongolia: Master’s
thesis, Inner Mongolia University (2014)
16. Xiao, H., Cai, C.: Comparison study of normalization of feature vector. Comput. Eng. Appl.
45(22), 117–119 (2009)
370 5 Deep Learning Based Machinery Fault Diagnosis

17. Liu, S.: Study on data normalization in BP neural network. Mech. Eng. Autonomy 3, 122–123
(2010)
18. Liu, H., Wang, H., Li, X.: A study on data normalization for target recognition based on RPROP
algorithm. Modern Radar 5, 55–60 (2009)
19. Yang, J., Zhang, D., Yang, J.Y.: Two-dimensional PCA: a new approach to appearance-based
face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137
(2004)
20. Lin, H.: Anti-noise Performance of Discrete Spectrum Correction Theories and Their Appli-
cation in Engineering. Guangzhou: PhD thesis, South China University of Technology
(2010)
21. Li, B., Liu, P., Hu, R., et al.: Fuzzy lattice classifier and its application to bearing fault diagnosis.
Appl. Soft Comput. 12(6), 1708–1719 (2012)
22. LeCun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
23. Suk, H.I., Lee, S.W., Shen, D.: Latent feature representation with stacked auto-encoder for AD/
MCI diagnosis. Brain Struct. Funct. 220(2), 841–859 (2015)
24. Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value
decomposition. Biol. Cybern. 59(4–5), 291–294 (1988)
25. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10),
1550–1560 (1990)
26. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
27. Keras’s Blog. https://blog.keras.io/building-autoencoders-in-keras.html
28. Kennedy, J.: Particle swarm optimization. In: Encyclopedia of Machine Learning. Springer
US, pp. 760–766 (2011)
29. Agogino, A., Goebel, K.: BEST lab, UC Berkeley. “Milling Data Set “, NASA Ames Prognos-
tics Data Repository. http://ti.arc.nasa.gov/project/prognostic-data-repository. NASA Ames
Research Center, Moffett Field, CA (2007)
Chapter 6
Phase Space Reconstruction Based
on Machinery System Degradation
Tracking and Fault Prognostics

6.1 Phase Space Reconstruction

In recent years, chaotic time series analysis has been widely used in many fields, such
as mathematics, physics, meteorology, information science, economy and biology;
the study of chaotic time series analysis has become one of the frontier topics of
nonlinear science. Chaos theory links determinism and randomness, two tradition-
ally completely independent and contradictory concepts, to each other. It holds that
there are deterministic laws behind phenomena that are regarded as random irregu-
larity. Theoretically, chaotic dynamical systems theory can establish a deterministic
mathematical model to describe stochastic and irregular systems, thus providing a
deterministic theoretical framework to explain complex phenomena in the real world
[1]. In 1980, Packard et al. [2] first proposed the reconstruction of the phase space
of a nonlinear time series to study its nonlinear dynamics characteristics, pioneering
the use of one-dimensional time series to study the chaotic phenomenon of complex
dynamic systems. Next, Takens [3] proposed to reconstruct the phase space of the
nonlinear time series using the delayed coordinate method and proved mathemati-
cally that the reconstructed phase space can preserve the dynamic characteristics of
the original system, which is called Takens embedding theorem [3]. The proposition
of Takens theorem makes it possible for researchers to relate the theoretical abstract
object, such as a chaotic dynamical system, to the measured time series in practical
engineering from a strictly mathematical point of view. In this way, researchers don’t
need to model complex systems directly; just by studying the measured time series,
the properties of chaotic systems can be equally researched on the basis of the preser-
vation of their intrinsic dynamical properties and their mathematical significance. At
present, the commonly used nonlinear characteristic parameters to reflect the char-
acteristics of chaotic time series, such as correlation dimension, Lyapunov exponent
and Kolmogorov entropy, are extracted from the reconstructed phase space of the
measured time series. Therefore, phase space reconstruction is the key to processing
nonlinear time series. In this chapter, Takens embedding theorem and phase space
reconstruction method based on delayed coordinates are introduced briefly, then the

© National Defense Industry Press 2023 371


W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_6
372 6 Phase Space Reconstruction Based on Machinery System Degradation …

selection methods of two parameters, delay time and embedding dimension in phase
space reconstruction, are introduced in detail, which lays a foundation for nonlinear
feature extraction, degradation tracking and fault prediction of mechanical rotating
parts.

6.1.1 Takens Embedding Theorem

According to chaos theory, the evolution of any component of a system is deter-


mined by the interaction of other components, so the development of any compo-
nent contains information about other related components. In other words, if the
univariate time series consisting of a variable of the system is considered, which is
the result of the interaction of many other relevant physical factors and contains all
the information changes of the other variables involved in the movement, therefore,
the univariate sequence must be extended to the high-dimensional space in some
way to fully display the information of the phase space, which is a nonlinear time
series phase space reconstruction. At present, it is a common method to reconstruct
the phase space by using a chaotic time series output from a nonlinear system, so
as to investigate the characteristics of the whole chaotic system. The current widely
used phase space reconstruction method is mainly the delayed coordinate method
proposed by Packard, Takens and so on, whose mathematical theory is based on the
Takens embedding theorem.
Before introducing Takens embedding theorem, some related mathematical
concepts are introduced qualitatively:
(1) Manifold: An abstract space that has the property locally Euclidean space is a
generalization of the concepts of curves and surfaces in Euclidean space;
(2) differential homeomorphism: for two manifolds M 1 and M 2 , f is called differ-
ential homeomorphism between M 1 and M 2 if the maps, f:M 1 → M 2 and their
inverse maps, f:M2 → M1, are differentiable;
(3) isometric isomorphism and embedding: for two metric spaces (M 1 , P1 ) and
(M 2 , P2 ), if there exists a full mapping, f:M 1 → M 2 , then two metric spaces
(M 1 , P1 ) and (M 2 , P2 ) are isometric isomorphism; if (M 1 , P1 ) and (M 3 , P3 ) is a
subspace isometric isomorphism, then (M 1 , P1 ) is called that can be embedded
in (M 3 , P3 ).
Based on the above concepts, the Takens embedding theorem can be further intro-
duced as follows [3]: Let M be a d-dimensional manifold, ϕ : M → M is a smooth
differential homeomorphism, and let y is a(smooth function on
( M, which
) is(a mapping
)
of φ(ϕ,y) : M → R 2d+1 , then φ(ϕ,v) (x) = y(x), y(ϕ(x)), y ϕ 2 (x) , . . . y ϕ 2d (x) is
an embedding from M to R 2d+1 . Where y(x) is the observed value of the system state
x; the space, including φ(ϕ,y) (M), is called the embedding space and its dimension,
2d + 1, is called the embedding dimension.
Takens theorem gives the mathematical guarantee of data embedding. The original
space and the embedded space that satisfy the theorem are isometric and isomorphic,
6.1 Phase Space Reconstruction 373

so the embedded space can retain the basic dynamic information of the original
space. The method of phase space reconstruction by Takens theorem is the delayed
coordinate method, which can reconstruct one-dimensional time series into a multi-
dimensional phase space vector through time delay [4]. For the observation time
series {x(1), x(2), …, x(N)}, whose length is N, the corresponding reconstructed
phase space can be obtained by selecting the appropriate delay time τ and embedding
dimension m according to formula 6.1:


⎪ X(1) = {x(1), x(1 + τ ), . . . , x(1 + (m − 1)τ )}

⎨ X(i) = {x(i ), x(i + τ ), . . . , x(i + (m − 1)τ )} ,
(6.1)
⎪ x(N − (m − 1)τ ) = {x(N − (m − 1)τ ), x(N − (m − 2)τ ), . . . , x(N )}



i = 1, 2, . . . , N − (m − 1)t

X(1), X(2),…, X(N − (m − 1)τ ) is a vector in the reconstructed phase space. It


can be seen from the above formula that the selection of the delay time parameter
and embedding dimension parameter has a great influence on the structure of recon-
structed phase space, but Takens theorem does not give a specific selection method
for these two parameters. The following two sections briefly discuss and introduce
the selection of latency and embedding dimensions.

6.1.2 Determination of Delay Time

According to Takens theory, the selection of delay time is arbitrary when recon-
structing phase space for an infinite time series that is not disturbed by noise.
However, in practice, the observation series cannot be infinitely long, and any time
series is inevitably disturbed by noise. If the delay time τ is too small, any two compo-
nents of the phase space vector X(i) = {x(i), x(i + τ ),…, x(i + (m-1)τ )}, x(i) and x(i +
τ ) are too close numerically, which leads that the difference of phase space vectors is
too small; the information redundancy is too large and the reconstructed phase space
contains less information about the original measurement points. It is shown that the
phase space track compresses to the principal diagonal line of the phase space in the
phase space shape. While if τ too large, the correlation between the elements of the
phase space vector will be lost easily, which is shown that the phase space trajec-
tory may appear folding phenomenon in the phase space morphology. Therefore, it’s
essential to choose the appropriate delay time τ on reserving the dynamic character-
istics of the original system to the maximum extent in the reconstructed phase space.
There are several commonly used methods to calculate delay time: autocorrelation
method, average displacement method and mutual information quantity method. The
autocorrelation method is a relatively mature method for determining the optimal
delay time by examining the linear independence between sequences, but the delay
time parameter obtained by the autocorrelation method cannot be generalized to the
reconstruction of high-dimensional phase space.
374 6 Phase Space Reconstruction Based on Machinery System Degradation …

Another method to calculate the delay time is the average displacement method.
The method needs to be based on the selection of an embedding dimension, and
the delay time obtained is the optimal value under the condition of the selection of
the embedding dimension. However, the optimal embedding dimension cannot be
determined in advance in reality, which brings an error in determining the optimal
delay time. In addition, the average displacement method needs to calculate the
distance between all vectors in the phase space, and the amount of calculation is
relatively large.
In addition to the above two methods, the mutual information method is commonly
used to calculate the delay time. Compared with the autocorrelation method, though
it needs more computation, the mutual information method contains the nonlinear
characteristics of the time series. So, its calculation results are better than the autocor-
relation method. What’s better, there is no need to determine the embedding dimen-
sion in advance. Therefore, the mutual information method is adapted to obtain the
delay time parameters of reconstructed phase space in this chapter.

6.1.3 Determination of Embedding Dimensions

For the reconstruction of the embedded dimension of phase space, Takens theorem
only gives a sufficient condition, m ≥ 2d + 1, for the selection of embedding dimen-
sion, but the selection method is only for the time series with infinite length and no
noise in the ideal case. In practice, the embedding dimension should be greater than
the minimum value of m. Theoretically, as long as m is large enough, the dynam-
ical characteristics of the original chaotic system can be described and its internal
motion law can be revealed. However, in practice, too large embedding dimension
will greatly increase the computation of geometric invariant parameters (such as
correlation dimension, Lyapunov exponent, etc.) of chaotic systems, and in the case
of large system noise, the impact of noise and rounding error on reconstruction is also
greatly increased. Here are some common ways to find the embedding dimension:
trial algorithm, singular value decomposition method, and pseudo-neighborhood
method.
Based on the given delay time, the trial algorithm can continuously calculate
some geometric invariants of the system (such as correlation dimension, Lyapunov
exponent, etc.) by increasing the embedding dimension until the embedding dimen-
sion reaches a certain value, where the parameters of these geometric invariants stop
changing. Then it’s the optimal phase space reconstruction embedding dimension.
However, the trial algorithm needs to reconstruct the phase space many times and
calculate the geometric invariants of the reconstructed space to observe, which is a
large computation and the computing time increases consequently.
The singular value decomposition method, also known as principal component
analysis, was first introduced by Broomhead in 1986 to determine the embedding
dimension of chaotic time series [5].
6.2 Recurrence Quantification Analysis Based on Machinery Fault … 375

The singular value decomposition method (principal component analysis) is


essentially linear. The application of the linear method to the parameter selection
of phase space reconstruction of nonlinear systems is controversial in theory.
The basic idea of the false adjacent point method is to investigate which of the
adjacent points in the phase space are the real ones and which are the false ones when
the embedding dimension changes. When there are no false neighbors, it is considered
that the geometric structure of phase space is completely opened [6]. Compared
with the trial algorithm, the false adjacent point method has less computation, less
data requirement, and better anti-noise ability. In this chapter, the false adjacent
point method is used to obtain the parameters of the embedding dimension of the
reconstructed phase space.

6.2 Recurrence Quantification Analysis Based


on Machinery Fault Recognition

Among the various methods for nonlinear time series analysis methods, the recur-
rence quantification analysis method [7] has the characteristics of a small amount
of data required and strong anti-noise ability, and is a new research hotspot in
nonlinear time series analysis. At present, recurrence quantification analysis (RQA),
as a nonlinear feature extraction method, has been widely used in various fields. For
example, Zbilut applies the RQA method to weak signal detection in noisy environ-
ments [8]. This section describes how to apply multiple feature parameters extracted
by the RQA method to the identification of bearing fault severity.

6.2.1 Phase Space Reconstruction Based RQA

Like many nonlinear feature extraction methods, the recurrence quantification anal-
ysis method is based on phase space reconstruction. By using the mutual information
method and the false neighborhood method introduced in Chap. 2, the delay time τ
and the embedding Dimension m are selected, and the phase space reconstruction
of the observation time series {x(1), x(2), …, x(N)} with length N is carried out by
using formula 6.1, we get a series of space vectors {X(1), X(2), …, X(N − (m −
1)τ )}, each of these vectors is a point in the reconstructed phase space. These vectors
are then used to construct recurrence matrices:
(
1 : ε > ||X(i ) − X( j )||
Ri j = Θ(ε − ||X(i) − X( j )||) = i, j ∈ [1, Nm ] (6.2)
0 : ε < ||X(i ) − X( j )||

Of the form:
Θ(•)—unit step function;
376 6 Phase Space Reconstruction Based on Machinery System Degradation …

ε—recurrence threshold;
N m = N − (m − 1)τ —number of vectors;
Suppose a certain recurrence threshold ε is determined, and any two vectors X(i)
and X( j) in space are brought into formula 6.2 for calculation. If Rij Equals 1, a point
is drawn at (i, j) coordinates; if Rij equals 0, no point is drawn at (i, j) coordinates.
When all the vectors in the reconstructed phase space are processed by Eq. 6.2, a two-
dimensional graph, called a recurrence graph, can be obtained. The linear structure
and the point density in the recurrence graph can reflect the dynamic characteristics
of the pre-reconstruction time series. For example, the points in the recurrence graph
of the Gaussian white noise are evenly distributed, and the recurrence graph of
the periodic signal is composed of some lines parallel to the diagonal [9]. However,
a recurrence graph is only a graphical and qualitative description of the dynamic
characteristics of time series, and its rich information needs to be described by some
quantitative features.
Based on the recurrence graph, Marwan proposed the recurrence quantitative
analysis method, which can extract effective characteristic parameters, such as the
recurrence rate (RR), determinism (DET), laminarity (LAM) and recurrence entropy
(ENTR), to quantitatively describe the dynamic characteristics of the original time
series [10]. The mathematical definition and basic meaning of each of these four
parameters are described below. For all N m vectors in the reconstructed phase space,
after constructing the recurrence graph according to formula 6.2, the recurrence rate
RR is defined as:
∑ Nm ∑ Nm
i=1 j=1 Ri j
RR = 2
(6.3)
Nm

Let P (L) and P (V) represent the length distributions of the line in 45-degree and
vertical direction in a recurrence graph, respectively:
∑lmax
p(l) = N l / α Nα (6.4)
α=lmin

∑vmax
p(v) = Nv / α Nα (6.5)
α=vmin

Of the form:
N l —the number of straight lines of length l in the 45-degree direction;
N v —the number of straight lines of length v in the vertical direction;
N α —the number of straight lines of length α in the direction of 45° or in the
vertical direction;
l min , l max —the minimum length (usually 2) and a maximum length of the 45-degree
straight line;
vmin , vmax —the minimum length (usually 2) and the maximum length of the
vertical straight line;
6.2 Recurrence Quantification Analysis Based on Machinery Fault … 377

Thus, deterministic (DET), laminar (LAM), and recurrence entropy (ENTR) can
be defined as:
∑lmax
l=lmin lp(l)
DET = ∑ Nm ∑ Nm
(6.6)
i=1 j=1 Ri j
∑vmax
v=v vp(v)
LAM = ∑vmaxmin (6.7)
v=1 vp(v)


Nm
ENTR = − p(l) ln p(l) (6.8)
l=lmin

In a physical sense, the recurrence rate RR describes the density of recurrence


points in a recurrence graph and reflects the probability of the occurrence of a
particular state Deterministic (DET) describes the ratio of the number of recurrence
points in the diagonal structure to the number of all recurrence points, reflecting
the predictability of the system; Laminar (LAM) describes the number of recursion
points in the vertical line structure of the recurrence graph, reflecting the intermit-
tent and hierarchical nature of the system; recurrence entropy (ENTR) is based on
the Shannon entropy of the diagonal length frequency distribution, which describes
the complexity of the deterministic structure in the dynamical system, reflecting the
dynamic information of the system or the degree of randomness [8]. In a word, these
RQA parameters are the reflection of System dynamics features and can be used as
feature parameters for fault diagnosis of mechanical rotating parts.
Next, a simulation experiment shows the effectiveness of the recurrence quantita-
tive analysis method. A white Gaussian noise signal with a standard deviation of 1, a
sine signal with a frequency of 2π and a Lorenz System x component with noise were
selected as simulation signals. Each kind of signal collects 1500 data points, selects
the delay time τ = 14, embeds the dimension m = 5 to reconstruct the phase space
and makes the recurrence graph, as shown in Fig. 6.1a–c, the horizontal axis and the
vertical axis represent the number of vectors in the reconstructed phase space. As
can be seen from the graph, the white Gaussian noise signal is a random signal, in
which the recurrence points are evenly distributed, and there is no obvious 45-degree
line and vertical line structure, while the sine signal is periodical, and its recurrence
graph has an obvious 45-degree linear structure, and the whole graph has a banded
distribution; the Lorenz system is a more complex chaotic system, its representa-
tion in a recurrence graph is that the structure of the graph is more complex than a
recurrence graph of white Gaussian noise and sine signals. A short 45-degree line,
a short vertical line, and a partially banded blank distribution appear in the figure.
Then, the recurrence rate, deterministic, laminar and recurrence entropy parameters
of the three signals are calculated by Formula 6.3, 6.6 and 6.8 respectively, as shown
in Table 6.1. As can be seen from the table, the values of the four parameters are also
very small for simple white Gaussian noise signals with no obvious regularity in the
recurrence graph; with the complexity of the system structure, the four parameters of
378 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.1 Recurrence figure of three different signals

Table 6.1 Recurrence quantitative analysis parameters


Recurrence rate Deterministic Laminar (LAM) Recurrence
(RR) (DET) entropy (ENTR)
White Gaussian 0.0018 0.0032 0.0038 0.0083
noise signal
Sine signal 0.1180 0.9691 0.0100 0.5185
Lorenz system 0.0817 0.9803 0.9860 1.1111
signal

the sinusoidal signal and the Lorenz system signal are larger than those of the white
noise signal, and as the most complex Lorenz system signal, the other parameters
are larger than the sinusoidal signal except the recurrence rate. This indicates that
the Lorenz system has a higher complexity and a larger amount of information.

6.2.2 RQA Based on Multi-parameters Fault Recognition

The four characteristic parameters extracted by the RQA algorithm, namely recur-
rence rate (RR), deterministic (DET), laminar (LAM) and recurrence entropy
(ENTR), can be used as quantitative features to identify and evaluate the fault
severity of rolling bearings. Based on this, a multi-parameter bearing fault severity
identification algorithm can be formed, and its flow chart is shown in Fig. 6.2.
The steps of the algorithm are as follows: first, the vibration signals of rolling
bearings with different fault degrees are obtained by sensors, and the acquired time
series are standardized; Then the phase space is reconstructed by using the mutual
information method and the false neighborhood method to select the delay time
and the embedding dimension. Then calculate the recurrence matrix and draw the
recurrence plot according to Eq. 6.2. Finally, the four characteristic parameters are
calculated according to Eqs. 6.3 and 6.8, and the change rules of these characteristic
6.2 Recurrence Quantification Analysis Based on Machinery Fault … 379

Fig. 6.2 Flow chart of RQA-based multi-parameter bearing fault severity identification algorithm

parameters extracted from bearing vibration signals under different fault degrees are
compared.
The validity of these characteristic parameters is verified by measured vibration
signals of rolling bearings with different failure degrees. Fault data for this experiment
were obtained from the Case Western Reserve University Electrical Engineering
Laboratory in Ohio, USA [11]. The experimental setup, as described in Chap. 2,
consists of a drive motor, a torque sensor, a power meter and a control device. The
tested bearing is the motor’s output shaft support bearing, the model is 6205-2RS
JEM SKF. Single-point faults of different sizes are introduced to the tested bearings
using EDM technology. The fault diameters include 0.18, 0.36 and 0.53 mm. The
vibration signal is sampled by the accelerometer at a frequency of 12 kHz.
Firstly, under the condition of 1750 rpm and 2 HP load, the data of bearing with
inner ring fault is processed and analyzed. Figure 6.3 shows a section of the vibration
signal and its corresponding spectrum of the healthy bearing and the fault bearing
with the fault diameter of 0.18 mm, 0.36 mm and 0.53 mm, respectively. Although the
difference can be seen from the time domain signal and its corresponding spectrum,
it is difficult to distinguish the difference in the fault severity directly from the
graph. The signals are then analyzed using a recurrence quantitative analysis method.
Firstly, the delay time and embedding dimension are adaptively selected by mutual
the information method and false neighborhood method, and the phase space of
380 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.3 Bearing vibration signal and its spectrum (RPM: 1750 rpm, load: 2HP)

bearing data with different failure degrees is reconstructed respectively, and the
recurrence threshold ε = 0.4 is selected to construct the recurrence graph, as shown
in Fig. 6.4, the horizontal and vertical axes of the recurrence graph represent the
number of vectors in the reconstructed phase space. As can be seen from the figure,
the density of recurrence points and the structure of horizontal and vertical lines in
the graph change with the increase of the degree of bearing failure. Then, in order
to be the recurrence graph quantitatively, the recurrence rate (RR), deterministic
(DET), laminar (LAM) and recurrence entropy (ENTR), which are the characteristic
parameters of recurrence quantitative analysis, are calculated respectively, the results
are shown in Table 6.2 and Fig. 6.5. As can be seen from the figure, for RR, Lam and
ENTR, the index value of the fault bearing is larger than that of the healthy bearing,
and the index value increases with the increase of the fault degree. However, for the
DET index, the healthy bearing DET index is greater than the mild failure of the
bearing DET index, which shows that when the bearing failure of the inner ring, the
predictability of the system does not necessarily increase with the degree of failure.
Therefore, RR, Lam and ENTR can be used to identify the signals of different fault
severity in this test, while DET is not an effective indicator to identify the fault
severity.
6.2 Recurrence Quantification Analysis Based on Machinery Fault … 381

1400 1400

1200 1200

1000 1000

800 800

600 600

400 400

200 200

200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400

(a) Health Bearings (b) Minor faults(0.18mm)

1400 1400

1200 1200

1000 1000

800 800

600 600

400 400

200 200

200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400

(c) Medium failure (0.36mm) (d) Serious failure (0.53mm)

Fig. 6.4 Recurrence diagram of bearing vibration signal with different failure degree (1750 rpm,
load: 2HP)

In order to verify the above results, the data processing and analysis of ball bearing
failure under the condition of 1750 rpm and 2 HP load are shown in Table 6.3 and
Fig. 6.6. At the same time, under the condition of 1797 rpm and 0 HP load, the
data processing and analysis of the bearing with outer ring fault are presented in
Table 6.4 and Fig. 6.7. As can be seen in Fig. 6.6 and Table 6.3, RR, DET, Lam
and ENTR all increase with the severity of ball bearing failure, therefore, these four
parameters are effective in diagnosing the severity of ball bearing fault. As can be
seen from Fig. 6.7 and Table 6.4, in the bearing outer ring failure test, the RR value
of the bearing signal at moderate failure was greater than the RR value of the bearing
382 6 Phase Space Reconstruction Based on Machinery System Degradation …

Table 6.2 Bearing inner race fault RQA parameters (RPM: 1750 rpm, load: 2 HP)
Fault size RR DET ENTR LAM
A: health bearing 0.0009 0.0779 0.3325 0.0217
B: 0.18 mm 0.0023 0.0557 1.5263 0.0571
C: 0.36 mm 0.0078 0.0604 1.553 0.1056
D: 0.53 mm 0.0406 0.2071 2.3661 0.3819

Fig. 6.5 Bearing inner race fault RQA parameters (RPM: 1750 rpm, load: 2 HP)

signal at severe failure. In addition, DET, LAM and ENTR can accurately diagnose
the severity of bearing failure.
From the three experimental results, it can be seen that RR and DET cannot accu-
rately evaluate the severity of bearing failure under certain failure conditions, only
Lam and ENTR can increase the severity of bearing failure. This can be explained
by the fact that as the bearing fault crack or spalling size increases, the rolling body

Table 6.3 Bearing ball fault RQA parameters (rpm: 1750 rpm, load: 2 HP)
Fault size RR DET ENTR LAM
A: health bearing 0.0008 0.0784 0.3387 0.0212
B: 0.18 mm 0.005 0.7374 1.3885 0.4482
C: 0.36 mm 0.0198 0.758 1.5168 0.6116
D: 0.53 mm 0.0409 0.9447 2.4918 0.8506
6.2 Recurrence Quantification Analysis Based on Machinery Fault … 383

Fig. 6.6 Bearing ball fault RQA parameters (rpm: 1750 rpm, load: 2 HP)

Table 6.4 RQA parameters for bearing outer ring failure-RPM-rpm: 1797 rpm, load: 0HP)
Fault size RR DET ENTR LAM
A: health bearing 0.0004 0.0937 0.3206 0.0454
B: 0.18 mm 0.0196 0.2032 2.0945 0.2222
C: 0.36 mm 0.0399 0.2169 2.5204 0.3719
D: 0.53 mm 0.0338 0.3298 2.5384 0.4522

through the fault point generated by the vibration shock signal will be enhanced,
the existence of these fault-related vibration signals makes the vibration signals of
fault bearing increase more frequency components, which increases the complexity
of the system and leads to the increase of entropy and laminarity. Therefore, in this
experiment, Lam and ENTR are two effective quantitative characteristic indexes,
which can identify the vibration signals of rolling bearings under different kinds of
faults and different severity of faults.
Though the experimental verification of a multi-parameter bearing fault severity
identification algorithm based on a recurrence quantitative analysis method, lami-
narity (LAM) and recurrence entropy (ENTR) has been proven to be effective in
identifying the vibration signals of bearings with different fault severity, which is the
basis for the study of bearing life degradation. In addition, the ENTR itself is defined
in the form of Shannon entropy, which is directly related to the complexity of the
384 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.7 Bearing outer ring failure RQA parameters (rpm: 1779 rpm, load: 0HP)

system, literature [9] further shows that the increase in the complexity of the dynam-
ical system leads to the change of the distribution of the 45-degree line length of
the recurrence graph, thus increasing the value of the recurrence entropy. Therefore,
considering the more definite physical meaning of the recurrence entropy parameter,
the recurrence entropy feature can be further selected as a non-linear feature param-
eter to perform degradation tracking research on the whole life cycle of mechanical
rotating parts, and the specific method will be detailed in the next section.

6.3 Kalman Filter Based Machinery Degradation Tracking

6.3.1 Standard Deviation Based RQA Threshold Selection

In the previous section, recursive entropy is used to identify the bearing at different
failure stages, and the influence of recursive threshold ε is not considered when calcu-
lating the recursive entropy value. However, from formula 6.2, it can be seen that
recursive threshold has an important effect on the recursive matrix and the calculation
of the recursive graph and recursive entropy. If ε is too large for ||X(i)-X( j)||, almost
all Rij are made equal to 1, then the recursive graph will be full of recursive points;
if ε is too small for ||X(i)-X( j)||, almost all Rij is made equal to 0, then the recursion
graph will be blank, with almost no recursion points. Both of these situations can
6.3 Kalman Filter Based Machinery Degradation Tracking 385

have adverse effects on the outcome of a recursive quantitative analysis, so choosing


the appropriate recursive threshold is important for the recursive quantitative anal-
ysis method itself. Furthermore, the previous fault severity identification was only
used to analyze the bearing signals at several discrete time points, and the selection
of recursive threshold has little influence on the results, the study of mechanical
component degradation tracking requires the real-time tracking of vibration signals
and the accurate identification of faults in the whole life cycle of components. There-
fore, the stability and fault sensitivity of the extracted feature parameters are required
to be higher, and these need to be achieved by a reasonable selection of the recur-
sive threshold. Therefore, it is of great significance to study the selection of the
recursive threshold. At present, some existing researches also discuss the selection
of the recursive threshold. For example, the maximum phase space scale [12] and
the standard deviation of the noise contained in the signal [13] have been studied
to select recursive thresholds. Although these methods are very effective in some
RQA experiments, the computation of maximum phase space scale and recursive
point density is too large to meet the real-time requirement of degradation tracking
research. In addition, it is difficult to determine the magnitude of the noise in the
signal in practical applications, so it is difficult to determine the threshold accurately
by using the noise standard deviation. Therefore, a new recursive threshold selection
method based on the standard deviation of the observation sequence is proposed to
improve the traditional RQA algorithm.
Theoretically, the standard deviation of time series reflects the fluctuation degree
of time series, and the greater the fluctuation degree of time series is, the greater
the fluctuation degree of time series is. because every vector in the reconstructed
phase space is constructed from every element in the observation sequence, ||X(i)-
X( j)|| in Formula 6.2 is also related to the fluctuation of the original observation
sequence, in other words, it is also related to the standard deviation of the original
observation sequence. The greater the standard deviation of the time series is, the
greater the ||X(i)-X( j)|| is, and the larger the recursive threshold ε to choose should be.
Therefore, it can be assumed that there is a linear relationship between the standard
deviation of time series and the recursive threshold ε. In a real-life rotator degradation
tracking experiment, for vibration signal observation sequences collected at specific
time intervals, the specific recursive threshold selection method is as follows: for the
time series of the vibration signal acquired for the first time, the standard deviation
σ 1 is calculated, and the recursive threshold ε1 is determined by using 10% of the
maximum phase space scale [12], the scaling coefficient k = ε1 /σ 1 is thus determined;
next, the recursive threshold εi can be obtained from formula 6.9 for the following I
sampling sequence:

εi = kσi (6.9)

where:
εi : the recursive threshold of the first observation sequence;
σ 1 : the standard deviation of the first observed sequence.
386 6 Phase Space Reconstruction Based on Machinery System Degradation …

After each observation sequence obtains the recursive threshold, its corresponding
recursive entropy feature can be obtained through formula 6.8. From the steps of
the above-mentioned recursive threshold selection method, it can be seen that the
method establishes a connection between the standard deviation of the observation
sequence and the recursive threshold, and adaptively selects the recursive threshold
according to the conditions of the different observation sequences, which improves
the stability of RQA method to obtain the recurrence entropy feature. In addition, the
RQA method only needs to calculate the maximum phase space scale parameter once
to obtain the recursive threshold, which is less computation, simple and convenient
to use, it can meet the real-time requirement of degradation tracking research.

6.3.2 Selection of Degradation Tracking Threshold

Another important issue to be studied is the setting of the health threshold for the
degradation tracking of rotating parts by using the recursive entropy feature of the
observation sequence. In the course of operation, the performance of rotating parts
will degrade with time. And in a long time, it will gradually from the health state to the
fault degradation state. The so-called health threshold is the threshold parameter that
can distinguish the health state and the failure degradation state of the rotating parts,
i.e. the threshold parameter which can identify the initial failure time of the parts.
The Chebyshev’s inequality of probability theory is used to select the degradation
tracking health threshold. Chebyshev’s inequality is defined as:

σh2
P{|X − μh | ≥ εh } ≤ εh2
σh2 or (6.10)
P{|X − μh | < εh } > 1 − εh2

where:
X: the recursive entropy sequence of mechanical components in the same state;
H: the mean of sequence X;
σ h : standard deviation of sequence X;
εh : a selected real number.
Chebyshev’s inequality shows that for an arbitrary probability distribution of an
observation sequence in the same state, all the values in the sequence are close to the
sequence mean, that is, all the values in the sequence are in the interval [μh − εh ,
μh + εh ] is greater than 1 − σh2 /εh2 . The theory can be illustrated by the following
hypothesis tests:

H0 : |X − μh | < εh
(6.11)
H1 : |X − μh | ≥ εh
6.4 Improved RQA Based Degradation Tracking 387

For any health state of the mechanical components of the recursive entropy value
X 0 , if |X − μh | ≥ εh , then reject the H 0 hypothesis, and mistakenly judge that
X 0 belongs to the fault state of the mechanical components of the recursive entropy
value. Statistically, this misdiagnosis is called making a type I error, and according to
formula 6.10, the probability of making a type 1 error is α = σh2 /εh2 at this time. On the
other hand, if X 1 , the recursive entropy value of a mechanical component in any fault
state, |X − μh | < εh , then H 0 hypothesis is accepted and X 1 is incorrectly judged to
be the recursive entropy value of a healthy mechanical component. This condition
of misdiagnosis is called making a type II error, and the probability of making a type
II error is β at this time. If α is too large, it is easy to mistake the recursive entropy
of a healthy part for a faulty part, and if β is too large, it is easy to mistake the
recursive entropy of a faulty part for a healthy part. According to hypothesis testing
theory, after the sampling sequence is determined, the probability of making these
two kinds of errors is restricted to each other. The smaller the probability, the greater
the probability. Conversely, the larger the probability, the smaller the probability. In
practical application, the probability of making the type I error is usually controlled
first, and then the probability of making the type II error is minimized. Here is
Chebyshev’s inequality used to determine the health threshold that distinguishes the
health state of a rotating component from the degradation state of a fault, it is also the
first step to control the probability of the type I error that the health state recursive
entropy value is misjudged as the fault state. Therefore, the selection, i.e., εh = 5σh ,
α = 4%, degradation tracking health threshold was set to μh + 5σ h . This means that
a healthy mechanical component has a 96% confidence probability that its recursive
entropy falls within the [μh − 5σ h , μh + 5σ h ] range, while the probability of a
healthy mechanical component having a greater than μh + 5σ h is less than 4%.
In other words, once the recursive entropy of a mechanical component exceeds the
health threshold μh + 5σ h , it is considered to be in the initial failure state, with a
96% confidence probability. In the actual research, the mechanical rotating parts are
generally in the healthy state at the beginning of the life cycle experiment, so the
healthy recursive entropy value series can be formed by the recursive entropy value
of the health data at this time, the mean and standard deviation were calculated, and
the degradation tracking health threshold was constructed.

6.4 Improved RQA Based Degradation Tracking

Based on the above-mentioned improved RQA algorithm based on standard deviation


threshold selection and Chebyshev’s inequality health threshold selection method,
the whole life cycle of mechanical rotating parts can be tracked. The flow chart of
the degradation tracking algorithm is shown in Fig. 6.8.
The steps of the algorithm are summarized as follows: firstly, for the observation
sequence at t = 1 time, the recursive threshold and entropy RP1 are calculated by
using the maximum phase space scale parameter, and the standard deviation and ratio
parameter k of recursive thresholds are calculated. For each subsequent observation
388 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.8 Flow chart of


degradation tracking
algorithm
6.5 Kalman Filter Based Incipient Fault Prognostics 389

sequence, the recursive threshold and entropy RPt were calculated. According to
Chebyshev’s formula, the degenerated health threshold is obtained by using the
recursive entropy sequence composed of the first t N recursive entropy values. And
then every recursive entropy RPt is calculated and compared with the health threshold.
If it is less than the health threshold, it shows that the component is still in a healthy
state, and the above steps are repeated to continue tracking; once the recursive entropy
value is greater than the health threshold, indicating that the initial fault occurred.

6.5 Kalman Filter Based Incipient Fault Prognostics

The above degradation tracking process gets the initial fault occurrence time when
the extracted recursive entropy feature is larger than the health threshold and does
not predict the initial fault occurrence time in advance. The Kalman filter [14]
is an optimal recursive prediction algorithm that can predict the future state of a
system based on dynamic system noise measurements. This section uses the Kalman
filter algorithm to predict the Kalman filter time of the initial failure of a rotating
mechanical component during degradation tracking.
A dynamical system can be described by the following dynamic equation:

Xk+1 = AXk + wk
(6.12)
yk = CXk + vk

where:
X k —the state of the system at K time;
yk —the observed value of the system at K time;
A, C—A is the transition matrix, C is the measurement matrix;
wk —process noise, wk ~ N (0, Q), Q is the covariance matrix;
vk —measurement noise, vk ~ N (0, R), R is the covariance matrix.
Assuming the current time is K, the steps for using the Kalman filter algorithm to
predict the state of the system at the K + 1 time are shown in formulas 6.13–6.17.
Initial predictions:

Xk+1|k = AXk|k (6.13)

Covariance prediction:

Pk+1|k = APk|k A' + Q (6.14)

Emmerich Kalman gain matrix calculations:

Kk+1 = Pk+1|k C' (CPk+1|k C' + R)−1 (6.15)


390 6 Phase Space Reconstruction Based on Machinery System Degradation …

Optimal state prediction:

Xk+1|k+1 = Xk+1|k + Kk+1 (yk − CXk+1|k ) (6.16)

Covariance update:

Pk+1|k+1 = Pk+1|k − Kk+1 CPk+1|k (6.17)

where:
Xk|k—the optimal state of the system at k time;
Xk + 1|k—the initial estimated state of the system at k + 1 time;
Xk + 1 |k + 1—the optimal estimated state of the system at k + 1 time;
Pk|k—the covariance matrix of System k-time;
Pk + 1|k + 1—the covariance matrix of System k + 1.
After one step of prediction, take the k + 1 moment as the current moment, repeat
the steps of formula 6.13–6.17, and continue to predict the next optimal estimation
state of the system. As can be seen from the above prediction steps, the Kalman
filter is a high-speed recursive algorithm, it can predict the state of the system at the
next moment only using the current state, measurements, and covariance matrix with
each iteration. In addition, the Kalman filter algorithm makes full use of the system
information including measurement values, measurement errors, system noise and
so on to predict the future optimal estimation state of the system.
However, to make predictions using the Kalman filter, a system dynamic equation
in the form of forms 6.12 and determine its parameters should be gotten, including
A, C, wk and vk . This dynamic equation needs to be able to describe the state of the
system well and be constructed simply to meet the real-time requirement of online
prediction. The autoregressive model (AR model) has the characteristics of simple
structure and convenient construction, and it is proved theoretically that the higher-
order AR model can achieve a similar precision to the autoregressive moving average
(ARMA) model. Therefore, the autoregressive model can satisfy the requirement of
online prediction. The Kalman filter algorithm uses the autoregressive model to
construct the dynamic equation of the system here:


p
Xt = a j Xt− j + εt (6.18)
j=1

where:

aj model parameters;
p order of the model;
εt model error;

The autoregressive model shows that the current state of the system X t can be
obtained by accumulating the previous p-time state X t−1 , …, X t-p and the model error.
6.5 Kalman Filter Based Incipient Fault Prognostics 391

Among them, the model order p can be obtained by using AIC (Akaike information
criterion) [15], and the model parameter aj can be obtained by using the Burg
algorithm [16]. In practical application, AR models of different order are built for
given time series, and corresponding AIC values are calculated for each model. The
AR model with the minimum AIC value is the most suitable model for the time series.
Based on AR Model 6.18, the parameters of dynamic Eq. 6.12 can be determined as
follows:
⎡ ⎤ p∗1 ⎡ ⎤ p∗ p ⎡ ⎤ p∗1
xk a1 a2 . . . a p εk
⎢ xk−1 ⎥ ⎢1 ⎥ [ ]1∗ p ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢0⎥
Xk = ⎢ . ⎥ A=⎢ . ⎥ C = 1 0 ··· 0 wk = ⎢ . ⎥
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
xk− p+1 1 0 0
(6.19)

The process noise wk can be determined by the model error εt of the AR model.
For the measurement noise vk , many kinds of research on Kalman filter are directly
determined by the measurement accuracy of sensors, but this approach is not appro-
priate here. Since the state of the system is given by the recursive entropy feature of
the RQA algorithm, the object of AR modeling is not the original signal measured
by the sensor, but the recursive entropy feature of the RQA algorithm. Therefore,
the measurement noise vk should be determined by the recursive entropy calculation
process. The average error method is used to determine the measurement noise. The
steps are as follows: for a t-time observation sequence X 1(T), it is divided into
n pieces of equal short sequences {x1 (t 1 ),…, x1 (t n )}, and calculate the recursive
entropy value of each short sequence RP1 … RPn , respectively; the average of these
recursive entropy values is taken as the recursive entropy feature of the time series
x1 (t), and the standard deviation of these recursive entropy values s1 is taken as
the measurement error of the recursive entropy feature; next, if N recursive entropy
feature points are used for AR modeling, the measurement noise vk can be determined
by the standard deviation s1 to sN of these recursive entropy features:

1 ∑
N
vk = si (6.20)
N i=1

Based on the above proposed Kalman filter algorithm and combined with the
previous method of mechanical rotating parts degradation tracking, the whole life
cycle of mechanical rotating parts can be tracked simultaneously and the time of the
initial failure of the component is predicted. The flow chart of the specific prediction
algorithm is shown in Fig. 6.9.
The specific algorithm steps are as follows:
(1) firstly, the recursive entropy value and the degenerated health threshold are
calculated according to the degeneracy tracking algorithm proposed in the
previous section;
392 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.9 Flow chart of an


initial Kalman filter fault
prediction algorithm

(2) at t Moment, the AR model is constructed by selecting n recursive entropy


values in the time interval {t−n + 1, t−n + 2, …, t}, and using the Kalman
filter method to predict the m recursive entropy values in the backward time
interval {t + 1, t + 2, …, t + m};
(3) if the recursive entropy value of t + l (l ≤ m) is larger than the threshold value
of degenerated health, then the initial failure of the component will occur at T
+ l; otherwise, the recursive entropy value of the next time is calculated, and N
recursive entropy values on the time interval {t−n + 2, t−n + 3, …, t, t + 1},
Repeat step (2) until a predicted value exceeds the health threshold.
To verify the effectiveness of the degradation tracking algorithm based on the
improved recursive quantitative analysis method and the Kalman filter algorithm
for the initial fault prediction of mechanical rotating parts, these algorithms are
applied to the degradation experimental data of actual bearings. The experimental
data were provided by NSF I/UCR Center [17]. The bearing test system is shown
in Fig. 6.10a and the detailed system structure is shown in Fig. 6.10b. The system
rotates at 2000 rpm, and the total radial load is 6000 lbs and loaded on the bearing
through the spring system. A total of 4 double-row deep groove ball bearings, whose
model is ZA2115 and each row of 16 rotors, the pitch diameter is 2.815 inches, roller
diameter is 0.331 inches, and the contact angle is 15.17 degrees, is installed on the
test shaft of the test system. All bearings are lubricated by an oil circulation system
6.5 Kalman Filter Based Incipient Fault Prognostics 393

that regulates oil temperature and flows, and wear particles are collected from the
oil by an electromagnet mounted in the feedback oil pipeline as a sign of bearing
failure. When the particle amount attached to the electromagnet exceeds a certain
threshold, the test bearing is considered to be damaged. At this time, the electronic
switch is switched on and the experiment stops. The PCB 353B33 high-sensitivity
accelerometer is mounted on each of the test housings, and four thermocouples are
mounted on the outer rings of the bearings to measure the operating temperature of
the bearings. The vibration data of the bearing are collected by DAQCarde-6062E
of NI company, and the vibration signal is processed by LABVIEW software. The
sampling frequency is 20 kHz.
In the experiment, the bearing vibration signal is collected every 10 min, and
20,480 data points are collected each time. This experiment started at 14:39 on
October 29, 2003 and ended at 23:39 on November 25, 2003, with 2000 data files
collected. At the end of the experiment, bearing 3 occurred an inner ring failure, as
shown in Fig. 6.11. Therefore, this experiment uses 2000 data files of bearing 3 for
analysis and processing.

Fig. 6.10 Bearing test-to-failure experiment system [18]

Fig. 6.11 Real defect


bearing diagram [48]
394 6 Phase Space Reconstruction Based on Machinery System Degradation …

First, the first data file is used to determine the scale factor k of 8 according to
Formula 6.9, and then all data files are processed by the improved RQA method.
Take the second data file as an example. Firstly, the optimal delay time τ = 2 and the
optimal embedding dimension m = 5 are obtained by using the mutual information
method and the pseudo-neighborhood method, and then the phase space of the second
data file is reconstructed according to the two parameters. After that, the recursive
threshold of the second file data is 0.88 (ε = kσ = 8 × 0.11) by using the threshold
selection method based on standard deviation. The waveform map and recursive
map of the first 1024 data points of the second data file are shown in Fig. 6.12. The
recursive entropy character is further calculated by the recursive graph. It is noted that
each data file is divided into 20 equal data segments on average, and each segment
calculates the recursive entropy value separately. The average of these recursive
entropy values serves as the recursive entropy characteristic value of the data file,
and the standard deviation of these recursive entropy values is labeled s1 … sN and
prepared for the construction of the equation of state. Since there are 2000 data files
in this experiment, a total of 2000 recursive entropy eigenvalues were calculated and
used to draw the degradation tracking curve of the bearing, as shown in Fig. 6.13.
Generally speaking, the first quarter of the time zone during the degradation test is in
a healthy state. Therefore, the recursive entropy eigenvalues of the first 500 data files
were used to calculate the degradation tracking health threshold of bearings. Based on
Chebyshev’s inequation, the health threshold is set to 1.6046, and the health threshold
line is also drawn in Fig. 6.13. As can be seen from the diagram, the recursive entropy
feature of the bearing exceeds the health threshold for the first time at point 1833,
so point 1833 is considered the initial failure time of the bearing. The above is
a degradation tracking curve, which is an improvement of the traditional recursive
quantitative analysis method by using the recursive threshold selection method based
on standard deviation. For comparison, this experiment uses the traditional recursive
quantitative analysis method to determine the recursive threshold using the maximum
phase space scale, and calculates the recursive entropy for the degradation tracking
experiment of the bearing, as shown in Fig. 6.15. As can be seen from the graph,
the degradation curve fluctuates greatly and the initial failure occurs at point 1984,
111 points later than the initial failure time obtained using the improved recursive
quantitative analysis method. In addition, considering the sensitivity of the kurtosis
parameter to the initial failure of mechanical components and the wide application of
the RMS parameter in degradation tracking [17], these two characteristic parameters
were also calculated here for each data file; The corresponding bearing degradation
tracking curves were made, and the health thresholds were also determined as 4.0649
and 0.5079 using the same method, as shown in Figs. 6.14 and 6.16. As can be seen
in Fig. 6.14, the kurtosis feature exceeds the set health threshold at point 800, which
means that point 800 is mistaken for the time of the initial failure of the bearing
in the online monitoring; in addition, the degradation curve in the real degradation
stage of the bearing (around 1800 points) has a very large fluctuation, which is very
adverse to the failure prediction. Similarly, as can be seen from Fig. 6.16, although
the regression curve derived from the root-mean-square feature is much more stable
than the kurtosis feature, the root-mean-square feature exceeds the health threshold
6.5 Kalman Filter Based Incipient Fault Prognostics 395

early at points 1271 and 1764, which also leads to a miscalculation of the time of the
initial failure. The results of these comparative experiments show that the regression
curve based on recursive entropy can describe the process of bearing degradation
more clearly get a more accurate fault threshold and reduce the possibility of false
timing of initial fault than those based on kurtosis and root mean square. In addition,
the improved recursive 1uantitative analysis method using standard deviation-based
threshold selection can extract more accurate and stable recursive entropy features,
which improves the effectiveness of the traditional recursive quantitative analysis
method for bearing degradation tracking.
Then, based on the recursive entropy features extracted above and the degradation
tracking results, a Kalman filter prediction method can be used to predict the initial
failure time of the bearing in advance. The autoregressive model (AR Model) was
constructed with 60 recursive entropy values (n = 60). Using the AR model as the
equation of state, the prediction is conducted 6 time units in advance (m = 6, 60 min
or 1 h) until the recursive entropy value exceeded the health threshold. In particular,
the measurement noise parameter vk in the dynamic equation is determined by the
average error method given in formula 6.20. The standard deviation of the measure-
ment noise in this experiment is 0.1. The final prediction is shown in Fig. 6.17. When
60 recurrent entropy feature points (1769–1828) were used to build the AR model
(the model order was selected as 15th order by AIC criterion), the 5th prediction
point exceeded the health threshold for the first time, that is, point 1833 is predicted
as the time of the initial failure of the bearing. This prediction is the same as the actual
degradation tracking result. As a comparison, the AR model and the ARMA model,

Fig. 6.12 Time-domain 5


waveform and the recurrence
plot for bearing 3 0
-5
0 200 400 600 800 1000
1000
900
800
700
600
500
400
300
200
100

0 200 400 600 800 1000


396 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.13 PR entropy degradation curve based on improved recursive quantitative analysis method
of bearing 3

Fig. 6.14 Kurtosis degradation curve for bearing 3

two commonly used time series prediction models, were also used to predict the time
of the initial failure of the bearing, and the results are shown in Figs. 6.18 and 6.19.
As can be seen from Fig. 6.19, the AR model prediction results cannot track the real
recursive entropy change trend well, and when the AR model is constructed by using
the 60 recursive entropy features from 1770 to 1829 points, the 6th prediction point
exceeds the health threshold for the first time, so point 1835 is predicted as the time
of initial failure. There is a 2-point (20 min) prediction delay error in the prediction
result. As can be seen from Fig. 6.18, when the ARMA model is constructed using
6.5 Kalman Filter Based Incipient Fault Prognostics 397

Fig. 6.15 PR entropy degradation tracking curve based on traditional recursive quantitative analysis
method of bearing 3

Fig. 6.16 RMS degradation curve for bearing 3


398 6 Phase Space Reconstruction Based on Machinery System Degradation …

60 recursive entropy features from 1769 to 1828 points, the 6th prediction point
exceeds the health threshold for the first time, thus point 1834 is predicted as the
time of initial failure. Although the ARMA model improves the prediction accuracy,
the prediction results still have a point (10 min) of prediction delay error. The error
can be explained as the AR model and ARMA model don’t contain any feedback
process, and their prediction results are completely dependent on the development
trend of adjacent data. The Kalman filter algorithm can make full use of all kinds of
information in the system, and every step of prediction includes error feedback, so
it can improve the accuracy of prediction. Besides the time series prediction model,
the neural network is also a common prediction method, so the backpropagation
(BP) neural network is used to predict the initial fault time of the bearing. The same
data segment (points 1769 to 1828) is used as the neural network training data, and
the prediction results are shown in Fig. 6.20. As can be seen from the figure, the
prediction curve can follow the trend of the actual recursive entropy value, but point
1832 is predicted as the time of the initial failure, and there is a point (10 min) of
prediction error. The error can be explained as that the neural network needs a lot of
training data to ensure the accuracy of the prediction, but in the experiment of online
prediction, the amount of training data is relatively insufficient.

Fig. 6.17 Prediction results of the bearing 3 failure using Kalman filter
6.6 Particle Filter Based Machinery Fault Prognostics 399

Fig. 6.18 Prediction results of the bearing 3 failure using ARMA model

Fig. 6.19 Prediction results of the bearing 3 failure using AR model

6.6 Particle Filter Based Machinery Fault Prognostics

The fault prediction of mechanical components can be divided into initial fault time
point prediction and remaining useful life prediction. The prediction of the initial
failure time point is to track the working state of the mechanical component contin-
uously when it is healthy, and to predict the time point of the slight initial failure
according to its degradation, when the component is not seriously damaged, it can
continue to work for a long time, and the prediction of the remaining service life is
400 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.20 Prediction results of the bearing 3 failure using BP neural network

to follow the development trend of the fault after the initial failure of the component,
and to predict the remaining useful life of the component, that is, the time it takes for
a part to degenerate from its current state to become completely damaged and unable
to continue working. In the last section, the initial fault time is predicted in advance,
but the method of predicting the remaining useful life is not given. The characteristic
parameters are stable when the machine parts are in a healthy state, and they change
obviously only when they are very close to the initial fault time point. However,
after the initial fault of the mechanical components, it is necessary to go through
a long degradation state before serious failure occurs, resulting in the components
cannot continue to work. Therefore, the prediction of the remaining useful life is a
long-term prediction process, which puts forward a higher demand for the stability
and accuracy of the prediction model.
Data-driven prediction models based on Bayesian estimation provide a rigorous
mathematical framework for long-term dynamical systems [19]. Based on the
Bayesian estimation theory, the state of the current system can be estimated from the
state of the previous time, and the estimated value of the current state can be further
updated by the measurement data of the current system, thus, the optimal estima-
tion state is obtained. Through such a recurrence estimation process, the multi-step
time prediction can be completed. The Kalman filter algorithm used in the previous
section is a linear approximation of Bayesian estimation, which can solve the problem
of optimal a posteriori state estimation for systems in linear Gauss spaces. For the
problem of nonlinear estimation, the extended Kalman filter algorithm is widely used
[20], but the extended Kalman filter algorithm only makes use of the first-order partial
derivative of the nonlinear function Taylor series and ignores the higher-order term,
6.7 Particle Filter 401

as a result, extended Kalman filter algorithms are often effective only for systems with
weak nonlinear characteristics, and cannot deal with signals with severe nonlinear
characteristics. The particle filter algorithm solves the problem of state prediction of
nonlinear systems very well. The particle filter algorithm is based on Monte Carlo
integration and recurrence Bayesian estimation. Its basic idea is as follows: firstly,
a set of random samples, called particles, is generated in the state space according
to the empirical distribution of the system. Particles and their weights are continu-
ously updated from the measured data, and the updated particle approximates the
posterior probability density distribution of the state of the system [21]. At present,
the particle filter algorithm in the residual life prediction research mainly includes:
the gearbox remaining useful life prediction [22], the bearing useful life prediction
[23], etc. However, there are still some problems in particle filter, such as particle
degradation and particle diversity loss, which need to be further studied. In order to
solve these problems, an adaptive importance density function selection algorithm
and a particle smoothing algorithm based on neural network are proposed in this
section. Based on this, an enhanced particle filter algorithm is proposed to predict
the residual service life of mechanical rotating parts.

6.7 Particle Filter

For any dynamical system, the dynamic equation can be expressed as:

xk = f (xk−1 ) + ωk−1
(6.21)
z k = h(xk ) + vk

Of the form:
xk the state of the system at k time;
zk the observed value of the system at k time;
ωk −1 the process noise of the system at k − 1 time;
vk the measurement noise of the system at k time;
f (•) procedure function;
h(•) measurement function;
Given dynamic Eq. 6.21, Bayesian estimators can be used to predict the optimal
a posteriori state distribution x p(x0:k |z 1:k ). The reasoning process of Bayesian esti-
mation can be divided into two steps, prediction and update, as shown in Eqs. 6.22
and 6.23 respectively:

p(x0:k |z 1:k−1 ) = p(xk |xk−1 ) p(x0:k−1 |z 1:k−1 )dx0:k−1 (6.22)
402 6 Phase Space Reconstruction Based on Machinery System Degradation …

p(x0:k |z 1:k−1 ) p(z k |xk )


p(x0:k |z 1:k ) =
p(z k |z 1:k−1 )
p(z k |xk ) p(xk |xk−1 ) p(x0:k−1 |z 1:k−1 ) (6.23)
=
p(z k |z 1:k−1 )
∝ p(z k |xk ) p(xk |xk−1 ) p(x0:k−1 |z 1:k−1 )

Of the form:
p(x 0:k | z1:k −1 ) the distribution of the predicted probability density of the system at
k time;
p(x 0:k | z1:k ) the posterior probability density distribution of system k time;
p(x k | x k −1 ) state transition probability model;
p(zk | x k ) likelihood function; 
p(zk |zk-1 ) normalization factor, p(z k |z 1:k−1 ) = p(x0:k |z 1:k−1 ) p(z k |xk )d xk ;
∝ positive correlation symbols;
The prediction step is to use the system model to calculate the prediction prob-
ability density function at time k for all measurements at 1: k-1, then update the
process using the latest measurements to revise the probability density function to
a posterior probability density function at time k. Equations 6.22 and 6.23 are the
basis of Bayesian estimation.
Generally speaking, for nonlinear and non-Gaussian Systems, in reality, the
optimal solutions of Eqs. 6.22 and 6.23 are difficult to be given by complete analytical
expressions. The particle filter method can solve the integral operation of Bayesian
estimation by Monte Carlo method, and thus give the approximate optimal solution
of Bayesian
{ i estimation [21].
} First of all, for a series of random samples (parti-
cles)
{ i x k , i = 1, 2, ...,
} N of the system at time k, the corresponding weights are
wk , i − 1, 2, ..., N . And the posterior probability density function of the system
state at time k can be approximated by these particles [24]:


N
p(xk |z 1:k ) ≈ wki δ(xk − xki ) (6.24)
i=1

Of the form:
N number of particles;
δ(•) the Dirac delta function, or the impulse function;
These weights can be determined by the importance sampling theory [21]. In this
theory, the importance density function q(x0:k |z 1:k ) is introduced to determine the
importance weight as follows:
( i )
p x0:k |z 1:k
wki ∝ ( i ) (6.25)
q x0:k |z 1:k

Let q(x0:k |z 1:k ) be further decomposed into:


6.7 Particle Filter 403

q(x0:k |z 1:k ) = q(xk |x0:k−1 , z 1:k )q(x0:k−1 |z 1:k−1 ) (6.26)

Then, with Eqs. 6.23 and 6.26 and with Eq. 6.25 available:

i
p(x0:k |z 1:k ) p(z k |xki ) p(xki |xk−1
i
) p(x0:k−1
i
|z 1:k−1 )
wki ∝ =
q(x0:k |z 1:k )
i
q(xk |x0:k−1 , z 1:k )q(x0:k−1 |z 1:k−1 )
i i

p(z k |xki ) p(xki |xk−1


i
)
= wk−1
i
(6.27)
q(xki |x0:k−1
i
, z 1:k )
p(z k |xki ) p(xki |xk−1
i
)
= wk−1
i
q(xki |xk−1
i
, zk )

From formula 6.27, it can be seen that the determination of the importance density
function is the key link in calculating the importance weight. Gordon et al. proposed a
priori state transition probability density function as the importance
( density
) function
( i i ) j j
to calculate the weights [21], that is, let q xk |xk−1 , z k = p xk |xk−1 . In this case,
Eq. 6.27 is simplified to:

wki ∝ wk−1
i
p(z k |xki ) (6.28)

That is, the weight value of the current moment can be obtained from the weight
value of the previous moment and the likelihood function. Next, the weight is stan-
dardized by Eq. 6.29 and the likelihood function is obtained by using the measurement
equation in Eq. 6.21. Then the weight function expression can be further given by
Eq. 6.30:


N
wki ≈ wki / wki (6.29)
i=1

wki ≈ wk−1
i
p(z k |xki ) ≈ wk−1
i
pvk (z k − h(xki )) (6.30)

Of the form:
pvk (•)—the probability density function of measurement noise vk ;
Finally, the optimal state xk of the system at the current time is estimated as
follows:


N
x̂k ≈ wki xki (6.31)
i=1

The biggest problem of traditional particle filter algorithms is particle degradation.


After several iterations, the weight of particles is concentrated on one or a few
particles, while the weight of other particles is almost zero, as a result, much of the
computational effort is wasted updating the probability density function particles,
404 6 Phase Space Reconstruction Based on Machinery System Degradation …

so that the resulting set of particles does not reflect the state of the real probability
density function. It should be noted that the resampling algorithm is introduced into
the traditional particle filter algorithm, which can remove the particles with small
weights, and the weight distribution can be optimized by selecting the appropriate
importance density function. Both methods are effective means to solve the problem
of particle degradation. In the following, the shortcomings of the existing importance
density function selection and resampling algorithms are introduced in detail, and
an improved algorithm is proposed.
The importance density function in the particle filter algorithm is closely related
to the calculation of particle weight and the renewal process of particles. In order to
simplify the calculation process, the prior state transfer probability density function
is used as the importance density function in the traditional particle filter algorithm,
the importance density function selection method is also adopted in many current
researches using particle filter [24]. However, this method takes the prior probability
density of the system state as an approximation of the posterior probability density,
and does not take into account the current measurement values, leading to a large
deviation between the probability density function resampled from the importance
probability density function (the prior probability density function) and the proba-
bility density function resampled from the real posterior probability density function.
this bias is particularly pronounced when the likelihood function is located at the tail
of the prior state transition probability density function. In the present research,
many scholars have discussed and studied the selection method of the importance
density function. For example, Yoon uses Gauss’s mixed model in conjunction with
an unscented sequential Monte Carlo probability assumption density filter to improve
the traditional importance density function selection method [25], li proposed a new
method to calculate the weight of importance by combining wavelet and Grey Model
[26]. All of the above studies have introduced other algorithms to assist the selection
of importance density function, and achieved good experimental results, but also
increased the complexity of the algorithm and increased the amount of computa-
tion. Therefore, different from the above methods, this paper presents an adaptive
importance density function selection method based on the update of its own particle
distribution.
In essence, the particle filter algorithm is to fit the posterior probability density
distribution of the state of the system by continuously updating the particle swarm,
while the importance density function has a close influence on the updating process of
the particles. Therefore, the updated particle swarm distribution can be considered
as the importance density function of the current moment to guide the next time
to calculate and update the weight of particles. Using this method, the importance
density function is updated by the particle distribution in the previous prediction step
in each prediction step, in this way, the importance density function can not only keep
the prior information of the system state at the last moment, but also approach the
posterior probability density distribution of the system state by real-time updating.
Combined with the traditional particle filter algorithm, the process of the adaptive
importance density function selection method is further introduced in detail (shown
in Fig. 6.21):
6.7 Particle Filter 405

Fig. 6.21 Flow chart of adaptive importance density function selection algorithm
406 6 Phase Space Reconstruction Based on Machinery System Degradation …
{ }
(1) particle initialization: at k = 0, the particle swarm x0i , i = 1, 2, ..., N is gener-
ated from the prior distribution of the system, and the variance σ0 of the particle
swarm is calculated at this moment, which is ready for the iterative operation
in the next moment;
(2) particle update: at k > 0, the process noise in Eq. 6.21 is determined by
ωk−1 = ε + Δk−1 , where Δk−1 ∼ N (0, σk−1 ), ε is the error of the equation of
state and N is the probability density function of normal distribution. And the
particle swarm {xki , i = 1, 2, ..., N } is calculated by updating the previous prob-
ability density function in Eq. 6.21, and the mean and variance of the updated
particle swarm were calculated. Since many distributions can be approximated
by normal distribution, the normal distribution N (x k , σk ) is used to approximate
the probability density function of the particle swarm.
(3) particle weight calculation: using the probability density function of the current
particle swarm to determine the importance density function q(xki |xk−1 i
, zk ) =
N (x k , σk ), the weight of each particle can be calculated using Eq. 6.32:

p(z k |xki ) p(xki |xk−1


i
) 1 pvk (z k − h(xki )) p(xki |xk−1
i
)
ŵki = wk−1
i
= (6.32)
q(xki |xk−1
i
, zk ) N N (x k , σk )

where, wk−1
i
is defined as 1/N , because the particles at the last moment have
the same weight after the resampling algorithm;
(4) particle resampling:
{ } after the weight { is} standardized by Eq. 6.29, the current
particle swarm xki and its weight ŵki are resampled by using the resampling
algorithm in the traditional particle filter to obtain the new particle swarm, and
the new particle has the same weight 1/N . The state of the system can be
estimated by the new particle swarm by Eq. 6.31; then k = k + 1, return step
2) and carry on the next time particle update and weight calculation until the
predicted steps reach the threshold kend .
From the above steps of the method of selecting the adaptive importance density
function, we can see that the importance density function can be adjusted adaptively
by the renewal of particles in each iteration cycle, and at the same time, the adjusted
importance density function can affect the subsequent resampling operation and the
next particle update operation. Therefore, the adaptive importance density function
selection method not only considers the prior information of the system state at the
last time when the particle update and resampling operation, but also does not limit
itself to the prior information, instead, the importance density function is adjusted
to make the distribution of particles close to the posterior probability distribution
of the state of the system, which reduces the possibility of particle degeneration. In
addition, because this method only needs to count the mean and variance of particle
swarm, and does not introduce other algorithms, it has less computation and meets
the real-time requirement of on-line prediction.
Resampling algorithm is another effective method to improve the degradation of
particles. The basic idea of resampling algorithm is to copy those particles with large
6.7 Particle Filter 407

weight and discard those with small weight, so that the number of small weight parti-
cles in the new particle swarm will be reduced after resampling, which will restrain
the degradation of particles. However, after several resamples, those particles with
high weights may be duplicated many times, resulting in many particles with the
same weights in the new population, resulting in a gradual loss of particle diver-
sity in the particle swarm, which makes it difficult for the iterative particle swarm
to represent the posterior probability density distribution of the system state [27].
This phenomenon is called particle depletion. In order to eliminate particle dilu-
tion, some algorithms are proposed to improve the traditional resampling algorithm,
such as residual resampling [28] and distributed resampling [29]. These algorithms
generally focus on improving the resampling algorithm itself, but seldom involve
adjusting some singularities in particles before resampling steps, i.e., some particles
with abnormally large weights, these singularities are the root cause of particle deple-
tion. In order to adjust the singularity, it is necessary to establish the relation model
between the particle and its corresponding weight. Since the relationship is generally
nonlinear and non-gauss, the model is generally difficult to give in analytic form. The
neural network has good nonlinear tracking ability and information learning ability,
among which the back propagation (BP) neural network has the characteristics of
simple structure and less prior information requirement for the modeled samples,
therefore, BP neural network is used as a particle smoothing algorithm to improve
the traditional resampling algorithm. This method does not change the steps of the
original resampling algorithm, only uses BP neural network to smooth the weight of
particles before resampling in order to eliminate the singularity.
BP neural network includes input layer, hidden layer and output layer. In the
training process of BP neural network, the gradient descent algorithm is used to
adjust the weights of the network, and the total error is reduced until the error between
nonlinear input and output reaches the minimum. It is theoretically proved that any
continuous function can be fitted by a BP neural network model with a hidden layer
[30]. Therefore, a three-layer BP neural network is used to construct the relationship
model between particles and their weights, furthermore, the weight of particles is
smoothed. Figure 6.22 shows a schematic diagram of a particle smoothing algorithm
based on the BP neural network. The steps of the algorithm are as follows:

} {xk , i = 1, 2, ..., N } and its corresponding


i
(1) firstly, the
{ new particle swarm
weights ŵk , i = 1, 2, ..., N are obtained after updating and calculating the
i

weights at time k;
(2) and, taking the particle value and its corresponding weight value as the training
input and the training output of the neural network respectively, a BP neural
network is trained by gradient descent algorithm, which is recorded as MBP ;
(3) then, the particle value xki is substituted into the trained neural network MBP as
the test input, and the network
( ) output is calculated, that is, the smoothed particle
weight value ŵs,k
i
= MBP xki ;
{( ) }
(4) finally, the particles and the smoothed weights xki , ŵs,ki
, i = 1, 2, ..., N are
resampled using the traditional resampling algorithm.
408 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.22 Flow chart of particle smoothing algorithm based on neural network

In each iteration of the particle filter prediction, the resampling particle smoothing
algorithm is used once to map the probability density distribution of the particle
weights from the discrete space to the continuous space, the new importance weight
points are sampled in the generated continuous space. After this process, the large
weights of the particles are smoothed, and the large differences between the weights
of the different particles are reduced, so that the diversity of the particles is preserved
after the resampling operation.

6.8 Enhanced Particle Filter

Combining the above-mentioned adaptive importance density function selection


method and the particle smoothing algorithm based on a neural network with the
traditional particle filter method, an enhanced particle filter algorithm is proposed,
the flow chart of the algorithm is shown in Fig. 6.23.
The specific algorithm steps are as follows:
(1) firstly, the particles are initialized at k = 0 to prepare for subsequent iterations;
(2) at the next time, the system dynamic equation is used to update the particles,
and the process noise is corrected by the variance value determined at the last
time;
(3) then the mean and variance of the particle swarm are calculated, and the distribu-
tion function of the particle at the current time is determined as the importance
density function by the adaptive importance density function selection method,
and the weight of each particle is calculated and standardized;
(4) next, the weight of particles is smoothed by the particle smoothing algorithm
based on BP neural network to keep the diversity of particles, and then the
particles are resampled by the resampling algorithm according to the smoothed
weight to obtain the new particle swarm with equal weight;
6.8 Enhanced Particle Filter 409

Fig. 6.23 Flow chart of


enhanced particle filter
algorithm
410 6 Phase Space Reconstruction Based on Machinery System Degradation …

(5) finally, the new particle swarm is used to estimate the current state of the system
through Eq. 6.31.
After one prediction, repeat the above steps, and execute the algorithm to predict
the next state of the system until the stopping condition k = kend is reached.

6.9 Enhanced Particle Filter Based Machinery


Components Residual Useful Life Prediction

In order to use the enhanced particle filter algorithm to predict the life of mechanical
rotating parts, the system dynamic equation is the first problem to be solved. The
dynamic equation needs to be able to describe the state of the system well and be
constructed simply to meet the real-time requirement of online prediction. In addition,
considering that the running state of mechanical parts is not only related to the state
of the previous time, but also related to the state at p consecutive times before.
Therefore, unlike the first-order state equation given in Eq. 6.21, a multi-order state
equation is needed to describe the state change of the system. In the previous chapter,
the validity of the autoregressive (AR) model in predicting the initial failure time
point is verified, in addition, the AR model can establish the relationship between the
state of the system at the current moment and the state of the system at the previous
successive moments. Therefore, the AR model is used to construct the multi-order
state equation, as shown in Eq. 6.33:


p
xk = f (xk−1 , xk−2 , ..., xk− p ) + ωk−1 = a j xk− j + εk−1 (6.33)
j=1

where, the variables in the formula have the same meanings as those in Eq. 6.30.
AIC criterion and Burg algorithm are also used to determine the order of the model.
In addition, the equation of measurement is in the form of:

z k = xk + vk (6.34)

where, the selection of vk is related to the predicted characteristics.


With the AR state equation, the enhanced particle filter algorithm can be applied
to the life prediction of mechanical rotating parts. Generally speaking, the whole
life cycle of mechanical rotating parts can be divided into three stages: health state,
fault degradation state and serious fault state. The health threshold of degradation
tracking introduced in Sect. 6.3 distinguishes the health state of a component from
the failure degradation state, and the initial failure occurrence time is the beginning
of the failure degradation state. Different from the time prediction of initial failure,
the prediction of remaining useful life is carried out under the condition of failure
degradation. Specifically, a prediction model (in this case, the enhanced particle
6.9 Enhanced Particle Filter Based Machinery Components Residual Useful … 411

filter) is built using current–time data to predict the state of the system backwards
until a predicted state exceeds a predetermined fault threshold X th . The threshold is
a sign that the component is in a state of serious failure. And the remaining useful
life of the component at the current moment RU L t can be obtained as follows:

RU L t = tr − t (6.35)

where, t is the current moment, tr is the time when the predicted state exceeds the
fault threshold.
Therefore, the AR model and the enhanced particle filter algorithm are combined
to establish the residual service life of mechanical rotating parts of the prediction
framework, the flow chart as shown in Fig. 6.24.
The algorithm flow is described as follows:
(1) to extract the nonlinear features describing the degradation of components and
obtain the health threshold is obtained by the method of setting the health
threshold in Sect. 6.3, then obtain the time of initial failure, which marks that
the component enters the stage of fault degradation;
(2) the remaining useful life is predicted from the beginning of the fault degradation
state;
(3) at time t, n features in time period {t − n + 1, t − n + 2, ..., t} were selected
to construct the AR model. And the particle swarm is generated by a prior
distribution using the features of the current moment. The enhanced particle
filter algorithm is used to continuously predict backward for multiple steps
{t + 1, t + 2, ...} until the predicted feature is greater than the fault threshold
X th . If the time corresponding to the feature is t + m, then m is the predicted
remaining useful life of the component at time t;
(4) at time t + 1, the new observation features are obtained. The
AR model was updated by selecting n features in the time period
{t − n + 2, t − n + 3, ..., t, t + 1}, and the predicted remaining useful life of
the component at time t + 1 was obtained by using the enhanced particle filter
algorithm with the same method as in step 3);
(5) repeat step (3) and step (4) until the new observed feature is larger than the fault
threshold X th ;
(6) at last, the remaining useful life curve is made by the predicted remaining useful
life of each time.
In order to verify the validity of the remaining useful life prediction algorithm of
mechanical rotating parts based on enhanced particle filter, the same experimental
data of bearing degradation as in Sect. 6.3 are used in this study, that is, bearing
3 degradation test data containing 2000 data files. The recurrence entropy feature
derived from the improved recurrence quantitative analysis method is still used as
the state characteristic parameter, and then the remaining useful life is predicted by
the enhanced particle filter method.
412 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.24 Flow chart of residual service life prediction for mechanical rotating parts based on
enhanced particle filter
6.9 Enhanced Particle Filter Based Machinery Components Residual Useful … 413

The regression tracking curve of bearing 3 based on recurrence entropy is shown


in Fig. 6.25. Each point in the figure corresponds to a data file, with a 10-min interval
between the two points. According to the experimental results in Sect. 6.3, the health
threshold is set to 1.6046, and the initial bearing failure occurs at point 1833. In
addition, because the degradation curve has a significant upward trend after point
2120, so point 2120 as the time point of serious bearing failure, the corresponding
recurrence entropy eigenvalue 2.1242 is used as the fault threshold of life prediction.
The prediction of the remaining useful life of the bearing starts from 1833 points.
Firstly, the AR model is established as a dynamic equation by using 60 recurrence
entropy features from 1833 to 1892, generating 100(N = 100) particles as a priori
distributed particle swarm, utilizing the enhanced particle filter algorithm to predict
the remaining useful life according to the steps described in Fig. 6.24, the final
result was that the 232nd prediction point exceeded the fault threshold for the first
time, and therefore the predicted remaining useful life of point 1892 was 2320 min
(232 ∗ 10 min =2320 min). Using the same procedure to predict the remaining useful
life of all the points in the fault degradation state, 228 remaining life prediction points
were obtained. Because the 2120th point is the prediction stop point, the predicted
results are shown in Fig. 6.27. The horizontal axis of the figure represents the starting
point of each life prediction, and the vertical axis represents the remaining useful
life corresponding to the predicted starting point.
As can be seen from the figure, the predicted remaining useful life curve can
basically reflect the trend of the true remaining useful life curve, and the closer to
the predicted end point, the smaller the fluctuation of the predicted remaining useful
life curve, this means that the closer the bearing state is to the point of failure time,

Fig. 6.25 Degeneracy tracking curve of bearing 3


414 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.26 Traditional particle filter residual service life prediction results of bearing 3

the higher the accuracy of remaining useful life prediction is. In order to make a
comparison, this experiment also uses the traditional particle filter method and the
common prediction model support vector regression method, which uses the same
data segment to train the AR model and the regression network, the remaining useful
life of bearing 3 is predicted by the same steps, and the results are shown in Figs. 6.26
and 6.27.
In order to quantitatively evaluate the prediction error, Eqs. 6.36 and 6.37 are used
to calculate the mean error e3 and root-mean-square error e4 :

1 ∑
M
e3 = |RU L p (i ) − RU L r (i )| (6.36)
M i=1
[
|
|1 ∑ M
( )2
e4 = | RU L p (i ) − RU L r (i ) (6.37)
M i=1

where, RU L p is predicted remaining useful life; RU L r is true remaining useful life;


M is the number of remaining useful life points used for prediction, which is 228 in
this experiment.
In addition, similar to the above simulation experiment, the effective particle
number and the standard deviation of particles are also calculated for each remaining
useful life prediction point, and the average values of these two parameters for
6.9 Enhanced Particle Filter Based Machinery Components Residual Useful … 415

Fig. 6.27 Enhanced particle filter residual life prediction results of bearing 3

all remaining useful life prediction points are calculated, the results are shown in
Table 6.5.
As can be seen from Fig. 6.26 and Table 6.5, although the remaining useful life
prediction curve obtained by the traditional particle filter method can also reflect
the trend of the real remaining useful life curve, the fluctuation of the prediction
curve and the prediction error are obviously higher than the enhanced particle filter
algorithm. In addition, the average effective particle number and the average standard
deviation of the particle swarm after enhanced particle filter algorithm are larger than
those of the traditional particle filter algorithm, the comparison results show that the
adaptive importance density function selection method and the particle smoothing
algorithm based on neural network can effectively reduce the degradation of particles

Table 6.5 Quantitative evaluation parameters of bearing 3 in different prediction methods


Enhanced particle Traditional particle Support vector
filter filter regression
Average error e3 (in 1.67 3.89 5.09
hours)
Root-mean-square error 2.43 4.89 6–16
e4 (in hours)
The average number of 75 59 /
effective particles
Mean standard deviation 0.0608 0.0291 /
of particles
416 6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.28 Support vector regression residual life prediction results of bearing 3

and keep the diversity of particles, therefore, the prediction accuracy of particle filter
is improved. In addition, as can be seen in Fig. 6.28 and Table 6.5, the fluctuation and
prediction error of the remaining useful life prediction curve obtained by the support
vector regression algorithm i larger than those obtained by the particle filter algorithm,
this can be explained as the application of support vector regression algorithm to the
prediction of remaining useful life under the condition of insufficient training samples
and long prediction steps. In order to meet the requirement of long-term prediction,
the selection of kernel function and regression parameters should be further improved
and optimized.

References

1. Meng, Q.: Nonlinear Dynamical Times Series Analysis Methods and Its Application. Shandong
University (in Chinese), Communication and Information Systems (2008)
2. Packard, N.H., Crutchfield, J.P., Fanners, J.D., et al.: Geometry from a time series. Phys. Rev.
Lett. 45(9), 712–716 (1980)
3. Takens, F.: Detecting strange attractors in turbulence. In: Dynamical Systems and Turbulence,
Warwick 1980, pp. 366–381. Springer, Heidelberg (1981)
4. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge University Press, UK
(1997)
5. Broomhead, B.S., King, G.P.: Extracting qualitative dynamies from expenmental data.
PhysieaD 20, 217–236 (1986)
6. Kennel, M.B., Brown, R., Abarbanel, H.D.I.: Determining embedding dimension for phase
space reconstruction using a geometrical construction. Phys. Rev. A 45(6), 3403–3411 (1992)
7. Webber, C.L., Zbilut, J.P.: Dynamical assessment of physiological systems and states using
recurrence plot strategies. J. Appl. Physiol. 76(2), 965–973 (1994)
References 417

8. Zbilut, J.P.: Detecting deterministic signals in exceptionally noisy environments using cross-
recurrence quantification. Phys. Lett. A 246(1), 122–128 (1998)
9. Nichols, J.M., Trickey, S.T., Seaver, M.: Damage detection using multivariate recurrence
quantification analysis. Mech. Syst. Signal Process. 20(2), 421–437 (2006)
10. Marwan, N.: Encounters with Neighbors: Current Developments of Concepts Based on
Recurrence Plots and Their Applications. University of Potsdam, Potsdam (2003)
11. Bearing Data Center Seeded Fault Test Data: The Case Western Reserve University Bearing
Data Center Website. http://csegroups.case.edu/bearingdatacenter/pages/welcome-case-wes
tern-reserve-university-bearing-data-center-website
12. Marwan, N., Romano, M.C., Thiel, M., et al.: Recurrence plots for the analysis of complex
systems. Phys. Rep. 438(5–6), 237–329 (2007)
13. Thiel, M., Romano, M.C., Kurths, J., et al.: Influence of observational noise on the recurrence
quantification analysis. Physica D 171(3), 138–152 (2002)
14. Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82,
35–45 (1960)
15. Gototo, S., Nakamura, M., Uosaki, K.: Online spectral estimation of nonstationary time-series
based on AR model parameter-estimation and order selection with a forgetting factor. IEEE
Trans. Signal Process. 43(6), 1519–1522 (1995)
16. Zhang, Y., Zhou, G., Shi, X., et al.: Application of Burg algorithm in time-frequency analysis
of doppler blood flow signal based on AR modeling. J. Biomed. Eng. 22(3), 481–485 (2005)
17. Qiu, H., Lee, J., Lin, J., et al.: Robust performance degradation assessment methods for
enhanced rolling element bearings prognostics. Adv. Eng. Inform. 17(3–4), 127–140 (2003)
18. Qiu, H., Lee, J., Lin, J.: Wavelet filter-based weak signature detection method and its application
on roller bearing prognostics. J. Sound Vib. 289(4–5), 1066–1090 (2006)
19. Zhu, Z.: Particle Filtering Algorithm and Its Application (in Chinese). Science Press, Beijing
(2010)
20. Samantaray, S.R., Dash, P.K.: High impedance fault detection in distribution feeders using
extended kalman filter and support vector machine. Eur. Trans. Electrical Power 20(3), 382–393
(2010)
21. Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/ non-Gaussian
Bayesian state estimation. IEE Proceedings-F Radar Signal Process. 140(2), 107–113 (1993)
22. Sun, L., Jia, Y., Cai, L., et al.: Residual useful life prediction of gearbox based on particle
filtering parameter estimation method (in Chinese). Vib. Shock 32(6), 6–12 (2013)
23. Chen, C., Vachtsevanos, G., Orchard, M.: Machine remaining useful life prediction: an inte-
grated adaptive neuro-fuzzy and high-order particle filtering approach. Mech. Syst. Signal
Process. 28, 597–607 (2012)
24. Arulampalam, M.S., Maskell, S., Gordon, N., et al.: A tutorial on particle filters for online
nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 50(2), 174–188 (2002)
25. Yoon, J.H., Kim, D.Y., Yoon, K.J.: Gaussian mixture importance sampling function for
unscented SMC-PHD filter. Signal Process. 93(9), 2664–2670 (2013)
26. Li, T., Zhao, D., Huang, Z., et al.: A wavelet-based grey particle filter for self-estimating the
trajectory of manoeuvring autonomous underwater vehicle. Meas. Control 36(3), 321–325
(2014)
27. Cao, B. Research on Improved Algorithms and Applications based on Particle Filter. (in
Chinese) Xi’an Institute of Optics and Precision Mechanics of CAS (2013)
28. Rigatos, G.G.: Particle filtering for state estimation in nonlinear industrial systems. IEEE Trans.
Instrum. Meas. 58(11), 3885–3900 (2009)
29. Bolic, M., Djuric, P.M., Hong, S.J.: Resampling algorithms and architectures for distributed
particle filters. IEEE Trans. Signal Process. 53(7), 2442–2450 (2005)
30. Li, Q., Yu, J., Mu, B., et al.: BP neural network prediction of the mechanical properties of
porous NiTi shape memory alloy prepared by thermal explosion reaction. Mater. Sci. Eng. A
419(1–2), 214–217 (2006)
Chapter 7
Complex Electro-Mechanical System
Operational Reliability Assessment
and Health Maintenance

7.1 Complex Electro-Mechanical System Operational


Reliability Assessment

Mechanical equipment manufacturing industry is an important embodiment of the


national integrated strength and defense strength. As the key tool and resource of
mechanical equipment manufacturing industry, complex electro-mechanical system
is the foundation of national economy and industry. However, under the conditions
of long-term operation, variable loads, and multi-physical field coupling, the perfor-
mance of complex electro-mechanical systems gradually degrade, which impacts the
reliability and remaining useful life. Typically, aircraft engine suffers from safety
hazards like rotor cracks, bearing damage, static and dynamic rubbing due to its
inherent complex structure and severe working conditions, thus affecting the relia-
bility of the aircraft engine. According to statistics from Air Force Material Labo-
ratory (AFML): more than 40% of flight accidents caused by mechanical excitation
are related to aircraft engines. In aero-engine accidents, the failure damage of engine
rotors (including rotating parts such as shafts, bearings, discs and blades) accounts
for more than 74%. At present, the overhaul period of Chinese aero engines is half
of that of American aero engines, and the total service life is only 1/4 of that of
American aero engines. For example, the refurbishment life of J-10 power is 300
flight hours and the total life is 900 flight hours. The overhaul period of the 3rd gener-
ation turbofan engines F100 and F110 in the US is about 800–1000 flight hours, and
the total service life is about 2000–4000 flight hours. Another typical example is
construction machinery: it works in the field all year round, even in extreme environ-
ments such as high altitude, hypoxia and drought, with variable working conditions,
those damages or destructions seriously endanger the reliability and service life of
electro-mechanical systems. According to statistics, engine failures account for about
30% of engineering machinery failures, compared to 20% by transmission system
failures, 25–35% by hydraulic system failures, and 15–25% by braking system fail-
ures and structural weld cracking. The average time between failure (MTBF) of
1000 h reliability test and “three guarantees” period abroad is 500–800 h, with a

© National Defense Industry Press 2023 419


W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex
Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_7
420 7 Complex Electro-Mechanical System Operational Reliability …

maximum of more than 2000 h. The MTBF in China is 150–300 h, among which the
mean time between failures of wheel loaders is 297.1 h, 400 h at most and 100 h at
least.
To sum up, high failure rate, low reliability and short service life are the bottlenecks
which restricts the international competitiveness and influence of Chinese mechanical
equipment, and they are the key to transform China from a manufacturing giant to a
manufacturing power. The difficulties are as follows:
(1) Reliability research and controllable life design in China are at the initial stage.
The mechanical equipment R&D in China has always been imitating foreign
products through surveying and mapping for design and development. Thus,
there is a lack of basic data such as load spectrum, reliability and life test of key
parts, and the relationship between load spectrum, service condition parameters,
reliability and life has not been established, making it impossible to control the
service life and reliability of products in the design stage.
(2) Traditional reliability theory and life test research mainly rely on the clas-
sical probability statistics method, which must meet three prerequisites: a large
number of samples determined by the law of large numbers; The sample has
probability repeatability; It is not affected by human factors. However, the above
three prerequisites are difficult to meet in the reliability and life test analysis
of mechanical equipment, it is unrealistic to obtain the reliability life data of
mechanical equipment through a large number of samples in view of economic
costs and time costs. Moreover, due to the differences in operation and main-
tenance among mechanical equipment, it is difficult to ensure the probability
repeatability and no disturbance from human factors.
(3) Traditional reliability assessment is mainly based on the binary hypothesis or
finite state hypothesis, and generally, it is considered that there are only two
states (namely, normal and failure) or finite states. However, the health status
of electro-mechanical system has the characteristics of continuous progressive
degradation and random decentralized failure, and the binary hypothesis and
finite state hypothesis are insufficient to reveal the health attributes of mechanical
equipment.
(4) Changes in service conditions and operating parameters (such as temperature,
vibration, load, pressure, electrical load, etc.) often affect the operating relia-
bility of electro-mechanical systems, any single parameter above exceeds the
limit or failure will affect the reliability. However, the machine service condi-
tions and parameters rarely follow the traditional mathematical distribution
form, so their changes are often more difficult to deal with.
In order to reveal the failure rules of parts, structures and equipment, both domestic
and foreign scholars have conducted in-depth research, explored the physical mech-
anism of mechanical structure performance degradation and failure evolution, and
put forward some reliability prediction models that can reflect the failure correlation
among components. These models mainly estimate and predict the failure character-
istics of the population (such as mean time to failure, reliable operation probability,
etc.) based on a large number of historical failure data. Literature [1–3] summa-
rizes the reliability prediction methods and theories. Common reliability prediction
7.1 Complex Electro-Mechanical System Operational Reliability Assessment 421

methods based on fault events include linear model, polynomial model, exponen-
tial model, time series model, regression model, etc. Due to the rich information
resources contained in the operation state data, some scholars have fused reliability
methods with fault prediction techniques in recent years, making the fault prediction
results more scientific and complete.

7.1.1 Definitions of Reliability

Reliability: the ability or possibility of components, products and systems to perform


specified functions without failure within a certain period of time and under certain
conditions. The reliability of products is usually evaluated by the degree of reliability,
failure rate, MTBF and other indicators. Mathematically,
Degree of reliability: refers to the ability of the product to complete the
predetermined function within the specified time and under the specified conditions,
Assuming that the specified time is t and the product life is T, the reliability is
usually expressed as the probability of T > t:

R(t) = P(T > t) (7.1)

Failure probability: refers to the probability that a product loses its intended func-
tion under specified conditions and within a specified time, also known as failure
rate, unreliability, etc., which is recorded as, and usually expressed as:

F(t) = P(T ≤ t) (7.2)

Obviously, the relationship between failure probability and reliability is:

F(t) = 1 − R(t) (7.3)

Operation reliability: the normalized health measurement index for completing


the scheduled function determined by its operation status information under the
specified conditions and within the service time.

7.1.2 Operational Reliability Assessment

7.1.2.1 Condition Monitoring and Signal Acquisition for Machinery

As we all know, machinery has experienced a series of degradation states from normal
states to failure, and the operation process can be monitored by some measurable
variables. Therefore, it is very important to establish an internal relationship between
422 7 Complex Electro-Mechanical System Operational Reliability …

mechanical operation reliability and condition monitoring information, such as vibra-


tion signal, temperature, pressure, etc. Vibration signal, the most commonly used
data, is acquired by sensors. The operation reliability evaluation starts with the
monitoring data obtained by sensors and the data acquisition system which aims
at collecting the mechanical state information in certain operating states.

7.1.2.2 Lifting Wavelet Packet Transform

In order to extract the mechanical dynamic signal characteristics from the collected
excitation response signal, the wavelet transform with multi-resolution capability
can observe the signal at different scales (resolutions). By decomposing the signal
into different frequency bands, we can see both the full picture of the signal and the
details of the signal. At least, it has the following two outstanding advantages:
(1) The multi-resolution ability of wavelet transform can observe the signal at
different scales (resolutions), decompose the signal into different frequency
bands, and see the full picture of the signal as well as the details of the signal.
(2) The orthogonal property of wavelet transform can decompose any signal into
its own independent frequency bands so that the decomposed signals in these
independent frequency bands carry different mechanical state information.
The lifting wavelet packet inherits the good multi-resolution and time–frequency
localization characteristics of the 1st generation wavelet transform, and has the advan-
tages of more efficient and faster wavelet transform execution capability, simple struc-
ture, low computational complexity [4]. Daubechies proved that any wavelet trans-
form can be executed by a lifting scheme [5]. The lifting wavelet packet transform is
a lifting scheme based on wavelet packet transform, which includes forward trans-
formation (decomposition) and inverse transformation (reconstruction). The inverse
transformation can be realized by running the forward transformation in the reverse
direction. The specific process is explained as follows:
(1) Decomposition
The forward transformation of the lifting wavelet packet for signal decomposition
includes three steps: split, prediction and update.
Split: suppose there is an original signal S = {x(k), k ∈ Z }, The original signal
can be divided into two sub-sequences: one is even sequence se = {se (k), k ∈ Z }
and the other is odd sequence so = {so (k), k ∈ Z }, where x(k) is the k th sample in
sequence S, Z is a set of positive integers.

se (k) = x(2k), k ∈ Z (7.4)

se (k) = x(2k), k ∈ Z (7.5)

where k is the sample order for sub-sequence se and so .


7.1 Complex Electro-Mechanical System Operational Reliability Assessment 423

The reason for splitting an original signal into two series is that adjacent samples
are much more correlated than those far from each other. Therefore, the odd and even
series are highly correlated.
Prediction and update: some even series samples can be used to predict specific
samples of odd series, and the difference of prediction is called detail signal. The
even series can be updated by the obtained detail signal, and the improved even signal
is called the approximate signal:

sl1 = s(l−1)1o − P(s(l−1)1e ) (7.6)

sl2 = s(l−1)1e + U (sl1 ) (7.7)

sl(2l −1) = s(l−1)2l−1 o − P(s(l−1)2l−1 e ) (7.8)

sl2l = s(l−1)2l−1 e + U (sl(2l −1) ) (7.9)

After the l th decomposition, sl1 , sl2 , . . . , sl2l is the decomposed signals on each
frequency band; s(l−1)1o , . . . , s(l−1)2l−1 o are the odd sequences after (l − 1)th decom-
position; s(l−1)1e , . . . , s(l−1)2l−1 e are even sequences after (l − 1)th decomposition; P
are N predictors, where p1 , p2 , . . . , p N are predictor’s coefficients, N is the number
of predictors. U are Ñ predictors, where u 1 , u 2 , . . . , u Ñ are predictor’s coefficients,
Ñ is the number of predictors. The forward transformation of lifting wavelet packet
is illustrated in Fig. 7.1.

(2) Reconstruction

The inverse transform for signal reconstruction can be derived from the forward
transform by running the lifting scheme as illustrated in Fig. 7.1 backwards. The
signal in one frequency band after decomposition is set to be reconstructed, and

Fig. 7.1 Lifting wavelet packet transform


424 7 Complex Electro-Mechanical System Operational Reliability …

the others are set as zero. The signal reconstruction of second generation wavelet
package transform for an appointed frequency band is carried out as follows:

s(l−1)2l−1 e = sl2l − U (sl(2l −1) ) (7.10)

s(l−1)2l−1 o = sl(2l −1) + P(s(l−1)2l−1 e ) (7.11)

s(l−1)2l−1 (2k) = s(l−1)2l−1 e (k), k ∈ Z (7.12)

s(l−1)2l−1 (2k + 1) = s(l−1)2l−1 o (k), k ∈ Z (7.13)

s(l−1)1e = sl2 − U (sl1 ) (7.14)

s(l−1)1o = sl1 + P(s(l−1)1e ) (7.15)

s(l−1)1 (2k) = s(l−1)1e (k), k ∈ Z (7.16)

s(l−1)1 (2k + 1) = s(l−1)1o (k), k ∈ Z (7.17)

Therefore, this chapter will use the lifting wavelet packet method, which divides
the signal into multi-level frequency bands in the whole frequency band, to analyze
the mechanical vibration signal more precisely.

7.1.2.3 Energy Distribution of Lifting Wavelet Packet Transformation

Since the orthogonal basis of the lifting wavelet packet follows the law of conserva-
tion of energy [6], each obtained 2 l band has the same bandwidth, and each band is
connected end-to-end after lth decomposition and reconstruction. Set sl,i (k) as the lth
band after lth decomposition, whose energy El,i and relative energy Ẽl,i are defined
as follows:

1 ∑( )2
n
El,i = sl,i (k) , i = 1, 2, · · · , 2l , k = 1, 2, · · · , n, n ∈ Z (7.18)
n − 1 k=1
⎛ l ⎞−1
∑2
Ẽl,i = El,i ⎝ El,i ⎠ (7.19)
i=1
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 425

7.1.2.4 Entropy and Measurement of Operational Reliability

Entropy is a commonly-used measure of uncertainty and an important concept of


modern dynamic systems and ergodic theory. Einstein once called the law of entropy
“the first law of the whole science”. Information entropy was proposed by Amer-
ican scholar Shannon CE in 1948 to measure the uncertainty in the system by intro-
ducing thermodynamic entropy into information theory [7–9]. In information theory,
information entropy represents the average amount of information provided by each
symbol and the average uncertainty of the information source. Given an uncertain
system X = {xn }, its information entropy can be expressed as [10]:


n
Sv (X ) = − pi log( pi ) (7.20)
i=1

∑n
where { pi } stands for the probability distribution of {xn }, and i=1 pi = 1. Infor-
mation entropy is used to describe the uncertainty of the system and evaluate the
complexity of random signals. According to this theory, the more uncertain the prob-
ability distribution (equal probability distribution), the larger the entropy value, and
vice versa. Therefore, the magnitude of information entropy also reflects the unifor-
mity of probability distribution. Therefore, using information entropy as a dimen-
sionless index can measure the irregularity and complexity of mechanical signals
in real-time and evaluate the reliability of the mechanical state. The operation relia-
bility and health maintenance block diagram of complex electro-mechanical systems
is shown in Fig. 7.2.

7.2 Reliability Assessment and Health Maintenance


of Turbo Generator Set in Power Plant

Turbo generators can produce a large amount of electrical energy, which is an impor-
tant part of the electric power system and is widely used in the electric power industry
all over the world. With a detailed, long-term maintenance plan in place, utilities can
ensure that their facilities will safely deliver as much reliable power to the grid as
possible. The criteria for a turbo generator are high reliability, high performance, with
many starts and flexible operation throughout the service life. In addition, modern
turbo generators are built to last between 30 and 40 years. With aging generator
units and mechanical components, reliability and safety evaluation are imperative
indicators for a plant to prevent failures.
Relevant research is studied worldwide by many researchers and engineers.
Matteson proposed a dynamic multi-criteria optimization framework for the sustain-
ability and reliability evaluation of power systems [11]. Lo Prete proposed a model
to assess and quantify the sustainability and reliability of different power production
scenarios [12]. Moharil et al. analyzed the generator system reliability with wind
426 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.2 Illustration of operational reliability and health maintenance of complex electro-
mechanical system

energy penetration in the conventional grid [13]. Since turbo generator faults have
a significant impact on safety, Whyatt et al. identified failure modes experienced
by turbo generators and described their reliability [14]. Tsvetkov et al. presented a
mathematical model for the analysis of generator reliability, including the develop-
ment of defects [15]. Generally speaking, traditional approaches entail collecting
sufficient failure samples to estimate the general probability of the system or compo-
nent failures and the distribution of the time-to-failure. It is usually difficult to use
probability and statistics for turbo generator safety analysis due to the lack of failure
samples and time-to-failure data. The failure rate of a generator includes all the fail-
ures which cause the generator to shut down and also depends on the maintenance
and operating policy of utilities. In fact, turbo generators are usually set on different
operating parameters and conditions such as temperatures, vibration, load and stress.
The variations of the operating parameters can affect operational safety whenever a
single parameter or condition is out of limit and failures can also be caused by the
interaction of operating parameters. It has been realized from the real-time operation
that a component will experience more failures during heavy loading conditions than
during light loading conditions, which means that the failure rate of a component
in real-time operation is not constant and varies with operating parameters [16].
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 427

Depending on the operating parameters and conditions, the constitutive components


of a turbo generator will go through a series of degradation states evolving from
functioning to failure. Therefore, there is a great demand for ways of assessing the
operational safety of turbo generators with time-varying operational parameters and
conditions during their whole life span, which is beneficial for implementing optimal
condition-based maintenance schedules with low failure risk.
When condition monitoring is performed during plant operational transients, the
intrinsically dynamic behavior of the monitored time-varying signals should be taken
into account. Monitoring the condition of a component is typically based on several
sensors that estimate the values of some measurable parameters (signals) and trigger a
fault alarm when the measured signal is out of the limit. To this purpose, Baraldi et al.
proposed approaches based on the development of several reconstruction models and
the signals were preprocessed using Haar wavelet transforms for a gas turbine during
start-up transients [17]. Lu et al. proposed a simplified on-board model with sensor
fault diagnostic logic for turbo-shaft engines [18]. Li et al. established a hybrid model
for hydraulic turbine-generator units based on nonlinear vibration [19]. The above
operational safety diagnosis and evaluation methods have mainly utilized dynamic
monitored information, so how to process the monitored information and associate
it with operational safety is very essential.
Information entropy is an effective indicator to measure a system’s degree of
uncertainty. Based on the information entropy theory, the most uncertain probability
distribution (such as the equal probability distribution) has the largest entropy, and
the most certain probability distribution has the smallest entropy. On this basis, the
use of information entropy is widespread in engineering applications, such as topo-
logical entropy of a given interval map, spatial entropy of pixels, weighted multiscale
permutation entropy of nonlinear time series, Shannon differential entropy for distri-
butions, min-and max-entropies, collision entropy, permutation entropy [20], time
entropy [21], multiscale entropy, wavelet entropy [22] and so on, different types of
information entropy have been defined in accordance with their own usage.
Entropy is well used in machinery fault diagnosis. Sawalhi et al. used minimum
entropy and spectral kurtosis for fault detection in rolling element bearings [23].
Tafreshi et al. proposed a machinery fault diagnosis method utilizing entropy measure
and energy map [24]. He et al. approximated entropy as a nonlinear feature param-
eter for fault diagnosis of rotating machinery [25]. Wu et al. proposed a bearing
fault diagnosis method based on multiscale permutation entropy and support vector
machine [26].
In the branch of information entropy, Rényi entropy was introduced by Alfréd
Rényi in 1960 [27], which is known as a parameterized family of uncertainty
measures. It is noteworthy that the classical Shannon entropy is a special case of
Rényi entropy when the order α of Rényi entropy is equal to one. Similarly, other
entropy measures that have appeared in various kinds of literature are also special
cases of Rényi’s entropy [28]. Besides being of theoretical interest as a unification
of several distinct entropy measures, Rényi entropy has found various applications
in statistics and probability [29], pattern recognition [30], quantum chemistry [31],
biomedicine [32], etc.
428 7 Complex Electro-Mechanical System Operational Reliability …

Therefore, a new method of operational reliability evaluation health maintenance


is proposed. This method extracts the relative energy of each frequency band of recon-
structed signals through the lifting wavelet packet decomposition and reconstruction
and maps it to the [0, 1] interval by Rényi entropy. Firstly, the sensor-dependent vibra-
tion signals reflecting the time-varying characteristic of an individual turbo generator
are acquired by professional sensors and then analyzed by the lifting wavelet packet
since the wavelet transform excels in analyzing unsteady signals in both the time
domain and frequency domain. The relative energy of the decomposed and recon-
structed signals in each frequency band describes the energy distribution character-
istics of the signals in different frequency bands, and the signal characteristics are
mapped to the reliability [0, 1] interval by defining the wavelet Rényi entropy, which
is applied in the 50 MW turbo generator.

7.2.1 Condition Monitoring and Vibration Signal Acquisition

After a 50 MW turbo generator unit (shown in Fig. 7.3) is repaired, in order to


ensure its normal start-up and operation, an MDS-2 portable vibration monitoring
system and professional sensors are used for vibration monitoring for the #1 and #2
bearing bushings in the high-pressure cylinder, #3 and #4 bearing bushings in the
low-pressure cylinder and #5 and #6 bearing bushings in the electric generator. The
structure of this turbo generator unit is shown in Fig. 7.4 and it mainly consists of
a high-pressure cylinder, a low-pressure cylinder, an electric generator and #1 ~ #6
bearing bushings.
With the increased speed and load in the start-up process, all the bearing bushings
are in normal states since the peak-to-peak vibration in the vertical direction is
less than 50 μm, except for the vibration of the #4 bearing bushing in the low-
pressure cylinder which is out of limit. Therefore, the condition monitoring emphasis
is focused on the vertical vibration of the #4 bearing bushing. In the start-up process
with an empty load, the peak-to-peak vibration in the vertical direction of #4 bearing
bushing is 24.7 μm at the speed of 740 r/min. Moreover, the peak-to-peak vibration

Fig. 7.3 The illustration of


the 50 MW turbo generator
unit
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 429

Fig. 7.4 The structure diagram of the 50 MW turbo generator unit

Fig. 7.5 The waveform of


#4 bearing bushing vibration A/µm
signal in time domain

in the vertical direction is increased to 63.2 μm at the speed of 3000 r/min and even
to 86.0 μm at the speed of 3360 r/min.
Afterward, vibration monitoring is conducted at a stable speed of 3000 r/min
with several given loads. The peak-to-peak vibration is about 74 μm with a load
of 6 MW, 104 μm with a load of 16 MW, and even increases to 132 μm with a
load of 20 MW. The vibration is too severe to increase the load more, so the load is
decreased to 6 MW and the peak-to-peak vibration is about 75–82 μm. The acquired
vibration waveform is shown in Fig. 7.5, which shows disorder and dissymmetry at
the top and bottom of the vibration signal. The sampling frequency is 2 kHz. The FFT
spectrum of the vibration signal is shown in Fig. 7.6. It can be seen that the amplitude
of the running frequency of 50 Hz is the largest in the whole frequency range. In
the 100–500 Hz frequency band, there are a large number of harmonic frequency
components from 2 times running frequency to 10 times running frequency, and
their amplitudes are also large. Generally speaking, the 50 Hz running frequency
component can represent rotor imbalance, and the 100 Hz double running frequency
mainly reflects the misalignment of shafting. For the rest of the high-frequency
harmonic components, it is difficult to analyze the current health status of the unit
based on the FFT spectrum alone.

7.2.2 Vibration Signal Analysis

In order to further analyze the sensor-dependent vibration signal, a lifting wavelet


packet is adopted to decompose the original signal to the extent of level 2, level 3
and level 4, respectively. Figure 7.7 shows the four signals obtained by the lifting
430 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.6 The FFT spectrum


of #4 bearing bushing in the

A/µm
turbo generator unit

wavelet packet in level 2, which correspond to the frequency bands of 0–250 Hz,
250–500 Hz, 500–750 Hz and 750–1000 Hz respectively.
Figure 7.8 shows the relative energy distribution of lifting wavelet packet recon-
structed signal in level 2. The first band has the largest relative energy, while the
second band has much more relative energy than the remaining two bands. On the
basis of level 2 wavelet packet analysis, the original signal is further decomposed
and reconstructed in level 3 and eight signals are obtained, as shown in Fig. 7.9. They
correspond to frequency bands 0–125 Hz, 125–250 Hz, 250–375 Hz, 375–500 Hz,
500–625 Hz, 625–750 Hz, 750–875 Hz and 875–1000 Hz, respectively. The relative
energy distribution of the eight signals is shown in Fig. 7.10. The energy of the first
four frequency bands is much larger than that of the last four frequency bands.
On the basis of the level 3 signal analysis, the original signal is further decomposed
to the extent of level 4 as shown in Fig. 7.11. The obtained sixteen signals correspond
to the frequency bands of 0–62.5 Hz, 62.5–125 Hz, 125–187.5 Hz, 187.5–250 Hz,
A/µm

Fig. 7.7 Lifting wavelet packet reconstructed signal in level 2


7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 431

Fig. 7.8 The relative energy distribution of lifting wavelet packet reconstructed signal in level 2
A/µm

Fig. 7.9 Lifting wavelet packet reconstructed signal in level 3

250–312.5 Hz, 312.5–375 Hz, 375–437.5 Hz, 437.5–500 Hz, 500–562.5 Hz, 562.5–
625 Hz, 625–687.5 Hz, 687.5–750 Hz, 750–812.5 Hz, 812.5–875 Hz, 875–937.5 Hz
and 937.5–1000 Hz. The signal’s relative energy of each frequency band is shown
in Fig. 7.12. It can be seen from the figure that the low-frequency band accounts for
a large amount of signal energy. Among them, the first reconstructed signal energy
accounts for the largest proportion, followed by the fourth to eighth frequency band,
and the second to third frequency band accounts for a small amount of signal energy.
432 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.10 The relative energy distribution of lifting wavelet packet reconstructed signal in level 3
A/µm

Fig. 7.11 Lifting wavelet packet reconstructed signal in level 4


7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 433

Fig. 7.12 The relative energy distribution of lifting wavelet packet reconstructed signal in level 4

7.2.3 Operational Reliability Assessment and Health


Maintenance

After the lifting wavelet packet is used to decompose and reconstruct the turbo
generator unit, the original signal is decomposed into several independent frequency
band signals, each frequency band has the corresponding relative energy distribution
characteristics. As a dimensionless index, information entropy can be used to describe
the average amount of relative energy information provided by each reconstructed
signal and the average uncertainty of the information source, measure the irregularity
and complexity of mechanical signals in real-time, and evaluate the reliability of
mechanical conditions.

7.2.3.1 Probability Space and Random Variable

is given by a non-empty finite set Ω and a proba-


As usual, a finite probability space ∑
bility function P: Ω → [0, 1] with ω∈Ω P(ω) = 1, taking it as understood that the
σ-algebra is given by the power set of Ω. For a random variable X : Ω → χ , where
its range is assumed to be finite. The distribution of X is denoted as Px : χ → [0, 1],
i.e., PX (x) = P(X = x), where X = x is a shorthand for the event ω ∈ Ω|X (ω) = x.
The standard notation for intervals in, e.g., [0,1] and [1, ∞] are denoted the respective
intervals [0, 1] = {r ∈ R|0 ≤ r ≪ 1 } and [1, ∞] = {r ∈ R|1 < r }.

7.2.3.2 Rényi Entropy

Rényi entropy unifies all the distinct entropy measures. For a parameter α ∈ [0, 1) ∪
(1, ∞) and a random variable X, the Rényi entropy of X is defined as:
434 7 Complex Electro-Mechanical System Operational Reliability …

1 ∑
Hα (X ) = log PX (x)α (7.21)
1−α x

where the sum is over all x ∈ supp(PX ).


It is well-known and not hard to verify that this definition of H α is consistent
with the respective definitions of H 0 and H 2 and limα→1 Hα (X ) = H (X ) and
limα→∞ Hα (X ) = H∞ (X ). Furthermore, it is known that the Rényi entropy is
decreasing in α, i.e., Hβ (X ) ≤ Hα (X ) for 0 ≤ α ≤ β ≤ ∞.
For α ∈ [0, 1) ∪ (1, ∞), It will be convenient to re-write Hα (X ) as Hα (X ) =
− log Renα (X ) with:
∑ α
PX (x)α ) α−1 = ||PX ||αα−1
1
Renα (X ) = ( (7.22)
x

where ||PX ||α is the α-norm of PX : X → [0,1] ⊂ R, we call Renα (X) the Rényi proba-
bility (of order α) of X. For completeness, we also define: Ren0 (X ) = |supp(PX )|−1
and Ren1 (X ) = 2−H (X ) , which is consistent with taking the limits.

7.2.3.3 Operational Reliability Degree

The relative energy distribution Ẽl,i of lifting wavelet packet analysis signal is calcu-
lated according to Eq. (7.19). Within the framework of Rényi entropy calculation, the
energy characteristics of mechanical operation state signals are mapped into a dimen-
sionless indicator in the interval of [0,1], and the operational reliability is defined as
follows:


l
2
1
R =1− log2l ( Ẽl,i )α (7.23)
1−α i=1

The operational reliability degree is calculated according to the equation from


level l = 2 to level l = 4, respectively. From Table 7.1, it is seen that the current
operational reliability degree of the turbo generator is not suitable, since all of the
calculated reliability degrees from level l = 2 to level l = 4 are under 0.4, it is inferred
that the current health condition of the turbo generator unit is not good, and there are
potential faults and dangerous parameters, which make the mechanical operational
condition become unstable, leading to the uncertainty of the probability distribution
of vibration signals obtained from monitoring, and therefore the low degree value of
mechanical operational reliability.

Table 7.1 Operational


Decomposition level l=2 l=3 l=4
reliability degree
Reliability 0.3363 0.2467 0.2812
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 435

As the above analysis, the amplitude of the running frequency is the largest among
the whole frequency range and some harmonic frequency components from two times
the running frequency to ten times the running frequency are also large in Fig. 7.6.
Signals analyzed by the lifting wavelet packet and its energy distributions in level 2,
level 3 and level 4 exhibit non-stationary, nonlinear and colored noise characteristics.
Considering the start-up process with no load and loading operation conditions, the
vertical vibrations of the #3 and #5 bearings which are adjacent to the #4 bearing
are not high (under 20 μm). Different from the #3 and #5 bearings, the vibration
of the #4 bearing increases with increased speed and load. It is concluded that the
vibration is not caused by imbalance and misalignment for the reason that vibrations
would be out of limits in multiple bearing positions if an imbalance or misalignment
fault occurs. Therefore, the problem is focused on the #4 bearing itself. It is inferred
that the monitored non-stationary and nonlinear components in the vibration signal
of the #4 bearing may be caused by mechanical looseness and local friction, so the
bearing force and support status of the sizing block and bearing lodgement must be
checked.
With the above analysis, the turbo generator unit is stopped and overhauled. The
preload of the #4 bearing bushing is about 0.11 mm, which is far from the requirement
of 0.25 mm. The gaps of the left and right sizing blocks are checked by a filler gauge.
The 0.05 mm filler gauge can be filled into 30 mm of the left sizing block and 25 mm
of the right sizing block. The gap in the bottom of the #4 bearing bushing is also
far away from the obligate gap of 0.05 mm. Therefore, the gaps of the 4# bearing
bush are re-corrected and the preload is added to the requirement of 0.25 mm, and
then it is operated again. After maintenance, the vibration of the #4 bearing bushing
decreased obviously in the process of increasing speed with no load. At 3000 r/min,
the load is gradually increased to 45 MW, and the peak-to-peak vertical vibration of
#4 bearing bushing is basically stable in the range of 46–57 μm, which is much better
than the previous situation. In order to assess the operational reliability of the turbo
generator unit after maintenance, vibration monitoring via sensors is conducted at a
stable speed of 3000 r/min with a load of 6 MW, which is the same as the case before
maintenance. The waveform of the acquired vibration signal shown in Fig. 7.13
shows some differences compared with the vibration signal before maintenance in
Fig. 7.5, such as the symmetry between the top and bottom of the vibration signal
is much better than before and the peak-to-peak vibration is about 45 μm, which
falls in the permissible range. The FFT spectrum of the vibration signal is shown in
Fig. 7.14, which is different from the spectrum before maintenance shown in Fig. 7.6.
The amplitudes of the harmonic frequency components from two times the running
frequency to ten times the running frequency are decreased. The lifting wavelet
packet is adopted to analyze the acquired vibration signal on level 2, level 3 and level
4, respectively. Afterward, the relative energy of the corresponding frequency band
analyzed by the lifting wavelet packet is computed.
The four obtained signals analyzed by the lifting wavelet packet in level 2 are
illustrated in Fig. 7.15. The signals’ relative energy after maintenance is concentrated
in the first frequency band shown in Fig. 7.16 and the relative energy of the last three
frequency bands is very little, which is quite different from Fig. 7.8. It is shown that
436 7 Complex Electro-Mechanical System Operational Reliability …

A/µm

Fig. 7.13 The waveform of vibration signal in time domain after maintenance
A/µm

Fig. 7.14 The FFT spectrum after maintenance

the relative energy in the second frequency band of Fig. 7.8 before maintenance is
generated by the machinery fault information of the gaps in the #4 bearing bushing
and the whole energy is decentralized in frequency bands. It can be seen that faults
can affect the frequency band energy distribution of the signal.
Figure 7.17 shows the signal of the lifting wavelet packet decomposition and
reconstruction in level 3. Figure 7.18 shows the signal relative energy distribution of
the lifting wavelet packet decomposition and reconstruction in level 3. Different from
Fig. 7.10 which is before maintenance, the first frequency band has the largest energy,
while the relative energy of the next seven frequency bands is small. By comparing
the operational conditions of the turbo generator unit before maintenance as shown
in Fig. 7.10, it indicates that the larger energy distribution from the second frequency
band to the fourth frequency band in Fig. 7.10 is caused by the fault information of
#4 bearing bushing.
The lifting wavelet packet decomposition and reconstruction in level 4 is further
proceeded, as shown in Fig. 7.19. Figure 7.20 shows the relative energy distribution
of lifting wavelet packet reconstructed signal after maintenance. Different from the
energy distribution of lifting wavelet packet decomposition and reconstructed vibra-
tion signal before maintenance is shown in Fig. 7.12, the relative energy of other
frequency bands in Fig. 7.20 is very small except for the first frequency band.
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 437

A/µm

Fig. 7.15 Lifting wavelet packet reconstructed signal in level 2 after maintenance

Fig. 7.16 The relative


energy distribution of lifting
wavelet packet reconstructed
signal in level 2 after
maintenance

It is inferred that the relative energy from the fourth frequency band to the ninth
frequency band in Fig. 7.12 before maintenance is caused by the machinery fault
information of the gaps in the #4 bearing bushing. It is concluded that machinery fault
information can spoil the energy convergence of lifting wavelet packet transform and
thus induce the dispersion of the wavelet energy distribution. Therefore, it is verified
that second wavelet package transform can process the vibration signals in different
frequency bands to effectively reveal the machinery operation conditions.
To summarize, it is diagnosed that the extensive vibration is caused by the loose-
ness of the #4 bearing, poor support and tension force shortage. The vibrations added
with increased speed and load, which have the characteristics of non-stationarity,
nonlinear properties and contain colored noise because of the friction caused by
looseness faults.
According to Eq. (7.23), after the maintenance of the turbo generator unit, the rela-
tive energy of vibration monitoring signals and lifting wavelet packet reconstructed
438 7 Complex Electro-Mechanical System Operational Reliability …

A/µm

Fig. 7.17 Lifting wavelet packet reconstructed signal in level 3 after maintenance

Fig. 7.18 The relative energy distribution of lifting wavelet packet reconstructed signal in level 3
after maintenance

signals are used to evaluate its current operational reliability. The calculation results
are shown in Table 7.2. It can be seen that the operational reliability has been improved
after maintenance, which is basically above 0.8.
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set … 439

A/µm

Fig. 7.19 Lifting wavelet packet reconstructed signal in level 4 after maintenance

Fig. 7.20 The relative energy distribution of lifting wavelet packet reconstructed signal in level 4
after maintenance
440 7 Complex Electro-Mechanical System Operational Reliability …

Table 7.2 The operational


Decomposition level l=2 l=3 l=4
reliability degree after
maintenance Operational reliability degree 0.8627 0.8278 0.8060

7.2.4 Analysis and Discussion

7.2.4.1 Comparison and Analysis of Operational Reliability Degree


Before and After the Maintenance of Turbo Generator Unit

When the turbo generator is degrading and in dangerous conditions, the stability of
the system is reduced, which causes the safe operational conditions to become more
and more uncertain. Entropy is a measure of “uncertainty”, and the most uncertain
probability distribution (such as equal probability distribution) has the largest entropy
value. Therefore, the magnitude of information entropy also reflects the uniformity of
probability distribution, which can measure the irregularity of mechanical monitoring
signals in real-time and reflect the reliability of mechanical operational conditions.
Before maintenance, due to the loosening fault of the #4 bearing bushing, the peak-to-
peak value of vibration monitoring exceeds the limit. The relative energy distribution
based on the lifting wavelet packet decomposition and reconstruction is loose, the
Rényi entropy value is large, and the calculation result of operational reliability is
small. From Table 7.1, it is shown that all of the calculated operational safety degrees
from level l = 2 to level l = 4 are under 0.4, the lowest of which is 0.2467 in level
l = 3. This indicates that the reliability of the current turbo generator unit is very
poor and needs maintenance. When the machine is stopped for maintenance, it is
found that the preloading force of the #4 bearing bushing is lower than the standard
value, causing mechanical loosening and local friction, which induced a disorder
phenomenon in the vibration signals acquired by the sensors.
After maintenance, all of the calculated operational safety degrees from level l = 2
to level l = 4 are over 0.8. This shows that through vibration monitoring, evaluation
and diagnosis, shutdown and maintenance, the unit health conditions and opera-
tional reliability indicators are improved, and it can be seen that timely maintenance
can improve operational safety and avoid accidents. In general, the operational relia-
bility evaluation of complex electro-mechanical systems based on vibration condition
monitoring can be realized by monitoring the vibration condition of turbo generator
unit and obtaining the relative energy distribution of each frequency band through
lifting wavelet packet analysis of signals, and then calculating Rényi entropy and
mapping it to probability [0, 1] space. Through the example analysis of a steam
turbo generator unit in a thermal power plant, it is found that the operational relia-
bility evaluation method proposed by wavelet Rényi entropy can provide guidance
for the health monitoring and maintenance of the steam turbo generator unit and lay
the foundation for diagnosis.
7.3 Reliability Assessment and Health Maintenance of Compressor … 441

7.2.4.2 Influence Analysis of the Number of Lifting Wavelet Packet


Decomposition Layers on Operational Reliability Evaluation

It can be seen from Tables 7.1 and 7.2 that the degree of operational reliability is
related to the number of decomposition levels in the lifting wavelet packet analysis.
With each additional layer of decomposition, the number of frequency bands after
decomposition is doubled. As the number of frequency bands increases, the relatively
concentrated energy of each frequency band is allocated to the additional frequency
bands of the next layer, so that the relative energy distribution of the signal becomes
uncertain, the entropy increases and the reliability degree decreases. Therefore, the
operational reliability degree of the steam turbo generator unit after maintenance is
shown in Table 7.2. As the number of decomposition layers increases from l = 2 to l =
4, the operational reliability decreases monotonically. But before the maintenance as
shown in Table 7.1, due to mechanical instability and fault information, the vibration
monitoring signal is irregular, uncertainty degree is higher, so after the lifting wavelet
packet decomposition and reconstruction, the relative energy distribution of the signal
is scattered, the operational reliability degree and the number of lifting wavelet
packet decomposition layers do not follow the simple monotonically decreasing
principle. Therefore, an appropriate number of layers l can be selected to measure
the operational reliability of complex electro-mechanical systems in specific analysis.
Since the number of layers l = 3 is intermediate between the number of layers l =
2 and l = 4, it is more appropriate to choose the number of layers l = 3 in practical
engineering applications.

7.3 Reliability Assessment and Health Maintenance


of Compressor Gearbox in Steel Mill

Nowadays, reliability is a multi-disciplinary scientific discipline that aims at system


safety. The fundamental issues of reliability engineering lie in the representation
and load modeling, quantification analysis, evolution and uncertainty assessment
of system models. To assess the operational reliability of the compressor gearbox
in a steel mill, the machinery vibration signals with time-varying operational char-
acteristics are first decomposed and reconstructed by means of a lifting wavelet
packet transform. The relative energy of every reconstructed signal is computed as
an energy percentage of the reconstructed signal in the whole signal energy. More-
over, a normalized lifting wavelet entropy is defined by the relative energy to reveal
the machinery’s operational uncertainty. Finally, the operational reliability degree is
defined by the quantitative value obtained by the normalized lifting wavelet entropy
belonging to the range of [0, 1].
442 7 Complex Electro-Mechanical System Operational Reliability …

7.3.1 Condition Monitoring and Vibration Signal Acquisition

There is an oxy-generator compressor as shown in Fig. 7.21. The rotating speed


of the output shaft in the overdrive gearbox is 14,885 r/min. The vibration of the
gearbox increases and high-frequency noise appears in service. As shown in Fig. 7.21,
the gearbox contains four sliding bearings, which can be tested by an acceleration
transducer with a 20 kHz sampling frequency. It is found that the vibration of the 3#
bearing is the highest among the four sliding bearings. Meanwhile, the temperature
of the 3# bearing bush is more than 50 °C and is the highest. The vibration signal
acquired at 3# bearing is shown in Fig. 7.22, containing considerable noise. The
frequency spectrum of the vibration signal is shown in Fig. 7.23. It can be seen
that the frequency components are spread throughout the spectrum, containing rich
vibration information in the higher frequency band.

Fig. 7.21 Schematic drawing of compressor gearbox

Fig. 7.22 Time domain 1000


waveform of vibration signal
A/um

-1000
0 0.01 0.02 0.03 0.04 0.05
t/s

Fig. 7.23 Frequency 200


spectrum of vibration signal
A/um

100

0
0 2000 4000 6000 8000 10000
f/Hz
7.3 Reliability Assessment and Health Maintenance of Compressor … 443

7.3.2 Vibration Signal Analysis

In order to further analyze the vibration signal to obtain more condition information,
the lifting wavelet packet transform is adopted to decompose the original signal to the
extent of level 2, level 3 and level 4, respectively. The obtained four signals analyzed
in level 2 are illustrated in Fig. 7.24, which correspond to the frequency bands 0–
2500 Hz, 2500–5000 Hz, 5000–7500 Hz and 7500–10,000 Hz, respectively. Then
the relative energy of each signal is calculated separately according to Eq. (7.19),
and the distribution of the whole signal accounted for by each signal is shown in
Fig. 7.25, from which it is evident that the first band has the largest energy, and those
in the second and third are comparable.
Furthermore, the original signal is further processed to the extent of level 3 as
shown in Fig. 7.26, corresponding to the frequency bands 0–1250 Hz, 1250–2500 Hz,
2500–3750 Hz, 3750–5000 Hz, 5000–6250 Hz, 6250–7500 Hz, 7500–8750 Hz and
8750–10,000 Hz, respectively. The relative energy of each frequency band is shown
in Fig. 7.27. The energy of the first four bands is much larger than that of the last
four bands. It is worth noting that after three levels of decomposition and reconstruc-
tion, the signal is decomposed into eight bands, and the relative energy distribution
occupied by each reconstructed signal is uneven. The second band signal is the most
energetic and the first band signal is the second. The signal is mainly concentrated in

1000
a21

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
1000
a22

0
A/um

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
1000
a23

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
1000
a24

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

t/s
Fig. 7.24 Reconstructed signals in level 2
444 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.25 Relative energy distribution of reconstructed signals in level 2

the low frequency, while the energy of the sixth band is generated by the machinery
degradation information and fault (Fig. 7.28).
The original signal is further processed to the extent of level 4 to obtain 16 signals,
as shown in Fig. 7.27. The relative energy of each corresponding frequency band is
shown in Fig. 7.29. Among them, the largest energy lies in the third band, followed
by the second, fourth, sixth, eleventh, twelfth and first bands, while the remaining
bands have a smaller energy share. Through the above analysis, it can be seen that the

1000 1000
a35

0 0
a31

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a36
a32

0 0
A/um

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a37
a33

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a34

a38

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05

t/s t/s

Fig. 7.26 Reconstructed signals in level 3


7.3 Reliability Assessment and Health Maintenance of Compressor … 445

Fig. 7.27 Relative energy distribution of reconstructed signals in level 3

relative energy distribution of the current signals in each band is relatively disordered,
and mechanical anomalies are suspected.

7.3.3 Operational Reliability Assessment and Health


Maintenance

After decomposing and reconstructing the vibration signal of the compressor gearbox
bearing using lifting wavelet packet transform, the original signal is decomposed into
several band-independent signals, each band having corresponding relative energy
distribution characteristics.
Shannon entropy is a measure of the uncertainty associated with a random vari-
able. Specifically, Shannon entropy quantifies the expected value of the informa-
tion contained. The Shannon entropy of a random variable X can be defined as in
Eq. (7.24), where Pi is defined in Eq. (7.25) with x i indicating the i-th possible value
of X out of n symbols, and Pi denoting the possibility of X = x i :


n
H (X ) = H (P1 , . . . , Pn ) = − Pi log2 Pi (7.24)
i=1

Pi = Pr(X = xi ) (7.25)

Shannon entropy attains, but is not limited to, the following properties:
(1) Bounded: 0 ≤ H (X ) ≤ log2 n
(2) Symmetry: H (P1 , P2 , · · · , Pn ) = H (P2 , P1 , · · · , Pn ) = · · ·
446 7 Complex Electro-Mechanical System Operational Reliability …

1000 1000

a41

a45
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a42

a46
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a33

a47
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a34

a48
0 0

-1000 -1000
A/um

0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05

1000 1000

a413
a49

0 0
-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a410

a414
0 0
-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a411

a415

0 0
-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a412

a416

0 0
-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05

t/s t/s

Fig. 7.28 Reconstructed signals in level 4

Fig. 7.29 Relative energy distribution of reconstructed signals in level 4


7.3 Reliability Assessment and Health Maintenance of Compressor … 447

(3) Grouping: H (P1 , P2 , · · · , Pn ) = H (P1 + P2 , · · · , Pn )+(P1 + P2 )H (P1 /(P1 +


P2 ), P2 /(P1 + P2 ))
Within the framework of Shannon’s entropy definition, the relative energy distri-
bution of the signal calculated through lifting wavelet packet transform (Eq. 7.19)
maps the degree of reliability value into a dimensionless reliability index in the
interval [0, 1]:


l
2
R = 1 − (− Ẽl,i log2l Ẽl,i )2 (7.26)
i=1

Since mechanical performance degradation and faults can make machinery condi-
tions uncertain, the probability distribution of the monitored condition informa-
tion will become uncertain too. If the distribution of the energy of transformed
2 l frequency bands follows uniform distribution then Ẽl,i = 1/2l . If the Shannon
entropy in the brackets is calculated as 1, then the normalized lifting wavelet entropy
is equal to 0. On the contrary, if there is only one band that concentrates the whole
energy of 2 l bands, its relative energy is equal to 1 (like the most certain probability
distribution), and the normalized lifting wavelet entropy is equal to 1. On the basis of
information entropy theory, the most uncertain probability distribution (such as equal
distribution) has the largest entropy and the opposite can also be true, so information
entropy is a measure of uncertainty and provides a practical criterion for analyzing
similarity or dissimilarity between probability distributions when mechanical equip-
ment is settled on different operating states involving functioning to failure. Since
wavelets meet the demands of transient signal analysis and entropy is associated with
the measurements of information uncertainty, it is useful to evaluate the operational
reliability of mechanical equipment with wavelet entropy from condition monitoring
information.
The degree of reliability is calculated according to Eq. (7.26) from level l = 2
to level l = 4, respectively. From Table 7.3, it is seen that machinery performance
degrades during longtime operations and needs repair since all degrees from level l =
2 to level l = 4 are near 0.5. It can be inferred that the current operating conditions of
the machinery are less certain, resulting in uncertainty in the probability distribution
of the vibration signals obtained from the monitoring, and consequently in low values
of the reliability measures.
The oxy-generator compressor in the steel mill is stopped and overhauled, and
many cracks and fragments are found on the #3 bearing bush of the gearbox. After
replacing the #3 bearing and maintenance, starting the oxy-generator again. The
vibration and the high-frequency noise are reduced. The waveform and frequency
spectrum of the acquired vibration signal are shown in Figs. 7.30 and 7.31.

Table 7.3 Degree of


Lifting wavelet packet level l=2 l=3 l=4
reliability
Degree of reliability 0.6120 0.4924 0.4854
448 7 Complex Electro-Mechanical System Operational Reliability …

500

A/um 0

-500

0 0.01 0.02 0.03 0.04 0.05


t/s

Fig. 7.30 Time domain waveform after maintenance

300

200
A/um

100

0
0 2000 4000 6000 8000 10000
f/Hz

Fig. 7.31 Frequency spectrum after maintenance

Figures 7.32 and 7.33 show the reconstructed signals and their relative energy in
level 2 using the lifting wavelet packet transform, respectively. Different from the
relative energy acquired before maintenance in Fig. 7.25, those after maintenance are
concentrated in the 1st frequency band, the 2nd band still occupies a certain amount
of energy, and the relative energy of the 3rd band is very small, which is different
from the relative energy distribution shown in Fig. 7.25. The 3rd band also occupies
a large amount of energy before maintenance. Therefore, it is speculated that the
large relative energy of the third band shown in Fig. 7.25 before maintenance may
be the fault information generated by many cracks and fragments of the #3 bearing
bush, which makes the energy distribution of the four bands relatively dispersed.
Figures 7.34 and 7.35 show the reconstructed signals and their relative energy in
level 3 using the lifting wavelet packet transform, respectively. The 2nd frequency
band has the largest energy and proportion. The 1st, 3rd, 4th bands have small energy
and the relative energy of the following four bands is very small. Compared with
Fig. 7.27 before maintenance, it can be seen that the 6th band also accounts for a
certain amount of energy, and the relative energy proportion of the 2nd and 4th bands
is also large. Moreover, the energy proportion of the 2nd band is not as large as those
after maintenance. The comparison shows that the 1st, 4th and 6th band in Fig. 7.27
before maintenance occupy large relative energy. Therefore, it is speculated that the
fault is caused by many cracks and fragments in the #3 bearing bush, which makes
the abnormal vibration cover a wide range of frequency band. After maintenance, the
7.3 Reliability Assessment and Health Maintenance of Compressor … 449

1000

a21 0

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
1000
a22

-1000
A/um

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
1000
a23

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
1000
a24

-1000
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

t/s

Fig. 7.32 Reconstructed signals in level 2 (after maintenance)

Fig. 7.33 Relative energy distribution of reconstructed signals in level 2 (after maintenance)

machine runs stably, making the relative energy distribution of the monitored signal
frequency band becomes uniform.
Figures 7.36 and 7.37 show the reconstructed signals and their relative energy in
level 4 using the lifting wavelet packet transform, respectively. The main energy is
concentrated in the 3rd frequency band, and the 2nd, 4th, 5th and 6th also have a
small amount. The energy distribution is quite different from that before maintenance
shown in Fig. 7.29, where the 2nd band also accounts for a large proportion, which
is close to the 3rd band. At the same time, the 1st, 4th, 5th, 6th, 9th, 11th and 12th
all occupy some relative energy, and the distribution is decentralized.
450 7 Complex Electro-Mechanical System Operational Reliability …

1000 1000

a31

a35
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000

a36
a32

0 0

-1000 -1000
A/um

0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a33

a37
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a34

a38
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05

t/s t/s

Fig. 7.34 Reconstructed signals in level 3 (after maintenance)

Fig. 7.35 Relative energy distribution of reconstructed signals in level 3 (after maintenance)

According to the difference between the relative energy distribution before and
after maintenance, it can be seen that the mechanical fault information can be reflected
in the energy distribution of the lifting wavelet packet reconstructed signal and make
the distribution disperse. This paper calculates the relative energy distribution after
maintenance based on the obtained signals to measure the operational reliability, as
shown in Table 7.4. It can be seen that the reliability after maintenance has been
improved. Therefore, the lifting wavelet packet transform can effectively reveal the
health status of complex electro-mechanical systems by decomposing signals and
reconstructing them in different frequency bands.
7.3 Reliability Assessment and Health Maintenance of Compressor … 451

1000 1000

a41

a45
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a42

a46
0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000

a47
a33

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000

a48
a34

0 0

-1000 -1000
A/um

0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a413
a49

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a410

a414

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a411

a415

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05
1000 1000
a412

a416

0 0

-1000 -1000
0 0.01 0.02 0.03 0.04 0.05 0 0.01 0.02 0.03 0.04 0.05

t/s t/s

Fig. 7.36 Reconstructed signals in level 4 (after maintenance)

7.3.4 Analysis and Discussion

It can be seen from the above analysis that when the mechanical equipment expe-
riences performance degradation or some faults occur, its working stability will be
reduced, and the operating state will become more and more uncertain. The monitored
signals contain fault information, which makes the relative energy distribution of the
reconstructed signal, which is generated via lifting wavelet packet decomposition,
scatter in the frequency band, thus reducing the degree of operational reliability.
452 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.37 Relative energy distribution of reconstructed signals in level 3 (after maintenance)

Table 7.4 Operation


Lifting wavelet packet level l=2 l=3 l=4
reliability after maintenance
Degree of reliability 0.9444 0.8698 0.7968

By comparing the operational reliability before and after maintenance, it is found


that the latter is better than the former across different decomposition levels l. Specif-
ically, it can be seen from Table 7.3 that the degree of operational reliability from
the 2nd level to the 4th level is about 0.5, which indicates that the current health
state is uncertain and needs maintenance. After maintenance, the operational relia-
bility of all layers in Table 7.4 exceeds 0.7, and the maximum reliability is 0.9444
in the 2nd level. It can be seen that timely repair and maintenance can improve
the operational reliability of electro-mechanical equipment and prevent accidents.
The operational reliability assessment based on condition monitoring can provide
a basis for condition-based maintenance and ensure the operational safety of the
electro-mechanical system.
From Tables 7.3 and 7.4, it is seen that the degree of reliability value uniformly and
monotonously decreases from level l = 2 to level l = 4. When the decomposition level
l increases, the number of frequency bands is increased and the initially concentrated
energy is scattered with the increased frequency bands. With the increased bands,
each band occupies a certain energy and the distribution becomes more uncertain.
Since the more uncertain probability distribution has more entropy, the normalized
lifting wavelet entropy is added and the operational reliability is decreased with
increased frequency bands. Therefore, the defined normalized lifting wavelet entropy
should be computed at an appropriate level l in order to achieve reasonable operational
reliability. In this case, level l = 3 is thought of as the most suitable level.
7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … 453

7.4 Aero-Engine Rotor Assembly Reliability Assessment


and Health Maintenance

Aero-engine, which is the heart of aircraft, is thermal machinery that works in long-
term service under the harsh environment of high temperature, high pressure and
high speed. It plays an extremely important role in the performance of aircraft. Liu
Daxiang, a member of the Chinese Academy of Engineering, pointed out at the
National Aircraft Manufacturing Technology Forum in April 2008 that aero-engines
have the characteristics of 3 high (high reliability, high performance, and high safety),
4 low (low fuel consumption, low pollution, low noise, and low cost), and 1 long (long
life). Aero-engine technology has always been the key for the world’s military powers
to give priority to development, highly monopolize and tightly block, and is one of
the important symbols of a country’s military equipment level, industrial strength
and comprehensive national strength. From 1963 to 1975, there were 3824 flight
accidents in USAF fighter jets, of which 1664 were caused by engines, accounting
for 43.5% of the total number of flight accidents; From 1989 to 1993, there were
279 major flight accidents in the world’s air transport, of which more than 20% were
caused by engine failures. Therefore, aero-engine is the focus of aviation flight safety
and maintenance support.
At present, many countries and airlines in the world attach great importance
to the development of safety technologies related to aero-engines. Table 7.5 [33]
summarizes typical faults and their common diagnostic methods and monitoring
parameters of aero-engines. Boeing B747, B767, Airbus A310 and other aircraft
are equipped with complete condition monitoring and fault diagnosis systems, with
more than 15 engine monitoring parameters. The diagnosis system of the F100 engine
records and monitors 38 engine parameters and flight parameters, including moni-
toring abnormal conditions such as over rotation, over temperature, abnormal oil
return pressure, engine stall, surge, main fuel pump failure, afterburner failure and
flameout. The effectiveness of the F100-PW-220 engine condition monitoring system
is 99.3%, and the error rate of condition monitoring is less than 1% for every 1 million
flight hours. The JT90 engine used in Boeing B747 aircraft uses a status monitoring
system to establish a trend analysis model to determine the source of engine perfor-
mance deterioration. The PW4000 engines equipped with Boeing B767, B777 and
McDonnell Douglas MD-11 aircraft and the V2500 equipped with A320 and MD-
90 aircraft use integrated control and monitoring systems, which have the capabil-
ities of self-inspection, fault isolation and precise thrust adjustment, and improve
flight reliability and maintainability [33]. At present, intelligent testing technology
is the inevitable trend of aero-engine development. The USAF’s engine-integrated
management system applies the Expert Maintenance System (XMAN), the Jet Engine
Troubleshooting Expert System (JET-X), and the Turbo Engine Expert Maintenance
System (TEXMAX). Artificial intelligence technology can help reduce the burden
of personnel in complex systems in the field of built-in test, automatic test equipment
and engine state monitoring. The diagnosis system based on the expert system can
reduce the maintenance man-hour by 30%, the replacement rate of parts failure by
454 7 Complex Electro-Mechanical System Operational Reliability …

50%, and the maintenance test by 50%. The artificial intelligence fault monitoring
system of the U.S. Army AH-64 helicopter uses artificial intelligence methods to
detect, identify and diagnose faults through an intelligent fault diagnosis and posi-
tioning device, find out the fault location, and carry out maintenance. The entire
system is controlled by an airborne computer.
Although aircraft and aero-engines are equipped with complete condition moni-
toring and fault diagnosis systems, aero-engine failures and breakdowns have been
emerging for many years and frequently occurred in many types of aircraft. For
example, in 1989, a B1-B bomber crashed due to the fracture of the rear shaft sealing
labyrinth of the high-pressure turbine of the F101 engine. In less than two months

Table 7.5 Typical fault diagnosis of aero-engine


Fault Diagnosis Monitored parameters
Blade damage Pneumatic parameter Rotation rate
measurement Exhaust temperature
Vibration and noise Vibration
analysis Noise spectrum
Borescope
Fatigue crack Vibration Amplitude
Borescope Frequency
Noise Noise spectrum
Ultrasonic
Damping table damage Vibration Blade spacing
Noise Vibration
Mechanical erosion Pneumatic parameter Thrust
measurement Fuel flow
Temperature before turbine
Rotation rate of high/low
pressure
Surge Pneumatic parameter Fuel flow
measurement Air flow
Vibration The temperature of turbine
inlet and outlet
Rotation speed
Control failure of vent valve or Pneumatic parameter Temperature before turbine
deflector measurement Air flow
Noise Rotation rate
Compressor sealing worn Pneumatic parameter Air flow
measurement Rotation rate
Compressor outlet temperature
Turbine outlet temperature
Thrust
Blade Icing Pneumatic parameter Rotation rate
measurement Air flow
Rotor flexibility Compressor outlet temperature
Pressure
7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … 455

since July 1994, the main US fighter F-16 has dropped four consecutively (two from
the Egyptian Air Force and two from the Israeli Air Force). It is very rare in world
aviation history that 4 consecutive aviation accidents were caused by the same fault
in a short time. The joint accident investigation and analysis conducted by the US Air
Force and GE, an aero-engine manufacturer, showed that the reason for those 4 F-16s
to fall was that the sealing labyrinth of the rear shaft of the high-pressure turbine of
the F-110 engine of the aircraft was broken, and the broken fragments damaged the
low-pressure turbine, which eventually led to engine damage. In the past crashes of
aircraft equipped with F101 and F110 engines, 8 aircraft B-1B, F-14 and F-16 of
different models were all caused by the fracture of the sealing labyrinth [33]. In view
of the above faults, the US Air Force and GE also took various measures, such as
adjusting the clearance, changing the original snap ring damping ring into a damping
bushing, etc. In order to solve the crack accident of the sealing labyrinth faults, since
the end of 1994, 150 F-16s have been grounded in the U.S. Air Force, 200 F-16s in
other countries’ air forces, two of the five B-2 bombers, and some F-14Ds. A certain
type of aero-engine in China also has the safety problem of sealing labyrinth faults.
Because the sealing labyrinth mainly bears the torque transmitted in the process of
rotation, it resists the torque by the friction torque generated between the joint surfaces
after bolt connection pre-tightening. However, the tightening torque controlled by
torque measuring wrench or constant torque wrench is greatly affected by the fluctu-
ation of friction coefficient, so the bolt preload distributed along the circumference
of the wheel disc may be different. If the pre-tightening force is different, the sealing
labyrinth disc will be preloaded in a certain direction, which will generate initial
stress and additional bending moment on the rotor shaft system. At the same time, it
will also make the coupling shaft system not concentric or not straight, which will
make the rotor in an unbalanced state, causing repeated bending and internal stress
of the rotor, thus generating excessive vibration, accelerating the wear of parts, and
finally causing cracks. In addition, after the removal and installation of bolts, or after
a period of service, the fit clearance between the assembly hole and the bolt will
increase, which will lead to a reduction in the fit accuracy, which will also make
the coupling shafting out of center or not straight, and cause the rotor components
to become loose. To sum up, the rupture accident of sealing ring gear at home and
abroad is a difficult problem that the academic and engineering circles pay close
attention to solving. Therefore, it is necessary to develop an effective evaluation
method for aero-engine rotor assembly reliability to ensure the reliability and safety
of aero-engine rotors.
Assembly is the last link of product manufacturing, and how to ensure the reli-
ability and safety of mechanical equipment at the beginning of assembly is one of
the main issues for industry and academia. According to statistics, in the automobile
assembly industry, the failures caused by the assembly in the manufacturing of a
new product account for 40–100% of new products [33]. At present, due to the lack
of automatic methods and advanced technologies to effectively detect the assembly
reliability of aero-engine rotors, the assembly performance can only be indirectly
evaluated by the whole machine test run, and sometimes there may even be repeated
456 7 Complex Electro-Mechanical System Operational Reliability …

procedures which contain test run failures, disassembly and assembly. The develop-
ment of aero-engine rotor assembly reliability evaluation technology can alleviate
such a problem, shorten assembly and maintenance time, reduce costs, and finally
ensure flight safety. Therefore, the research of aero-engine rotor assembly reliability
assessment is an important research topic in the field of manufacturing reliability,
which is of great significance for the safety, economy and maintainability of aircraft.
Therefore, according to the structural characteristics and assembly process of an
aero-engine rotor and the causes of the loose bolt assembly, this section carries out
research on the assessment of assembly reliability. Through the excitation test of the
rotor, the lifting wavelet packet is used to analyze the excitation response signal,
extract the relative energy distribution of the reconstructed signal, and map it to the
reliability interval [0, 1].

7.4.1 The Structure Characteristics of Aero-Engine Rotor

An aero-engine is a turbofan engine, which is mainly composed of a low-pressure


and a high-pressure compressor, a high-pressure and a low-pressure turbine. The air
flow enters the engine from the inlet port and is pressurized through the low-pressure
compressor, and then the flow is divided into two streams, one of which is discharged
through the outer culvert; The other part is further pressurized by the high-pressure
compressor, and then heated and combusted in the combustion chamber. The heat
generated by the gas combustion makes the high-pressure turbine expand to do work
and drive the front-end compressor to rotate. The gas further expands to do work
through the tail nozzle to generate flight thrust. A set of stud bolts are used to tighten
the drum and discs at all levels among the 2nd stage blade disc and each one in the 3rd
stage and the high-pressure rotor shaft, and the torque is transmitted by the friction
of the end surface, as shown in Fig. 7.38.
In order to simulate the influence of small fluctuation of tightening torque on
the assembly reliability of stay bolts during manual assembly, 3 kinds of assembly

Fig. 7.38 Schematic diagram of aero-engine rotor


7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … 457

Fig. 7.39 Assembly reliability assessment of aero-engine rotor

experiments are set for the assembly process of pull rod bolts: assembly state 1
(torque M1 ), state 2 (torque M2 ), and state 3 (torque M3 ), where M1 < M2 < M3
and only the tightening torque of state 3 meets the requirements.
When the bolts are loosely assembled, the stiffness of the rotor structure becomes
smaller, and the system is easy to vibrate. Due to the influence of damping during
vibration transmission, the dynamic response signal of the engine rotor will attenuate
quickly. As the bolt assembly state changes from loose to tight, the rotor structure
stiffness gradually increases, the internal damping of the structure decreases, and
the high-frequency component of the dynamic response signal increases. Therefore,
in view of rotor assembly looseness fault, according to the characteristics that lead
to different rotor dynamic characteristics due to different bolt preloads, combined
with the structural characteristics and assembly process of the aero-engine rotor, the
research on assembly reliability assessment and health maintenance mainly includes
three steps, as shown in Fig. 7.39:
(1) Through the advanced data acquisition system, the dynamic excitation test is
carried out on the rotor under different assembly states;
(2) The lifting wavelet packet transform is used to analyze the excitation response
signals of the high-pressure compressor rotor under different assembly states,
and the wavelet packet decomposition of the excitation response signals is used
to reconstruct the relative energy characteristics of the sub-signals;
(3) The relative energy distribution of the signal is mapped to the assembly
reliability interval within the definition framework of information entropy.
458 7 Complex Electro-Mechanical System Operational Reliability …

7.4.2 Aero-Engine Rotor Assembly Reliability Assessment


Test System

In order to study the assembly performance of an aero-engine rotor, an assembly


performance testing system is built. It mainly uses the advanced data acquisition
system to test the dynamic excitation of an aero-engine rotor in different assembly
states. The test system is mainly composed of five parts: an aero-engine high-
pressure compressor rotor, vibration exciter, signal generator, sensor and data acqui-
sition system. JZK-5 vibration exciter produced by Jiangsu Lianneng Electronics
Co., Ltd. of China is used to test the dynamic response of an aero-engine high-
pressure compressor rotor using dynamic excitation under different assembly states.
The specific performance parameters of the vibration exciter are shown in Table 7.6.
The exciter is installed at the lower end of the high-pressure shaft of the high-pressure
compressor rotor, as shown in Fig. 7.40.
In this case, the DF1631 power function signal generator is selected as the signal
generator. Through many experiments, it is found that the exciter with a square wave
signal can generate a better excitation response signal in the experiment. Therefore,
the square wave signal generated by the DF1631 power function signal generator is

Table 7.6 Performance parameters of vibration exciter JZK-5


Maximum Maximum Maximum Maximum Frequency Force constants
excitation amplitude acceleration (g) input current range (Hz) (N/A)
force (N) (mm) (Arms)
≥ 50 ± 7.5 20 ≤7 DC-5k 7.2
Overall Mass (kg) Mass of Output First order DC resistance
dimension movable parts resonance of moving coil
(mm) (kg) frequency (Ω)
(Hz)
Φ138 × 160 8.1 0.25 Rod 50 0.7

Fig. 7.40 Assembly reliability assessment test of an aero-engine rotor


7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … 459

Table 7.7 Performance parameters of accelerometer 333B32-ICP


Range (pk) Sensitivity (mv/ Resolution (rms) Frequency range Operating Mass (g)
g) (Hz) temperature
range (°C)
± 50 g 100 0.00015 g 0.5–3000 −18 ~ +66 4

used as the excitation signal of the exciter. The working frequency is 1 Hz, and the
amplitude of the output signal is a full-scale value.
The 333B32 ICP acceleration sensor made by the American PCB company is
used to measure the excitation response signal of the aero-engine rotor under the
excitation of the vibration exciter. The main performance parameters of the sensor
are shown in Table 7.7. The accuracy level can ensure that the test result contains
4 significant digits. Since the purpose of the excitation test is to study the dynamic
response law of bolt tightness and rotor assembly reliability, the installation plane of
the sensor is A-A plane close to the bolt installation as shown in Fig. 7.40, and the
4 sensors are arranged evenly along the circumference. Sony EX system is used to
collect and store the excitation response information of an aero-engine high-pressure
compressor rotor under the excitation of vibration exciter.

7.4.3 Experiment and Analysis

The vibration exciter is used to test the vibration response signal of an aero-engine
rotor in three assembly states. Taking sensor I as shown in Fig. 7.40 as an example,
the measured time domain waveforms and spectra of dynamic signals under three
different states are shown in Fig. 7.41, and the sampling frequency is 6400 Hz.
The time domain waveforms of the excitation response signals in three states are
all oscillation attenuation signals. The amplitude of the excitation response signal
in assembly state 1 is the smallest. With the increase of preload, the amplitude in
assembly state 2 and 3 is slightly larger than that in state 1. In the signal spectrum
diagram, there is a maximum spectrum peak near 2000 Hz. With the increase of
bolt pre-tightening force, the stiffness of the rotor structure increases, and the high-
frequency component in the spectrum increases. Therefore, the second frequency
spectrum peak appears near 2600 Hz in the spectrum of state 3 (i.e. qualified state).
Next, the lifting wavelet packet transform is used on the excitation response signals
measured under three assembly states, and the excitation response sub-signals a31,
a32, …, a38 of eight frequency bands are obtained, which are shown in Fig. 7.42. The
excitation response signals under each state are decomposed into different frequency
bands by the lifting wavelet packet.
Since the sampling frequency of the excitation response signal is 6400 Hz, the
frequency band of sub-signal a31, a32, a33, a34, a35, a36, a37, a38 are corresponding
to 0–400 Hz, 400–800 Hz, 800–1200 Hz, 1200–1600 Hz, 1600–2000 Hz, 2000–
2400 Hz, 2400–2800 Hz, 2800–3200 Hz.
460 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.41 Time domain waveform and frequency spectrum of aero-engine rotor under 3 assembly
states (from Sensor I)
7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … 461

Fig. 7.42 Reconstructed signals from lifting wavelet packet transform of an aero-engine rotor under
3 assembly states (from Sensor I)
462 7 Complex Electro-Mechanical System Operational Reliability …

Because the lifting wavelet packet decomposes the excitation response signal into
independent and orthogonal frequency bands, the excitation response sub-signals
corresponding to each frequency band contain different assembly information, so
the relative energy of the sub-signals decomposed into each frequency band can
reflect the dynamic response information as the bolt preload changes.
According to Eq. (7.19) described in Sect. 7.1.2, the relative energies sub-signals
a31–a38 reconstructed from lifting wavelet packet transform are calculated under
the three assembly states, respectively. It can be seen from the distribution shown in
Fig. 7.43 that the excitation response sub-signal a35 (corresponding to the frequency
band of 1600–2000 Hz) obtained through the lifting wavelet packet decomposition
and reconstruction corresponding to the three assembly states occupies the largest
energy among the eight sub-signals a31–a38. It contains the main signal compo-
nents of excitation response information from stay bolt. From assembly state 1 to
3, the energy amplitude of the excitation response sub-signal a35 is also positively
correlated to the bolt preload. In addition, the sub-signal a36 (the corresponding
frequency band is 2000–2400 Hz) also has a large amount of energy in the eight
reconstructed sub-signals, and the energy gradually declines with the increase of
bolt preload. Note that as the bolt preload gradually increases, the structural stiffness
evolves from assembly state 1 to 3, and the energy of the excitation response sub-
signal a35 containing the main signal components of the rotor excitation response
(such as the natural frequency of the rotor) is more and more concentrated, with the
frequency band energy containing other non-main signal components reduced (such
as the sub-signal a36), Therefore, in the assembly state 3 (i.e. qualified assembly
state), the energy of the signal a35 of the aero-engine high-pressure compressor
rotor structure is significantly greater than that of the other two assembly states, and
the energy is very concentrated.
For the three assembly states, the rotor assembly reliability can be obtained by
Eq. (7.26) using the above relative energy distribution, and the results are shown in
Table 7.8. It can be seen from the table that in the three states of bolts from loose to
tight, the assembly reliability of the aero-engine rotor increases monotonously, which
consists of the physical law that when the bolt preload gets tighter, the stiffness of
the aero-engine rotor will increase gradually. In assembly state 1, the rotor stiffness

Fig. 7.43 Relative energy distribution of reconstructed signals via lifting wavelet packet transform
under 3 assembly states (from Sensor I)
7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … 463

Table 7.8 Assembly


Assembly state 1 2 3
reliability of aero-engine rotor
Degree of assembly reliability 0.5197 0.8693 0.9486

is the smallest. In addition to the main signal components such as the rotor’s natural
frequency, its excitation response signal contains more other dynamic response infor-
mation, such as those caused by assembly looseness. The closer the various frequency
components are to the equal probability distribution, the greater the information
entropy and the smaller the assembly reliability. In contrast, in assembly state 3 (i.e.
qualified assembly state), the rotor stiffness is the largest, and the excitation response
signal is mainly composed of signal components such as rotor natural frequency, and
the energy is concentrated. The energy of other dynamic response information is
small, so the probability distribution of the signal is relatively determined, thus the
assembly reliability of the aero-engine rotor in assembly state 3 is the largest. For the
intermediate assembly process, i.e. assembly state 2, whose bolt tightness is between
state 1 and 3, its assembly reliability lies between state 1 and 3.

7.4.4 In-Service Aero-Engine Rotor Assembly Reliability


Assessment and Health Maintenance

In China, the service life of in-service aero-engines is controlled according to working


hours and calendar life most of the time. When one of them reaches the predefined
value, the engine will be returned to the factory for repair. If the vibration of an
aero-engine exceeds the standard level, it is necessary to trace the source of vibration
and conduct health maintenance. Firstly, the excitation test was carried out for the
high-pressure compressor rotor of the aero-engine, and the sampling frequency was
6400 Hz. The time domain waveform of the excitation response signal measured
by sensor I (in Fig. 7.35) is shown in Fig. 7.44a and its signal decays rapidly. The
frequency spectrum of the excitation response signal is shown in Fig. 7.44b and there
are many frequency components, of which the largest component is about 2200 Hz,
and the other three large components are about 1000, 1400, and 2600 Hz.
After three layers of decomposition and reconstruction using the lifting wavelet
packet transform, the excitation response sub-signals a31–a38 of eight frequency
bands are obtained, as shown in Fig. 7.44c. The signal is decomposed into different
frequency bands by the lifting wavelet packet. The sub-signals corresponding to
each frequency band contain different information. The relative energy of eight sub-
signals a31–a38 is obtained from the decomposition and reconstruction according to
Eq. (7.19), and its distribution is shown in Fig. 7.44d. It can be seen from the figure
that the 5th frequency band still has the largest energy, but the 3rd band also has a
large energy. Through comparison with Fig. 7.43, it is found that the relative energy
distribution of the current signal is scattered, and the scattered energy mainly appears
in the 3rd and 8th bands. These energies may be response information generated due
464 7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.44 Vibration response signals of the aero-engine rotor in service (from Sensor I)
References 465

Fig. 7.45 Dye penetration


of balanced holes of
labyrinth #32

to the looseness of the rotor. Preliminary diagnosis suggests that the current rotor
may have a looseness fault. The operational reliability index R = 0.7914 is shown
in Eq. (7.19), which is between the assembly states 1 and 2 as shown in Table 7.8,
and much closer to state 2. It can be seen that the assembly tightness of the pull rod
bolts has deteriorated in service, and the bolts have loosened, which has not met the
requirements for the optimal use of the engine. Loose pull rod bolts will result in
more vibration and may exceed the standard level, thus causing fatigue cracks and
fracture accidents of the wheel disc.
After the maintenance of the engine rotor in the factory, it was found that 7
balanced holes in the 9th labyrinth disc of the rotor had cracks of varying degrees,
where Fig. 7.45 shows the cracks after dye penetration of the balanced holes.
The assembly performance prediction results of the high-pressure compressor
rotor in the aero-engine and the actual maintenance results of the factory jointly
show that the aero-engine in service in the field is affected by maneuver tasks, flight
loads, body vibration and other factors, resulting in the gradual degradation of the
performance of the aero-engine rotor especially the loosening of pull rod bolts, which
increases the vibration and leads to cracks in the balanced holes of the 9th labyrinth
disc, thus, the service life of the engine is reduced. The dynamic assembly information
of the aero-engine rotor is tested by exciting vibration. The lifting wavelet packet
method is used to analyze the rotor vibration response signal and extract the relative
energy characteristics of the reconstructed signal. The relative energy distribution of
the signal is mapped to [0, 1] from the perspective of information entropy, which can
effectively evaluate the reliability degree and health status of the aero-engine rotor,
and provides a new tool for life prediction and health management.

References

1. Bazovsky, I.: Reliability Theory and Practice. Prentice-Hall, Englewood Cliffs (1961)
2. Dupow, H., Blount, G.: A review of reliability prediction. Aircr. Eng. Aerosp. Technol. 69(4),
356–362 (1997)
3. Denson, W.: The history of reliability prediction. IEEE Trans. Reliab. 47(3), 321–328 (1998)
466 7 Complex Electro-Mechanical System Operational Reliability …

4. Sweldens, W.: The lifting scheme: a construction of second generation wavelets. SIAM J. Math.
Anal. 29(2), 511–546 (1998)
5. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into lifting steps. J. Fourier Anal.
Appl. 4(3), 247–269 (1998)
6. He, Z.J., Zi, Y.Y., Zhang, X.N.: Modern Signal Processing Technology and Its Application (in
Chinese). Xi’an Jiaotong University Press, Xi’an (2006)
7. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(4), 623–656
(1948)
8. Shannon, C.E.: Communication theory of secrecy systems. Bell Syst. Tech. J. 28(4), 656–715
(1949)
9. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423
(1948)
10. Chen, P.C., Chen, C.W., Chiang, W.L., et al.: GA-based decoupled adaptive FSMC for nonlinear
systems by a singular perturbation scheme. Neural Comput. Appl. 20(4), 517–526 (2011)
11. Matteson, S.: Methods for multi-criteria sustainability and reliability assessments of power
systems. Energy 71, 130–136 (2014)
12. Lo Prete, C., Hobbs, B.F., Norman, C.S., et al.: Sustainability and reliability assessment of
microgrids in a regional electricity market. Energy 41(1), 192–202 (2012)
13. Moharil, R.M., Kulkani, P.S.: Generator system reliability analysis including wind generators
using hourly mean wind speed. Electric Power Components Syst. 36(1), 1–16 (2008)
14. Whyatt, P., Horrocks, P., Mills, L.: Steam generator reliability - Implications for APWR codes
end standards. Nuclear Energy-J. Br. Nuclear Energy Soc. 34(4), 217–228 (1995)
15. Tsvetkov, V.A.: A mathematical-model for analysis of generator reliability, including develop-
ment of defects. Electrical Technol. 4, 107–112 (1992)
16. Sun, Y., Wang, P., Cheng, L., et al.: Operational reliability assessment of power systems
considering condition-dependent failure rate. IET Gener. Transm. Distrib. 4(1), 60–72 (2010)
17. Baraldi, P., Di Maio, F., Pappaglione, L., et al.: Condition monitoring of electrical power
plant components during operational transients. Proc. Inst. Mech. Eng. Part O—J. Risk Reliab.
226(O6), 568–583 (2012)
18. Lu, F., Huang, J.Q., Xing, Y.D.: Fault diagnostics for turbo-shaft engine sensors based on a
simplified on-board model. Sensors 12(8), 11061–11076 (2012)
19. Li, Z.J., Liu, Y., Liu, F.X., et al.: Hybrid reliability model of hydraulic turbine-generator unit
based on nonlinear vibration. Proc. Inst. Mech. Eng. Part C—J. Mech. Eng. Sci. 228(11),
1880–1887 (2014)
20. Qu, J.X., Zhang, Z.S., Wen, J.P., et al.: State recognition of the viscoelastic sandwich structure
based on the adaptive redundant lifting wavelet packet transform, permutation entropy and the
wavelet support vector machine. Smart Mater. Struct. 23(8) (2014)
21. Si, Y., Zhang, Z.S., Liu, Q., et al.: Detecting the bonding state of explosive welding structures
based on EEMD and sensitive IMF time entropy. Smart Mater. Struct. 23(7) (2014)
22. Yu, B., Liu, D.D., Zhang, T.H.: Fault diagnosis for micro-gas turbine engine sensors via wavelet
entropy. Sensors 11(10), 9928–9941 (2011)
23. Sawalhi, N., Randall, R.B., Endo, H.: The enhancement of fault detection and diagnosis
in rolling element bearings using minimum entropy deconvolution combined with spectral
kurtosis. Mech. Syst. Signal Process. 21(6), 2616–2633 (2007)
24. Tafreshi, R., Sassani, F., Ahmadi, H., et al.: An approach for the construction of entropy measure
and energy map in machine fault diagnosis. J. Vib. Acoustics—Trans. Asme 131(2) (2009)
25. He, Y.Y., Huang, J., Zhang, B.: Approximate entropy as a nonlinear feature parameter for fault
diagnosis in rotating machinery. Measurement Sci. Technol. 23(4) (2012)
26. Wu, S.D., Wu, P.H., Wu, C.W., et al.: Bearing fault diagnosis based on multiscale permutation
entropy and support vector machine. Entropy 14(8), 1343–1356 (2012)
27. Rényi, A.: On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Statist.
Prob. 1, 547–561 (1961)
28. Fehr, S., Berens, S.: On the conditional Rényi entropy. IEEE Trans. Inf. Theor. 60(11), 6801–
6810 (2014)
References 467

29. Nanda, A.K., Sankaran, P.G., Sunoj, S.M.: Rényi’s residual entropy: a quantile approach. Statist.
Probab. Lett. 85, 114–121 (2014)
30. Endo, T., Omura, K., Kudo, M.: Analysis of relationship between Rényi entropy and marginal
Bayes error and its application to weighted Naive Bayes classifiers. Int. J. Pattern Recogn.
Artif. Intell. 28(7) (2014)
31. Nagy, A., Romera, E.: Relative Rényi entropy for atoms. Int. J. Quantum Chem. 109(11),
2490–2494 (2009)
32. Lake, D.E.: Rényi entropy measures of heart rate Gaussianity. IEEE Trans. Biomed. Eng. 53(1),
21–27 (2006)
33. Zhang, B.C.: Test and Measurement Technology for Aero Engines (in Chinese). Beihang
University Press, Beijing (2005)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy