Physics-Informed HPs Selection
Physics-Informed HPs Selection
A R T I C L E I N F O A B S T R A C T
Keywords: A situation often encountered in the condition monitoring (CM) and health management of
Gearbox gearboxes is that a large volume of CM data (e.g., vibration signal) collected from a healthy state
Fault Detection is available but CM data from a faulty state unavailable. Fault detection under such a situation is
Long-Short Term Memory
usually tackled by modeling the baseline CM data and then detect the fault by examining any
Physics-Informed Hyperparameters Selection
deviation of the baseline model versus newly monitored data. Given that the CM data is mostly
time series, the long-short term memory (LSTM) neural network can be employed for baseline CM
data modeling. The LSTM is free from the choice of the number of lagged input time series and
can also store both long-term and short-term time series dependency information. However, we
found that an LSTM with its hyperparameters selected whilst minimizing validation mean squared
error (VAMSE) does not differentiate the faulty and healthy states well. There is still room for
detectability improvement. In this paper, we propose a physics-informed hyperparameters se
lection strategy for the LSTM identification and subsequently the fault detection of gearboxes. The
key idea of the proposed strategy is to select hyperparameters based on maximizing the
discrepancy between healthy and physics-informed faulty states, as opposed to minimizing
VAMSE. Case studies have been conducted to detect the gear tooth crack and tooth wear using
laboratory test rigs. Results have shown that the proposed physics-informed hyperparameters
selection strategy returns an LSTM that can better detect these faults than the LSTM returned from
minimizing VAMSE.
1. Introduction
In the condition monitoring (CM) and health management of gearboxes, engineering practitioners are often found it easy to collect
a large volume of baseline CM data (e.g., vibration signal) from a healthy state, but hard to collect CM data from a faulty state. An
example of such a situation is when a fleet of newly designed wind turbines is just commissioned. The baseline CM data accumulates
quickly. But there is no time for the wind turbines to experience any fault yet and hence no CM data from a faulty state could be
available. Under such a situation, engineering practitioners still wish to detect any early fault once it emerges.
The abovementioned fault detection problem is usually tackled by modeling the baseline CM data and then detect the fault by
examining any deviation of the bassline model versus newly monitored data [1]. The deviation is often measured by a score that may or
* Corresponding author.
E-mail address: ming.zuo@ualberta.ca (M.J. Zuo).
https://doi.org/10.1016/j.ymssp.2022.108907
Received 13 June 2021; Received in revised form 28 December 2021; Accepted 24 January 2022
Available online 7 February 2022
0888-3270/© 2022 Elsevier Ltd. All rights reserved.
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
Nomenclature
AR autoregression
AUC area under the ROC curve
BIC Bayesian information criteria
CII crack induced impulse
CM condition monitoring
HNRES harmonic to noise ratio in envelope spectrum
LSTM long short-term memory
ROC receiver operating curve
RMS root mean squared
SNR signal to noise ratio
VAMSE validation mean squared error
may not be probabilistic [1]. When the score exceeds a predefined decision threshold, the gearbox is deemed to be faulty [1].
Given that the CM data is mostly time series, time series models were widely employed for modelling the baseline underlying data
generation process. Classical linear time series models include autoregression (AR), moving average, autoregression and moving
average, autoregression integrated moving average, etc. [2,3]. However, the underlying data generation process of gearboxes may be
nonlinear. Therefore, various nonlinear time series models were reported, including bilinear model, interaction product terms-based
AR models [4,5], generalized functional-coefficient AR [6,7], varying index coefficient AR model [8], and AR neural network [9,10].
The performance of these models depends critically on the choice of the number of lagged input time series values, and they are also
hard to store long-term dependency information.
Recurrent neural networks [11] were widely employed for nonlinear time series modeling. Typical recurrent neural networks
include the long short term memory (LSTM), LSTM with coupled input and forget gate, LSTM with peephole connection, gated
recurrent unit [12], and mutated units #1 ~ 3 [13]. Recurrent neural networks are free from the choice of the number of lagged input
time series values. Particularly, LSTM can also store both long-term and short-term time series dependency information by incorpo
rating a proper gating mechanism. These recurrent neural networks had been proven to be successful in many applications like natural
language processing [12], remaining useful life prediction [14,15], traffic flow prediction [16], and financial time series prediction
[17]. In the context of fault detection, Wang et al. [18] used a single layer LSTM for modeling the time sync averaged vibration signal
from a fixed-axis gearbox, and used the LSTM prediction error to detect the gear bore crack. Liu et al. [19] proposed a nonlinear
predictive gated recurrent unit-based denoising autoencoder for bearing fault diagnosis, and compared its performance with
autoencoder, stacked autoencoder, denoising autoencoder, and stacked denoising autoencoder. For acoustic anomaly detection,
Marchi et al. [20] comprehensively compared LSTM, bidirectional LSTM, and multiple linear perception when acted as basic
autoencoder, compressed autoencoder, and denoising autoencoder, and configured as nonlinear predictive or not, respectively.
The performance of LSTM is critically determined by hyperparameters selection. The hyperparameters in this context mainly include
the proper architecture parameters like the number of hidden states and hidden layers, and training parameters like the l-2 regulator,
batch size, maximum epoch, and learning rate. In the past decade, a large volume of studies was devoted to facilitating the optimi
zation search process. Reported optimization search strategies include grid search, random search [21], evolutionary (Metaheuristic)
algorithms [22], Bayesian optimization [23], differentiable architecture search [24], architecture search with reinforcement learning
[25], etc. These optimization methods [18–25] select hyperparameters whilst minimizing validation mean squared error (VAMSE)
which is an estimator of the generalization error. However, we found that an LSTM with its hyperparameters selected whilst mini
mizing VAMSE may not differentiate the faulty and healthy states well. Often sacrificing VAMSE a bit can return an LSTM which better
differentiate the faulty and healthy states. Therefore, there is still room for detectability improvement.
The focus of this paper is to investigate how physics knowledge can be integrated into the hyperparameters selection process so that
the resulted LSTM can better differentiate healthy and faulty states. The motivation of this study is the ever-increasing amount of
research works on physics-informed machine learning (also referred as physics-guided machine learning) [26,27]. Physics-informed
machine learning learns a machine from a hybrid source of information that consists of data and physics knowledge [26]. Given CM
data from both healthy and faulty states, several pioneer studies were reported in the machine health management area to address fault
diagnosis or severity assessment tasks. Sadoughi and Hu [28] designed a convolutional layer based on knowledge of the fault-induced
impulses and then integrated this layer into a convolutional neural network for the classification of rolling element bearing faults.
Wang et al. [29] presented a physics-informed neural network for tool wear prediction. A cross physics-data fusion scheme was
proposed to fuse the information from physics-based and data-driven models. Also, wear prediction consistency was introduced as an
additive term in the loss function.
Physics knowledge on the gearboxes fault-induced signatures is commonly known. For example, localized faults will induce im
pulses in the vibration signals [30] and distributed faults will raise the sidebands around the gear meshing frequency and its harmonics
[31]. If such physics knowledge can be integrated into the hyperparameters selection process when no CM data from a faulty gearbox is
available, it is expected that the resulted LSTM can better differentiate the faulty and healthy states.
In this paper, we propose a physics-informed hyperparameters selection strategy for LSTM and subsequently the fault detection of
gearboxes. The key idea of the proposed strategy is to select hyperparameters based on maximizing the discrepancy between healthy
2
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
and physics-informed faulty states, as opposed to the conventional minimizing VAMSE. Specifically, physics knowledge is used to
simulate fault-induced signatures which are then added to the validation signals to form the physics-informed faulty data. The resulted
LSTM is hypothesized to have better fault detection capability compared to the LSTM returned from the conventional VAMSE-based
strategy. Without loss of generality, we consider the detection of gear tooth crack and tooth wear using the LSTM-based method in
this paper. Other type of faults is for future studies.
The rest of this paper is organized as follows: Section 2 details the fundamentals of the LSTM and its conventional hyperparameters
selection strategy; Section 3 presents the proposed physics-informed hyperparameter selection strategy for the LSTM identification;
Section 4 presents two gear tooth crack detection case studies and one gear tooth wear detection case study, using laboratory datasets;
Discussions and conclusion remarks are given in Section 5 and Section 6, respectively.
2. Fundamentals
The fundamentals of the LSTM and its conventional hyperparameters selection strategy are introduced in this section.
LSTM is a special type of recurrent neural networks. Compared with conventional recurrent neural networks, LSTM introduces an
internal cell state to tackle the weight vanishing or exploding problem [32]. Details of a LSTM block refer to ref. [32]. Applying LSTM
for time series modeling/prediction, we treat data point xt at time t as input. A fully connected layer is added to aggregate the hidden
state ht of an LSTM block and output data point ̂ x t+1 at time t + 1. The LSTM may be stacked to form a deep structure for time series
modeling, as shown in Fig. 1. In this case, the hidden state outputted from the lower LSTM layer is taken as input for the higher LSTM
layer. The fully connected layer is added only in the topmost LSTM layer.
Let Nl denotes the number of LSTM layers in a deep LSTM, and Wl, Rl, and bl denote the weight matrices, recurrent weight matrices,
and bias matrices, for the lth layer. The loss function for training a deep LSTM will include the weight matrices for each layer in the l-2
regularization term.
∑
N Nl
∑ ( ⃦ l⃦ ⃦ l⃦ ⃦ l⃦ )
loss = x t − xt )2 + λ
(̂ ⃦W ⃦ + ⃦R ⃦ + ⃦b ⃦ (1)
t=2 l=1
where N is the length of the data sequence, λ is l-2 regulator, and ‖ ⋅ ‖ denotes l-2 norm. Training a deep LSTM neural network becomes
to solve
( l l l )
W , R , b , ⋯, W Nl , RNl , bNl = augmin loss (2)
W l ,Rl ,bl ,⋯,W Nl ,RNl ,bNl
The expression of the gradient with respect to Wl, Rl, and bl for solving the above optimization problem can be obtained by
truncated backpropagation through time method [32]. With the gradient expression, Adam [33], an effective stochastic gradient
method, can be employed to solve the above optimization problem.
As mentioned in the Introduction, proper selection of hyperparameters is critical for the success of an LSTM neural network. The
involved architecture hyperparameters include the number of layers Nl and the number of hidden states in each layer, whereas the
training hyperparameters include l-2 regulator λ, minimal batch size, maximum epochs, and learning rate, Adam parameters (β1, β2, ∊,
usually set as 0.9, 0.999, and 1 × 10− 7, respectively). To reduce computational cost while optimizing architecture hyperparameters,
researchers often adopt a halving strategy, namely the number of hidden states consecutively reduce by half from lower to higher layer
[34]. For instance, if the number of hidden layers Nl = 3 and the number of hidden states for the first LSTM layer Nh1 = 100, then the
x1 x2 x3 xn-1
3
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
number of hidden states for the second LSTM layer Nh2 = 50 and for the third layer Nh3 = 25.
Mathematically, the hyperparameters selection can be described as [21]
{ }
( ( ))
λ* = augmin mean L x; Aλ Xtrain (3)
λ∈Λ x∈Xvalid
where λ denotes the hyperparameters set that include the number of layers Nl, the number of hidden states l-2, regulator λ, minimal
batch size, maximum epochs, learning rate, etc; λ* is the optimal hyperparameters; Λ denotes the search space of the hyperparameters
λ; Xtrain and Xvalid are the training and validation data; L(•) denotes loss function (VAMSE [18]–[25]); Aλ denotes a learning algorithm
(Adam in this paper) that maps the training data Xtrain to an LSTM predictor.
Figure 2 shows the scheme of the hyperparameters selection strategy based on minimizing VAMSE. Although many optimization
search strategies were reported to solve the minimization problem described in equation (3), the optimization search strategies are not
the focus of this paper. Here, we introduce the grid search considering its simplicity and being deterministic, although it is compu
tationally costly. The random search and evolutionary algorithms (e.g., genetic algorithms) involve uncertainties during the searching
process. For fair comparison (the proposed physic-informed hyperparameters selection vs VAMSE-based hyperparameters selection)
purpose, adopting the grid search can avoid the possibility that an improved fault detection performance was induced from the
optimization algorithm. We acknowledge that other deterministic methods like Bayesian optimization and differentiable architecture
search are more efficient than the grid search, although theoretically more complex. These algorithms are of our future studies.
The grid search method is partitioning the search space Λ discretely and then evaluates all combinations of the hyperparameters.
Detailed procedures of the hyperparameters selection strategy are described below:
(a) In the beginning, training and validation data (Xtrain and Xvalid) needs to be prepared, both collected under a healthy state of the
target gearbox.
(b) Search candidate sets for each hyperparameter should be configured. For instance, the set for Nh1 can be a sequence from 50 to
500 with a step size equals 50.
(c) A loop is executed to train LSTM using training data, apply the trained LSTM on validation data, and obtain VAMSE under all
possible combinations of the hyperparameter candidates.
(d) The hyperparameters that give the minimal VAMSE are chosen, and the resulted LSTM to be the final model for gearbox fault
detection. This step is not shown in the schematic figure.
In this section, we first introduce the main scheme of the proposed physics-informed hyperparameters selection strategy, then
introduce the crack-induced impulse generation method, as well as the harmonic to noise ratio in the envelope spectrum.
4
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
where Xpyhsics denotes the physics-informed fault signature; Ind() denotes an indicator function that quantifies the fault signature in the
LSTM residuals; discrep() denotes a discrepancy measure between two sequences a and b, and in this paper, we use the following
mean(a) − mean(b)
discrep(a, b) = (5)
std(b)
The discrepancy measure specified in equation (5) is the standardized Euclidean distance between the mean of indicator sequence a
(calculated from the residual of Xvalid + Xpyhsics) and the distribution of the indicator sequence b (calculated from the residual of Xvalid).
The discrepancy measure is also equivalent to the Mahalanobis distance [35] as the number of random variables equals to 1.
The key point of the proposed physics-informed hyperparameters selection strategy is that we generate fault signature Xpyhsics based
on physical knowledge and add it to the Xvalid for mincing vibration data under faulty state. Using both the Xvalid + Xpyhsics and Xvalid, the
discrepancy can be calculated, which measures how well a LSTM differentiates healthy and faulty states. The hyperparameters setting
that returns a LSTM that best differentiates healthy and faulty states is eventually selected.
Figure 3 shows the scheme of the proposed physics-informed hyperparameters selection strategy, which is mostly the same as what
was described in Section 2.2 except for two major differences: first, we simulate physics-informed fault signature Xpyhsics and add it to
the Xvalid for mincing vibration data under faulty state; Second, VAMSE is replaced with the discrepancy measure between the healthy
state and physics-informed faulty state.
In this paper, we consider two often occurred faults in gearbox operation, namely the gear tooth crack and gear tooth wear. Physical
knowledge on what signatures these two faults will induce are as follows: gear tooth crack induces periodic impulses in the vibration
signals [30] and tooth wear raises the gear meshing frequency and its harmonics, as well as their sidebands in the frequency spectrum
of the vibration signals [31]. The simulation of the crack induced impulse (CII) will be detailed in Section 3.2. We use the harmonic to
noise ratio in envelope spectrum (HNRES) as the indicator to quantify CIIs in the LSTM residual. The HNRES will be detailed in Section
3.3. As for gear tooth wear induced signature, it is straightforward to edit the frequency spectrum of the vibration signal and then
conduct inverse Fourier transform to obtain the corresponding tooth wear-informed fault signature Xpyhsics.
The proposed physics-informed strategy utilized additional physics information on the fault signature when selecting hyper
parameters, compared with the conventional VAMSE-based strategy. Therefore, the resulted LSTM is expected to have better fault
5
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
detection capability compared to the LSTM returned from the conventional VAMSE-based strategy.
Note that we must train different LSTMs with optimized hyperparameters for different type of faults. An LSTM with optimized
hyperparameters is used for detecting a designated fault type. For instance, an LSTM with its hyperparameters optimized by utilizing
crack induced impulses is used for detecting the gear crack fault. An LSTM with its hyperparameters optimized by utilizing gear tooth
wear induced energy raises in gear meshing frequency and its harmonics, and their sidebands, is used for detecting the gear tooth wear
fault.
A tooth crack reduces the meshing stiffness and therefore generates an impulse in the gear meshing force. Such impulse will
transmit through the path of gear, shaft, and casing, and eventually reach the accelerometer [31]. In this paper, we use the unit impulse
sequence scaled by an amplitude factor to model the gear meshing force changes due to a tooth crack. Next, we identify the trans
mission path effect via operational modal analysis and then generate CIIs. Fig. 4 shows the logic of CIIs generation.
The scaled unit impulse sequence is as follows
Aδt− LT , for L = 1, 2, ..., Nc , t = 1, 2, ..., N (6)
where δt is the unit impulse function also called the Dirac delta function, δt = 1 when t = 0 and δt = 0 elsewhere; A denotes the scaling
amplitude; T is the period of the gear revolution; L is an integer that denotes the Lth gear revolution cycles; Nc the total number of gear
revolution cycles.
The transmission path effect can be obtained by operational modal analysis using vibration signals collected under the healthy state
[36]. We adopt the AR model [37] to obtain the transmission path effect. The AR model is a typical time series model, which is
expressed by the following formula [38]
∑
na
xt = αi xt− i + εt (7)
i=1
where xt and xt-i denote the data points at time t and time t-i, respectively, of a time series (e.g., a measured vibration signal); na
specifies the order of the AR terms; εt is the error term at time t; αi represents the AR parameters. In equation (7), we have assumed that
the data point xt is centered, and thus no constant term β0 is needed.
Substitution of data for t = 1, …, N and discarding na number of starting points yields
x = Φθ + e (8)
with
⎡ ⎤ ⎡ ⎤
xna +1 xna xna − 1 ⋯ x2 x1
⎢ xna +2 ⎥ ⎢ xna +1 xna ⋯ x3 x2 ⎥
x=⎢
⎣ ⋮ ⎦
⎥ ⎢
Φ=⎣ ⎥
⎦
⋮ ⋮ ⋱ ⋮ ⋮
xN (N− na )×1
xN− 1 xN− 2 ⋯ xN− na xN− na − 1 (N− na )×na
⎡ ⎤ ⎡ ⎤
α1 εna +1
⎢ α2 ⎥ ⎢ εna +2 ⎥
θ=⎢
⎣ ⋮ ⎦
⎥ e=⎢
⎣ ⋮ ⎦
⎥
αna na ×1
εN (N− na )×1
The parameter vector θ can be estimated by minimizing the residual sum of square errors (e.g., the ordinary least squared estimator
[38]) as
{ }
̂
θ = argmin eT e
{ θ }
= argmin (y − Φθ)T (y − Φθ) (9)
θ
yT
Φθ = θT T
Φ y { T T T T
}
= argmin y y− 2y Φθ + θ Φ Φθ
θ
Taking the partial derivative with respect to vector θ and enabling it equals to 0, the parameters of projection θ has a solution as
follows:
Transmission
Path Effect
6
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
∂( T )
y y− 2yT Φθ + θT ΦT Φθ = − 2yT Φ + 2θT ΦT Φ = 0
∂θ (10)
T ( )− 1 ( )− 1
⇒̂
θ = yT Φ ΦT Φ ⇒̂
θ = ΦT Φ ΦT y
The model order na is selected by minimizing Bayesian information criteria (BIC) [11].
BIC = Nln(MSE(e) ) + na ln(N) (11)
This BIC-based method evaluates na in enough large candidates and then chooses the one which minimizes BIC.
We remove the gear meshing frequency and its harmonics first and then use the resulted vibration signals for AR model identi
fication. Specifically, we first transform the vibration signals collected under a healthy state into the frequency domain and manually
set the amplitudes of the gear meshing frequency and its harmonics to zero, then transform the vibration signals back to the time
domain via inverse Fourier transform.
The transmission path effects are buried in the AR model, and the eventual CII st is obtained by feeding the impulse sequence into
the AR model as follows
∑
na
st = Aδt− LT + αi st− i (12)
i=1
The above equation is identical to equation (7) by replacing the εt with scaled unit impulse sequence Aδt-LT. Initial value s0 can be
chosen as 0 to start the autoregressive calculation.
Fig. 5. Experiment setup: (a) gearbox test rig, (b) schematic of the 2nd stage speed-up gearbox, (c) four sensor locations.
7
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
We calculate the harmonic to noise ratio in envelope spectrum (HNRES) to quantify the significance of CIIs in the LSTM residual
signals. HNRES represents the energy portion of CIIs in a signal. The envelope Et of the xt is first obtained as
1 ( 2 )
Et = xt− W/2 + ⋯ + x2t + ⋯ + xt+W/2
2
(13)
W +1
where W is the window size which is set as the period of one tooth pair mesh. The mean of Et is then removed. Next, the zero-mean
envelope is transformed to the frequency domain via the fast Fourier transform. Finally, HNRES value p is calculated from the envelope
spectrum as follows.
Nf N/2
∑ ∑
p = 100% × Aif / Ai (14)
i=1 i=1
where f is the occurrence frequency of CIIs; Nf is the number of harmonics of the occurrence frequency; Ai is the envelope spectrum
amplitude at i. In equation (14), N/2 denotes the number of frequency components in the halved discrete envelope spectrum.
Theoretically, p ∈ [0,100]. The higher the p value is, the more significant the CIIs are in a signal, and thus more severe of the toot crack
fault.
4. Case studies
In this section, we present two case studies to validate the effectiveness of the proposed physics-informed hyperparameters se
lection strategy.
Table 1
Object gear specifications.
Parameters Values
8
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
Fig. 6. Gear with a tooth root crack (αc = 60◦ ; q0 = 0.2q; wo = 0.2w).
Table 2
UofA training, validation, and testing data.
Health states Number of segments Cycles per segment
whilst minimizing BIC. Fig. 7 shows the frequency spectrum of the used data in comparison with the frequency response of the AR
model, as well as the AR residual. From Fig. 7(a), we can observe a good match between the frequency spectrum of the used data and
the frequency response of the AR model. For comparison, the amplitude of the AR model frequency response is multiplied by a factor of
10− 2.7 which is the average amplitude of the AR residual as shown in Fig. 7(b).
Fig. 7. (a) modal shape versus the spectrum of raw signal (b) AR residual.
9
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
With the obtained transmission path effect, the CIIs st are obtained by feeding the scaled unit impulse sequence into the AR model,
as described in equation (12). Fig. 8 shows an example of the impulse sequence and simulated CIIs. The parameters are configured as A
= 1, T = 10 cycles, and Nc = 10. In the next subsection, T and Nc will be kept the same whereas the value of amplitude A will be
discussed.
10
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
Fig. 9. LSTM hyperparameters selection via maximizing discrepancy. Effects of crack induced impulses amplitude (a) A = 3; (b) A = 4; (c) A = 5; (d)
A = 6. Each short dashed blue line ( − − − ) window contains discrepancy measures obtained under 16 number of hidden states Nh1 = [50, 100, …,
800], and the number of layers Nl and l-2 as marked in subplot (d). (For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.)
Healthy
Faulty
Fig. 10. Fault detection performance of different LSTMs obtained when the amplitude of crack indued impulses A is configured differently.
11
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
Fig. 11. An example of the validation data added with the crack induced impulses with different levels of amplitude, namely the Xvalid + Xphysics : (a)
raw validation data (A = 0); (b) A = 3; (c) A = 4; (d) A = 5; (e) A = 6.
of when CIIs with different levels of amplitude A are added to a segment of validation data. As shown in Fig. 11(d, e), the validation
data added with CIIs when A = 5 or 6 shows apparent visible periodic spikes in the time waveform. But when A = 3 or 4, spikes in
between 5 and 10 cycles are invisible and submerged in the time waveform, as shown in Fig. 11(b, c). From these observations, the
principle for A determination is concluded as: a value of A that is high enough to produce visible spikes in the time waveform of the
resulted Xvalid + Xphysics is needed for the physics-informed hyperparameters selection. In this case study, A = 6 is used.
Although the above analysis on suitable A value has used the data collected under faulty state, it is important to note that the
determination principle is applicable for other gearboxes. Engineering practitioners do not need to repeat the same analysis when
choosing A for a specific gearbox, thus the proposed physics-informed hyperparameters selection strategy is still free from the
Table 3
LSTM hyperparameters selected by the reported and proposed strategies in UofA case study.
SNR Nl Nh1 λ
2
Proposed strategy Raw 4 800 10−
3
10 dB 2 700 10−
2
1 dB 4 550 10−
− 4
Reported strategy [18–25] Raw 2 400 10
4
10 dB 3 600 10−
2
1 dB 4 300 10−
12
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
requirement of collecting data under the faulty state. In Section 4.2, we will show the applicability on another gearbox without data
collected under a faulty state.
For comparison, the LSTM was identified whilst minimizing the VAMSE (hereafter, we call this method the reported strategy
[18–22,24,25]).
The robustness to noise of both the proposed and reported methods was examined. Two levels of signal-to-noise ratio (SNR) are
considered, namely 10 dB and 1 dB. Gaussian random noise with zero mean and standard deviation (i.e., 10− 10/20 and 10− 1/20) is
added to the standardized vibration data.
The LSTM hyperparameters selected by the reported and proposed strategies are listed in Table 3.
Figure 12 shows the fault detection performance of the physics-informed LSTM (i.e., the LSTM with its hyperparameters selected by
the physics-informed selection strategy) in comparison with the reported LSTM (i.e., the LSTM with its hyperparameters selected by
the minimizing VAMSE), for the raw data, 10 dB SNR, and 1 dB SNR cases. The number of harmonics of the occurrence frequency is set
as 25 during the HNRES calculation. Fig. 12(a, b) shows the HNRES series and the receiver operating curve (ROC), respectively, for the
raw data case. Fig. 12(c, d) shows the 10 dB SNR case. Fig. 12(e, f) shows the 1 dB SNR case. In the ROC plot, each point tells the
missing alarm and false alarm rate when a certain fault detection threshold is used. Thus, the ROC curve comprehensively tells the fault
detection performance given all possible selections of fault detection threshold. Moreover, the more the ROC curve approaches the left
top corner, the smaller the missing alarm and false alarm are.
It is evident that the proposed physics-informed LSTM performs better than the reported LSTM in fault detection. The HNRES series
obtained by using the physics-informed LSTM under the faulty state is higher than that of the reported LSTM, as observed in Fig. 12(a,
Healthy
Faulty
Healthy
Faulty
Healthy
Faulty
Fig. 12. Fault detection performance of the physics-informed LSTM and reported LSTM on UofA dataset, under 3 levels of the signal to noise ratio.
(a, b) shows the HNRES series and the receiver operating curve (ROC), respectively, for the raw data case. (c, d) shows the 10 dB SNR case. (e, f)
shows the 1 dB SNR case. In (b, d, f), the AUC value on the right-hand side corresponds to the physics-informed LSTM, and on the left-hand side the
reported LSTM. In the ROC plots, the false positive rate denotes the type I error; in other words, a healthy state is identified as a faulty state. True
positive rate denotes the type II error; in other words, the true fault is not detected.
13
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
c, e), and the corresponding ROC curves are also closer to the left top corner, as observed in Fig. 12(b, d, f). We further calculate the
area under the ROC curve (AUC) to provide a quantitative comparison. For the raw data and 10 dB cases, the AUC of the physics-
informed LSTM is 1, which means we can detect the incipient tooth crack fault 100% with no false alarm. But the reported method
has a smaller AUC value of 0.9664 and 0.9559, respectively. For the 1 dB case, the AUC of the physics-informed LSTM is 0.8945 which
is less than 1 due to the high noise volume. But the reported method has a much smaller AUC of only 0.6195. The effects of noise are
like a shrinkage operation on the HNRES under the fault state.
To show our finding that sacrificing VAMSE can return an LSTM which better differentiate the faulty and healthy states, we report
the VAMSE of both the proposed physics-informed LSTM and the reported LSTM. Table 4 lists the VAMSEs of both LSTMs. The
proposed physics-informed LSTM has a bigger VAMSE than that of the reported LSTM in three SNR cases, while the proposed LSTM has
better fault detection performance as presented in Fig. 12. We conclude that sacrificing VAMSE can indeed return an LSTM which
better differentiates the faulty and healthy states.
Table 4
VAMSE (m/s2)2 of the reported and proposed LSTM in UofA case study.
SNR Raw 10 dB 1 dB
14
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
(a) (b)
Accelerometer Torque
Encoder meter
Motor
EMP
brake
Coupling Encoder
Gearbox
Fig. 13. Spur gearbox test rig at the University of New South Wales: (a) overall view; (b) schematic diagram. (Figure).
Source [43]
Table 5
UNSW dataset, partition of training, validation, and testing data.
Health states Number of segments Cycles per segment
Training Healthy 64 1
Validation Healthy 24 10
Testing Healthy 36 10
Pitted area = 4.45 mm2 36 10
Pitted area = 7.52 mm2 36 10
3fgm
2fgm-fr
2fgm+f
fgm-fr
fgm+fr
3fgm-fr
3fgm+f
r
Fig. 14. Raised amplitudes: the first three gear mesh harmonics and their first order sidebands. In the figure, fgm denotes gear meshing order, fr
denotes rotating order. Red dashed line denotes the raised energy. (For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.)
crack cases, sacrificing VAMSE can indeed return an LSTM which betters differentiating the faulty and healthy states.
5. Discussions
The hypothesis of this study is that the LSTM returned from the proposed physics-informed strategy will have better fault detection
capability compared to the LSTM returned from the conventional VAMSE-based strategy. Section 4 have shown that the proposed
physics-informed strategy outperforms conventional VAMSE-based strategy in terms of detecting tooth crack and wear faults. The
reason behind this improved performance may be as follows. Although the conventional VAMSE-based strategy returns a LSTM that
best generalizes the healthy state, best generalization of the healthy state does not guarantee the best differentiation of healthy and
15
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
Table 6
LSTM hyperparameters selected by the reported and proposed strategies in UNSW case study.
SNR Nl Nh1 λ
3
Reported strategy [18–25] Raw 3 600 10−
4
10 dB 2 700 10−
4
1 dB 4 450 10−
− 3
Proposed strategy Raw 4 750 10
4
10 dB 4 700 10−
3
1 dB 4 600 10−
Healthy 4.45mm2
2
7.52mm
Healthy 4.45mm
2
2
7.52mm
Healthy 4.45mm
2
2
7.52mm
Fig. 15. Fault detection performance of the proposed physics-informed LSTM and reported LSTM on UNSW gear wear dataset, under 3 levels of the
signal to noise ratio. (a, b) shows the RMS series and the receiver operating curve (ROC), respectively, for the raw data case. (c, d) shows the 10 dB
SNR case. (e, f) shows the 1 dB SNR case. In (b, d, f), the AUC value on the right-hand side corresponds to the physics-informed LSTM, and on the
left-hand side the reported LSTM.
Table 7
VAMSE (m/s2)2 of the reported and proposed LSTM in UNSW case study.
SNR Raw 10 dB 1 dB
16
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
faulty states, as the LSTM simply does not have any sense of how the faulty state would be. As for the proposed physics-informed
strategy, additional physics information on the fault signature have been utilized when selecting hyperparameters. The learned
LSTM machine have somehow directly seen how an “out-of-sample” (faulty) state looks like, in addition to the healthy state. Therefore,
the learned LSTM has better fault detection capability compared to the LSTM returned from the conventional VAMSE-based strategy.
An advantage of the proposed physics-informed hyperparameters selection strategy is that we do not need the most accurate
simulation of the fault-induced signatures. In the gear tooth crack case studies, the scaled unit impulse sequence is a simplified model
for mimicking the gear meshing force changes due to a tooth crack. Crack-induced impulses are then generated by further considering
the transmission path effect, as illustrated in Fig. 4. Other more practical ways are available for simulating crack-induced impulses,
such as dynamic modeling. Meanwhile, in the gear tooth wear case study, we also utilized a simplified model for mimicking the gear
wear induced signatures in the vibration signals. The simplified model is the energy raises in gear meshing frequency and its har
monics, and their first order sidebands. Nevertheless, using these simplified simulations of fault-induced signatures can already help to
get a better fault detection performance, as shown in the case studies presented in Section 4. It is arguably true that we do not need the
most accurate simulation of the fault-induced signatures. This advantage makes the proposed physics-informed hyperparameters
selection strategy implementation friendly.
There may be other ways to integrate physics knowledge into the machine learning, as pointed out by one of the reviewers. A
possible way is to integrate physics knowledge into the loss function of LSTM training. A modified loss function would therefore be
∑ ∑
loss = loss(̂
x − x) + γ x − x)− 1 + λR(W)
loss(̂ (15)
x∈XValid x∈XValid +XPhysics
where the first term is the predication error on validation data collected under healthy state; the second term is the inverse of
predication error on data collected under physics-informed faulty state; the third term is a regularization on weight matrices. Such a
loss function will minimize the generalization error on validation data and maximize the predication error on physic-informed faulty
data. Therefore, the returned LSTM is expected to better differentiate healthy and faulty states. Whether or not such loss function can
indeed help the LSTM to better differentiate healthy and faulty states is subject to future study.
Although not validated yet, we believe that the proposed physics-informed LSTM-based method can also be used for the detection
of other types of faults. We will apply the proposed strategy to the detection of other types of faults in our future works.
The efficiency of the proposed strategy can be further improved by adopting other advanced searching algorithms like random
search, Bayesian optimization, and differentiable architecture search, as mentioned in Section 2.2. These algorithms are of our future
studies.
The requirement of physical information on fault induced signatures may be a limitation of the proposed strategy, as physical
information may not be available for some assets, especially a newly designed model.
6. Conclusions
In this paper, we focused on tackling the fault detection problem when data collection is easy for baseline CM data from a healthy
state but not for the CM data from a faulty state, a situation often met in the CM and health management of gearboxes. We proposed a
physics-informed hyperparameters selection strategy for LSTM and subsequently the fault detection of gearboxes. The physics
knowledge on the fault-induced signature in CM data has been utilized and the data under faulty state mimicked. The novelty of the
proposed strategy lies on the selection of hyperparameters based on maximizing the discrepancy between healthy and physics-
informed faulty states, as opposed to minimizing VAMSE. Case studies have been conducted to detect the gear tooth crack and
tooth wear using laboratory test rigs. Results have shown that the proposed physics-informed hyperparameters selection strategy
returns an LSTM that can better detect the crack and wear faults than the LSTM returned from minimizing VAMSE.
The major contributions are as follows: 1) a physics-informed hyperparameters selection strategy is proposed for LSTM and sub
sequently the fault detection of gearboxes; 2) the performance of the physics-informed hyperparameters selection strategy is
comparatively and comprehensively assessed using two case studies.
Yuejian Chen: Conceptualization, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft.
Meng Rao: Investigation, Validation, Writing – review & editing. Ke Feng: Validation, Writing – review & editing. Ming J. Zuo:
Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – review & editing.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgement
Sponsored by Future Energy Systems under Canada First Research Excellent Fund (# FES-T11-P01 and FES-T14-P02), and Natural
17
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
Sciences and Engineering Research Council of Canada (Grant #RGPIN-2021-02900). Reviewers’ and the Editor’s efforts are also much
appreciated.
References
[1] M.A.F. Pimentel, D.A. Clifton, L. Clifton, L. Tarassenko, A review of novelty detection, Signal Process. 99 (2014) 215–249, https://doi.org/10.1016/j.
sigpro.2013.12.026.
[2] B. Assaad, M. Eltabach, J. Antoni, Vibration based condition monitoring of a multistage epicyclic gearbox in lifting cranes, Mech. Syst. Signal Process. 42 (1)
(2014) 351–367, https://doi.org/10.1016/j.ymssp.2013.06.032.
[3] L. Yip, ‘Analysis and modeling of planetary gearbox vibration data for early fault detection’, M.A.Sc., University of Toronto (Canada), Canada, 2011. Accessed:
Dec. 15, 2017.
[4] J. Ma, F. Xu, K. Huang, R. Huang, GNAR-GARCH model and its application in feature extraction for rolling bearing fault diagnosis, Mech. Syst. Signal Process. 93
(Supplement C) (2017) 175–203, https://doi.org/10.1016/j.ymssp.2017.01.043.
[5] Y. Kim, J. Kim, Y.H. Kim, J. Chong, H.S. Park, System identification of smart buildings under ambient excitations, Measurement 87 (Supplement C) (2016)
294–302, https://doi.org/10.1016/j.measurement.2016.02.028.
[6] Z. Cai, J. Fan, Q. Yao, Functional-coefficient regression models for nonlinear time series, J. Am. Stat. Assoc. 95 (451) (2000) 941–956, https://doi.org/10.1080/
01621459.2000.10474284.
[7] J. Fan, Q. Yao, Z. Cai, Adaptive varying-coefficient linear models, J. Royal Stat. Soc.: Ser. B (Statistical Methodology) 65 (1) (2003) 57–80, https://doi.org/
10.1111/1467-9868.00372.
[8] S. Ma, P.-X.-K. Song, Varying index coefficient models, J. Am. Stat. Assoc. 110 (509) (2015) 341–356, https://doi.org/10.1080/01621459.2014.903185.
[9] M. Gan, H. Peng, X. Peng, X. Chen, G. Inoussa, A locally linear RBF network-based state-dependent AR model for nonlinear time series modeling, Inf. Sci. 180
(22) (2010) 4370–4383, https://doi.org/10.1016/j.ins.2010.07.012.
[10] M. Gan, C.L.P. Chen, H.X. Li, L. Chen, Gradient radial basis function based varying-coefficient autoregressive model for nonlinear and nonstationary time series,
IEEE Signal Process Lett. 22 (7) (2015) 809–812, https://doi.org/10.1109/LSP.2014.2369415.
[11] Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Comput. 31 (7) (2019) 1235–1270, https://
doi.org/10.1162/neco_a_01199.
[12] K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, IEEE Trans. Neural Networks Learn. Syst. 28 (10) (2017)
2222–2232, https://doi.org/10.1109/TNNLS.2016.2582924.
[13] R. Jozefowicz, W. Zaremba, I. Sutskever, An Empirical Exploration of Recurrent Network Architectures, in: in Proceedings of the 32nd International Conference on
Machine Learning, 2015, pp. 2342–2350.
[14] W. Mao, J. He, J. Tang, Y. Li, Predicting remaining useful life of rolling bearings based on deep feature representation and long short-term memory neural
network, Adv. Mech. Eng. 10 (12) (2018), https://doi.org/10.1177/1687814018817184.
[15] W. Yu, I.Y. Kim, C. Mechefske, Remaining useful life estimation using a bidirectional recurrent neural network based autoencoder scheme, Mech. Syst. Sig.
Process. 129 (2019) 764–780, https://doi.org/10.1016/j.ymssp.2019.05.005.
[16] X. Ma, Z. Tao, Y. Wang, H. Yu, Y. Wang, Long short-term memory neural network for traffic speed prediction using remote microwave sensor data, Transp. Res.
Part C: Emerg. Technol. 54 (2015) 187–197, https://doi.org/10.1016/j.trc.2015.03.014.
[17] S. Siami-Namini, N. Tavakoli, A.S. Namin, ‘A comparison of ARIMA and LSTM in forecasting time series’, in 2018 17th IEEE International Conference on Machine
Learning and Applications (ICMLA), Dec. 2018, pp. 1394–1401. 10.1109/ICMLA.2018.00227.
[18] W. Wang, F.A. Galati, D. Szibbo, ‘LSTM Residual Signal for Gear Tooth Crack Diagnosis’, in Advances in Asset Management and Condition Monitoring, Cham, 2020,
pp. 1075–1090. 10.1007/978-3-030-57745-2_89.
[19] H. Liu, J. Zhou, Y. Zheng, W. Jiang, Y. Zhang, Fault diagnosis of rolling bearings with recurrent neural network-based autoencoders, ISA Trans. 77 (2018)
167–178, https://doi.org/10.1016/j.isatra.2018.04.005.
[20] E. Marchi, F. Vesperini, S. Squartini, B. Schuller, Deep recurrent neural network-based autoencoders for acoustic novelty detection, Comput. Intell. Neurosci.
2017 (2017) 1–14, https://doi.org/10.1155/2017/4694860.
[21] J. Bergstra, Y. Bengio, ‘Random Search for Hyper-Parameter Optimization’, p. 25, 2012.
[22] L.M.R. Rere, M.I. Fanany, A.M. Arymurthy, Metaheuristic algorithms for convolution neural network, Comput. Intell. Neurosci. 2016 (2016) 1–13, https://doi.
org/10.1155/2016/1537325.
[23] M.A. Gelbart, J. Snoek, R.P. Adams, ‘Bayesian Optimization with Unknown Constraints’, arXiv:1403.5607 [cs, stat], Mar. 2014, Accessed: Sep. 11, 2021.
[Online]. Available: http://arxiv.org/abs/1403.5607.
[24] H. Liu, K. Simonyan, Y. Yang, ‘DARTS: Differentiable Architecture Search’, arXiv:1806.09055 [cs, stat], Apr. 2019, Accessed: Apr. 14, 2021. [Online]. Available:
http://arxiv.org/abs/1806.09055.
[25] B. Zoph, Q.V. Le, ‘Neural Architecture Search with Reinforcement Learning’, arXiv:1611.01578 [cs], Feb. 2017, Accessed: Apr. 14, 2021. [Online]. Available:
http://arxiv.org/abs/1611.01578.
[26] L. von Rueden et al., ‘Informed Machine Learning – A Taxonomy and Survey of Integrating Knowledge into Learning Systems’, arXiv:1903.12394 [cs, stat], Feb.
2020, Accessed: Apr. 14, 2021. [Online]. Available: http://arxiv.org/abs/1903.12394.
[27] J. Willard, X. Jia, S. Xu, M. Steinbach, V. Kumar, ‘Integrating Physics-Based Modeling with Machine Learning: A Survey’, arXiv:2003.04919 [physics, stat], Jul.
2020, Accessed: Apr. 14, 2021. [Online]. Available: http://arxiv.org/abs/2003.04919.
[28] M. Sadoughi, C. Hu, Physics-based convolutional neural network for fault diagnosis of rolling element bearings, IEEE Sens. J. 19 (11) (2019) 4181–4192,
https://doi.org/10.1109/JSEN.2019.2898634.
[29] J. Wang, Y. Li, R. Zhao, R.X. Gao, Physics guided neural network for machining tool wear prediction, J. Manuf. Syst. 57 (2020) 298–310, https://doi.org/
10.1016/j.jmsy.2020.09.005.
[30] R.B. Randall, A new method of modeling gear faults, J. Mech. Des. 104 (2) (1982) 259–267, https://doi.org/10.1115/1.3256334.
[31] R.B. Randall, Vibration-based Condition Monitoring: Industrial, Aerospace and Automotive Applications, John Wiley & Sons, 2011.
[32] R.C. Staudemeyer, E.R. Morris, ‘Understanding LSTM – a tutorial into Long Short-Term Memory Recurrent Neural Networks’, arXiv:1909.09586 [cs], Sep. 2019,
Accessed: Dec. 15, 2020. [Online]. Available: http://arxiv.org/abs/1909.09586.
[33] D.P. Kingma, J. Ba, ‘Adam: a method for stochastic optimization’, arXiv:1412.6980 [cs], Jan. 2017, Accessed: Dec. 15, 2020. [Online]. Available: http://arxiv.
org/abs/1412.6980.
[34] Z. Zhao, T. Li, J. Wu, C. Sun, S. Wang, R. Yan, X. Chen, Deep learning algorithms for rotating machinery intelligent diagnosis: an open source benchmark study,
ISA Trans. 107 (2020) 224–255, https://doi.org/10.1016/j.isatra.2020.08.010.
[35] R. De Maesschalck, D. Jouan-Rimbaud, D.L. Massart, The mahalanobis distance, Chemometr. Intell. Lab. Syst. 50 (1) (2000) 1–18, https://doi.org/10.1016/
S0169-7439(99)00047-7.
[36] D. Hanson, R.B. Randall, J. Antoni, D.J. Thompson, T.P. Waters, R.A.J. Ford, Cyclostationarity and the cepstrum for operational modal analysis of mimo
systems—Part I: modal parameter identification, Mech. Syst. Sig. Process. 21 (6) (2007) 2441–2458, https://doi.org/10.1016/j.ymssp.2006.11.008.
[37] V.H. Vu, M. Thomas, A.A. Lakis, L. Marcouiller, Operational modal analysis by updating autoregressive model, Mech. Syst. Sig. Process. 25 (3) (2011)
1028–1044, https://doi.org/10.1016/j.ymssp.2010.08.014.
[38] J.D. Cryer, K.-S. Chan, Time Series Analysis - With Applications in R. Springer, 2008. Accessed: Feb. 02, 2018. [Online]. Available: //www.springer.com/gp/book/
9780387759586.
18
Y. Chen et al. Mechanical Systems and Signal Processing 171 (2022) 108907
[39] Y. Chen, X. Liang, M.J. Zuo, An improved singular value decomposition-based method for gear tooth crack detection and severity assessment, J. Sound Vib. 468
(2020), 115068, https://doi.org/10.1016/j.jsv.2019.115068.
[40] Z. Chen, Y. Shao, Dynamic simulation of spur gear with tooth root crack propagating along tooth width and crack depth, Eng. Fail. Anal. 18 (8) (2011)
2149–2164, https://doi.org/10.1016/j.engfailanal.2011.07.006.
[41] K. Feng, ‘Gear wear run-to-failure dataset’, vol. 1, Aug. 2021, 10.17632/p2yryg9k6z.1.
[42] K. Feng, W.A. Smith, P. Borghesani, R.B. Randall, Z. Peng, Use of cyclostationary properties of vibration signals to identify gear wear mechanisms and track wear
evolution, Mech. Syst. Sig. Process. 150 (2021), 107258, https://doi.org/10.1016/j.ymssp.2020.107258.
[43] Y. Chen, K. Feng, R.B. Randall, P. Borghesani, M.J. Zuo, ‘Use of autoregressive conditional heteroskedasticity model to assess the tooth surface roughness of a
gearbox’, in 2020 Asia-Pacific International Symposium on Advanced Reliability and Maintenance Modeling (APARM), Aug. 2020, pp. 1–4. 10.1109/
APARM49247.2020.9209389.
[44] H. Chang, P. Borghesani, W.A. Smith, Z. Peng, Application of surface replication combined with image analysis to investigate wear evolution on gear teeth – A
case study, Wear 430–431 (2019) 355–368, https://doi.org/10.1016/j.wear.2019.05.024.
[45] S. Ziaran, R. Darula, Determination of the state of wear of high contact ratio gear sets by means of spectrum and cepstrum analysis, J. Vib. Acoust. 135(2) 2013,
10.1115/1.4023208.
19