0% found this document useful (0 votes)
18 views9 pages

DL For SER

This document discusses the evaluation of deep learning architectures for Speech Emotion Recognition (SER), presenting a frame-based formulation that minimizes speech processing. The authors empirically explore various neural network architectures, including feed-forward and recurrent models, reporting state-of-the-art results on the IEMOCAP database for speaker-independent SER. The study aims to improve SER performance by automating feature extraction through deep learning, contrasting with previous approaches that relied on extensive feature engineering.

Uploaded by

vishalvyshnav257
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

DL For SER

This document discusses the evaluation of deep learning architectures for Speech Emotion Recognition (SER), presenting a frame-based formulation that minimizes speech processing. The authors empirically explore various neural network architectures, including feed-forward and recurrent models, reporting state-of-the-art results on the IEMOCAP database for speaker-independent SER. The study aims to improve SER performance by automating feature extraction through deep learning, contrasting with previous approaches that relied on extensive feature engineering.

Uploaded by

vishalvyshnav257
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Neural Networks ( ) –

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

2017 Special Issue

Evaluating deep learning architectures for


Speech Emotion Recognition
Haytham M. Fayek a,∗ , Margaret Lech a , Lawrence Cavedon b
a
School of Engineering, RMIT University, Melbourne VIC 3001, Australia
b
School of Science, RMIT University, Melbourne VIC 3001, Australia

article info abstract


Article history: Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which
Available online xxxx makes SER an excellent test bed for investigating and comparing various deep learning architectures.
We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end
Keywords: deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore
Affective computing
feed-forward and recurrent neural network architectures and their variants. Experiments conducted
Deep learning
illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and
Emotion recognition
Neural networks emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the
Speech recognition IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of
the models’ performances.
© 2017 Elsevier Ltd. All rights reserved.

1. Introduction SER can be regarded as a static or dynamic classification problem,


which has motivated two popular formulations in the literature
In recent years, deep learning in neural networks has achieved to the task (Ververidis & Kotropoulos, 2006): turn-based process-
tremendous success in various domains that led to multiple ing (also known as static modeling), which aims to recognize emo-
deep learning architectures emerging as effective models across tions from a complete utterance; or frame-based processing (also
numerous tasks. Feed-forward architectures such as Deep Neural known as dynamic modeling), which aims to recognize emotions
Networks (DNNs) and Convolutional Neural Networks (ConvNets)
at the frame level. In either formulation, SER can be employed in
have been particularly successful in image and video processing as
stand-alone applications; e.g. emotion monitoring, or integrated
well as speech recognition, while recurrent architectures such as
Recurrent Neural Networks (RNNs) and Long Short-Term Memory into other systems for emotional awareness; e.g. integrating SER
(LSTM) RNNs have been effective in speech recognition and natural into Automatic Speech Recognition (ASR) to improve its capabil-
language processing (LeCun, Bengio, & Hinton, 2015; Schmidhuber, ity in dealing with emotional speech (Cowie et al., 2001; Fayek,
2015). These architectures process and model information in Lech, & Cavedon, 2016b; Fernandez, 2004). Frame-based process-
different ways and have their own advantages and limitations. ing is more robust since it does not rely on segmenting the input
For instance, ConvNets are able to deal with high-dimensional speech into utterances and can model intra-utterance emotion dy-
inputs and learn features that are invariant to small variations namics (Arias, Busso, & Yoma, 2013; Fayek, Lech, & Cavedon, 2015).
and distortions (Krizhevsky, Sutskever, & Hinton, 2012), whereas However, empirical comparisons between frame-based processing
LSTM-RNNs are able to deal with variable length inputs and model and turn-based processing in prior work have demonstrated the
sequential data with long range context (Graves, 2008). superiority of the latter (Schuller, Vlasenko, Eyben, Rigoll, & Wen-
In this paper, we investigate the application of end-to-end deep demuth, 2009; Vlasenko, Schuller, Wendemuth, & Rigoll, 2007).
learning to Speech Emotion Recognition (SER) and critically ex-
Whether performing turn-based processing or frame-based
plore how each of these architectures can be employed in this task.
processing, most of the research effort in the last decade has been
devoted to selecting an optimal set of features (Schuller et al.,
∗ 2010). Despite the effort, little success has been achieved in real-
Corresponding author.
izing such a set of features that performs consistently over differ-
E-mail addresses: haytham.fayek@ieee.org (H.M. Fayek),
margaret.lech@rmit.edu.au (M. Lech), lawrence.cavedon@rmit.edu.au ent conditions and multiple data sets (Eyben, Scherer et al., 2015).
(L. Cavedon). Thus, researchers have resorted to brute-force high-dimensional
http://dx.doi.org/10.1016/j.neunet.2017.02.013
0893-6080/© 2017 Elsevier Ltd. All rights reserved.
2 H.M. Fayek et al. / Neural Networks ( ) –

features sets that comprise many acoustic parameters in an at- a feature set comprising acoustic parameters aggregated using
tempt to capture all variances (Tahon & Devillers, 2016). Such high- statistical functionals, by using LSTM-RNNs as well as SVM, where
dimensional feature sets complicate the learning process in most the former was shown to yield better results given enough data.
machine learning algorithms, increase the likelihood of overfitting This study differs from prior studies in several ways. We focus
and hinder generalization. Moreover, the computation of many on a frame-based formulation for SER, aiming to achieve a sys-
acoustic parameters is computationally expensive and may be dif- tem with a simple pipeline and low latency by modeling intra-
ficult to apply on a large scale or with limited resources (Eyben, utterance emotion dynamics. Moreover, most previous studies
Huber, Marchi, Schuller, & Schuller, 2015). Therefore, it is highly relied on some form of high-level features, while in this paper we
pertinent to investigate the application of deep learning to SER strive for minimal speech processing and rely on deep learning to
to alleviate the problem of feature engineering and selection and automate the process of feature extraction. Furthermore, we use
achieve an SER with a simple pipeline and low latency. Moreover, uniform data subsets and experiment conditions promoting com-
SER is an excellent test bed for exploring various deep learning ar- parisons across various deep learning models, which has not been
chitectures since the task itself can be formulated in multiple ways. investigated in previous studies.
Deep learning has been applied to SER in prior work, as dis-
cussed in Section 2. However, with different data subsets and un- 3. Deep learning: An overview
der various experiment conditions involved in prior studies, it is
difficult to directly compare various deep learning models. To the Deep learning in neural networks is the approach of composing
best of our knowledge, our work provides the first empirical ex- networks into multiple layers of processing with the aim of
ploration of various deep learning formulations and architectures learning multiple levels of abstraction (Goodfellow, Bengio, &
applied to SER. As a result, we report state-of-the-art results on the Courville, 2016; LeCun et al., 2015). In doing so, the network can
popular Interactive Emotional Dyadic Motion Capture (IEMOCAP) adaptively learn low-level features from raw data and higher-level
database (Busso et al., 2008) for speaker-independent SER. features from low-level ones in a hierarchical manner, nullifying
The remainder of this paper is divided into seven sections. In the over-dependence of shallow networks on feature engineering.
the following section, related work is reviewed, highlighting re- The remainder of this section reviews the architectures, learning
cent advances. In Section 3, a review of deep learning is presented procedures and regularization methods used in this paper.
focusing on the architectures and methods used in this paper. In
Section 4, the proposed SER system is explained. In Section 5, the
3.1. Architectures
experimental setup is described, depicting the data, its preprocess-
ing, the computational setup and the training recipe. Experiments
The two most popular neural network architectures are the
performed and their results are presented in Section 6 and dis-
feed-forward (acyclic) architecture and the recurrent (cyclic)
cussed in Section 7. Finally, the paper is concluded in Section 8.
architecture (Schmidhuber, 2015). Feed-forward neural network
architectures comprise multiple layers of transformations and
2. Related work nonlinearity with the output of each layer feeding the subsequent
layer. A feed-forward fully-connected multi-layer neural network
Work on SER prior to 2011 is well reviewed in the literature — also known as Deep Neural Network (DNN) — can be modeled
(Ayadi, Kamel, & Karray, 2011; Petta, Pelachaud, & Cowie, 2011; by iterating over Eqs. (1) and (2):
Ververidis & Kotropoulos, 2006). Since DNNs displaced Gaussian
Mixture Models (GMMs) for acoustic modeling in ASR (Hinton h(l) = y(l−1) W(l) + b(l) (1)
et al., 2012; Mohamed, Dahl, & Hinton, 2012), researchers have y(l) (l)
= φ(h ) (2)
attempted to employ DNNs for other speech applications as well,
and specifically for SER. Stuhlsatz et al. (2011) proposed a DNN where l ∈ {1, . . . , L} denotes the lth layer, h(l) ∈ Rno is a vector of
Generalized Discriminant Analysis to deal with high-dimensional preactivations of layer l, y(l−1) ∈ Rni is the output of the previous
feature sets in SER, demonstrating better performance than layer (l − 1) and input to layer l, W(l) ∈ Rni ×no is a matrix of
Support Vector Machines (SVM) on the same set of features. In Li learnable weights of layer l, b(l) ∈ Rno is a vector of learnable biases
et al. (2013) a hybrid DNN—Hidden Markov Model (HMM) trained of layer l, y(l) ∈ Rno is the output of layer l, y(0) is the input to the
on Mel-Frequency Cepstral Coefficients (MFCCs) was proposed for model, y(L) is the output of the final layer L and the model, and φ is a
SER and compared to a GMM—HMM indicating improved results. nonlinear activation function applied element-wise. The activation
Han, Yu, and Tashev (2014) used a DNN to extract features from function used in this paper for feed-forward architectures is the
speech segments, which were then used to construct utterance- Rectified Linear Unit (ReLU) as in Eq. (3) due to its advantages over
level SER features that were fed into an Extreme Learning other activation functions, such as computational simplicity and
Machine (ELM) for utterance-level classification outperforming faster learning convergence (Glorot, Bordes, & Bengio, 2011).
other techniques. In Fayek, Lech, and Cavedon (2016a), a DNN
was used to learn a mapping from Fourier-transform based filter φ(z ) = max(0, z ). (3)
banks to emotion classes using soft labels generated from multiple To provide a probabilistic interpretation of the model’s output,
annotators to model the subjectiveness in emotion recognition the output layer L utilizes a softmax nonlinearity instead of the
which yielded improved performance compared to ground truth nonlinear function used in previous layers as in Eq. (4):
labels obtained by majority voting between the same annotators.
More recently, alternative neural network architectures for ezk
softmax(zk ) = (4)
SER were also investigated. Mao, Dong, Huang, and Zhan (2014) K

used a ConvNet in a two-stage SER scheme that involves learning ezk
k=1
local invariant features using a sparse auto-encoder from speech
spectrograms, processed using Principal Component Analysis where K is the number of output classes.
(PCA) followed by salient discriminative feature analysis to extract A popular variant of the feed-forward neural network architec-
discriminative features demonstrating competitive results. Tian, ture is the Convolutional Neural Network (ConvNet) (LeCun et al.,
Moore, and Lai (2015) compared knowledge-inspired disfluency 1990), which leverages three ideas: sparse interactions; parame-
and non-verbal vocalization features in emotional speech against ter sharing; and equivariant representations. This can be achieved
H.M. Fayek et al. / Neural Networks ( ) – 3

by replacing the affine transformation in Eq. (1) with a convolu- where ŷ ∈ {0, 1}K is a one-of-K encoded label and y(L) is the output
tion operation as in Eq. (5) and adding another layer called pooling, of the model.
which aims to merge semantically similar features using a subsam- The gradients are computed by differentiating the cost function
pling operation such as maximization. with respect to the model parameters using a mini-batch of data
examples sampled from the training data and backpropagated to
h(l) = y(l−1) ∗ W(l) + b(l) (5) prior layers using the backpropagation algorithm (Rumelhart, Hin-
(l) m×j×k ton, & Williams, 1986). Training recurrent architectures requires
where in this case, W ∈ R is a tensor of m learnable
filters, each of which is of height j and width k. Following recent modification to the backpropagation algorithm to compute the
work (He, Zhang, Ren, & Sun, 2016; Simonyan & Zisserman, 2015), gradients with respect to the parameters and states of the model,
subsampling is performed in this work by adjusting the stride in which is known as the backpropagation through time algorithm
convolution layers rather than an explicit pooling layer. (Werbos, 1988).
The recurrent architecture extends the notion of a typical feed- Gradient descent or one of its variants is used to update the
forward architecture by adding inter-layer and self connections parameters of the model using the gradients computed. A per-
to units in the recurrent layer (Graves, 2008), which can be parameter adaptive variant of gradient descent called RMSProp
modeled using Eq. (6) in place of Eq. (1). This makes such type of (Dauphin, de Vries, & Bengio, 2015; Tieleman & Hinton, 2012) was
architectures particularly suitable for tasks that involve sequential used in this paper, which uses gradient information to adjust the
inputs such as speech. learning rate as in Eqs. (14) and (15):
(l)
ht = yt
(l−1) (l)
W(yl) + st −1 W(sl) + b(l) (6) r := ηr + (1 − η)(∂ C /∂w)2 (14)
∂ C /∂w
where t denotes the time step, ht
(l)
∈ Rno is a vector of w := w − α √ (15)
(l−1) r +ϵ
preactivations of layer l at time step t , yt ∈ Rni is the output
of the previous layer (l − 1) at time step t and input to layer l where r is a leaky moving average of the squared gradient and
(l)
at time step t , Wy ∈ Rni ×no is a matrix of learnable weights of η and α are hyperparameters denoting the decay rate and the
(l) learning rate respectively.
layer l, st −1 ∈ Rno is the state of layer l at the previous time step
(t − 1), W(sl) ∈ Rno ×no is a matrix of learnable weights of layer l, 3.3. Regularization
and b(l) ∈ Rno is a vector of learnable biases of layer l. For recurrent
architectures, sigmoid functions such as the logistic function as Deep architectures are prone to overfitting, which makes reg-
in Eq. (7) and the hyperbolic tangent (tanh) function were used ularization an essential ingredient in their success. In this paper,
as the activation function instead of ReLUs, as ReLUs amplify the three regularization techniques were used: l2 weight decay, which
exploding gradient problem in recurrent architectures due to their penalizes the l2 norm of the weights of the model; dropout (Srivas-
unbounded nature. tava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014), which
1 stochastically omits units in the model during training prevent-
σ (z ) = (7) ing co-adaptation of units; and Batch Normalization (BatchNorm)
1 + e −z
(Ioffe & Szegedy, 2015), which aims to reduce the internal covariate
A popular variant of the recurrent architectures is the Long
shift in deep architectures by normalizing the means and standard
Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997),
deviations of layer preactivations as in Eq. (16):
which uses an explicit memory cell to better learn long-term
dependencies. An LSTM cell can be modeled using Eqs. (8)–(12): z(l) − 
E(z(l) )
BatchNorm(z(l) ; γ , β) = β + γ  (16)
(l) (l−1)
+ Whi h(t l−) 1 + Wci c(t l−) 1 + b(i l)
 
it = σ Wyi yt (8)  (z(l) ) + ϵ
Var

(l) (l−1) where γ ∈ Rno , β ∈ Rno are model parameters that determine the
+ Whf h(t l−) 1 + Wcf c(t l−) 1 + b(f l)
 
ft = σ Wyf yt (9) mean and standard deviation of the layer preactivations respec-
(l) tively and 
E and Var
 are estimates of the sample mean and sample
= ft c(t l−) 1 + it tanh Wyc y(t l−1) + Whc h(t l−) 1 + b(cl)
 
ct (10) variance of the normalized preactivations respectively.
(l) Unlike l2 weight decay, which is employed by simply modifying
= σ Wyo yt(l−1) + Who h(t l−) 1 + Wco ct(l) + b(ol)
 
ot (11) the cost function, dropout and BatchNorm require modifying the
(l) (l) (l) architecture of the model in that BatchNorm can be treated as an
ht = ot tanh(ct ) (12)
additional layer added before the nonlinearity layer, while dropout
where σ is the logistic sigmoid function in Eq. (7), and i, f, o and is applied after the nonlinearity layer.
c are the input gate, forget gate, output gate and cell activation
vectors respectively, all of which are the same size as the vector 4. Proposed speech emotion recognition system
h. The weight matrices from the cell to gate vectors, e.g. Wci , are
diagonals such that each element in each gate vector only receives Fig. 1 is a sketch of the proposed SER system which follows
input from the same element of the cell vector (Graves, Mohamed, a frame-based processing formulation that utilizes Fourier-
& Hinton, 2013). transform based filter bank speech spectrograms and a deep multi-
layered neural network to predict emotion class probabilities for
3.2. Learning each frame in the input utterance.
Let X ∈ RN ×T be a speech utterance or speech stream sliced into
Learning is formulated as an optimization problem to minimize a time-series sequence of T frames, each of which is a RN vector of
a cost function. The cost function used in this paper is the cross- audio features. The aim is to rely on minimal speech processing
entropy cost function in Eq. (13): and thus each frame is represented by Fourier-transform based
K
log Mel-scaled N filter banks. The goal of the model is to predict
(L) p(yt |x), where x ∈ X is a number of concatenated frames,

C =− ŷk log(yk ) (13)
k=1
xt −l ∥ · · · ∥xt ∥ · · · ∥xt +r , where xt is the target frame, l is the number
4 H.M. Fayek et al. / Neural Networks ( ) –

annotators were not limited to these emotions during annotation.


Ground truth labels were obtained by majority voting, where 74.6%
of the utterances were agreed upon by at least two annotators. Ut-
terances that were labeled differently by all three annotators were
discarded in this study. To be consistent with other studies on the
same database (Kim, Lee, & Provost, 2013; Mariooryad & Busso,
2013; Shah, Chakrabarti, & Spanias, 2014), utterances that bore
only the following four emotions were included: anger, happiness,
sadness and neutral, with excitement considered as happiness.
An eight-fold Leave-One-Speaker-Out (LOSO) cross-validation
scheme (Schuller, Vlasenko et al., 2009) was employed in all
experiments using the eight speakers in the first four sessions.
Both speakers in the fifth session were used to cross-validate the
hyperparameters of the models and to apply early stopping during
training and therefore were not included in the cross-validation
folds so as to not bias the results (Refaeilzadeh, Tang, & Liu, 2009).
Audio was analyzed using a 25 ms Hamming window with a
stride of 10 ms. Log Fourier-transform based filter banks with 40
coefficients distributed on a Mel scale were extracted from each
frame. The mean and variance were normalized per coefficient for
each fold using the mean and variance computed using the training
subset only. No speaker dependent operations were performed.
Since the data was labeled at an utterance-level, all frames in an
utterance inherited the utterance label. A voice activity detector
was then used to label silent frames and silence was added as an
Fig. 1. Overview of the proposed SER system. A deep multi-layered neural network additional class to the four previously mentioned emotion classes;
composed of several fully-connected, convolutional or recurrent layers ingests a i.e. a frame has either the same label as its parent utterance or the
target frame (solid line) concatenated with a number of context frames (dotted line) silence label. The underlying assumption here is that frames in an
to predict the posterior class probabilities corresponding to the target frame. utterance convey the same emotion as the parent utterance, which
concurs with the same assumption made when a categorical label
of past context frames and r is the number of future frames, was assigned to the entire utterance; nevertheless, this assumption
and yt is the predicted output for frame xt , which is either one is eased by labeling silent and unvoiced frames as silence.
emotion class or silence. Silence was added to the output classes
since silence as well as unvoiced speech were not removed from 5.2. Computational setup
the input speech utterance as it has been shown that silence and
other disfluencies are effective cues in emotion recognition (Tian, Due to the large number of experiments carried out in this
Lai, & Moore, 2015). paper, several computational resources were exploited at differ-
The proposed model is a deep multi-layered neural network. ent stages. Some experiments were carried out on a cluster of
We experiment with several neural network architectures as CPUs, while others were carried out using Graphics Processing
Units (GPUs) to accelerate the training process. The Kaldi toolkit
presented in Section 6. It is important to note that the model is
(Povey et al., 2011) was used for speech processing and analysis.
able to deal with utterances of variable length, independent of
The neural networks and training algorithms were implemented
the choice of architecture since the model predicts p(yt |x), ∀t ∈
in Matlab and C. Training time varied significantly between dif-
{1, . . . , T }: this only requires the target frame xt and past l and ferent models with an average duration of 2 days; however the
future r context frames, which are fixed prior. Since emotions largest model took 14 days to train on a GPU. The code is available
manifest in speech in a slow manner, one may not necessarily at http://github.com/haythamfayek/SER.
predict the class of every single frame in an utterance or speech
stream but may rely on predicting the class of a frame sampled 5.3. Training recipe
every few frames, depending on application requirements. The
output of the model may be aggregated over the entire utterance The parameters of the neural networks were√ initialized from
to perform utterance-level classification if desired. a Gaussian distribution with zero mean and 2/ni standard
deviation, where ni is the number of inputs to the layer, as
5. Experimental setup recommended by He et al. (2016). Mini-batch stochastic gradient
descent with a batch size of 256 and RMSProp per-parameter
In this section, we introduce the data and its preprocessing, the adaptive learning rate were used to optimize the parameters with
computational setup and training recipe used in this paper. respect to a cross-entropy cost function. The base learning rate
was set to α = 1 × 10−2 and annealed by a factor of 10 when
the error plateaus. The decay rate was set to η = 0.99. Fully-
5.1. Data and preprocessing connected layers were regularized using dropout with a retention
probability P = 0.5. Convolutional layers were regularized using
The popular Interactive Emotional Dyadic Motion Capture l2 weight decay with penalty λ = 1 × 10−3 . LSTM layers were
(IEMOCAP) database (Busso et al., 2008) was used. It comprises regularized using dropout with a retention probability P = 0.5
12 h of audio-visual recordings divided into five sessions. Each and the gradients were clipped to lie in range [−5, 5]. BatchNorm
session is composed of two actors, a male and a female, perform- was used after every fully-connected or convolutional layer. The
ing emotional scripts as well as improvised scenarios. In total, the validation set was used to perform early-stopping during training,
database comprises 10 039 utterances with an average duration such that training halts when the learning rate reaches α = 1 ×
of 4.5 s. Utterances were labeled by three annotators using cate- 10−8 ; and the model with the best accuracy on the validation set
gorical labels. The database predominantly focused on five emo- during training was selected. These hyperparameters were chosen
tions: anger, happiness, sadness, neutral and frustration; however, based on experimental trials using the validation set.
H.M. Fayek et al. / Neural Networks ( ) – 5

such as noise. Moreover, ConvNets are able to deal with high-


dimensional input, which in this case is due to the large number
of context frames required. In this experiment, we present
an in-depth exploration of various ConvNet architectures and
demonstrate the effect of the number of convolutional and fully-
connected layers, number of filters, size of filters and type of
convolution (spatial vs temporal) on the performance of the
system. Table 1 lists various ConvNet architectures and their
respective test accuracy and UAR. All experiments were conducted
using 259 past context frames and no future context frames,
which correspond to approximately 2.6 s of speech; i.e. the input
dimensionality is 40 filter banks × 260 frames.
From the results listed in the first segment of Table 1, the
benefit of network depth can be observed. The best results were
obtained using 2 convolutional layers followed by 2–3 fully-
connected layers. Adding more layers to the network did not yield
any performance gain and resulted in overfitting. The results in
the second segment of Table 1 demonstrate the effect of the filter
Fig. 2. Test accuracy and UAR of a DNN with various number of context frames.
size on the performance of the model. It can be seen that similar
to other speech applications, SER requires a relatively large filter
6. Experiments and results with an optimal size of 10 × 10. Temporal convolution did perform
slightly worse than spatial convolution as demonstrated in the final
As is standard practice in the field of automatic emotion segment of Table 1.
recognition, results are reported using Accuracy and Unweighted
Average Recall (UAR) to reflect imbalanced classes (Schuller, Steidl,
6.2. Recurrent architectures
& Batliner, 2009). Both metrics are reported for the average of
the eight-fold LOSO cross-validation scheme. The output of the
In the next set of experiments, we investigate how LSTM-
model during evaluation was the class with the highest posterior
RNNs can be employed for the proposed SER system. LSTM-RNNs
probability as in Eq. (17):
can be trained in several ways, such as Sequence-to-Sequence,
ℓt = argmax p(yt |x) (17) where a model is trained to ingest a sequence of frames and
k=1,...,K output a sequence of class labels; or Sequence-to-One, where a
where K = 5 is the number of output classes and ℓt is the predicted model is trained to ingest a sequence of frames and output a
label of frame t. class label. Sequence-to-Sequence training may seem to be a better
fit to the proposed system; however, preliminary experiments
demonstrated the superiority of Sequence-to-One training, as
6.1. Feed-forward architectures
Sequence-to-Sequence training failed to converge in most cases
or had poor performance otherwise. Therefore, Sequence-to-One
The first experiment was conducted to investigate the number training was used in our experiments: the model was trained to
of context frames required for SER. We hypothesized that unlike ingest a sequence of frames, frame-by-frame, and predict a class
ASR, SER does not rely on future context but does require a large label for the final frame, p(yt |x), where x is xt −c ∥ · · · ∥xt and c is
number of past context frames. Therefore, the models were trained the number of context frames (sequence length).
in two configurations: (1) the first configuration was to predict LSTM-RNNs can handle sequences of arbitrary lengths. How-
p(yt |x), where x is xt −c ∥ · · · ∥xt ∥ · · · ∥xt +c for various values of c, ever, the effect of the sequence length, on which the model was
i.e. predict the class label of the center frame; (2) the second trained, on the ability of the model to handle arbitrary sequence
configuration was to predict p(yt |x), where x is xt −c ∥ · · · ∥xt for lengths is not well-studied. Hence, several models were trained
various values of c, i.e. predict the class label of the final frame. using various training sequence lengths {20, 60, 100, 200}, where
The model used in this experiment was a DNN with 5 hidden LSTM-RNN-c denotes the training sequence length c on which the
layers; each was composed of 1024 fully-connected units with model was trained; it was then evaluated on a number of test se-
BatchNorm, ReLU and dropout layers interspersed in-between and quence lengths {20, 60, 100, 200, 260, 300}. An extra model was
a softmax output layer. This architecture was selected based on the trained on sequence length c chosen randomly at each iteration
best UAR on the validation set, which was excluded from the cross- such that c ∈ {20, 60, 100, 200}, denoted LSTM-RNN-R. The model
validation scheme. Fig. 2 is a plot of the test accuracy and UAR of used in this experiment was a 2-layered LSTM-RNN with 256 units
the model in both configurations for various numbers of context in each hidden layer and dropout interspersed in-between and
frames. a softmax output layer. This architecture was selected based on
Two observations are immediately evident from the results in the best UAR on the validation set, which was excluded from the
Fig. 2: (1) the performance of the system is directly proportional cross-validation scheme. Figs. 3 and 4 depict the accuracy and UAR
to the number of context frames until it starts to plateau after respectively of the LSTM-RNNs trained and evaluated on various
220 frames and (2) the future context has a minor contribution sequence lengths.
to the performance of the system as hypothesized. Since a large Results in Figs. 3 and 4 demonstrate a similar trend in that
number of context frames lead to an increase in the dimensionality models trained on short sequences did not perform as well on
of the input and may increase overfitting (as shown in Fig. 2), long sequences and vice versa. In addition, noticeable gains in
a good trade-off between the number of context frames and the performance could be achieved by increasing the number of
performance of the system would lie between 2–3 s of speech. context frames (test sequence length). The best performance at
ConvNets are able to learn features that are insensitive to small each test sequence length was obtained by the model trained on
variations in the input speech which can help in disentangling the same sequence length and the performance degraded gradually
speaker-dependent variations as well as other sources of distortion as the test sequence length deviated from the training sequence
6 H.M. Fayek et al. / Neural Networks ( ) –

Table 1
Test accuracy and UAR of various ConvNet architectures. FC(no ) denotes a fully-connected layer of no units followed by BatchNorm, ReLUs
and dropout. Conv(m × j × k) and Conv1D(m × j × k) denote a spatial convolutional layer and a temporal convolutional layer respectively
of m filters each of size j × k with a stride of 2 followed by BatchNorm and ReLUs. Softmax(no ) denotes a softmax output layer of no units
followed by a softmax operation.
Architecture Test Accuracy (%) Test UAR (%)

Conv(32 × 4 × 4) — FC(1024) — Softmax(1024) 62.27 58.30


Conv(32 × 4 × 4) — FC(1024)× 2 — Softmax(1024) 62.78 58.87
Conv(32 × 4 × 4) — Conv(64 × 3 × 3) — FC(1024) — Softmax(1024) 62.58 58.71
Conv(32 × 4 × 4) — Conv(64 × 3 × 3) — FC(1024)× 2 — Softmax(1024) 63.16 58.56
Conv(16 × 4 × 4) — Conv(32 × 3 × 3) — FC(716)× 2 — Softmax(716) 63.34 59.30
Conv(32 × 4 × 4) — Conv(64 × 3 × 3) — FC(1024)× 3 — Softmax(1024) 63.82 58.92
Conv(16 × 4 × 4) — Conv(32 × 3 × 3) — FC(716)× 3 — Softmax(716) 62.90 58.17

Conv(16 × 6 × 6) — Conv(32 × 6 × 6) — FC(716)× 2 — Softmax(716) 63.51 59.50


Conv(16 × 10 × 10) — Conv(32 × 10 × 10) — FC(716)× 2 — Softmax(716) 64.78 60.89
Conv(16 × 14 × 14) — Conv(32 × 14 × 14) — FC(716)× 2 — Softmax(716) 62.84 58.30
Conv(16 × 10 × 18) — Conv(32 × 10 × 18) — FC(716)× 2 — Softmax(716) 63.07 58.79

Conv1D(64 × 40 × 4) — FC(1024)× 2 — Softmax(1024) 62.41 58.38


Conv1D(64 × 40 × 8) — FC(1024)× 2 — Softmax(1024) 62.98 59.07
Conv1D(64 × 40 × 16) — FC(1024)× 2 — Softmax(1024) 62.91 58.49

UAR averaged over all test sequence lengths, followed by LSTM-


RNN-R.

7. Discussion

The number of context frames was a major contributing factor


in the performance of the system. All architectures benefited from
using a large number of context frames. Key to harnessing the in-
formation in these frames without overfitting was using recent ad-
vances in regularization methods such as dropout and BatchNorm,
otherwise the accompanied increase in dimensionality of the input
data would have been problematic.
Table 2 lists the best model from each architecture and their
respective accuracy and UAR trained and evaluated under the
same data subsets and experiment conditions. As stated earlier,
SER can be regarded as a static or dynamic classification problem,
which makes it an excellent test bed for conducting a comparison
Fig. 3. Test accuracy of an LSTM-RNN with various number of context frames.
between these architectures. In this case, the ConvNet and the DNN
LSTM-RNN-c denotes the sequence length which the model was trained on. The
no. of frames denotes the sequence length which the model was evaluated on. can be regarded in this formulation as static classifiers that process
a number of concatenated frames jointly to predict a class label,
whereas the LSTM-RNN can be regarded in this formulation as a
dynamic classifier that processes a sequence of frames, frame-by-
frame, to predict a class label. The results in Table 2 suggest that the
static component in speech is more discriminative for SER than the
dynamic component. This is likely due to the dominant presence
of the linguistic aspect in the dynamic component of speech,
which hinders the recognition of paralinguistic components such
as emotions. We speculate that for this reason and due to the
ConvNet’s ability to learn discriminative features invariant to
small variations, the ConvNet yielded the best accuracy and UAR
followed by the DNN then the LSTM-RNN.
Trends in Table 1 and the best ConvNet architecture reported in
this work are similar to those reported in other speech applications
and particularly in ASR (Abdel-Hamid et al., 2014). This may pave
the way for a single architecture to be used in a multi-task setting
for speech processing (Fayek et al., 2016b).
Fig. 5 illustrates the output of the proposed SER system using
a ConvNet for a number of selected utterances, all of which
Fig. 4. Test UAR of an LSTM-RNN with various number of context frames. LSTM- were from the test subset of the data. Qualitative assessment of
RNN-c denotes the sequence length which the model was trained on. The no. of
the system’s output indicates that the network has learned to
frames denotes the sequence length which the model was evaluated on.
model the intra-utterance emotion dynamics with high confidence
and is able to transition smoothly from one class to another,
length. Moreover, by varying the sequence length when training capturing brief pauses and mixed emotions as shown in Fig. 5. It
LSTM-RNN-R, the model did learn to perform well on various test is particularly interesting to note the system’s output in Fig. 5(e),
sequence lengths. On average, LSTM-RNN-100 yielded the best which has classified the first half of the utterance as neutral
H.M. Fayek et al. / Neural Networks ( ) – 7

Table 2
Test accuracy and UAR for various network architectures.
Model Test Accuracy (%) Test UAR (%)

FC(1024)× 5 — Softmax(1024) 62.55 58.78


Conv(16 × 10 × 10) — Conv(32 × 10 × 10) — FC(716)× 2 — Softmax(716) 64.78 60.89
LSTM-RNN(256)× 2 - Softmax(256) 61.71 58.05

Fig. 5. Input speech utterances (top) and corresponding aligned output (below) of the proposed SER system for a number of utterances from the test subset. The output is
the posterior class probabilities p(yt ) denoting the confidence of the model. Transcripts: (a): Oh, laugh at me all you like but why does this happen every night she comes
back? She goes to sleep in his room and his memorial breaks in pieces. Look at it, Joe look.: Angry. (b): I will never forgive you. All I’d done was sit around wondering if I was
crazy waiting so long, wondering if you were thinking about me.: Happy. (c): OKay. So I am putting out the pets, getting the car our the garage.: Neutral. (d): They didn’t die.
They killed themselves for each other. I mean that, exactly. Just a little more selfish and they would all be here today.: Sad. (e): Oh yeah, that would be. Well, depends on
what type of car you had, though too. I guess it would be worth it. helicopter. Yeah, helicopter. There is a helipad there, right? Yeah, exactly.: Happy.

and the second half of the utterance as happy conforming to T


p(yt |x)

our manual inspection, whereas the database annotators assigned
t =1
the happy label to the entire utterance. In some cases, as in ℓu = argmax (18)
Fig. 5(d,e), the system did not predict the correct output class with k=1,...,K T
high confidence across the entire utterance, which may suggest where K = 4 is the number of output classes, i.e. ignoring the
that a smoothing function over the network may offer additional silence class, and ℓu is the utterance-level predicted label.
improvements in some cases; however, this is beyond the scope of Table 3 shows the SER results reported in prior work on the
this work. IEMOCAP database. Note that differences in data subsets used and
To compare the results reported in this paper with prior other experiment conditions should be taken into consideration
work in the literature, which mostly relies on utterance-based when comparing these results against each other, c.f. Han et al.
classification, the posterior class probabilities computed for each (2014), Lee, Mower, Busso, Lee, and Narayanan (2011), Mariooryad
frame in an utterance were averaged across all frames in that and Busso (2013) and Shah et al. (2014) for more details. As can
utterance and an utterance-based label was selected based on the be seen from Table 3, the proposed SER system outperforms all
maximum average class probabilities, ignoring the silence label, as other speaker-independent methods. In addition, the proposed
per Eq. (18): SER system offers other advantages such as real-time output since
8 H.M. Fayek et al. / Neural Networks ( ) –

Table 3
SER results reported in prior work on the IEMOCAP database. Note that differences in data subsets used and other experiment conditions should be taken into consideration
when comparing the following results against each other, c.f. references for more details.
Method Test Accuracy (%) Test UAR (%) Notes

DNN + ELM (Han et al., 2014) 54.3 48.2 –


SVM (Mariooryad & Busso, 2013) 53.99 50.64 Better performance was reported by incorporating other modalities.
Replicated Softmax Models + SVM (Shah et al., – 57.39 Better performance was reported by incorporating other modalities.
2014)
Hierarchical Binary Decision Tree (Lee et al., – 58.46 Speaker-Dependent Normalization
2011)
Proposed SER system (Utterance-Based) 57.74 58.28
Proposed SER system (Frame-Based) 64.78 60.89

it does not depend on future context. Moreover, the system is able Eyben, F., Scherer, K., Schuller, B., Sundberg, J., Andre, E., Busso, C., et al. (2015).
to deal with utterances of arbitrary length with no degradation in The Geneva minimalistic acoustic parameter set (gemaps) for voice research
and affective computing. IEEE Transactions on Affective Computing, 190–202.
performance. Furthermore, the system can handle utterances that http://dx.doi.org/10.1109/TAFFC.2015.2457417.
contain more than an emotion class, as demonstrated in Fig. 5(e), Fayek, H. M., Lech, M., & Cavedon, L. (2015). Towards real-time speech
which would not be possible in an utterance-based formulation. emotion recognition using deep neural networks. In 2015 9th international
conference on signal processing and communication systems (ICSPCS) (pp. 1–5).
http://dx.doi.org/10.1109/ICSPCS.2015.7391796.
8. Conclusion Fayek, H. M., Lech, M., & Cavedon, L. (2016a). Modeling subjectiveness in
emotion recognition with deep neural networks: Ensembles vs soft labels.
In 2016 international joint conference on neural networks (IJCNN) (pp. 566–570).
Various deep learning architectures were explored on a Speech http://dx.doi.org/10.1109/IJCNN.2016.7727250.
Emotion Recognition (SER) task. Experiments conducted illumi- Fayek, H. M., Lech, M., & Cavedon, L. (2016b). On the correlation and
nate how feed-forward and recurrent neural network architectures transferability of features between automatic speech recognition and
speech emotion recognition. In Interspeech 2016 (pp. 3618–3622).
and their variants could be employed for paralinguistic speech http://dx.doi.org/10.21437/Interspeech.2016-868.
recognition, particularly emotion recognition. Convolutional Neu- Fernandez, R. (2004). A computational model for the automatic recognition of affect
ral Networks (ConvNets) demonstrated better discriminative in speech. (Ph.D. thesis), School of Architecture and Planning, Massachusetts
Institute of Technology.
performance compared to other architectures. As a result of our Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks.
exploration, the proposed SER system which relies on minimal In JMLR W&CP: Proceedings of the fourteenth international conference on artificial
speech processing and end-to-end deep learning, in a frame- intelligence and statistics, AISTATS 2011.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
based formulation, yields state-of-the-art results on the IEMOCAP Graves, A. (2008). Supervised sequence labelling with recurrent neural networks.
database for speaker-independent SER. (Ph.D. thesis), Technische Universitat Munchen.
Future work can be pursued in several directions. The proposed Graves, A., Mohamed, A.-R., & Hinton, G. (2013). Speech recognition
with deep recurrent neural networks. In 2013 IEEE international con-
SER system can be integrated with automatic speech recognition, ference on acoustics, speech and signal processing (pp. 6645–6649).
employing joint knowledge of the linguistic and paralinguistic http://dx.doi.org/10.1109/ICASSP.2013.6638947.
components of speech to achieve a unified model for speech Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural
network and extreme learning machine. In Interspeech 2014.
processing. More generally, observations made in this work as a He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
result of exploring various architectures could be beneficial for recognition. In 2016 IEEE conference on computer vision and pattern recognition
devising further architectural innovations in deep learning that can (CVPR) (pp. 770–778). http://dx.doi.org/10.1109/CVPR.2016.90.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., et al. (2012).
exploit advantages of current models and address their limitations. Deep neural networks for acoustic modeling in speech recognition: The shared
views of four research groups. IEEE Signal Processing Magazine, 29, 82–97.
Acknowledgments http://dx.doi.org/10.1109/MSP.2012.2205597.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
Computation, 9, 1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735.
This research was funded by the Vice-Chancellor’s Ph.D. Schol- Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
arship (VCPS) from RMIT University. We gratefully acknowledge training by reducing internal covariate shift. In Proceedings of the 32nd
international conference on machine learning (pp. 448–456).
the support of NVIDIA Corporation with the donation of one of the Kim, Y., Lee, H., & Provost, E. M. (2013). Deep learning for robust fea-
Tesla K40 GPUs used in this research. ture generation in audiovisual emotion recognition. In 2013 IEEE interna-
tional conference on acoustics, speech and signal processing (pp. 3687–3691).
http://dx.doi.org/10.1109/ICASSP.2013.6638346.
References Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. In Neural information processing systems
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). (pp. 1097–1105).
Convolutional neural networks for speech recognition. IEEE/ACM Transactions LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
on Audio, Speech, and Language Processing, 22, 1533–1545. Insight.
Arias, J. P., Busso, C., & Yoma, N. B. (2013). Energy and F0 contour modeling LeCun, Y., Boser, B., Denker, J.S., Howard, R.E., Habbard, W., & Jackel, L.D. et al. (1990).
with functional data analysis for emotional speech detection. In Interspeech Advances in neural information processing systems 2. chapter Handwritten
(pp. 2871–2875). digit recognition with a back-propagation network. (pp. 396–404).
Ayadi, M. E., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recog- Lee, C.-C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition
nition: Features, classification schemes, and databases. Pattern Recognition, 44, using a hierarchical binary decision tree approach. Speech Communication, 53,
572–587. 1162–1171.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., & Gonzalez, I. et al. (2013). Hybrid
IEMOCAP: interactive emotional dyadic motion capture database. Language deep neural network–hidden markov model (dnn-hmm) based speech emotion
Resources and Evaluation, 42, 335–359. http://dx.doi.org/10.1007/s10579-008- recognition. In 2013 humaine association conference on affective computing and
9076-6. intelligent interaction (ACII) (pp. 312–317).
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech
et al. (2001). Emotion recognition in human–computer interaction. IEEE Signal emotion recognition using convolutional neural networks. IEEE Transactions on
Processing Magazine, 18, 32–80. http://dx.doi.org/10.1109/79.911197. Multimedia, 16, 2203–2213.
Dauphin, Y., de Vries, H., & Bengio, Y. (2015). Equilibrated adaptive learning rates for Mariooryad, S., & Busso, C. (2013). Exploring cross-modality affective reactions for
non-convex optimization. In Advances in neural information processing systems audiovisual emotion recognition. IEEE Transactions on Affective Computing, 4,
(pp. 1504–1512). 183–196. http://dx.doi.org/10.1109/T-AFFC.2013.11.
Eyben, F., Huber, B., Marchi, E., Schuller, D., & Schuller, B. (2015). Real-time robust Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief
recognition of speakers’ emotions and characteristics on mobile platforms. networks. IEEE Transactions on Audio, Speech, and Language Processing, 20,
In 2015 international conference on affective computing and intelligent interaction 14–22. http://dx.doi.org/10.1109/TASL.2011.2109382.
(ACII) (pp. 778–780). http://dx.doi.org/10.1109/ACII.2015.7344658. Petta, P., Pelachaud, C., & Cowie, R. (2011). Emotion-oriented systems. In The
humaine handbook.
H.M. Fayek et al. / Neural Networks ( ) – 9

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). Stuhlsatz, A., Meyer, C., Eyben, F., ZieIke, T., Meier, G., & Schuller, B. (2011). Deep
The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech neural networks for acoustic emotion recognition: Raising the benchmarks.
recognition and understanding. IEEE Signal Processing Society, IEEE Catalog No.: In 2011 IEEE international conference on acoustics, speech and signal processing
CFP11SRW-USB.
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. In Encyclopedia of (ICASSP) (pp. 5688–5691).
database systems (pp. 532–538). Springer. Tahon, M., & Devillers, L. (2016). Towards a small set of robust acoustic features for
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by emotion recognition: Challenges. IEEE/ACM Transactions on Audio, Speech, and
back-propagating errors. Nature, 323, 533–536. Language Processing, 24, 16–28.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Tian, L., Lai, C., & Moore, J. (2015). Recognising emotions in dialogues with
Networks, 61, 85–117. http://dx.doi.org/10.1016/j.neunet.2014.09.003. disfluencies and non-verbal vocalisations. In R. Lickley (Ed.), The 7th workshop
Schuller, B., Steidl, S., & Batliner, A. (2009). The interspeech 2009 emotion challenge.
on disfluency in spontaneous speech.
In Interspeech, vol. 2009 (pp. 312–315).
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C. A., Tian, L., Moore, J., & Lai, C. (2015). Emotion recognition in spontaneous and acted
et al. (2010). The interspeech 2010 paralinguistic challenge. In Interspeech dialogues. In 2015 International conference on affective computing and intelligent
(pp. 2794–2797). interaction (ACII) (pp. 698–704).
Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., & Wendemuth, A. (2009). Acoustic Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by
emotion recognition: A benchmark comparison of performances. In IEEE a running average of its recent magnitude. COURSERA: Neural Networks for
workshop on automatic speech recognition understanding, 2009 (pp. 552–557).
Machine Learning, 4.
Shah, M., Chakrabarti, C., & Spanias, A. (2014). A multi-modal approach to emotion
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources,
recognition using undirected topic models. In IEEE international symposium on
circuits and systems (ISCAS) (pp. 754–757). features, and methods. Speech Communication, 48, 1162–1181.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large- Vlasenko, B., Schuller, B., Wendemuth, A., & Rigoll, G. (2007). Frame vs. turn-level:
scale image recognition. In International conference on learning representations emotion recognition from speech considering static and dynamic processing.
(ICLR). In Affective computing and intelligent interaction (pp. 139–147). Springer.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Werbos, P. J. (1988). Generalization of backpropagation with application to a
Dropout: A simple way to prevent neural networks from overfitting. Journal of recurrent gas market model. Neural Networks, 1, 339–356. http://dx.doi.org/10.
Machine Learning Research, 15, 1929–1958. 1016/0893-6080(88)90007-X.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy