DL For SER
DL For SER
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
features sets that comprise many acoustic parameters in an at- a feature set comprising acoustic parameters aggregated using
tempt to capture all variances (Tahon & Devillers, 2016). Such high- statistical functionals, by using LSTM-RNNs as well as SVM, where
dimensional feature sets complicate the learning process in most the former was shown to yield better results given enough data.
machine learning algorithms, increase the likelihood of overfitting This study differs from prior studies in several ways. We focus
and hinder generalization. Moreover, the computation of many on a frame-based formulation for SER, aiming to achieve a sys-
acoustic parameters is computationally expensive and may be dif- tem with a simple pipeline and low latency by modeling intra-
ficult to apply on a large scale or with limited resources (Eyben, utterance emotion dynamics. Moreover, most previous studies
Huber, Marchi, Schuller, & Schuller, 2015). Therefore, it is highly relied on some form of high-level features, while in this paper we
pertinent to investigate the application of deep learning to SER strive for minimal speech processing and rely on deep learning to
to alleviate the problem of feature engineering and selection and automate the process of feature extraction. Furthermore, we use
achieve an SER with a simple pipeline and low latency. Moreover, uniform data subsets and experiment conditions promoting com-
SER is an excellent test bed for exploring various deep learning ar- parisons across various deep learning models, which has not been
chitectures since the task itself can be formulated in multiple ways. investigated in previous studies.
Deep learning has been applied to SER in prior work, as dis-
cussed in Section 2. However, with different data subsets and un- 3. Deep learning: An overview
der various experiment conditions involved in prior studies, it is
difficult to directly compare various deep learning models. To the Deep learning in neural networks is the approach of composing
best of our knowledge, our work provides the first empirical ex- networks into multiple layers of processing with the aim of
ploration of various deep learning formulations and architectures learning multiple levels of abstraction (Goodfellow, Bengio, &
applied to SER. As a result, we report state-of-the-art results on the Courville, 2016; LeCun et al., 2015). In doing so, the network can
popular Interactive Emotional Dyadic Motion Capture (IEMOCAP) adaptively learn low-level features from raw data and higher-level
database (Busso et al., 2008) for speaker-independent SER. features from low-level ones in a hierarchical manner, nullifying
The remainder of this paper is divided into seven sections. In the over-dependence of shallow networks on feature engineering.
the following section, related work is reviewed, highlighting re- The remainder of this section reviews the architectures, learning
cent advances. In Section 3, a review of deep learning is presented procedures and regularization methods used in this paper.
focusing on the architectures and methods used in this paper. In
Section 4, the proposed SER system is explained. In Section 5, the
3.1. Architectures
experimental setup is described, depicting the data, its preprocess-
ing, the computational setup and the training recipe. Experiments
The two most popular neural network architectures are the
performed and their results are presented in Section 6 and dis-
feed-forward (acyclic) architecture and the recurrent (cyclic)
cussed in Section 7. Finally, the paper is concluded in Section 8.
architecture (Schmidhuber, 2015). Feed-forward neural network
architectures comprise multiple layers of transformations and
2. Related work nonlinearity with the output of each layer feeding the subsequent
layer. A feed-forward fully-connected multi-layer neural network
Work on SER prior to 2011 is well reviewed in the literature — also known as Deep Neural Network (DNN) — can be modeled
(Ayadi, Kamel, & Karray, 2011; Petta, Pelachaud, & Cowie, 2011; by iterating over Eqs. (1) and (2):
Ververidis & Kotropoulos, 2006). Since DNNs displaced Gaussian
Mixture Models (GMMs) for acoustic modeling in ASR (Hinton h(l) = y(l−1) W(l) + b(l) (1)
et al., 2012; Mohamed, Dahl, & Hinton, 2012), researchers have y(l) (l)
= φ(h ) (2)
attempted to employ DNNs for other speech applications as well,
and specifically for SER. Stuhlsatz et al. (2011) proposed a DNN where l ∈ {1, . . . , L} denotes the lth layer, h(l) ∈ Rno is a vector of
Generalized Discriminant Analysis to deal with high-dimensional preactivations of layer l, y(l−1) ∈ Rni is the output of the previous
feature sets in SER, demonstrating better performance than layer (l − 1) and input to layer l, W(l) ∈ Rni ×no is a matrix of
Support Vector Machines (SVM) on the same set of features. In Li learnable weights of layer l, b(l) ∈ Rno is a vector of learnable biases
et al. (2013) a hybrid DNN—Hidden Markov Model (HMM) trained of layer l, y(l) ∈ Rno is the output of layer l, y(0) is the input to the
on Mel-Frequency Cepstral Coefficients (MFCCs) was proposed for model, y(L) is the output of the final layer L and the model, and φ is a
SER and compared to a GMM—HMM indicating improved results. nonlinear activation function applied element-wise. The activation
Han, Yu, and Tashev (2014) used a DNN to extract features from function used in this paper for feed-forward architectures is the
speech segments, which were then used to construct utterance- Rectified Linear Unit (ReLU) as in Eq. (3) due to its advantages over
level SER features that were fed into an Extreme Learning other activation functions, such as computational simplicity and
Machine (ELM) for utterance-level classification outperforming faster learning convergence (Glorot, Bordes, & Bengio, 2011).
other techniques. In Fayek, Lech, and Cavedon (2016a), a DNN
was used to learn a mapping from Fourier-transform based filter φ(z ) = max(0, z ). (3)
banks to emotion classes using soft labels generated from multiple To provide a probabilistic interpretation of the model’s output,
annotators to model the subjectiveness in emotion recognition the output layer L utilizes a softmax nonlinearity instead of the
which yielded improved performance compared to ground truth nonlinear function used in previous layers as in Eq. (4):
labels obtained by majority voting between the same annotators.
More recently, alternative neural network architectures for ezk
softmax(zk ) = (4)
SER were also investigated. Mao, Dong, Huang, and Zhan (2014) K
used a ConvNet in a two-stage SER scheme that involves learning ezk
k=1
local invariant features using a sparse auto-encoder from speech
spectrograms, processed using Principal Component Analysis where K is the number of output classes.
(PCA) followed by salient discriminative feature analysis to extract A popular variant of the feed-forward neural network architec-
discriminative features demonstrating competitive results. Tian, ture is the Convolutional Neural Network (ConvNet) (LeCun et al.,
Moore, and Lai (2015) compared knowledge-inspired disfluency 1990), which leverages three ideas: sparse interactions; parame-
and non-verbal vocalization features in emotional speech against ter sharing; and equivariant representations. This can be achieved
H.M. Fayek et al. / Neural Networks ( ) – 3
by replacing the affine transformation in Eq. (1) with a convolu- where ŷ ∈ {0, 1}K is a one-of-K encoded label and y(L) is the output
tion operation as in Eq. (5) and adding another layer called pooling, of the model.
which aims to merge semantically similar features using a subsam- The gradients are computed by differentiating the cost function
pling operation such as maximization. with respect to the model parameters using a mini-batch of data
examples sampled from the training data and backpropagated to
h(l) = y(l−1) ∗ W(l) + b(l) (5) prior layers using the backpropagation algorithm (Rumelhart, Hin-
(l) m×j×k ton, & Williams, 1986). Training recurrent architectures requires
where in this case, W ∈ R is a tensor of m learnable
filters, each of which is of height j and width k. Following recent modification to the backpropagation algorithm to compute the
work (He, Zhang, Ren, & Sun, 2016; Simonyan & Zisserman, 2015), gradients with respect to the parameters and states of the model,
subsampling is performed in this work by adjusting the stride in which is known as the backpropagation through time algorithm
convolution layers rather than an explicit pooling layer. (Werbos, 1988).
The recurrent architecture extends the notion of a typical feed- Gradient descent or one of its variants is used to update the
forward architecture by adding inter-layer and self connections parameters of the model using the gradients computed. A per-
to units in the recurrent layer (Graves, 2008), which can be parameter adaptive variant of gradient descent called RMSProp
modeled using Eq. (6) in place of Eq. (1). This makes such type of (Dauphin, de Vries, & Bengio, 2015; Tieleman & Hinton, 2012) was
architectures particularly suitable for tasks that involve sequential used in this paper, which uses gradient information to adjust the
inputs such as speech. learning rate as in Eqs. (14) and (15):
(l)
ht = yt
(l−1) (l)
W(yl) + st −1 W(sl) + b(l) (6) r := ηr + (1 − η)(∂ C /∂w)2 (14)
∂ C /∂w
where t denotes the time step, ht
(l)
∈ Rno is a vector of w := w − α √ (15)
(l−1) r +ϵ
preactivations of layer l at time step t , yt ∈ Rni is the output
of the previous layer (l − 1) at time step t and input to layer l where r is a leaky moving average of the squared gradient and
(l)
at time step t , Wy ∈ Rni ×no is a matrix of learnable weights of η and α are hyperparameters denoting the decay rate and the
(l) learning rate respectively.
layer l, st −1 ∈ Rno is the state of layer l at the previous time step
(t − 1), W(sl) ∈ Rno ×no is a matrix of learnable weights of layer l, 3.3. Regularization
and b(l) ∈ Rno is a vector of learnable biases of layer l. For recurrent
architectures, sigmoid functions such as the logistic function as Deep architectures are prone to overfitting, which makes reg-
in Eq. (7) and the hyperbolic tangent (tanh) function were used ularization an essential ingredient in their success. In this paper,
as the activation function instead of ReLUs, as ReLUs amplify the three regularization techniques were used: l2 weight decay, which
exploding gradient problem in recurrent architectures due to their penalizes the l2 norm of the weights of the model; dropout (Srivas-
unbounded nature. tava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014), which
1 stochastically omits units in the model during training prevent-
σ (z ) = (7) ing co-adaptation of units; and Batch Normalization (BatchNorm)
1 + e −z
(Ioffe & Szegedy, 2015), which aims to reduce the internal covariate
A popular variant of the recurrent architectures is the Long
shift in deep architectures by normalizing the means and standard
Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997),
deviations of layer preactivations as in Eq. (16):
which uses an explicit memory cell to better learn long-term
dependencies. An LSTM cell can be modeled using Eqs. (8)–(12): z(l) −
E(z(l) )
BatchNorm(z(l) ; γ , β) = β + γ (16)
(l) (l−1)
+ Whi h(t l−) 1 + Wci c(t l−) 1 + b(i l)
it = σ Wyi yt (8) (z(l) ) + ϵ
Var
(l) (l−1) where γ ∈ Rno , β ∈ Rno are model parameters that determine the
+ Whf h(t l−) 1 + Wcf c(t l−) 1 + b(f l)
ft = σ Wyf yt (9) mean and standard deviation of the layer preactivations respec-
(l) tively and
E and Var
are estimates of the sample mean and sample
= ft c(t l−) 1 + it tanh Wyc y(t l−1) + Whc h(t l−) 1 + b(cl)
ct (10) variance of the normalized preactivations respectively.
(l) Unlike l2 weight decay, which is employed by simply modifying
= σ Wyo yt(l−1) + Who h(t l−) 1 + Wco ct(l) + b(ol)
ot (11) the cost function, dropout and BatchNorm require modifying the
(l) (l) (l) architecture of the model in that BatchNorm can be treated as an
ht = ot tanh(ct ) (12)
additional layer added before the nonlinearity layer, while dropout
where σ is the logistic sigmoid function in Eq. (7), and i, f, o and is applied after the nonlinearity layer.
c are the input gate, forget gate, output gate and cell activation
vectors respectively, all of which are the same size as the vector 4. Proposed speech emotion recognition system
h. The weight matrices from the cell to gate vectors, e.g. Wci , are
diagonals such that each element in each gate vector only receives Fig. 1 is a sketch of the proposed SER system which follows
input from the same element of the cell vector (Graves, Mohamed, a frame-based processing formulation that utilizes Fourier-
& Hinton, 2013). transform based filter bank speech spectrograms and a deep multi-
layered neural network to predict emotion class probabilities for
3.2. Learning each frame in the input utterance.
Let X ∈ RN ×T be a speech utterance or speech stream sliced into
Learning is formulated as an optimization problem to minimize a time-series sequence of T frames, each of which is a RN vector of
a cost function. The cost function used in this paper is the cross- audio features. The aim is to rely on minimal speech processing
entropy cost function in Eq. (13): and thus each frame is represented by Fourier-transform based
K
log Mel-scaled N filter banks. The goal of the model is to predict
(L) p(yt |x), where x ∈ X is a number of concatenated frames,
C =− ŷk log(yk ) (13)
k=1
xt −l ∥ · · · ∥xt ∥ · · · ∥xt +r , where xt is the target frame, l is the number
4 H.M. Fayek et al. / Neural Networks ( ) –
Table 1
Test accuracy and UAR of various ConvNet architectures. FC(no ) denotes a fully-connected layer of no units followed by BatchNorm, ReLUs
and dropout. Conv(m × j × k) and Conv1D(m × j × k) denote a spatial convolutional layer and a temporal convolutional layer respectively
of m filters each of size j × k with a stride of 2 followed by BatchNorm and ReLUs. Softmax(no ) denotes a softmax output layer of no units
followed by a softmax operation.
Architecture Test Accuracy (%) Test UAR (%)
7. Discussion
Table 2
Test accuracy and UAR for various network architectures.
Model Test Accuracy (%) Test UAR (%)
Fig. 5. Input speech utterances (top) and corresponding aligned output (below) of the proposed SER system for a number of utterances from the test subset. The output is
the posterior class probabilities p(yt ) denoting the confidence of the model. Transcripts: (a): Oh, laugh at me all you like but why does this happen every night she comes
back? She goes to sleep in his room and his memorial breaks in pieces. Look at it, Joe look.: Angry. (b): I will never forgive you. All I’d done was sit around wondering if I was
crazy waiting so long, wondering if you were thinking about me.: Happy. (c): OKay. So I am putting out the pets, getting the car our the garage.: Neutral. (d): They didn’t die.
They killed themselves for each other. I mean that, exactly. Just a little more selfish and they would all be here today.: Sad. (e): Oh yeah, that would be. Well, depends on
what type of car you had, though too. I guess it would be worth it. helicopter. Yeah, helicopter. There is a helipad there, right? Yeah, exactly.: Happy.
Table 3
SER results reported in prior work on the IEMOCAP database. Note that differences in data subsets used and other experiment conditions should be taken into consideration
when comparing the following results against each other, c.f. references for more details.
Method Test Accuracy (%) Test UAR (%) Notes
it does not depend on future context. Moreover, the system is able Eyben, F., Scherer, K., Schuller, B., Sundberg, J., Andre, E., Busso, C., et al. (2015).
to deal with utterances of arbitrary length with no degradation in The Geneva minimalistic acoustic parameter set (gemaps) for voice research
and affective computing. IEEE Transactions on Affective Computing, 190–202.
performance. Furthermore, the system can handle utterances that http://dx.doi.org/10.1109/TAFFC.2015.2457417.
contain more than an emotion class, as demonstrated in Fig. 5(e), Fayek, H. M., Lech, M., & Cavedon, L. (2015). Towards real-time speech
which would not be possible in an utterance-based formulation. emotion recognition using deep neural networks. In 2015 9th international
conference on signal processing and communication systems (ICSPCS) (pp. 1–5).
http://dx.doi.org/10.1109/ICSPCS.2015.7391796.
8. Conclusion Fayek, H. M., Lech, M., & Cavedon, L. (2016a). Modeling subjectiveness in
emotion recognition with deep neural networks: Ensembles vs soft labels.
In 2016 international joint conference on neural networks (IJCNN) (pp. 566–570).
Various deep learning architectures were explored on a Speech http://dx.doi.org/10.1109/IJCNN.2016.7727250.
Emotion Recognition (SER) task. Experiments conducted illumi- Fayek, H. M., Lech, M., & Cavedon, L. (2016b). On the correlation and
nate how feed-forward and recurrent neural network architectures transferability of features between automatic speech recognition and
speech emotion recognition. In Interspeech 2016 (pp. 3618–3622).
and their variants could be employed for paralinguistic speech http://dx.doi.org/10.21437/Interspeech.2016-868.
recognition, particularly emotion recognition. Convolutional Neu- Fernandez, R. (2004). A computational model for the automatic recognition of affect
ral Networks (ConvNets) demonstrated better discriminative in speech. (Ph.D. thesis), School of Architecture and Planning, Massachusetts
Institute of Technology.
performance compared to other architectures. As a result of our Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks.
exploration, the proposed SER system which relies on minimal In JMLR W&CP: Proceedings of the fourteenth international conference on artificial
speech processing and end-to-end deep learning, in a frame- intelligence and statistics, AISTATS 2011.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
based formulation, yields state-of-the-art results on the IEMOCAP Graves, A. (2008). Supervised sequence labelling with recurrent neural networks.
database for speaker-independent SER. (Ph.D. thesis), Technische Universitat Munchen.
Future work can be pursued in several directions. The proposed Graves, A., Mohamed, A.-R., & Hinton, G. (2013). Speech recognition
with deep recurrent neural networks. In 2013 IEEE international con-
SER system can be integrated with automatic speech recognition, ference on acoustics, speech and signal processing (pp. 6645–6649).
employing joint knowledge of the linguistic and paralinguistic http://dx.doi.org/10.1109/ICASSP.2013.6638947.
components of speech to achieve a unified model for speech Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural
network and extreme learning machine. In Interspeech 2014.
processing. More generally, observations made in this work as a He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
result of exploring various architectures could be beneficial for recognition. In 2016 IEEE conference on computer vision and pattern recognition
devising further architectural innovations in deep learning that can (CVPR) (pp. 770–778). http://dx.doi.org/10.1109/CVPR.2016.90.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., et al. (2012).
exploit advantages of current models and address their limitations. Deep neural networks for acoustic modeling in speech recognition: The shared
views of four research groups. IEEE Signal Processing Magazine, 29, 82–97.
Acknowledgments http://dx.doi.org/10.1109/MSP.2012.2205597.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
Computation, 9, 1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735.
This research was funded by the Vice-Chancellor’s Ph.D. Schol- Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network
arship (VCPS) from RMIT University. We gratefully acknowledge training by reducing internal covariate shift. In Proceedings of the 32nd
international conference on machine learning (pp. 448–456).
the support of NVIDIA Corporation with the donation of one of the Kim, Y., Lee, H., & Provost, E. M. (2013). Deep learning for robust fea-
Tesla K40 GPUs used in this research. ture generation in audiovisual emotion recognition. In 2013 IEEE interna-
tional conference on acoustics, speech and signal processing (pp. 3687–3691).
http://dx.doi.org/10.1109/ICASSP.2013.6638346.
References Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. In Neural information processing systems
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). (pp. 1097–1105).
Convolutional neural networks for speech recognition. IEEE/ACM Transactions LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
on Audio, Speech, and Language Processing, 22, 1533–1545. Insight.
Arias, J. P., Busso, C., & Yoma, N. B. (2013). Energy and F0 contour modeling LeCun, Y., Boser, B., Denker, J.S., Howard, R.E., Habbard, W., & Jackel, L.D. et al. (1990).
with functional data analysis for emotional speech detection. In Interspeech Advances in neural information processing systems 2. chapter Handwritten
(pp. 2871–2875). digit recognition with a back-propagation network. (pp. 396–404).
Ayadi, M. E., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recog- Lee, C.-C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition
nition: Features, classification schemes, and databases. Pattern Recognition, 44, using a hierarchical binary decision tree approach. Speech Communication, 53,
572–587. 1162–1171.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., & Gonzalez, I. et al. (2013). Hybrid
IEMOCAP: interactive emotional dyadic motion capture database. Language deep neural network–hidden markov model (dnn-hmm) based speech emotion
Resources and Evaluation, 42, 335–359. http://dx.doi.org/10.1007/s10579-008- recognition. In 2013 humaine association conference on affective computing and
9076-6. intelligent interaction (ACII) (pp. 312–317).
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech
et al. (2001). Emotion recognition in human–computer interaction. IEEE Signal emotion recognition using convolutional neural networks. IEEE Transactions on
Processing Magazine, 18, 32–80. http://dx.doi.org/10.1109/79.911197. Multimedia, 16, 2203–2213.
Dauphin, Y., de Vries, H., & Bengio, Y. (2015). Equilibrated adaptive learning rates for Mariooryad, S., & Busso, C. (2013). Exploring cross-modality affective reactions for
non-convex optimization. In Advances in neural information processing systems audiovisual emotion recognition. IEEE Transactions on Affective Computing, 4,
(pp. 1504–1512). 183–196. http://dx.doi.org/10.1109/T-AFFC.2013.11.
Eyben, F., Huber, B., Marchi, E., Schuller, D., & Schuller, B. (2015). Real-time robust Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief
recognition of speakers’ emotions and characteristics on mobile platforms. networks. IEEE Transactions on Audio, Speech, and Language Processing, 20,
In 2015 international conference on affective computing and intelligent interaction 14–22. http://dx.doi.org/10.1109/TASL.2011.2109382.
(ACII) (pp. 778–780). http://dx.doi.org/10.1109/ACII.2015.7344658. Petta, P., Pelachaud, C., & Cowie, R. (2011). Emotion-oriented systems. In The
humaine handbook.
H.M. Fayek et al. / Neural Networks ( ) – 9
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). Stuhlsatz, A., Meyer, C., Eyben, F., ZieIke, T., Meier, G., & Schuller, B. (2011). Deep
The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech neural networks for acoustic emotion recognition: Raising the benchmarks.
recognition and understanding. IEEE Signal Processing Society, IEEE Catalog No.: In 2011 IEEE international conference on acoustics, speech and signal processing
CFP11SRW-USB.
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. In Encyclopedia of (ICASSP) (pp. 5688–5691).
database systems (pp. 532–538). Springer. Tahon, M., & Devillers, L. (2016). Towards a small set of robust acoustic features for
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by emotion recognition: Challenges. IEEE/ACM Transactions on Audio, Speech, and
back-propagating errors. Nature, 323, 533–536. Language Processing, 24, 16–28.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Tian, L., Lai, C., & Moore, J. (2015). Recognising emotions in dialogues with
Networks, 61, 85–117. http://dx.doi.org/10.1016/j.neunet.2014.09.003. disfluencies and non-verbal vocalisations. In R. Lickley (Ed.), The 7th workshop
Schuller, B., Steidl, S., & Batliner, A. (2009). The interspeech 2009 emotion challenge.
on disfluency in spontaneous speech.
In Interspeech, vol. 2009 (pp. 312–315).
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C. A., Tian, L., Moore, J., & Lai, C. (2015). Emotion recognition in spontaneous and acted
et al. (2010). The interspeech 2010 paralinguistic challenge. In Interspeech dialogues. In 2015 International conference on affective computing and intelligent
(pp. 2794–2797). interaction (ACII) (pp. 698–704).
Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., & Wendemuth, A. (2009). Acoustic Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by
emotion recognition: A benchmark comparison of performances. In IEEE a running average of its recent magnitude. COURSERA: Neural Networks for
workshop on automatic speech recognition understanding, 2009 (pp. 552–557).
Machine Learning, 4.
Shah, M., Chakrabarti, C., & Spanias, A. (2014). A multi-modal approach to emotion
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources,
recognition using undirected topic models. In IEEE international symposium on
circuits and systems (ISCAS) (pp. 754–757). features, and methods. Speech Communication, 48, 1162–1181.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large- Vlasenko, B., Schuller, B., Wendemuth, A., & Rigoll, G. (2007). Frame vs. turn-level:
scale image recognition. In International conference on learning representations emotion recognition from speech considering static and dynamic processing.
(ICLR). In Affective computing and intelligent interaction (pp. 139–147). Springer.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Werbos, P. J. (1988). Generalization of backpropagation with application to a
Dropout: A simple way to prevent neural networks from overfitting. Journal of recurrent gas market model. Neural Networks, 1, 339–356. http://dx.doi.org/10.
Machine Learning Research, 15, 1929–1958. 1016/0893-6080(88)90007-X.