Multimodal Speech Emotion Recognition and Ambiguity Resolution
Multimodal Speech Emotion Recognition and Ambiguity Resolution
Abstract—Identifying emotion from speech is a non- of Speech Emotion Recognition (SER) as in [1]
trivial task pertaining to the ambiguous definition [2] and [3]. However, this rise has made practition-
of emotion itself. In this work, we adopt a feature- ers rely more on the power of the deep learning
engineering based approach to tackle the task of
speech emotion recognition. Formalizing our problem models as opposed to using domain knowledge to
as a multi-class classification problem, we compare the construct meaningful features and building models
performance of two categories of models. For both, that perform well as well as are interpretable. In this
we extract eight hand-crafted features from the audio work, we explore the implication of hand-crafted
signal. In the first approach, the extracted features features for SER and compare the performance of
are used to train six traditional machine learning
classifiers, whereas the second approach is based on lighter machine learning models with the heavily
deep learning wherein a baseline feed-forward neural data-reliant deep learning models. Furthermore, we
network and an LSTM-based classifier are trained also combine features from the textual modality to
over the same features. In order to resolve ambiguity understand the correlation between different modali-
in communication, we also include features from the ties and aid ambiguity resolution. More formally, we
text domain. We report accuracy, f-score, precision
and recall for the different experiment settings we pose our task as a multi-class classification problem
evaluated our models in. Overall, we show that lighter and employ the two classes of models to solve
machine learning based models trained over a few that. For both the approaches, we first extract hand-
hand-crafted features are able to achieve performance crafted features from the time domain of the audio
comparable to the current deep learning based state- signal and train the respective models.
of-the-art method for emotion recognition.
Index Terms—multimodal speech emotion recogni- In the first approach, we train traditional machine
tion, machine learning, deep learning learning classifiers, namely, Random Forests, Gra-
dient Boosting, Support Vector Machines, Naive-
I. I NTRODUCTION Bayes and Logistic Regression. In the second ap-
proach, we build a Multi-Layer Perceptron and an
Communication is the key to human existence
LSTM [4] classifier to recognize emotion given
and more often than not, we have to deal with
a speech signal. The models are evaluated on
ambiguous situations. For instance, the phrase “This
the IEMOCAP [5] dataset under different settings,
is awesome” could be said under either happy or sad
namely, Audio-only, Text-only and Audio + Text 1 .
settings. Humans are able to resolve ambiguity in
The rest of the paper is organized as follows: Sec-
most cases because we can efficiently comprehend
tion II describes existing methods in the literature
information from multiple domains (henceforth, re-
for the task of speech emotion recognition; Section
ferred to as modalities), namely, speech, text and
III gives an overview of the dataset used in this work
visual. With the rise of deep learning algorithms,
and the pre-processing steps applied before feature
there have been multiple attempts to tackle the task
extraction; Section IV describes the proposed mod-
1 Code for all the experiments is available at els and implementation details; Results are reported
http://tinyurl.com/y55dlc3m in Section V, followed by the conclusion and future
scope of this work in Section VI. TABLE I: Number of examples for each emotion
Class Count
II. L ITERATURE R EVIEW Angry 860
In this section, we review some of the work Happy 1309
that has been done in the field of speech emotion Sad 2327
Fear 1007
recognition (SER). The task of SER is not new and Surprise 949
has been studied for quite some time in literature. Neutral 1385
A majority of the early approaches ( [6] [7]) used Total 7837
Hidden Markov Models (HMMs) [8] for identifying
emotion from speech. Recent introduction of deep
neural networks to the domain has also significantly features extracted and the two models applied to the
improved the state-of-the-art performance. For in- classification problem.
stance, [3] and [9] use recurrent autoencoders to A. Data Pre-processing
solve the task. Recently, methods have also been a) Audio: A preliminary frequency analysis re-
proposed to efficiently combine features from multi- vealed that the dataset is not balanced. The emotions
ple domains, such as, Tensor Fusion Networks [10] “fear” and “surprise” were under-represented and
and Low-Rank Matrix Multiplication [11], instead use upsampling techniques to alleviate the issue.
of trivial concatenation. We then merged examples from “happy” and “ex-
This work aims to provide a comparative study cited” classes as “happy” was under-represented and
between 1) deep learning based models that are the two emotions closely resemble each other. In
trained end-to-end, and 2) lighter machine learning addition to that, we discard examples classified as
and deep learning based models trained over hand- “others”; they corresponded to examples that were
crafted features. We also investigate the information labeled ambiguous even for a human. Applying the
residing in multiple modalities and how their com- aforementioned operations resulted in 7837 exam-
bination affects the performance. ples in total. Final sample distribution for each of
III. DATASET the emotions is shown in Table I.
In this work, we use the IEMOCAP [5] released b) Text: The available transcriptions were first
in 2008 by researchers at the University of Southern normalized to lowercase and any special symbols
California (USC). It contains five recorded sessions were removed.
of conversations from ten speakers and amounts to B. Feature Extraction
nearly 12 hours of audio-visual information along
We now describe the handcrafted features used to
with transcriptions. It is annotated with eight cat-
train both, the ML- and the DL-based models.
egorical emotion labels, namely, anger, happiness,
1) Audio Features:
sadness, neutral, surprise, fear, frustration and ex-
a) Pitch: Pitch is important because wave-
cited. It also contains dimensional labels such as
forms produced by our vocal cords change depend-
values of the activation and valence from 1 to 5;
ing on our emotion. Many algorithms for estimating
however, they are not used in this work.
the pitch signal exist. We use the most common
The dataset is already split into multiple utter-
method based on autocorrelation of center-clipped
ances for each session and we further split each
frames [12]. Formally, the input signal y[n] is
utterance file to obtain wav files for each sentence.
center-clipped to give a resultant signal, yclipped [n]:
This was done using the start timestamp and end
timestamp provided for the transcribed sentences.
This results in a total of ∼10K audio files which y[n] − Cl , if y[n] ≥ Cl
are then used to extract features. yclipped [n] = 0, if |y[n]| < Cl (1)
y[n] + Cl , if y[n] ≤ Cl
IV. M ETHODOLOGY
This section describes the data pre-processing Typically, Cl is nearly half the mean of the input
steps followed by a detailed description of the signal and [·] denotes the discrete nature of the input
signal. Now, autocorrelation is calculated for the
obtained signal yclipped , which is further normalized P ause = P r(y[n] < t) (5)
and the peak values associated with the pitch of
where t represents a carefully-chosen threshold of
the given input y[n]. It was found that center-
≈ 0.4 ∗ E, E being the RMSE.
clipping the input signal resulted in more distinct
e) Central moments: Finally, we use the mean
autocorrelation peaks.
and standard deviation of the amplitude of the signal
b) Harmonics: In the emotional state of anger to incorporate a “summarized” information of the
or for stressed speech, there are additional excitation input.
signals other than pitch ( [13], [14]). This additional
excitation is apparent in the spectrum as harmonics
(see Figure 1) and cross-harmonics. We calculate
harmonics using a median-based filter as described
in [15]. First, the median filter is created for a given
window size l, given by:
ot = σg (Wo xt + Uo gh t − 1 + bo ) (9)
ht = ot · σh (ct ) (11)
where initial values are c0 = 0 and h0 = 0 and Fig. 4: LSTM classifier
· denotes the element-wise product, t denotes the
time step (each element in a sequence belongs to Figure 4 shows the network implemented in this
one time step), xt refers to the input vector to the work. We feed the feature vectors as input to the
LSTM unit, ft is the forget gate’s activation vector, network and finally pass the output of the LSTM
it refers to the input gate’s activation vector, ot network through a softmax layer to get probability
refers to the output gate’s activation vector, ht is scores for each of the six emotion classes. Since we
the hidden state vector (which is typically used to are using feature vectors as input, we do not need
map a vector from the feature space to a lower- another decoder network to transform it back from
dimensional latent space,) ct is the cell state vector hidden to output space thereby reducing network
and W, U and b are weight and bias matrices which size.
(a) Audio-only setting (a) E1, Audio-only setting
Fig. 5: Performance of different models; E1: Ensem- Fig. 6: Confusion Matrices of the our ensemble
ble (RF + XGB + MLP); E2: Ensemble (RF + XGB models; E1: Ensemble (RF + XGB + MLP); E2:
+ MLP + MNB + LR) Ensemble (RF + XGB + MLP + MNB + LR)
E. Experiments Hyperparameters for the all the models under the
Here, we describe the three different settings we three experiment settings could be found in the
conducted our experiments in: released repository.
• Audio-only: In this setting, we train all the G. Evaluation Metrics:
classifiers using only the audio feature vectors In this section, we first describe the various eval-
described earlier. uation metrics used and report results for the three
• Text-only: In this setting, we train all the classi-
experiment settings.
fiers using only the text feature vectors (TFIDF a) Accuracy: This refers to the percentage of
vectors) test samples that are classified correctly.
• Audio+Text: In this setting, we fuse the feature
b) Precision: This measure tells us out of all
vectors from the two modalities. There have
predictions, how many are actually present in the
been some methods proposed to fuse vectors
ground truth (a.k.a. labels). It is calculated using
efficiently from multiple modalities but we sim-
the formula:
ply concatenate the feature vectors from audio
and text to obtain the combined feature vectors. tp
P recision = (12)
Through this experiment, we would be able to tp + f p
infer how much information is contained in
c) Recall: This measure tells us how many
each of the modalities and how does fusion
correct labels are present in the predicted output.
influence the model’s performance.
It is calculated using the formula:
F. Implementation Details tp
P recision = (13)
In this section, we describe the implementation tp + f n
details adopted in this work. Here, tp, f p, and f n stand for true positive,
• We use librosa [23], a Python library, to false positive and false negative respectively. We can
process the audio files and extract features from compute these values from the confusion matrix.
them. d) F-score: It is defined as the harmonic mean
• We use scikit-learn and xgboost [24] of precision and recall. This measure was included
[25], the machine learning libraries for Python, as accuracy is not a complete measure of a model’s
to implement all the ML classifiers (RF, XGB, predictive power but F-score is since it is more
SVM, MNB, and LR) and the MLP. normalized.
• We use PyTorch [26] to implement the LSTM We compare our best performing models with the
classifiers described earlier. current state-of-the-art as mentioned in [2]. They
• In order to regularize the hidden space of the employ three types of recurrent encoders, namely,
LSTM classifiers, we use a shut-off mecha- ARE, TRE and MDRE denoting Audio-, Text- and
nism, called dropout [27], where a fraction of Multimodal Dual- Recurrent Encoders respectively.
neurons are not used for final prediction. This is It is important to mention that [2] only consid-
shown to increase the robustness of the network ers four emotions for classification, namely, angry,
and prevent overfitting. happy, sad and neutral as opposed to six in our case.
We randomly split our dataset into a train (80%) In order to present a fair comparison of our method
and test (20%) set. The same split is used for all with theirs, we also run the experiments for the four
the experiments to ensure a fair comparison. The classes (models with code 4-class in Figure 5).
LSTM classifiers were trained on an NVIDIA Titan
X GPU for faster processing. We stop the training V. R ESULTS
when we do not see any improvement in validation In this section, we discuss the performance of
performance for >10 epochs. Here, one epoch refers models described in Section IV.
to one iteration over all the training samples. Dif- From Figure 5, we can see that our simpler
ferent batch sizes were used for different models. and lighter ML models either outperform or are
comparable to the much heavier current state-of-the- A. Most Important Features:
art on this dataset. A more detailed analysis follows: In this section, we investigate which features
a) Audio-only results: Results are especially contribute the most during prediction in this clas-
interesting for this setting. Performance of LSTM sification task. We chose the XGB model for this
and ARE reveals that deep models indeed need a lot study and rank the eight audio features. We see that
of information to learn features as the LSTM clas- Harmonic, which is directly related to the excitation
sifier trained on eight-dimensional features achieves in signals, contributes the most. It is interesting to
very low accuracy as compared to the end-to-end see that “silence” attributing to Pause, is almost as
trained ARE. However, neither of them are able significant as standard deviation of the autocorre-
to beat the lighter E1 model (Ensemble of RF, lated signal (related to pitch). The low contribution
XGB and MLP) which was trained on the eight- of central moments is expected as a signal is very
dimensional audio feature vectors. A look at the con- diverse and an global/coarse feature would be unable
fusion matrix (Fig. 6a) reveals that detecting “neu- to identify the nuances present in it.
tral” or distinguishing between “angry”, “happy”
and “sad” is the most difficult for the model.
b) Text-only results: We observe that the per-
formance of all the models for this setting is similar.
This could be attributed to the richness of TFIDF
vectors known to capture word-sentence correlation.
We see from the confusion matrix (Fig. 6b) that
our text-based models are able to distinguish the
six emotions fairly well along with the end-to-end
trained TRE. We observe that “sad” is the toughest
for textual features to identify very clearly.
Fig. 7: Most important Audio features
c) Audio+Text results: We see that combining
audio and text features gives us a boost of ∼14% for
all the metrics. This is clear evidence of the strong VI. C ONCLUSION AND F UTURE W ORK
correlation between text and speech features. Also, In this work, we tackle the task of speech emotion
this is the only case when the recurrent encoders recognition and study the contribution of differ-
seem to perform slightly better in terms of accuracy ent modalities towards ambiguity resolution on the
but at the cost of precision. The lower performance IEMOCAP dataset. We compare, both, ML- and
of E1 maybe be attributed to the trivial fusion DL-based models and show that even lighter and
method (concatenation) we use as simple concate- more interpretable ML models can achieve perfor-
nation for an ML model would still contain a lot of mance close to DL-based models. We show that
modality-specific connections instead of the desired ensembling multiple ML models also lead to some
inter-modal connections. The promising result here improvement in the performance. We only extract a
is that combining features from both the modalities handful of time-domain features from audio signals.
indeed helped to resolve the ambiguity observed The audio feature-space could be made even richer
for modality-specific models as shown in Fig. 6c. if we could include some frequency-domain features
We can say that the textual features helped in too such as Mel-Frequency Cepstral Coefficients
correct classification of “angry” and “happy” classes (MFCC) [28], Spectral Roll-off and additional time-
whereas the audio features enabled the model to domain features such as Zero Crossing Rate (ZCR)
detect “sad” better. [29]. Also, better fusion methods such as TFN [10]
Overall, we can conclude that our simple ML and LMF [11] could be employed for combining
methods are very robust to have achieved compa- speech and text vectors more effectively. It would
rable performance even though they are modeled to also be interesting to see the scaling in the perfor-
predict six-classes as opposed to four in previous mance of ML models v/s DL models if include more
works. data.
R EFERENCES [18] J. Friedman, T. Hastie, and R. Tibshirani, The elements of
statistical learning, vol. 1. Springer series in statistics New
[1] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech York, 2001.
emotion recognition using cnn,” in Proceedings of the 22nd [19] J. Platt et al., “Probabilistic outputs for support vector ma-
ACM international conference on Multimedia, pp. 801–804, chines and comparisons to regularized likelihood methods,”
ACM, 2014. Advances in large margin classifiers, vol. 10, no. 3, pp. 61–
[2] S. Yoon, S. Byun, and K. Jung, “Multimodal speech emo- 74, 1999.
tion recognition using audio and text,” in 2018 IEEE Spoken [20] C. Cortes and V. Vapnik, “Support-vector networks,” Ma-
Language Technology Workshop (SLT), pp. 112–118, IEEE, chine learning, vol. 20, no. 3, pp. 273–297, 1995.
2018. [21] A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes,
[3] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition “Multinomial naive bayes for text categorization revisited,”
using deep neural network and extreme learning machine,” in Australasian Joint Conference on Artificial Intelligence,
in Fifteenth annual conference of the international speech pp. 488–499, Springer, 2004.
communication association, 2014. [22] G. King and L. Zeng, “Logistic regression in rare events
[4] S. Hochreiter and J. Schmidhuber, “Long short-term mem- data,” Political analysis, vol. 9, no. 2, pp. 137–163, 2001.
ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, [23] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar,
1997. E. Battenberg, and O. Nieto, “librosa: Audio and music
[5] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, signal analysis in python,” in Proceedings of the 14th
S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemo- python in science conference, pp. 18–25, 2015.
cap: Interactive emotional dyadic motion capture database,” [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Language resources and evaluation, vol. 42, no. 4, p. 335, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
2008. V. Dubourg, et al., “Scikit-learn: Machine learning in
[6] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mariño, python,” No. Oct, pp. 2825–2830, 2011.
“Speech emotion recognition using hidden markov models,” [25] T. Chen, “Scalable, portable and distributed gradient boost-
in Seventh European Conference on Speech Communication ing (gbdt, gbrt or gbm) library, for python, r, java, scala,
and Technology, 2001. c++ and more. runs on single machine, hadoop, spark, flink
[7] B. Schuller, G. Rigoll, and M. Lang, “Hidden markov and dataflow,” 2014.
model-based speech emotion recognition,” in 2003 IEEE [26] Facebook, “Pytorch,” 2017.
International Conference on Acoustics, Speech, and Signal [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
Processing, 2003. Proceedings.(ICASSP’03)., vol. 2, pp. II– and R. Salakhutdinov, “Dropout: a simple way to prevent
1, IEEE, 2003. neural networks from overfitting,” The Journal of Machine
Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[8] L. R. Rabiner and B.-H. Juang, “An introduction to hidden
[28] S. Davis and P. Mermelstein, “Comparison of parametric
markov models,” ieee assp magazine, vol. 3, no. 1, pp. 4–
representations for monosyllabic word recognition in con-
16, 1986.
tinuously spoken sentences,” IEEE transactions on acous-
[9] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for
tics, speech, and signal processing, vol. 28, no. 4, pp. 357–
robust feature generation in audiovisual emotion recogni-
366, 1980.
tion,” in 2013 IEEE International Conference on Acoustics,
[29] F. Gouyon, F. Pachet, O. Delerue, et al., “On the use
Speech and Signal Processing, pp. 3687–3691, IEEE, 2013.
of zero-crossing rate for an application of classification
[10] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. of percussive sounds,” in Proceedings of the COST G-6
Morency, “Tensor fusion network for multimodal sentiment conference on Digital Audio Effects (DAFX-00), Verona,
analysis,” arXiv preprint arXiv:1707.07250, 2017. Italy, 2000.
[11] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang,
A. Zadeh, and L.-P. Morency, “Efficient low-rank multi-
modal fusion with modality-specific factors,” arXiv preprint
arXiv:1806.00064, 2018.
[12] M. Sondhi, “New methods of pitch extraction,” IEEE
Transactions on audio and electroacoustics, vol. 16, no. 2,
pp. 262–266, 1968.
[13] H. Teager and S. Teager, “Evidence for nonlinear sound
production mechanisms in the vocal tract,” in Speech pro-
duction and speech modelling, pp. 241–261, Springer, 1990.
[14] G. Zhou, J. H. Hansen, and J. F. Kaiser, “Nonlinear feature
based classification of speech under stress,” IEEE Transac-
tions on speech and audio processing, vol. 9, no. 3, pp. 201–
216, 2001.
[15] D. Fitzgerald, “Harmonic/percussive separation using me-
dian filtering,” 2010.
[16] Y. Amit, D. Geman, and K. Wilder, “Joint induction of
shape features and tree classifiers,” IEEE Transactions on
Pattern Analysis & Machine Intelligence, no. 11, pp. 1300–
1305, 1997.
[17] L. Breiman, “Bagging predictors,” Machine learning,
vol. 24, no. 2, pp. 123–140, 1996.