THIRD - s10772 022 09985 6
THIRD - s10772 022 09985 6
https://doi.org/10.1007/s10772-022-09985-6
Received: 16 April 2021 / Accepted: 17 June 2022 / Published online: 8 July 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
Speech emotion recognition is one of the fastest growing areas of interest in the field of affective computing. Emotion
detection aids human–computer interaction and finds application in a wide gamut of sectors, ranging from healthcare to
retail to education. The present work strives to provide a speech emotion recognition framework that is both reliable and
efficient enough to work in real-time environments. Speech emotion recognition can be performed using linguistic as well
as paralinguistic aspects of speech; this work focusses on the latter, using non-lexical or paralinguistic attributes of speech
like pitch, intensity and mel-frequency cepstral coefficients to train supervised machine learning models for emotion rec-
ognition. A combination of prosodic and spectral features is used for experimental analysis and classification is performed
using algorithms like Gaussian Naïve Bayes, Random Forest, k-Nearest Neighbours, Support Vector Machine and Multilayer
Perceptron. The choice of these ML models was based on the swiftness with which they could be trained, making them more
suitable for real-time applications. Comparative analysis of the models reveals SVM and MLP to be the best performers
with 77.86% and 79.62% accuracies respectively. The performance of these classifiers is compared with benchmark results
in literature, and a significant improvement over state-of-the-art models is presented. The observations and findings of this
work can be applied to design real-time emotion recognition frameworks that can be used to design and develop applications
and technologies for various domains.
Keywords Emotion recognition · Affective computing · Support vector machine · Multilayer perceptron · Paralinguistic
acoustic features
13
Vol.:(0123456789)
708 International Journal of Speech Technology (2022) 25:707–725
the emotional state of patients during the counselling ses- utterance. The resultant dimension of a segmental feature is
sion would enable the psychiatrist to converse in a better thus equal to the number of time-frames the speech utterance
manner, thereby enhancing the efficacy of such sessions. is windowed into. Supra-segmental features, on the other
Realizing the wide-ranging applications of speech emotion hand, are computed once for every utterance. Some exam-
recognition, it becomes critical to understand and develop ples of segmental features include frame intensity, mel-fre-
systems that work efficiently and can be used in real-time quency cepstral coefficients (MFCCs) and linear prediction
environments. cepstral coefficients (LPCCs) while fundamental frequency,
shimmer, jitter and speech rate are good examples of supra-
1.1 Speech emotion recognition segmental features. Vector features can be further classi-
fied as low-level descriptors (LDDs) and functionals. LDDs
Emotion recognition from speech is generally performed in are the set of all segmental and supra-segmental features.
one of two ways. One is to extract linguistic (lexical) infor- Functionals are obtained from LDDs by applying statisti-
mation contained in speech, that is, use an Automatic Speech cal measures like mean, maximum, variance and kurtosis
Recognition (ASR) software to capture words or phrases (Anagnostopoulos et al., 2015).
from speech. These words and phrases can then be looked
up in a dictionary of salient words and phrases designed 1.3 Machine learning for speech emotion
for this purpose, where each word or phrase is associated recognition
with a set of discrete emotional states. The presence of the
word ‘great’ for example, can be mapped to positive emo- Classifying speech into emotions involves the extraction of
tional states like happiness or surprise in the dictionary. The useful and meaningful patterns from audio signals and learn-
other way of detecting emotion from speech relies on using ing to identify these patterns in data. This is done through
non-lexical features of speech like pitch, intensity and audio machine learning techniques. A number of pattern recog-
quality aspects like jitter and shimmer. In this approach, we nition algorithms can be used for classification. Previous
try to correlate the pronunciation of the words articulated in works in this area have used Decision trees, Support Vec-
different emotional states with their corresponding emotions. tor Machines, Gaussian Mixture Models (GMMs), Hidden
In the present work, the latter approach has been taken to Markov Models (HMMs) and Artificial Neural Networks
design an SER system. It is also possible to design an SER (ANNs) and other soft computing approaches (Agrawal &
using non-lexical acoustic features that is assisted by an ASR Christopher, 2020; Agrawal et al., 2021; Gupta et al., 2020).
that contributes to the classification using lexical aspects of In this paper, we experiment with Gaussian Naïve Bayes,
speech. This framework would, however, need a corpus with Random Forest, k-Nearest Neighbours, Support Vector
spontaneous speech audios with corresponding emotional Machine and Multilayer Perceptron as models for classifi-
state information. cation. Like any other typical machine learning problem, the
A good number of corpora exist for emotion recognition task of emotion recognition is broken down into three main
from speech. Of these, most are acted SER databases, in that steps—feature extraction, feature subset selection and classi-
a handful of actors articulate a fixed statement in a given set fication. This study uses the Ryerson Audio-Visual Database
of emotional states. It is noted that actors often exaggerate of Emotional Speech and Song (RAVDESS) (Livingstone
emotional speech, hence making the task of emotion recog- & Russo, 2018) to classify speech into one of eight possi-
nition harder in natural or spontaneous speech. ble emotions—happy, sad, neutral, calm, disgust, surprised,
anger and fearful. Performance of SER systems largely
1.2 Paralinguistic acoustic features depend on the features extracted from speech. The most
challenging aspect of speech emotion recognition is find-
As discussed earlier, speech contains linguistic (explicit) as ing the golden set of features that can discriminate between
well as paralinguistic (implicit) components. The paralin- emotions correctly and yet not overfit the machine learning
guistic aspects of speech include pronunciation, loudness of model. Mel-frequency Cepstral Coefficients (MFCCs), Lin-
speech, intonation, shimmer, jitter, harmonics to noise ratio ear Frequency Cepstral Coefficients (LFCCs), pitch, inten-
(HNR) and variations in pitch and intensity. Since the use sity, formants along with their derivatives and double deriva-
of paralinguistic acoustic features is independent of lexical tives have been extracted in the feature extraction phase. In
content, an SER trained on a corpus of English sentences can the feature selection phase, mutual information between each
very well be used to detect emotion from speech in other lan- feature and the target attribute is computed so as to rank the
guages. Acoustic features are broadly categorized into two features in decreasing order of importance. Then the top
types based on their temporal structure, namely, segmen- 25% of features have been retained and the classification
tal and supra-segmental. Segmental features are computed is performed on this subset of features. Feature selection is
once for every time frame (20–30 msec) windowed from an an essential step in the machine learning pipeline. They are
13
International Journal of Speech Technology (2022) 25:707–725 709
primarily of three types—Filter, Wrapper and Embedded. In log-energy as features. While MFCCs are dominantly used
this study, a Filter based approach is used, as it is simpler, in both speaker recognition and emotion detection tasks,
faster and independent of the choice of classifier. For the (Liu, 2018) suggested that a feature set of Gammatone Fre-
classification part, results have been presented for the two quency Cepstral Coefficients (GFCCs) performed better than
best performing classifiers, i.e., support vector machine with MFCCs for emotion detection. He reported a 3.6% increase
a Gaussian kernel, and multilayer perceptron. The most opti- in the overall accuracy when GFCCs were used in the place
mal hyperparameters of the considered models are obtained of MFCCs. In Zhou et al. (2011), the authors drew a com-
with the help of grid search and finally, overall and class- parison between MFCCs and LFCCs for speaker recogni-
wise validation accuracies and ROC curves are presented tion tasks, and noted that LFCCs consistently outperformed
and used to evaluate the ML models considered in the study. MFCCs, especially for female speech emotion recognition.
The rest of the paper is organised as follows. Section 2 LFCCs tend to capture more spectral information in the high
reviews the work done in the literature pertaining to speech frequency range, hence performing better for female speech
emotion recognition. Section 3 provides a detailed descrip- signals, which have shorter vocal tracts, and hence higher
tion of the framework, and provides the methodology and formant frequencies. Shegokar and Sircar (2016) performed
the details of the different components of the pipeline. It continuous wavelet transform (CWT) on audio signals and
provides the implementation details ranging from feature used Principal Component Analysis (PCA) for feature selec-
extraction to data pre-processing to classification. It also tion. In Pan et al. (2012), the authors used a combination
provides a brief description of the performance metrics of MFCCs, LPCCs, Mel energy Spectrum Dynamic coeffi-
used in the study to evaluate ML models. The results are cients (MEDCs), energy and pitch features for emotion rec-
summarised in Sect. 4. It also provides comparisons made ognition. In addition to these features, they also added the
between the accuracy of the proposed method and the accu- first and second derivatives of the above segmental features
racies obtained by other techniques reported in the literature. to extract useful information from temporally changing data.
Section 5 concludes the paper and provides potential areas In Koduru et al. (2020), the authors used MFCCs, Discrete
for improvement. Wavelet Transform (DWT), pitch, energy and zero-crossing
rate in the feature extraction phase. Surampudi et al. (2019)
provides novel approaches to extract acoustic features for
2 Literature review sound events. These could be utilised for speech emotion
recognition as well.
In this section, a summary of earlier works in speech emo-
tion recognition (SER) has been presented. This includes a 2.2 Review on classical machine learning
review on feature extraction methods, feature subset selec- techniques
tion and classification methodologies, including both clas-
sical machine learning and deep learning approaches used Early works in the area of speech emotion recognition
by recent state-of-the-art SER systems. (1990s) made use of Maximum Likelihood Probability
(MLP) classifiers and Linear Discriminant Classifiers
2.1 Review on feature extraction methods (LDC). It was in the 2000s that Neural Network (NN) clas-
sifiers gained popularity for emotion recognition tasks (Rong
The most crucial step in the design of SER systems is decid- et al., 2009). In recent times, however, the focus has shifted
ing upon the set of acoustic features to extract. Most works again to using classical machine learning algorithms like
till date have used Mel-frequency Cepstral Coefficients Support Vector Machines (SVMs) over other deep learn-
(MFCCs) and its derivatives, either alone or in combination ing techniques. This is primarily because of the simplicity
with other spectral and prosodic features. Many research- and ease of training SVMs when compared to computa-
ers have remarked that MFCCs are more useful for emotion tionally intensive models like ANNs and DNNs. Authors
detection than any other acoustic feature like Linear Predic- in Petrushin (2000) performed speech emotion recognition
tive Coefficients (LPCs), formants or energy. A lot of recent using the following machine learning approaches—k-near-
works have used MFCCs and its derivatives in conjunc- est neighbours classifier, neural network classifier and an
tion with features like shimmer, jitter, energy, zero cross- ensemble of neural networks. The ensemble consisted of
ing rate, spectral centroids etc. Bhavan et al. (2019) used 7 to 15 base models and a majority-voting mechanism was
spectral sub band centroids along with the first 13 MFCCs used for the final classification. In Bhavan et al. (2019), the
and their derivatives as the initial set of features. In Kwon authors used an ensemble of SVMs for classification. They
et al. (2003), the authors selected band energies, MFCCs compared the overall accuracy of the model using an SVM
(along with its first and second derivatives), pitch, funda- and using a bagged ensemble of 20 SVMs and reported a
mental frequency, three formant frequencies (F0, F1, F2) and 3% improvement in emotion recognition on the RAVDESS
13
710 International Journal of Speech Technology (2022) 25:707–725
dataset. For feature selection, they used a method called features as inputs to a Convolutional Neural Network
Boruta. Shegokar and Sircar (2016) used SVMs with dif- (CNN). Following an incremental approach to classification,
ferent kernels (linear, quadratic, cubic and gaussian) for their proposed framework achieved commendable accuracy
classification, with the highest accuracy of 60.1% obtained scores of 71.61%, 86.1%, and 64.3% when validated on
using a quadratic kernel for the RAVDESS dataset. Pan RAVDESS, Emo-DB, and IEMOCAP datasets respectively.
et al. (2012) also used SVM for classification of speech into More recently, deep learning models have gained popu-
emotions. They experimented with different combinations larity as end-to-end frameworks for speech emotion recog-
of features like LPCCs, MFCCs and MEDCs. In 2020, Chen nition. Authors in Tzirakis et al. (2018) designed an end-
et al. (2020) proposed a two-layer fuzzy multiple random to-end SER system using a CNN and fed in the raw audio
forest model that fuses personalized and non-personalized input to the model. The CNN took care of all the phases
features during emotion recognition. Validating their frame- of emotion recognition, from feature extraction to selec-
work on the CASIA and Berlin EmoDB corpora, the authors tion to classification in the emotion recognition pipeline. In
demonstrated appreciable improvements of 1.39% to 7.64% 2021, Mustaqeem et al. proposed a simple lightweight deep
and 4.06% to 4.30% over the back propagation and random learning-based Self-Attention Module (SAM) for their SER
forest models respectively. In 2021, T. Tuncer et al. proposed system. The extracted features were considered as input to
a non-linear multi-level feature generation model based on SAM to produce channel and spatial axes attention map.
cryptographic structure, called a shuffle box. First, Tunable This was followed by the use of multilayer perceptron and
Q wavelet transform is used for multi-level feature genera- convolutional neural network in channel attention and spa-
tion, then iterative neighbourhood component analysis is tial attention to extract global cues and spatial information
employed to select discriminative features in the feature sub- respectively. Both channel and spatial attention maps were
set selection phase, and finally, high, medium, and low-level used to generate attention weights. The SAM was placed in
coefficients are generated via the wavelet transform method. the middle of the convolutional neural network while per-
RAVDESS, Emo-DB, SAVEE, and EMOVO datasets are forming end-to-end training. When tested with IEMOCAP,
used to validate the system’s performance and accuracy RAVDESS, and Emo-DB datasets, the SAM-CNN frame-
scores of 87.43%, 90.09%, 84.79%, and 79.08% are reported work achieved average recall scores of 98%, 80% and 93%
respectively. Kwon et al. (2003) experimented with different respectively. In one of their other works (Kwon et al., 2021),
classifiers based on Hidden Markov Model (HMM), Linear the authors designed a one-dimensional dilated convolu-
Discriminant Analysis (LDA), Quadratic Discriminant Anal- tional neural network based on a multi-learning strategy
ysis (QDA) and SVM. The authors used Forward Selection for their SER framework. This model attained accuracies of
and Back Elimination in the feature selection phase of the 73% and 90% on IEMOCAP and Emo-DB datasets respec-
ML pipeline to identify the subset of features to be used in tively. In 2021, M. Xu et al. proposed a multi-head attention
classification. Rong et al. (2009) proposed a novel method mechanism for their SER system. The authors evaluated
for feature selection for the problem of speech emotion rec- the performance of their attention-based CNN model on
ognition and compared their results with commonly used the IEMOCAP dataset. With an accuracy of 76.18%, this
feature selection algorithms like PCA, Best First Search, was noted to be a significant improvement over benchmark
Sequential Forward Selection and Multidimensional Scal- results for the given dataset. Moreover, they experimented
ing, to name a few. on speech data by injecting 50 different types of commonly
occurring noises to testify the robustness of the model.
2.3 Review on deep learning techniques Speech emotion recognition works mentioned so far have
used deep learning as an end-to-end SER framework or a
Many recent works in SER have gravitated towards the simple classifier. A few works in SER utilise different deep
use of artificial neural networks (ANNs) for classification. learning architectures for different stages of the pipeline, like
Authors in Nantasri et al. (2020) extracted MFCCs and its feature subset selection. For instance, in a recent study in
derivatives from speech and trained a light-weight ANN to Kwon et al. (2021), Mustaqeem et al. proposed a framework
perform classification. What made the ANN light-weight based on key sequence segment selection and redial function
was the small number of parameters and the limited number network. Here, a CNN is used for the feature subset selection
of features it used. This model used neural networks as a phase to extract salient features from an input spectrogram.
classifier after the feature extraction and selection phases. The selected features are then normalised and forwarded to
Similar to this, Issa et al. (2020) have also limited the use a bidirectional long short-term memory (LSTM) neural net-
of their deep learning model to the classification phase. work for classification. This framework achieved accuracies
They used MFCCs, chroma-gram, Mel-scale spectrogram, of 72.25%, 85.57%, and 77.02% on IEMOCAP, Emo-DB
Tonnetz representation, and various other spectral contrast and RAVDESS datasets respectively.
13
International Journal of Speech Technology (2022) 25:707–725 711
13
712 International Journal of Speech Technology (2022) 25:707–725
signal is first segmented into frames of equal length (25ms) has been done with the help of Librosa (McFee et al., 2015)
with some overlap (50%). Fast Fourier Transform (FFT) is and Parselmouth (Jadoul et al., 2018) libraries. While six
performed for each frame. The resultant frequency response statistics, namely, mean, minimum, maximum, standard
is then passed through a set of N triangular band-pass fil- deviation, kurtosis and skewness are computed on segmen-
ters (Mel filter bank), the output of which is then subjected tal features derived from MFCCs, LFCCs, intensity, pitch
to Discrete Fourier Transform (DCT). The coefficients thus and formants 1, 2 and 3, four statistics, namely, minimum,
obtained are the MFCCs. maximum, mean and standard deviation are calculated for
Linear Frequency Cepstral Coefficients (LFCC) Linear spectral centroids. This is done because the audio signals are
Frequency Cepstral Coefficients are obtained in a way simi- of different lengths, hence different numbers of frames must
lar to the MFCCs with one major point of difference. The be processed together and be converted to a feature vector of
triangular band-pass filters in the filter bank used to derive uniform length. Calculating functionals (statistics like mean,
LFCCs are equally spaced on the linear scale. LFCCs are variance, moments) on low level descriptors serves this pur-
known to contain better information of vocal tract excitation pose. In our experimental setup, a total of 652 features were
in the higher frequency region of human voice. computed. Of these, MFCC and LFCC related features were
Spectral Centroids Spectral Centroids are the weighted 360 and 234 respectively. Pitch and intensity features con-
average of frequencies present in sound, with the magnitude tributed to 18 each; 18 were formant-based features. Finally,
of each frequency component in the spectrum of the signal 4 spectral centroid features were appended to the feature
acting as its weight. It gives an idea of where the centre of vector, making it a 652-dimensional vector.
mass of the spectrum is located.
Formants Formants are acoustic resonances of the vocal 3.2 Feature subset selection
tract. They are local maxima’s that can be observed in spec-
trograms as broad peaks, occurring at multiples of the fun- Feature subset selection is a critical step in designing any
damental frequency. F1 is the formant with the lowest fre- kind of machine learning model. Its purpose is to reduce
quency, followed by F2 and F3. The first two formants, F1 the dimensionality of the feature space, which not only
and F2, are important in determining the quality of vowels. makes the training of the model a lot faster, but also, in most
Pitch Speech is of two types—voiced and unvoiced. cases, improves the performance of the classifier. There are
Unvoiced speech segments are characterized by turbulent, three primary approaches to feature subset selection—Fil-
noise-like sound. These segments often are articulations of ter, Wrapper and Embedded. While filter approaches work
consonants. Voiced segments, on the other hand, are articu- independent of the classifier being used, wrapper methods
lations of vowels and are characterized by a dominant, low evaluate subsets of features based on their performance on
frequency signal. This fundamental frequency is what we specific models. Wrapper methods are computationally very
call pitch. expensive, as they involve iteratively looking for subsets of
Intensity Intensity is defined as the power carried by features (which grow exponentially with the number of fea-
sound per unit area. It makes an important feature in emo- tures), and evaluating their performance using a classifier.
tion recognition systems. High arousal emotions like anger While exhaustive approaches are guaranteed to give the
and surprise tend to have higher intensity than low arousal most optimal subset of features, it is computationally not
emotions like sadness and boredom. feasible to apply those, since the number of subsets to be
Since speech is a non-stationary signal, it does not suf- evaluated in a feature space of dimensionality N would be
fice to consider MFCC, LFCC, pitch and intensity values of the order of 2N . Sequential methods of feature selection
alone for each of the frames. It is also important to cap- like Forward Selection and Back Elimination are also popu-
ture the rate of change of these features with time, to get a lar choices of feature subset selection. These are, however,
better understanding of the dynamics of the speech signal. greedy approaches and do not guarantee the optimal subset
We thus consider the velocity (delta) as well as the accel- of features. In embedded feature subset selection, feature
eration (delta-delta) features for MFCCs, LFCCs, intensity selection is integrated into the machine learning model. It
and pitch. The delta coefficient for the tth frame is simply combines the advantages of both filter and wrapper-based
the difference between the value at the tth and the value at methods and sits in between the two approaches in terms of
the (t − 1)th frame. The delta-delta coefficients are calcu- time complexity.
lated in the same way by applying the above logic to the Table 1 provides a brief overview of the different feature
delta coefficients. In total, 20 MFCCs, 13 LFCCs, formants selection methods that have been used by recent works in
1, 2 and 3 (F1, F2 and F3) along with spectral centroids, the area of speech emotion recognition. Shegokar and Sir-
pitch and intensity features are extracted. In addition to this, car (2016) used Principal Component Analysis (PCA) in 3
the delta and delta-delta coefficients for MFCCs, LFCCs, and 10 directions prior to training the feature matrices on
intensity and pitch are also computed. Feature extraction linear, quadratic, cubic and gaussian SVMs. PCA has also
13
International Journal of Speech Technology (2022) 25:707–725 713
been employed in the SER system developed by Changqin statistical dependence of the target attribute on this subset
et al. (Quan et al., 2017) to improve the system’s perfor- of features is maximised (Peng et al., 2005). To realise this
mance. In principal component analysis, data points are scheme of maximal dependency, features are first ranked
projected onto the first few principal components in a way by their relevance to the target attribute c and the top x% of
that maximises the variance retained in the projected data. them are picked. Relevance to the target attribute is meas-
Kwon et al. (2003) used forward selection and backward ured with the help of mutual information (MI). For this,
elimination to rank features and chose the optimal subset sklearn’s GenericUnivariateSelect class is used. Since there
of features. Unlike PCA, which results in a set of new fea- was a high imbalance in the number of features extracted
tures, both forward selection and backward selection return from different feature groups like MFCCs and its derivatives,
subsets of the original set of features. Bhavan et al. (2019) spectral centroids, etc., it was decided to retain the top 20%
used Boruta, a wrapper-based feature selection method. It of MFCC and LFCC related features. All other features were
is an algorithm that addresses the problem of finding an all- included without any selection mechanism. Table 2 provides
relevant subset of features, instead of the minimal-optimal a summary of the different feature groups before and after
subset. The algorithm is explained in detail in the work of feature subset selection.
Jankowski et al. in Kursa et al. (2010). Many recent works
in the area of speech emotion recognition have gravitated 3.3 Classification
towards the use of deep learning techniques for classification
like CNNs, and hence do not have an explicit feature subset After feature extraction and feature subset selection, the next
selection module. Feature subset selection, along with clas- step in the pipeline is to run the subset of features through
sification is taken care of by the deep learning model. This five machine learning models to classify speech into one
is exemplified in the works of Zeng et al. (2019), Zamil et al. of the 8 emotions. These models are then compared to
(2019) and Christy et al. (2020). Other works like that of each other on the basis of overall accuracy and class-wise
Gomathy (2021) and Daneshfar and Kabudian (2020) have
utilised swarm-based optimisation techniques for feature
selection. Gomathy (2021) applies Cat Swarm Optimisa- Table 2 Features before and after feature subset selection
tion (CSO) while Daneshfar et al. (Daneshfar & Kabudian, Description All features Selected
2020) employ a modified quantum-behaved particle swarm features
optimisation (QPSO) to get a near-optimal subset of features
MFCCs 360 72
prior to classification.
LFCCs 234 47
In this study, a filter-based approach of feature sub-
Spectral centroids 4 4
set selection is employed. This is because filter-based
I, II, III formants 18 18
approaches are fast, simple, less prone to overfitting and
Pitch 18 18
are not model-specific, unlike wrapper methods. The aim
Intensity 18 18
here is to obtain a subspace of m features, Rm such that the
13
714 International Journal of Speech Technology (2022) 25:707–725
accuracies. Each of the five models described below is run The posterior probability of the d dimensional feature
independently on the entire set of features (without feature vector [x1 , x2 , x3 , … , xd ] belonging to class Ci is given by the
subset selection) and the results are compared using Strati- following equation.
fied k-fold cross-validation (k = 10). The five models con- ( )
sidered in this study with their corresponding results are P Ci ∕[x1 , x2 , x3 , … , xd ]
( )
explained next. For the implementation of these models, the P [x1 , x2 , x3 , … , xd ]∕Ci
scikit-learn library is used (Pedregosa et al., 2011). = ( ) × P(Ci )
P [x1 , x2 , x3 , … , xd ]
3.3.1 Gaussian Naïve Bayes Classifier where P([x1 , x2 , x3 , … , xd ]) is the probability of the feature
vector and is called the predictor prior probability. P(Ci ) is
The Naïve Bayes algorithm is a simple classification tech- the prior probability of the ith class. P([x1 , x2 , x3 , … , xd ]∕Ci )
nique that relies on the Bayes Theorem for calculating the is the likelihood probability of the feature vector, given class
posterior probability. What makes the algorithm naïve, is the Ci. The above equation can be reduced to the following equa-
assumption that the predictors or features used in the clas- tion with regards to the assumption of conditional independ-
sification model are all conditionally independent of each ence of the features with respect to each other.
other. What this means mathematically, is that the likeli- ( )
hood of a class, given a feature vector, can be expressed as P Ci ∕[x1 , x2 , x3 , … , xd ]
( )
a product of the likelihoods of the class given each feature P [P(x1 ∕Ci ) × P(x2 ∕Ci ) × P(x3 ∕Ci ) × ⋯ × P(xd ∕Ci )]
= ( )
of the feature vector. P [x1 , x2 , x3 , … , xd ]
× P(Ci )
13
International Journal of Speech Technology (2022) 25:707–725 715
Posterior probabilities are calculated for each class and subset of 177 features, the accuracy increases to 44.98%
the class with the maximum posterior probability is selected which is still very low when compared to other models.
and assigned to the feature vector. This is called the maxi- Algorithm1 presents Gaussian Naïve Bayes classifier.
mum a posteriori (MAP) rule. In this study, Gaussian Naïve
Bayes (GNB) classifier is used. The Gaussian Naïve Bayes
algorithm is a simple extension of the vanilla Naïve Bayes 3.3.2
k‑Nearest Neighbours classifier
algorithm, extended to work with real-valued attributes.
Here, each of the features is assumed to follow normal distri- The k-Nearest Neighbours (k-NN) classifier is a simple ML
bution, and the mean and standard deviations are estimated algorithm that employs a lazy-learner approach to classifica-
from the training data. The Naïve Bayes algorithm is sought tion. It stores all the training examples at the time of training,
for its computational simplicity and lower training times. and does not learn anything from it. At the time of classifi-
However, it is known to be a poor classifier, as the assump- cation, a search is performed to look for the k points in the
tion of conditional independence does not hold for most dataset that are closest to the testing example. The Euclidean
practical applications. In the SER application, the classifier distance is used as the distance metric in this study. Euclidean
shows an extremely poor performance. The overall stratified distance between two d-dimensional feature vectors f1 and f2
tenfold cross-validation accuracy given by this classifier is is given by
43.48%. This is when the model is run with all 652 features √
√ d
obtained by the feature extraction module. When run on a √∑ ( )2
Euclidean distance (f1 , f2 ) = √ f1i − f2i
i=1
13
716 International Journal of Speech Technology (2022) 25:707–725
13
International Journal of Speech Technology (2022) 25:707–725 717
3.3.4 Support Vector Machine Choosing the appropriate kernel function is thus of utmost
importance. In this study, we use the Gaussian kernel SVM.
Support Vector Machines (SVM) are robust binary classi- The Gaussian kernel, or, the Radial Basis Function (RBF)
fiers that find the hyperplane between two classes of data kernel applied on two d-dimensional vectors xi and xj , is
such that the margin between the two classes is maximized. given by the following equation.
This hyperplane is constructed in a higher dimensional � ∑d �
space by what is known as the kernel trick. A kernel func- k=1
��xik − xjk ��2
K(xi , xj ) = exp −
tion is suitably chosen for the application, such that the 2𝜎 2
training points are linearly separable in the higher dimen-
∑d
sional space. The kernel trick is that the kernel function, In the above equation, the term k=1 ��xik − xjk ��2 is the
when applied onto two vectors p and q, gives a value that squared L2-norm distance between the vectors xi and xj .
is the dot product of the two vectors p and q when each of 1
Kernel coefficient (𝛾) is given by 2 . Algorithm 4 presents
them is projected onto the higher dimensional space. What 2𝜎
the support vector machine classifier.
this higher dimensional space is exactly, is not of our con-
Using grid search and a stratified tenfold cross-vali-
cern, and the entire prediction process can be carried out
dation strategy, we find the most suitable values of ker-
by simply knowing the dot product values, which can be
nel coefficient (𝛾) and penalty term (C) to be 0.0029 and
conveniently found using the kernel function. Considering
100 respectively. With these hyperparameters, the model
two vectors xi and xj and their transformed data points 𝜙(xi )
gives an overall accuracy of 67.57%, which is significantly
and 𝜙(xj ) , the dot product of the two transformed vectors
higher than the accuracy scores given by all the previ-
can be found by simply applying the kernel function on xi
ous models. This is when the model has been trained on
and xj , as shown.
the entire set of 652 features. When trained on the 177
K(xi , xj ) = 𝜙(xi ) × 𝜙(xj ) selected features, a substantial improvement in the perfor-
mance is observed. The model does exceedingly well and
gives an overall accuracy of 77.86%.
13
718 International Journal of Speech Technology (2022) 25:707–725
13
International Journal of Speech Technology (2022) 25:707–725 719
⋆ Details of BPA (back propagation algorithm) can be and F1-scores were also computed for the two contending
found in Rojas (1996) models and tabulated for reference. These metrics are
defined as follows.
TP
3.4 Performance evaluation Precision =
TP + FP
TP
Post training the machine learning models, a comparative Sensitivity =
TP + FN
analysis of their performance was done. Emotion-wise accu- TN
racies were noted using the tenfold cross-validation strategy Specificity =
TN + FP
for each of the classifiers. Models were compared to each Precision × Sensitivity
other on the basis of their overall accuracies (averaged over F1-score =
Precision + Sensitivity
all classes or emotions). The overall validation accuracy for
a multi-class classification problem with k classes is defined On the basis of the above performance metrics, the per-
as the ratio of the number of samples that were correctly formance of the Gaussian Naïve Bayes, Random Forest,
classified to the total number of samples. k-Nearest Neighbours, Support Vector Machine and Mul-
tilayer Perceptron classifiers were evaluated and the most
Overall validation accuracy
∑k suitable model was chosen for the SER framework. The
Samples correctly classified as class k plethora of metrics and their analytics would enable the
= i=1∑k
Samples belonging to class k users to choose an appropriate learning model according to
their requirements.
i=1
Table 4 Validation accuracies Classifier Hyperparameters Accuracy score (652 Accuracy score
before and after feature subset features) (177 features)
selection
GNB Smoothing parameter = 1 × 10−9 43.48 44.26
k-NN k=1 53.25 65.17
RF k = 100, Gini split 61.84 64.06
SVM 𝛾 = 0.0029, C = 100 67.57 77.86
MLP 2 hidden layers 73.76 79.62
(640, relu), (32, sigmoid)
13
720 International Journal of Speech Technology (2022) 25:707–725
intensity. In total, the speech-only dataset makes up for a populations. The aim is to have each subset or sample to be
total of 1440 audio files in wav format recorded with a sam- representative of the entire population in terms of percent-
pling frequency of 48 KHz. age composition of each class. This is preferred over random
sampling for machine learning problems, since you would
want the training data to be representative of the entire data-
4.2 Data pre‑processing
set, to be able to predict unseen examples correctly. K-fold
cross-validation is a cross-validation approach wherein the
Data is read using Python’s Librosa library, and features are
entire dataset is first divided into k equal folds or subsets of
extracted with the help of Librosa and Parselmouth. The
data. Then, in each run, one of the k folds acts as the test
features obtained are then scaled with respect to a Standard
set, while the other k − 1 folds are used to train the model.
Normal distribution, having mean 0 and variance 1. After
The model is evaluated using the test set and the accuracy
this, MFCC and LFCC features are ranked in the decreasing
is noted. This experiment is performed k times, with each
order of their mutual information with the target attribute,
and the top 20% of these features are retained and added to
other features like spectral centroids, formants, pitch and
intensity, i.e., 177 of 652 features are retained in the feature
subset selection step. After this process, to account for the
class imbalance in the RAVDESS dataset, and balance the
number of samples across all emotions, we use the Synthetic
Minority Oversampling Technique (SMOTE) to oversample
the minority class data. Neutral class is oversampled using
SMOTE and a total of 1536 audio files are considered for
classification, with each emotion being represented by 192
files. Imblearn (Lemaître et al., 2017) is used for the imple-
mentation of SMOTE.
4.3 Model training
13
International Journal of Speech Technology (2022) 25:707–725 721
of the k folds acting as the test set exactly once. The aver-
age accuracy of the k experiments is reported as the k-fold
accuracy. In this study, a stratified tenfold cross-validation
strategy is used to evaluate the classification models.
4.4 Experimental results
13
722 International Journal of Speech Technology (2022) 25:707–725
This shows that Random Forest is not as sensitive as k- In addition to studying overall accuracies, it is impera-
NN to the presence of irrelevant features in the data, and tive to look at the emotion-wise performance of the classi-
performs satisfactorily well even without feature subset fiers as well. Figure 2 depicts the class-wise performance
selection. While Random Forest and k-NN classifiers do of all five algorithms. The similitude across these classi-
a good job of classifying speech into emotions, as can be fiers is that each of them performs well on the neutral class,
verified from other state-of-the-art approaches tabulated in while the sad class shows the lowest performance scores.
Table 4, the best accuracies are obtained by support vector The MLP and SVM show fairly decent accuracy scores for
machine and multilayer perceptron. With SVM, the overall the sad class, with recalls of 60.82% and 57.81% respec-
accuracy with a stratified tenfold cross-validation technique tively. The second worst performing algorithm, namely, the
is a considerable 77.86% post feature subset selection, with k-NN classifier performs best on the neutral class, correctly
the maximum accuracy being 86.27%. With the MLP, an recognising 96.88% of all true neutral class samples. This
overall accuracy of 79.62% is achieved, with a maximum of is followed by the MLP, SVM and Random Forest models
84.96%. From these results, it is quite clear that SVM and with recalls of 91.67%, 91.15% and 87.5% respectively. The
MLP are the best performers in terms of validation accuracy confusion matrices of the two best performing algorithms,
and the only real contenders in the SER framework. namely, Support Vector Machine and Multilayer Perceptron,
are depicted in Figs. 3 and 4 .
From these confusion matrices, performance metrics like
precision, sensitivity, specificity and F1-score are computed
and tabulated in Table 5. While class-wise precision, recall
and F1-scores of MLP are consistently higher that SVM, the
class-wise specificity scores of SVM seem to be greater than
those of MLP, except for the neutral and sad classes. The
two algorithms are also compared on the basis of their ROC
curves. The area under ROC curves is a good indicator of a
model’s performance. The greater the area under the curve
(AUC) for a given emotion, the better is the model’s ability
to distinguish between test instances for that emotion. Fig-
ure 5 provides the ROC curves for each of the 8 classes and
their macro-average for the SVM model. Figure 6 provides
the ROC curves for the MLP model. A comparison of the
two plots reveals that the MLP model’s AUC is slightly bet-
ter than the SVM model’s, on an average. Interestingly, for
Fig. 7 Wilcoxon signed-test correlation matrix low energy emotions like calm, disgust and fearful, the SVM
Shegokar and Sircar (2016) Continuous Wavelet SVM (linear, Linear SVM—52.6%
Transform (CWT) quadratic, gaussian, Quadratic SVM—60.1%
and other prosodic features cubic kernels) Cubic SVM—56.3%
Gaussian SVM—52.3%
Zeng et al. (2019) Spectrograms Gated Residual 64.48%
Networks (GResNets)
Zamil et al. (2019) MFCCs Logistic Model Tree (LMT) 70%
Bhavan et al. (2019)a MFCCs and SVM 75.69%
spectral centroids (gaussian kernel)
Christy et al. (2020) MFCCs and modulation CNN 78.20%
spectral (MS)
Proposed work MFCCs, LFCCs, SVM 77.86%
spectral centroids, (gaussian kernel)
formants, pitch and MLP 79.62%
intensity (with two hidden layers
(640, relu), (32, sigmoid)
a
Reports the best accuracy with a 90:10 train-test split; in general, average accuracy is reported
13
International Journal of Speech Technology (2022) 25:707–725 723
model does a more sensitive classification—better recall— overall accuracy they reported for the RAVDESS dataset
compared to the MLP model. was 60.1%. Zeng et al. (2019) used spectrograms with a
Comparing overall validation accuracies of the five deep neural network-based classifier and observed an accu-
models from Table 4 mistakenly compels us to conclude racy of 64.48%. In Zamil et al. (2019), a 13-dimensional
that MLP is the best model for classification on the basis feature vector comprising of MFCCs is used for trained a
of mean accuracies, completely neglecting SVM. It is per- Logistic Model Tree on the data. The best overall accuracy
haps constructive to look at the training times of the SVM reported by them is 70%. It is important to note here that
and MLP models to make a fair comparison between the MFCCs alone do not contain all the emotional information
two and come to the decision of the better classifier for the in speech. Its delta and delta-delta coefficients, along with
framework. While the training of a single fold of data (out other spectral features play an important role in classifying
of the 10 folds) takes 17 microseconds μs on average for speech to emotions correctly. Our proposed models performs
an SVM, it takes 1.000023 × 106 μs on average for training better than the aforementioned models on the RAVDESS
a single fold of data for an MLP. While one second and dataset. It should also be noted that emotion recognition
23 microseconds is not a lot, it is several folds larger than for the RAVDESS dataset is not a very simple task, with
17 μs. Moreover, the computational complexity of MLPs the human accuracy rate reported as 67% for this dataset
tends to be inherently higher due to the huge number of (Livingstone & Russo, 2018).
parameters (weights) to update in each iteration of the Authors in Bhavan et al. (2019) used a bagged ensemble
back propagation algorithm. of support vector machines for classification with MFCCs
To test the significance of the difference between SVM and spectral centroids as their primary features. With
and MLP, a non-parametric statistical hypothesis test called this, their best performance with a 90:10 train-test split is
Wilcoxon signed-rank test is used. This test is performed to reported as 75.69% on the RAVDESS dataset. In contrast,
evaluate whether two dependent samples are drawn from our best performance is 84.96% using MLP and 86.27%
the same population. It is considered an alternative to the using an SVM. Christy et al. (2020) used a convolutional
paired Student’s t-test for dependent samples. Pairwise tests neural network to perform speech emotion recognition using
are carried out for k-NN, Random Forest, SVM and MLP a combination of MFCCs and modulation spectral features.
classifiers. The Naïve Bayes classifier was excluded in the Their best accuracy was reported as 78.20%. While this is a
discussion due to its poor performance. These tests are per- close competitor to the proposed methods of classification,
formed using k-fold cross-validation with k = 20 and the it should be noted that it is worth going with the SVM model
level of significance is chosen as 5% (𝛼 = 0.05). The null for classification (accuracy = 77.86%), due to its faster train-
hypothesis for each pairwise test is that the performance ing time.
of the two classifiers, in terms of their accuracies, does not Hence, it can be concluded that the SER framework
differ significantly. Figure 7 provides the p-values for all the developed in this study surpasses reported frameworks in
pairwise tests. From Figure 7, it is clear that the SVM and terms of overall accuracy scores. With a mere 177 features
MLP models are both significantly different from the Ran- retained by the feature subset selection model, the proposed
dom Forest and KNN models, since the p-values for these 4 SVM framework is light in terms of computation and highly
pairs is very close to zero. At 5% level of significance, it is reliable. Due to its lower training times compared to its MLP
seen that there is no significant difference in the performance counterpart, the proposed SVM framework is suitable to be
of the Random Forest and k-NN classifiers. Similarly, there incorporated into real-time applications for speech emotion
is no significant difference in the performance of the SVM recognition.
and MLP models.
4.5 Discussion 5 Conclusion
We now compare our SER framework with both SVM and In this paper, a Speech emotion recognition framework has
MLP models to various other SER frameworks developed in been designed and implemented using machine learning
literature. Table 6 provides a comparison between the over- algorithms like Gaussian Naïve Bayes, k-Nearest Neigh-
all accuracy of the SVM and MLP models with the overall bours, Random Forest, Support Vector Machine and Mul-
accuracies of other SER models reported in the literature tilayer Perceptron. A combination of spectral and prosodic
for the RAVDESS dataset. All our results have been tabu- features is examined. Of these 652 features, comprising of
lated with the tenfold stratified cross-validation approach. MFCCs, LFCCs, formants, spectral centroids, intensity
Shegokar and Sircar (2016) extracted features based on and pitch, along with their velocity and acceleration com-
continuous wavelet transform (CWT) and used SVMs with ponents, 177 are retained for further processing based on
different kernel functions for classification. The highest mutual information with the target attribute. Filter-based
13
724 International Journal of Speech Technology (2022) 25:707–725
approaches like selecting the top x features based on some Another area with scope for improvement would be trying
criterion like mutual information or Pearson’s coefficient are out different feature selection algorithms. Boruta can be tried
faster and computationally less intensive than wrapper-based out to reduce the dimensionality of the extracted features.
approaches. Thus, the choice of this feature subset selec- This algorithm was used by Bhavan et al. (2019) to extract
tion algorithm is fitting for an SER framework designed to the most emotionally informative subset of features for their
work in real-time applications. The features, post feature bagged ensemble of SVMs. Evolutionary algorithms like
subset selection, are then used to train the aforementioned Genetic Algorithm (GA) and swarm-based meta-heuristics
ML models, and evaluation of these models is carried out like particle swarm optimisation (PSO) could also be experi-
using stratified ten-fold cross-validation strategy. mented with for selecting the most optimal subset of fea-
Comparative analysis of the models reveals that both the tures. Filter-based approaches that assign scores to features
MLP and SVM classifiers perform remarkably well com- based on correlation coefficients or Chi-squared test can also
pared to the other three classifiers in terms of validation be tried. Moreover, other frameworks of knowledge-based
accuracy. Wilcoxon signed rank test is used to statistically decision systems can be utilised for feature subset selection
determine, for each pair of models, how significantly their and classification phases in SER (Christopher et al., 2016;
performances differ. It is noted that MLP and SVM both Kavya et al., 2021). Apart from this, further research can be
perform significantly better than the other classifiers. With done on how gender-based information can be exploited to
overall accuracies of 79.62% and 77.86%, a pairwise test design gender-dependent SER systems, and how they can be
done between these two classifiers reports no significant an improvement over gender-neutral SER systems.
difference in their performances. While the performance of
the SVM is only slightly lesser than that of MLP, it is on an
average 1000 times faster in terms of training time. Since the
aim of the paper is to provide an SER framework that is both References
reliable and fast enough to be used in real-time applications,
SVM provides a better alternative to MLP. The performance Agrawal, E., & Christopher, J. (2020). Emotion recognition from peri-
ocular features. In International conference on machine learn-
of both these models is noted to be a significant improve- ing, image processing, network security and data sciences (pp.
ment over other state-of-the-art models reported in literature 194–208). Springer.
for the RAVDESS dataset. A number of neural network- Agrawal, E., Christopher, J. J., & Arunachalam, V. (2021). Emotion
based approaches have been reported in the last couple of recognition through voting on expressions in multiple facial
regions. ICAART, 2, 1038–1045.
years. While these models have obtained accuracies as high Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features
as 82.3% on the RAVDESS dataset (Nantasri et al., 2020), it and classifiers for emotion recognition from speech: A survey
should be noted that the proposed model that uses an SVM from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.
is computationally much less taxing to train. It is also much Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support
vector machines for emotion recognition from speech. Knowledge-
faster compared to a DNN. Based Systems, 184, 104886.
Chen, L., Su, W., Feng, Y., Wu, M., She, J., & Hirota, K. (2020). Two-
5.1 Future work layer fuzzy multiple random forest for speech emotion recognition
in human-robot interaction. Information Sciences, 509, 150–163.
Christopher, J. J., Nehemiah, K. H., & Arputharaj, K. (2016). Knowl-
Further work can be done to incorporate audio quality edge-based systems and interestingness measures: Analysis with
features like shimmer, jitter and hormonic-to-noise ratio clinical datasets. Journal of Computing and Information Technol-
(HNR) to enhance the performance of the system. Moreo- ogy, 24(1), 65–78.
ver, experiments can be performed varying the number of Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. A.
(2020). Multimodal speech emotion recognition and classifica-
MFCC and LFCC coefficients and seeing how these affect tion using convolutional neural network techniques. International
the performance of the system. The real challenge in design- Journal of Speech Technology, 23, 381–388.
ing speech emotion recognition systems is to discover and Daneshfar, F., & Kabudian, S. J. (2020). Speech emotion recognition
explore the right set of features to be used, and focus more using discriminative dimension reduction by employing a modi-
fied quantum-behaved particle swarm optimization algorithm.
on the quality rather than the quantity of acoustic features. Multimedia Tools and Applications, 79(1), 1261–1289.
While the present study has focused only on paralinguistic Gomathy, M. (2021). Optimal feature selection for speech emotion
aspects of speech, linguistic (lexical) information can also be recognition using enhanced cat swarm optimization algorithm.
incorporated via an Automatic Speech Recognition (ASR) International Journal of Speech Technology, 24(1), 155–163.
Gupta, K., Gupta, M., Christopher, J., & Arunachalam, V. (2020).
system to facilitate emotion recognition. Anagnostopoulos Fuzzy system for facial emotion recognition. In International
et al. discuss the inclusion of non-linguistic vocalisations to conference on intelligent systems design and applications (pp.
improve SER systems (Anagnostopoulos et al., 2015). These 536–552). Springer.
rely on detecting cues like laughter, sighs, yawns, cries, etc.
which can be directly linked to one or more emotional states.
13
International Journal of Speech Technology (2022) 25:707–725 725
Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recogni- et al. (2011). Scikit-learn: Machine learning in python. The Jour-
tion with deep convolutional neural networks. Biomedical Signal nal of Machine Learning Research, 12, 2825–2830.
Processing and Control, 59, 101894. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on
Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing mutual information criteria of max-dependency, max-relevance,
parselmouth: A python interface to praat. Journal of Phonetics, and min-redundancy. IEEE Transactions on Pattern Analysis and
71, 1–15. Machine Intelligence, 27(8), 1226–1238.
Kavya, R., Christopher, J., Panda, S., & Lazarus, Y. B. (2021). Machine Petrushin, V. A. (2000). Emotion recognition in speech signal: Experi-
learning and XAI approaches for allergy diagnosis. Biomedical mental study, development, and application. In Sixth international
Signal Processing and Control, 69, 102681. conference on spoken language processing.
Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction Picard, R. W. (2000). Affective computing. MIT press.
algorithms to improve the speech emotion recognition rate. Inter- Quan, C., Zhang, B., Sun, X., & Ren, F. (2017). A combined cepstral
national Journal of Speech Technology, 23(1), 45–55. distance method for emotional speech recognition. International
Kursa, M. B., Jankowski, A., & Rudnicki, W. R. (2010). Boruta—a Journal of Advanced Robotic Systems, 14(4), 1729881417719836.
system for feature selection. Fundamenta Informaticae, 101(4), Rojas, R. (1996). The backpropagation algorithm. In Neural networks
271–285. (pp. 149–182). Springer.
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion rec- Rong, J., Li, G., & Chen, Y.-P.P. (2009). Acoustic feature selection for
ognition by speech signals. In Eighth European conference on automatic emotion recognition from speech. Information Process-
speech communication and technology. ing & Management, 45(3), 315–328.
Kwon, S., et al. (2020). CLSTM: Deep feature-based speech emotion Shegokar, P., & Sircar, P. (2016). Continuous wavelet transform based
recognition using the hierarchical convLSTM network. Math- speech emotion recognition. In 2016 10th international confer-
ematics, 8(12), 2133. ence on signal processing and communication systems (ICSPCS)
Kwon, S., et al. (2021). Mlt-dnet: Speech emotion recognition using (pp. 1–8). IEEE.
1d dilated CNN based on multi-learning trick approach. Expert Surampudi, N., Srirangan, M., & Christopher, J. (2019). Enhanced
Systems with Applications, 167, 114177. feature extraction approaches for detection of sound events. In
Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced- 2019 IEEE 9th international conference on advanced computing
learn: A python toolbox to tackle the curse of imbalanced data- (IACC) (pp. 223–229). IEEE.
sets in machine learning. The Journal of Machine Learning Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech
Research, 18(1), 559–563. emotion recognition using deep neural networks. In 2018 IEEE
Liu, G. K. (2018). Evaluating gammatone frequency cepstral coef- international conference on acoustics, speech and signal process-
ficients with neural networks for emotion recognition from ing (ICASSP) (pp. 5089–5093). IEEE.
speech. arXiv preprint arXiv:1806.09010. Vogt, T., & André, E. (2006). Improving automatic emotion recognition
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual from speech via gender differentiaion. In LREC (pp. 1123–1126).
database of emotional speech and song (ravdess): A dynamic, Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman,
multimodal set of facial and vocal expressions in North Ameri- I. (2019). Emotion detection from speech signals using voting
can English. PLoS ONE, 13(5), e0196391. mechanism on classified frames. In 2019 international conference
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Batten- on robotics, electrical and signal processing techniques (ICREST)
berg, E., & Nieto, O. (2015). Librosa: Audio and music signal (pp. 281–285). IEEE.
analysis in python. In Proceedings of the 14th python in science Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based
conference Vol. 8, (pp. 18–25). Citeseer. multi-task audio classification. Multimedia Tools and Applica-
Nantasri, P., Phaisangittisagul, E., Karnjana, J., Boonkla, S., Keera- tions, 78(3), 3705–3722.
tivittayanun, S., Rugchatjaroen, A., Usanavasin, S., & Shino- Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., &
zaki, T. (2020). A light-weight artificial neural network for Shamma, S. (2011). Linear versus mel frequency cepstral coef-
speech emotion recognition using average values of MFCCs and ficients for speaker recognition. In 2011 IEEE workshop on auto-
their derivatives. In 2020 17th International conference on elec- matic speech recognition & understanding (pp. 559–564). IEEE.
trical engineering/electronics, computer, telecommunications
and information technology (ECTI-CON) (pp. 41–44). IEEE. Publisher's Note Springer Nature remains neutral with regard to
Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using jurisdictional claims in published maps and institutional affiliations.
support vector machine. International Journal of Smart Home,
6(2), 101–108.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,
13