0% found this document useful (0 votes)

8 views19 pages

THIRD - s10772 022 09985 6

Uploaded by

Mugdha Ashiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views19 pages

THIRD - s10772 022 09985 6

Uploaded by

Mugdha Ashiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

International Journal of Speech Technology (2022) 25:707–725

https://doi.org/10.1007/s10772-022-09985-6

Machine learning techniques for speech emotion recognition using

paralinguistic acoustic features
Tulika Jha1 · Ramisetty Kavya1 · Jabez Christopher1 · Vasan Arunachalam2

Received: 16 April 2021 / Accepted: 17 June 2022 / Published online: 8 July 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
Speech emotion recognition is one of the fastest growing areas of interest in the field of affective computing. Emotion
detection aids human–computer interaction and finds application in a wide gamut of sectors, ranging from healthcare to
retail to education. The present work strives to provide a speech emotion recognition framework that is both reliable and
efficient enough to work in real-time environments. Speech emotion recognition can be performed using linguistic as well
as paralinguistic aspects of speech; this work focusses on the latter, using non-lexical or paralinguistic attributes of speech
like pitch, intensity and mel-frequency cepstral coefficients to train supervised machine learning models for emotion rec-
ognition. A combination of prosodic and spectral features is used for experimental analysis and classification is performed
using algorithms like Gaussian Naïve Bayes, Random Forest, k-Nearest Neighbours, Support Vector Machine and Multilayer
Perceptron. The choice of these ML models was based on the swiftness with which they could be trained, making them more
suitable for real-time applications. Comparative analysis of the models reveals SVM and MLP to be the best performers
with 77.86% and 79.62% accuracies respectively. The performance of these classifiers is compared with benchmark results
in literature, and a significant improvement over state-of-the-art models is presented. The observations and findings of this
work can be applied to design real-time emotion recognition frameworks that can be used to design and develop applications
and technologies for various domains.

Keywords Emotion recognition · Affective computing · Support vector machine · Multilayer perceptron · Paralinguistic
acoustic features

1 Introduction 1995 (Picard, 2000). Affective computing is the study and

development of intelligent devices that can interpret, recog-
Communication in humans is a fundamental faculty. It is nize and simulate human-like responses. Emotion recogni-
not only based on the linguistic parts of speech, but also tion relies on capturing physical cues like facial expression,
the emotional part. It is thus of paramount importance for speech and body posture of people to interpret their emo-
machines engaged in human–machine interaction to have tional state. Emotion detection capabilities, when incorpo-
emotion recognition capabilities. Speech Emotion Recogni- rated into conversational agents can considerably facilitate
tion (SER) is the task of identifying the emotion of a person human–computer interaction (HCI). While some of the most
from his or her speech. The field has risen to prominence obvious applications of a reliable real-time SER framework
over the years following the elaborate work and revolution- include enhancement of voice-assistants like Siri and Alexa,
ary findings of Rosalind Picard on affective computing in other potential application is in the domain of education,
where SER can be integrated with classroom monitoring
systems to record the level of engagement of students. This
* Jabez Christopher
jabezc@hyderabad.bits-pilani.ac.in could potentially be used as feedback for teaching. Another
interesting application is to detect driver drowsiness levels
1
Department of Computer Science and Information Systems, using a real-time emotion detection system that relies on
BITS Pilani Hyderabad Campus, Hyderabad, Telangana, facial and vocal cues. In addition, psychological health ser-
India
vices could benefit tremendously from using emotion detec-
2
Department of Civil Engineering, BITS Pilani Hyderabad tion software in online counselling sessions. Here, tracking
Campus, Hyderabad, Telangana, India

13
Vol.:(0123456789)
708 International Journal of Speech Technology (2022) 25:707–725

the emotional state of patients during the counselling ses- utterance. The resultant dimension of a segmental feature is
sion would enable the psychiatrist to converse in a better thus equal to the number of time-frames the speech utterance
manner, thereby enhancing the efficacy of such sessions. is windowed into. Supra-segmental features, on the other
Realizing the wide-ranging applications of speech emotion hand, are computed once for every utterance. Some exam-
recognition, it becomes critical to understand and develop ples of segmental features include frame intensity, mel-fre-
systems that work efficiently and can be used in real-time quency cepstral coefficients (MFCCs) and linear prediction
environments. cepstral coefficients (LPCCs) while fundamental frequency,
shimmer, jitter and speech rate are good examples of supra-
1.1 Speech emotion recognition segmental features. Vector features can be further classi-
fied as low-level descriptors (LDDs) and functionals. LDDs
Emotion recognition from speech is generally performed in are the set of all segmental and supra-segmental features.
one of two ways. One is to extract linguistic (lexical) infor- Functionals are obtained from LDDs by applying statisti-
mation contained in speech, that is, use an Automatic Speech cal measures like mean, maximum, variance and kurtosis
Recognition (ASR) software to capture words or phrases (Anagnostopoulos et al., 2015).
from speech. These words and phrases can then be looked
up in a dictionary of salient words and phrases designed 1.3 Machine learning for speech emotion
for this purpose, where each word or phrase is associated recognition
with a set of discrete emotional states. The presence of the
word ‘great’ for example, can be mapped to positive emo- Classifying speech into emotions involves the extraction of
tional states like happiness or surprise in the dictionary. The useful and meaningful patterns from audio signals and learn-
other way of detecting emotion from speech relies on using ing to identify these patterns in data. This is done through
non-lexical features of speech like pitch, intensity and audio machine learning techniques. A number of pattern recog-
quality aspects like jitter and shimmer. In this approach, we nition algorithms can be used for classification. Previous
try to correlate the pronunciation of the words articulated in works in this area have used Decision trees, Support Vec-
different emotional states with their corresponding emotions. tor Machines, Gaussian Mixture Models (GMMs), Hidden
In the present work, the latter approach has been taken to Markov Models (HMMs) and Artificial Neural Networks
design an SER system. It is also possible to design an SER (ANNs) and other soft computing approaches (Agrawal &
using non-lexical acoustic features that is assisted by an ASR Christopher, 2020; Agrawal et al., 2021; Gupta et al., 2020).
that contributes to the classification using lexical aspects of In this paper, we experiment with Gaussian Naïve Bayes,
speech. This framework would, however, need a corpus with Random Forest, k-Nearest Neighbours, Support Vector
spontaneous speech audios with corresponding emotional Machine and Multilayer Perceptron as models for classifi-
state information. cation. Like any other typical machine learning problem, the
A good number of corpora exist for emotion recognition task of emotion recognition is broken down into three main
from speech. Of these, most are acted SER databases, in that steps—feature extraction, feature subset selection and classi-
a handful of actors articulate a fixed statement in a given set fication. This study uses the Ryerson Audio-Visual Database
of emotional states. It is noted that actors often exaggerate of Emotional Speech and Song (RAVDESS) (Livingstone
emotional speech, hence making the task of emotion recog- & Russo, 2018) to classify speech into one of eight possi-
nition harder in natural or spontaneous speech. ble emotions—happy, sad, neutral, calm, disgust, surprised,
anger and fearful. Performance of SER systems largely
1.2 Paralinguistic acoustic features depend on the features extracted from speech. The most
challenging aspect of speech emotion recognition is find-
As discussed earlier, speech contains linguistic (explicit) as ing the golden set of features that can discriminate between
well as paralinguistic (implicit) components. The paralin- emotions correctly and yet not overfit the machine learning
guistic aspects of speech include pronunciation, loudness of model. Mel-frequency Cepstral Coefficients (MFCCs), Lin-
speech, intonation, shimmer, jitter, harmonics to noise ratio ear Frequency Cepstral Coefficients (LFCCs), pitch, inten-
(HNR) and variations in pitch and intensity. Since the use sity, formants along with their derivatives and double deriva-
of paralinguistic acoustic features is independent of lexical tives have been extracted in the feature extraction phase. In
content, an SER trained on a corpus of English sentences can the feature selection phase, mutual information between each
very well be used to detect emotion from speech in other lan- feature and the target attribute is computed so as to rank the
guages. Acoustic features are broadly categorized into two features in decreasing order of importance. Then the top
types based on their temporal structure, namely, segmen- 25% of features have been retained and the classification
tal and supra-segmental. Segmental features are computed is performed on this subset of features. Feature selection is
once for every time frame (20–30 msec) windowed from an an essential step in the machine learning pipeline. They are

13
International Journal of Speech Technology (2022) 25:707–725 709

primarily of three types—Filter, Wrapper and Embedded. In log-energy as features. While MFCCs are dominantly used
this study, a Filter based approach is used, as it is simpler, in both speaker recognition and emotion detection tasks,
faster and independent of the choice of classifier. For the (Liu, 2018) suggested that a feature set of Gammatone Fre-
classification part, results have been presented for the two quency Cepstral Coefficients (GFCCs) performed better than
best performing classifiers, i.e., support vector machine with MFCCs for emotion detection. He reported a 3.6% increase
a Gaussian kernel, and multilayer perceptron. The most opti- in the overall accuracy when GFCCs were used in the place
mal hyperparameters of the considered models are obtained of MFCCs. In Zhou et al. (2011), the authors drew a com-
with the help of grid search and finally, overall and class- parison between MFCCs and LFCCs for speaker recogni-
wise validation accuracies and ROC curves are presented tion tasks, and noted that LFCCs consistently outperformed
and used to evaluate the ML models considered in the study. MFCCs, especially for female speech emotion recognition.
The rest of the paper is organised as follows. Section 2 LFCCs tend to capture more spectral information in the high
reviews the work done in the literature pertaining to speech frequency range, hence performing better for female speech
emotion recognition. Section 3 provides a detailed descrip- signals, which have shorter vocal tracts, and hence higher
tion of the framework, and provides the methodology and formant frequencies. Shegokar and Sircar (2016) performed
the details of the different components of the pipeline. It continuous wavelet transform (CWT) on audio signals and
provides the implementation details ranging from feature used Principal Component Analysis (PCA) for feature selec-
extraction to data pre-processing to classification. It also tion. In Pan et al. (2012), the authors used a combination
provides a brief description of the performance metrics of MFCCs, LPCCs, Mel energy Spectrum Dynamic coeffi-
used in the study to evaluate ML models. The results are cients (MEDCs), energy and pitch features for emotion rec-
summarised in Sect. 4. It also provides comparisons made ognition. In addition to these features, they also added the
between the accuracy of the proposed method and the accu- first and second derivatives of the above segmental features
racies obtained by other techniques reported in the literature. to extract useful information from temporally changing data.
Section 5 concludes the paper and provides potential areas In Koduru et al. (2020), the authors used MFCCs, Discrete
for improvement. Wavelet Transform (DWT), pitch, energy and zero-crossing
rate in the feature extraction phase. Surampudi et al. (2019)
provides novel approaches to extract acoustic features for
2 Literature review sound events. These could be utilised for speech emotion
recognition as well.
In this section, a summary of earlier works in speech emo-
tion recognition (SER) has been presented. This includes a 2.2 Review on classical machine learning
review on feature extraction methods, feature subset selec- techniques
tion and classification methodologies, including both clas-
sical machine learning and deep learning approaches used Early works in the area of speech emotion recognition
by recent state-of-the-art SER systems. (1990s) made use of Maximum Likelihood Probability
(MLP) classifiers and Linear Discriminant Classifiers
2.1 Review on feature extraction methods (LDC). It was in the 2000s that Neural Network (NN) clas-
sifiers gained popularity for emotion recognition tasks (Rong
The most crucial step in the design of SER systems is decid- et al., 2009). In recent times, however, the focus has shifted
ing upon the set of acoustic features to extract. Most works again to using classical machine learning algorithms like
till date have used Mel-frequency Cepstral Coefficients Support Vector Machines (SVMs) over other deep learn-
(MFCCs) and its derivatives, either alone or in combination ing techniques. This is primarily because of the simplicity
with other spectral and prosodic features. Many research- and ease of training SVMs when compared to computa-
ers have remarked that MFCCs are more useful for emotion tionally intensive models like ANNs and DNNs. Authors
detection than any other acoustic feature like Linear Predic- in Petrushin (2000) performed speech emotion recognition
tive Coefficients (LPCs), formants or energy. A lot of recent using the following machine learning approaches—k-near-
works have used MFCCs and its derivatives in conjunc- est neighbours classifier, neural network classifier and an
tion with features like shimmer, jitter, energy, zero cross- ensemble of neural networks. The ensemble consisted of
ing rate, spectral centroids etc. Bhavan et al. (2019) used 7 to 15 base models and a majority-voting mechanism was
spectral sub band centroids along with the first 13 MFCCs used for the final classification. In Bhavan et al. (2019), the
and their derivatives as the initial set of features. In Kwon authors used an ensemble of SVMs for classification. They
et al. (2003), the authors selected band energies, MFCCs compared the overall accuracy of the model using an SVM
(along with its first and second derivatives), pitch, funda- and using a bagged ensemble of 20 SVMs and reported a
mental frequency, three formant frequencies (F0, F1, F2) and 3% improvement in emotion recognition on the RAVDESS

13
710 International Journal of Speech Technology (2022) 25:707–725

dataset. For feature selection, they used a method called features as inputs to a Convolutional Neural Network
Boruta. Shegokar and Sircar (2016) used SVMs with dif- (CNN). Following an incremental approach to classification,
ferent kernels (linear, quadratic, cubic and gaussian) for their proposed framework achieved commendable accuracy
classification, with the highest accuracy of 60.1% obtained scores of 71.61%, 86.1%, and 64.3% when validated on
using a quadratic kernel for the RAVDESS dataset. Pan RAVDESS, Emo-DB, and IEMOCAP datasets respectively.
et al. (2012) also used SVM for classification of speech into More recently, deep learning models have gained popu-
emotions. They experimented with different combinations larity as end-to-end frameworks for speech emotion recog-
of features like LPCCs, MFCCs and MEDCs. In 2020, Chen nition. Authors in Tzirakis et al. (2018) designed an end-
et al. (2020) proposed a two-layer fuzzy multiple random to-end SER system using a CNN and fed in the raw audio
forest model that fuses personalized and non-personalized input to the model. The CNN took care of all the phases
features during emotion recognition. Validating their frame- of emotion recognition, from feature extraction to selec-
work on the CASIA and Berlin EmoDB corpora, the authors tion to classification in the emotion recognition pipeline. In
demonstrated appreciable improvements of 1.39% to 7.64% 2021, Mustaqeem et al. proposed a simple lightweight deep
and 4.06% to 4.30% over the back propagation and random learning-based Self-Attention Module (SAM) for their SER
forest models respectively. In 2021, T. Tuncer et al. proposed system. The extracted features were considered as input to
a non-linear multi-level feature generation model based on SAM to produce channel and spatial axes attention map.
cryptographic structure, called a shuffle box. First, Tunable This was followed by the use of multilayer perceptron and
Q wavelet transform is used for multi-level feature genera- convolutional neural network in channel attention and spa-
tion, then iterative neighbourhood component analysis is tial attention to extract global cues and spatial information
employed to select discriminative features in the feature sub- respectively. Both channel and spatial attention maps were
set selection phase, and finally, high, medium, and low-level used to generate attention weights. The SAM was placed in
coefficients are generated via the wavelet transform method. the middle of the convolutional neural network while per-
RAVDESS, Emo-DB, SAVEE, and EMOVO datasets are forming end-to-end training. When tested with IEMOCAP,
used to validate the system’s performance and accuracy RAVDESS, and Emo-DB datasets, the SAM-CNN frame-
scores of 87.43%, 90.09%, 84.79%, and 79.08% are reported work achieved average recall scores of 98%, 80% and 93%
respectively. Kwon et al. (2003) experimented with different respectively. In one of their other works (Kwon et al., 2021),
classifiers based on Hidden Markov Model (HMM), Linear the authors designed a one-dimensional dilated convolu-
Discriminant Analysis (LDA), Quadratic Discriminant Anal- tional neural network based on a multi-learning strategy
ysis (QDA) and SVM. The authors used Forward Selection for their SER framework. This model attained accuracies of
and Back Elimination in the feature selection phase of the 73% and 90% on IEMOCAP and Emo-DB datasets respec-
ML pipeline to identify the subset of features to be used in tively. In 2021, M. Xu et al. proposed a multi-head attention
classification. Rong et al. (2009) proposed a novel method mechanism for their SER system. The authors evaluated
for feature selection for the problem of speech emotion rec- the performance of their attention-based CNN model on
ognition and compared their results with commonly used the IEMOCAP dataset. With an accuracy of 76.18%, this
feature selection algorithms like PCA, Best First Search, was noted to be a significant improvement over benchmark
Sequential Forward Selection and Multidimensional Scal- results for the given dataset. Moreover, they experimented
ing, to name a few. on speech data by injecting 50 different types of commonly
occurring noises to testify the robustness of the model.
2.3 Review on deep learning techniques Speech emotion recognition works mentioned so far have
used deep learning as an end-to-end SER framework or a
Many recent works in SER have gravitated towards the simple classifier. A few works in SER utilise different deep
use of artificial neural networks (ANNs) for classification. learning architectures for different stages of the pipeline, like
Authors in Nantasri et al. (2020) extracted MFCCs and its feature subset selection. For instance, in a recent study in
derivatives from speech and trained a light-weight ANN to Kwon et al. (2021), Mustaqeem et al. proposed a framework
perform classification. What made the ANN light-weight based on key sequence segment selection and redial function
was the small number of parameters and the limited number network. Here, a CNN is used for the feature subset selection
of features it used. This model used neural networks as a phase to extract salient features from an input spectrogram.
classifier after the feature extraction and selection phases. The selected features are then normalised and forwarded to
Similar to this, Issa et al. (2020) have also limited the use a bidirectional long short-term memory (LSTM) neural net-
of their deep learning model to the classification phase. work for classification. This framework achieved accuracies
They used MFCCs, chroma-gram, Mel-scale spectrogram, of 72.25%, 85.57%, and 77.02% on IEMOCAP, Emo-DB
Tonnetz representation, and various other spectral contrast and RAVDESS datasets respectively.

13
International Journal of Speech Technology (2022) 25:707–725 711

While deep learning techniques have proven to be effec- 3 System overview

tive in classifying speech into emotions, there are certain
limitations to these techniques. For one, training a deep This section gives an overview of the proposed SER sys-
neural network is difficult with limited training examples. tem. Figure 1 presents the framework of the system. It con-
A typical dense neural network needs much more data to sists of four primary modules, namely, the feature extrac-
train, than a classical machine learning model. Not only tion, the feature subset selection, the classification and the
is the availability of data a problem, but also the fact that performance evaluation modules. These have been briefly
the training and computation times are generally higher for described in the following subsections.
deep learning architectures, rendering them unsuitable for
real-time applications. While works noted so far have con- 3.1 Feature extraction
centrated on a gender-neutral, speaker-independent SER
system, Vogt and André (2006) incorporates an automatic In the proposed model, a combination of features derived
gender detection module into the SER system. With the from Mel-Frequency Cepstral Coefficients (MFCCs), Linear
help of experimental results, they showed that features Frequency Cepstral Coefficients (LFCCs), formants, pitch,
derived from pitch were more discriminating and useful intensity and spectral centroids are used.
for classification in gender-specific models. Overall, they Mel-Frequency Cepstral Coefficients (MFCC) MFCCs
noted a 2-4% improvement over the recognition rate of a represent acoustic signals on the Mel scale, which is a
gender-independent SER system. perpetual non-linear scale wherein the pitches are scaled
in a way that closely resembles the non-linear perception
of pitches by the human ear. To get the MFCCs, the audio

Fig. 1 System framework

13
712 International Journal of Speech Technology (2022) 25:707–725

signal is first segmented into frames of equal length (25ms) has been done with the help of Librosa (McFee et al., 2015)
with some overlap (50%). Fast Fourier Transform (FFT) is and Parselmouth (Jadoul et al., 2018) libraries. While six
performed for each frame. The resultant frequency response statistics, namely, mean, minimum, maximum, standard
is then passed through a set of N triangular band-pass fil- deviation, kurtosis and skewness are computed on segmen-
ters (Mel filter bank), the output of which is then subjected tal features derived from MFCCs, LFCCs, intensity, pitch
to Discrete Fourier Transform (DCT). The coefficients thus and formants 1, 2 and 3, four statistics, namely, minimum,
obtained are the MFCCs. maximum, mean and standard deviation are calculated for
Linear Frequency Cepstral Coefficients (LFCC) Linear spectral centroids. This is done because the audio signals are
Frequency Cepstral Coefficients are obtained in a way simi- of different lengths, hence different numbers of frames must
lar to the MFCCs with one major point of difference. The be processed together and be converted to a feature vector of
triangular band-pass filters in the filter bank used to derive uniform length. Calculating functionals (statistics like mean,
LFCCs are equally spaced on the linear scale. LFCCs are variance, moments) on low level descriptors serves this pur-
known to contain better information of vocal tract excitation pose. In our experimental setup, a total of 652 features were
in the higher frequency region of human voice. computed. Of these, MFCC and LFCC related features were
Spectral Centroids Spectral Centroids are the weighted 360 and 234 respectively. Pitch and intensity features con-
average of frequencies present in sound, with the magnitude tributed to 18 each; 18 were formant-based features. Finally,
of each frequency component in the spectrum of the signal 4 spectral centroid features were appended to the feature
acting as its weight. It gives an idea of where the centre of vector, making it a 652-dimensional vector.
mass of the spectrum is located.
Formants Formants are acoustic resonances of the vocal 3.2 Feature subset selection
tract. They are local maxima’s that can be observed in spec-
trograms as broad peaks, occurring at multiples of the fun- Feature subset selection is a critical step in designing any
damental frequency. F1 is the formant with the lowest fre- kind of machine learning model. Its purpose is to reduce
quency, followed by F2 and F3. The first two formants, F1 the dimensionality of the feature space, which not only
and F2, are important in determining the quality of vowels. makes the training of the model a lot faster, but also, in most
Pitch Speech is of two types—voiced and unvoiced. cases, improves the performance of the classifier. There are
Unvoiced speech segments are characterized by turbulent, three primary approaches to feature subset selection—Fil-
noise-like sound. These segments often are articulations of ter, Wrapper and Embedded. While filter approaches work
consonants. Voiced segments, on the other hand, are articu- independent of the classifier being used, wrapper methods
lations of vowels and are characterized by a dominant, low evaluate subsets of features based on their performance on
frequency signal. This fundamental frequency is what we specific models. Wrapper methods are computationally very
call pitch. expensive, as they involve iteratively looking for subsets of
Intensity Intensity is defined as the power carried by features (which grow exponentially with the number of fea-
sound per unit area. It makes an important feature in emo- tures), and evaluating their performance using a classifier.
tion recognition systems. High arousal emotions like anger While exhaustive approaches are guaranteed to give the
and surprise tend to have higher intensity than low arousal most optimal subset of features, it is computationally not
emotions like sadness and boredom. feasible to apply those, since the number of subsets to be
Since speech is a non-stationary signal, it does not suf- evaluated in a feature space of dimensionality N would be
fice to consider MFCC, LFCC, pitch and intensity values of the order of 2N . Sequential methods of feature selection
alone for each of the frames. It is also important to cap- like Forward Selection and Back Elimination are also popu-
ture the rate of change of these features with time, to get a lar choices of feature subset selection. These are, however,
better understanding of the dynamics of the speech signal. greedy approaches and do not guarantee the optimal subset
We thus consider the velocity (delta) as well as the accel- of features. In embedded feature subset selection, feature
eration (delta-delta) features for MFCCs, LFCCs, intensity selection is integrated into the machine learning model. It
and pitch. The delta coefficient for the tth frame is simply combines the advantages of both filter and wrapper-based
the difference between the value at the tth and the value at methods and sits in between the two approaches in terms of
the (t − 1)th frame. The delta-delta coefficients are calcu- time complexity.
lated in the same way by applying the above logic to the Table 1 provides a brief overview of the different feature
delta coefficients. In total, 20 MFCCs, 13 LFCCs, formants selection methods that have been used by recent works in
1, 2 and 3 (F1, F2 and F3) along with spectral centroids, the area of speech emotion recognition. Shegokar and Sir-
pitch and intensity features are extracted. In addition to this, car (2016) used Principal Component Analysis (PCA) in 3
the delta and delta-delta coefficients for MFCCs, LFCCs, and 10 directions prior to training the feature matrices on
intensity and pitch are also computed. Feature extraction linear, quadratic, cubic and gaussian SVMs. PCA has also

13
International Journal of Speech Technology (2022) 25:707–725 713

Table 1 Feature subset selection Description Feature selection

techniques used in literature technique

Shegokar and Sircar (2016) Used Continuous Wavelet PCA

Transform (CWT) and other prosodic
features and classified using SVM
Changqin et al. Used pitch and formant related PCA
Quan et al. (2017) features and classified using SVM
Kwon et al. (2020) Used MFCCs and its derivatives Forward selection and
as features and classified using Back Elimination
QDA, SVM, LDA and HMM
Bhavan et al. (2019) Used MFCCs and spectral centroids Boruta
as features and classified using SVM
Gomathy (2021) Used a mixture of spectral and CSO
prosodic features and classified using
SVNN (Support Vector Neural Network)
Daneshfar et al. Used a mixture of prosodic, QPSO
(Daneshfar & Kabudian, 2020) spectral and voice quality features
and classified using a GMM

been employed in the SER system developed by Changqin statistical dependence of the target attribute on this subset
et al. (Quan et al., 2017) to improve the system’s perfor- of features is maximised (Peng et al., 2005). To realise this
mance. In principal component analysis, data points are scheme of maximal dependency, features are first ranked
projected onto the first few principal components in a way by their relevance to the target attribute c and the top x% of
that maximises the variance retained in the projected data. them are picked. Relevance to the target attribute is meas-
Kwon et al. (2003) used forward selection and backward ured with the help of mutual information (MI). For this,
elimination to rank features and chose the optimal subset sklearn’s GenericUnivariateSelect class is used. Since there
of features. Unlike PCA, which results in a set of new fea- was a high imbalance in the number of features extracted
tures, both forward selection and backward selection return from different feature groups like MFCCs and its derivatives,
subsets of the original set of features. Bhavan et al. (2019) spectral centroids, etc., it was decided to retain the top 20%
used Boruta, a wrapper-based feature selection method. It of MFCC and LFCC related features. All other features were
is an algorithm that addresses the problem of finding an all- included without any selection mechanism. Table 2 provides
relevant subset of features, instead of the minimal-optimal a summary of the different feature groups before and after
subset. The algorithm is explained in detail in the work of feature subset selection.
Jankowski et al. in Kursa et al. (2010). Many recent works
in the area of speech emotion recognition have gravitated 3.3 Classification
towards the use of deep learning techniques for classification
like CNNs, and hence do not have an explicit feature subset After feature extraction and feature subset selection, the next
selection module. Feature subset selection, along with clas- step in the pipeline is to run the subset of features through
sification is taken care of by the deep learning model. This five machine learning models to classify speech into one
is exemplified in the works of Zeng et al. (2019), Zamil et al. of the 8 emotions. These models are then compared to
(2019) and Christy et al. (2020). Other works like that of each other on the basis of overall accuracy and class-wise
Gomathy (2021) and Daneshfar and Kabudian (2020) have
utilised swarm-based optimisation techniques for feature
selection. Gomathy (2021) applies Cat Swarm Optimisa- Table 2 Features before and after feature subset selection
tion (CSO) while Daneshfar et al. (Daneshfar & Kabudian, Description All features Selected
2020) employ a modified quantum-behaved particle swarm features
optimisation (QPSO) to get a near-optimal subset of features
MFCCs 360 72
prior to classification.
LFCCs 234 47
In this study, a filter-based approach of feature sub-
Spectral centroids 4 4
set selection is employed. This is because filter-based
I, II, III formants 18 18
approaches are fast, simple, less prone to overfitting and
Pitch 18 18
are not model-specific, unlike wrapper methods. The aim
Intensity 18 18
here is to obtain a subspace of m features, Rm such that the

13
714 International Journal of Speech Technology (2022) 25:707–725

accuracies. Each of the five models described below is run The posterior probability of the d dimensional feature
independently on the entire set of features (without feature vector [x1 , x2 , x3 , … , xd ] belonging to class Ci is given by the
subset selection) and the results are compared using Strati- following equation.
fied k-fold cross-validation (k = 10). The five models con- ( )
sidered in this study with their corresponding results are P Ci ∕[x1 , x2 , x3 , … , xd ]
( )
explained next. For the implementation of these models, the P [x1 , x2 , x3 , … , xd ]∕Ci
scikit-learn library is used (Pedregosa et al., 2011). = ( ) × P(Ci )
P [x1 , x2 , x3 , … , xd ]

3.3.1 Gaussian Naïve Bayes Classifier where P([x1 , x2 , x3 , … , xd ]) is the probability of the feature
vector and is called the predictor prior probability. P(Ci ) is
The Naïve Bayes algorithm is a simple classification tech- the prior probability of the ith class. P([x1 , x2 , x3 , … , xd ]∕Ci )
nique that relies on the Bayes Theorem for calculating the is the likelihood probability of the feature vector, given class
posterior probability. What makes the algorithm naïve, is the Ci. The above equation can be reduced to the following equa-
assumption that the predictors or features used in the clas- tion with regards to the assumption of conditional independ-
sification model are all conditionally independent of each ence of the features with respect to each other.
other. What this means mathematically, is that the likeli- ( )
hood of a class, given a feature vector, can be expressed as P Ci ∕[x1 , x2 , x3 , … , xd ]
( )
a product of the likelihoods of the class given each feature P [P(x1 ∕Ci ) × P(x2 ∕Ci ) × P(x3 ∕Ci ) × ⋯ × P(xd ∕Ci )]
= ( )
of the feature vector. P [x1 , x2 , x3 , … , xd ]
× P(Ci )

13
International Journal of Speech Technology (2022) 25:707–725 715

Posterior probabilities are calculated for each class and subset of 177 features, the accuracy increases to 44.98%
the class with the maximum posterior probability is selected which is still very low when compared to other models.
and assigned to the feature vector. This is called the maxi- Algorithm1 presents Gaussian Naïve Bayes classifier.
mum a posteriori (MAP) rule. In this study, Gaussian Naïve
Bayes (GNB) classifier is used. The Gaussian Naïve Bayes
algorithm is a simple extension of the vanilla Naïve Bayes 3.3.2
k‑Nearest Neighbours classifier
algorithm, extended to work with real-valued attributes.
Here, each of the features is assumed to follow normal distri- The k-Nearest Neighbours (k-NN) classifier is a simple ML
bution, and the mean and standard deviations are estimated algorithm that employs a lazy-learner approach to classifica-
from the training data. The Naïve Bayes algorithm is sought tion. It stores all the training examples at the time of training,
for its computational simplicity and lower training times. and does not learn anything from it. At the time of classifi-
However, it is known to be a poor classifier, as the assump- cation, a search is performed to look for the k points in the
tion of conditional independence does not hold for most dataset that are closest to the testing example. The Euclidean
practical applications. In the SER application, the classifier distance is used as the distance metric in this study. Euclidean
shows an extremely poor performance. The overall stratified distance between two d-dimensional feature vectors f1 and f2
tenfold cross-validation accuracy given by this classifier is is given by
43.48%. This is when the model is run with all 652 features √
√ d
obtained by the feature extraction module. When run on a √∑ ( )2
Euclidean distance (f1 , f2 ) = √ f1i − f2i
i=1

13
716 International Journal of Speech Technology (2022) 25:707–725

The number of neighbours (k) is chosen by performing a Table 3 MLP architecture

grid search with k values ranging from 1 to 15. A downward
NN Layer Description
trend in the performance with increasing k is observed, with
k = 1 achieving the highest overall accuracy, evaluated using Input layer 177 neurons
stratified tenfold cross validation. The k-nearest neighbours is Hidden layer 1 640 neurons with Relu activation
weighted uniformly for the classification. The overall accuracy Hidden layer 2 32 neurons with Sigmoid activation
with k = 1 is 53.25%, considering all 652 features. The overall Output layer 8 neurons with Softmax activation
accuracy considering the subset of 177 features, keeping all
hyperparameters constant is 65.17%, which is a substantial ensemble, for both gini and entropy criteria for measuring
improvement. Algorithm 2 presents the k-Nearest Neighbours the split quality. It is also noted that gini impurity is a bet-
classifier. ter measure of split quality for this application, as it gives
slightly better performance when compared to entropy.
3.3.3 Random Forest classifier Algorithm 3 presents the random forest classifier. An overall
accuracy of 61.84% can be achieved with a random forest
Ensemble learning is the process of combining the prediction classifier with 100 trees in the ensemble, using gini impurity
of many weaker models to form a strong learner. Random for- for measuring split quality. When the model is trained on the
ests are ensembles of decision trees. A voting mechanism is selected subset of 177 features, the classifier gives an overall
adopted to give the final prediction as the class label voted for accuracy score of 64.06%.
by the majority of trees in the ensemble. Bootstrapping is done
to build individual trees, instead of using the entire dataset for
each tree.

Two split criteria, namely, gini and entropy, are used

along with grid search. From the grid search process, it is
observed that the random forest’s performance monotoni-
cally increases with the number of decision trees in the

13
International Journal of Speech Technology (2022) 25:707–725 717

3.3.4 Support Vector Machine Choosing the appropriate kernel function is thus of utmost
importance. In this study, we use the Gaussian kernel SVM.
Support Vector Machines (SVM) are robust binary classi- The Gaussian kernel, or, the Radial Basis Function (RBF)
fiers that find the hyperplane between two classes of data kernel applied on two d-dimensional vectors xi and xj , is
such that the margin between the two classes is maximized. given by the following equation.
This hyperplane is constructed in a higher dimensional � ∑d �
space by what is known as the kernel trick. A kernel func- k=1
��xik − xjk ��2
K(xi , xj ) = exp −
tion is suitably chosen for the application, such that the 2𝜎 2
training points are linearly separable in the higher dimen-
∑d
sional space. The kernel trick is that the kernel function, In the above equation, the term k=1 ��xik − xjk ��2 is the
when applied onto two vectors p and q, gives a value that squared L2-norm distance between the vectors xi and xj .
is the dot product of the two vectors p and q when each of 1
Kernel coefficient (𝛾) is given by 2 . Algorithm 4 presents
them is projected onto the higher dimensional space. What 2𝜎
the support vector machine classifier.
this higher dimensional space is exactly, is not of our con-
Using grid search and a stratified tenfold cross-vali-
cern, and the entire prediction process can be carried out
dation strategy, we find the most suitable values of ker-
by simply knowing the dot product values, which can be
nel coefficient (𝛾) and penalty term (C) to be 0.0029 and
conveniently found using the kernel function. Considering
100 respectively. With these hyperparameters, the model
two vectors xi and xj and their transformed data points 𝜙(xi )
gives an overall accuracy of 67.57%, which is significantly
and 𝜙(xj ) , the dot product of the two transformed vectors
higher than the accuracy scores given by all the previ-
can be found by simply applying the kernel function on xi
ous models. This is when the model has been trained on
and xj , as shown.
the entire set of 652 features. When trained on the 177
K(xi , xj ) = 𝜙(xi ) × 𝜙(xj ) selected features, a substantial improvement in the perfor-
mance is observed. The model does exceedingly well and
gives an overall accuracy of 77.86%.

13
718 International Journal of Speech Technology (2022) 25:707–725

3.3.5 Multilayer Perceptron hidden layers is used. Algorithm 5 presents the multilayer

perceptron with two hidden layers and the description of
A multilayer perceptron (MLP) consists of one or more the architecture is presented in Table 3. It is compiled
layers of neurons, where each neuron is modelled by a using Adam optimizer for 20 epochs. The categorical
single node whose output is defined as a function of the cross entropy is taken as the loss function. When trained
linear combination of its inputs. using the entire feature set, and evaluated using stratified
The activation function (𝜎) is used to introduce non- tenfold cross-validation, an overall accuracy of 73.76% is
linearity into the model. Common activation functions obtained. When trained with the selected subset of 177
include sigmoid, relu (rectified linear unit) and tanh (tan features, the performance improves and an overall accu-
hyperbolic). In this study, a multilayer perceptron with 2 racy of 79.62% is achieved by the model.

13
International Journal of Speech Technology (2022) 25:707–725 719

⋆ Details of BPA (back propagation algorithm) can be and F1-scores were also computed for the two contending
found in Rojas (1996) models and tabulated for reference. These metrics are
defined as follows.
TP
3.4 Performance evaluation Precision =
TP + FP
TP
Post training the machine learning models, a comparative Sensitivity =
TP + FN
analysis of their performance was done. Emotion-wise accu- TN
racies were noted using the tenfold cross-validation strategy Specificity =
TN + FP
for each of the classifiers. Models were compared to each Precision × Sensitivity
other on the basis of their overall accuracies (averaged over F1-score =
Precision + Sensitivity
all classes or emotions). The overall validation accuracy for
a multi-class classification problem with k classes is defined On the basis of the above performance metrics, the per-
as the ratio of the number of samples that were correctly formance of the Gaussian Naïve Bayes, Random Forest,
classified to the total number of samples. k-Nearest Neighbours, Support Vector Machine and Mul-
tilayer Perceptron classifiers were evaluated and the most
Overall validation accuracy
∑k suitable model was chosen for the SER framework. The
Samples correctly classified as class k plethora of metrics and their analytics would enable the
= i=1∑k
Samples belonging to class k users to choose an appropriate learning model according to
their requirements.
i=1

The above metric can be computed for a classifier from its

confusion matrix. It is the ratio of the rank of the confusion
matrix to the total number of samples. After ranking the 4 Experimental analysis
classifiers in order of their overall validation accuracies,
Wilcoxon signed-rank test was used to statistically determine This section presents the implementational aspects of the
if the pairwise differences in the models’ accuracies were system corresponding to the modules detailed in the pre-
significant. Following this, the confusion matrices of the two ceding section. It also includes a detailed discussion of the
best, contending models for the SER framework were ana- results, findings and observations of this work.
lysed in depth. Their performances were compared on the
basis of ROC-AUC curves. The ROC-AUC curve is a com- 4.1 Dataset description
monly used metric for evaluating the machine learning mod-
els. Visualisation of the Area Under the Curve (AUC) of The Ryerson Audio-Visual Database of Emotional Speech
Receiver Operating Characteristics (ROC) provides informa- and Song (RAVDESS) dataset is a song and speech data-
tion about the model’s ability to distinguish between classes. set in the English language. It consists of song and speech
It is the plot of TPR vs FPR where TPR or True Positive audios of 24 actors, 12 male and 12 female in North Ameri-
TP
Rate is given by TPR = , where TP is the number can accent. The speech consists of eight emotions—happy,
TP + FN
of true positives and FN is the number of false negatives. sad, surprised, calm, neutral, disgust, angry and fearful.
FPR or False Positive Rate is given by FPR =
FP
, Each actor acts out each of the eight emotions in two differ-
TN + FP ent statements—“Kids are talking by the door” and “Dogs
where FP is the number of false positives and TN is the
are sitting by the door”. Besides this, each emotion except
number of true negatives. Apart from the ROC curve, per-
the neutral emotion is performed at two levels of emotional
formance metrics such as precision, sensitivity, specificity

Table 4 Validation accuracies Classifier Hyperparameters Accuracy score (652 Accuracy score
before and after feature subset features) (177 features)
selection
GNB Smoothing parameter = 1 × 10−9 43.48 44.26
k-NN k=1 53.25 65.17
RF k = 100, Gini split 61.84 64.06
SVM 𝛾 = 0.0029, C = 100 67.57 77.86
MLP 2 hidden layers 73.76 79.62
(640, relu), (32, sigmoid)

13
720 International Journal of Speech Technology (2022) 25:707–725

intensity. In total, the speech-only dataset makes up for a populations. The aim is to have each subset or sample to be
total of 1440 audio files in wav format recorded with a sam- representative of the entire population in terms of percent-
pling frequency of 48 KHz. age composition of each class. This is preferred over random
sampling for machine learning problems, since you would
want the training data to be representative of the entire data-
4.2 Data pre‑processing
set, to be able to predict unseen examples correctly. K-fold
cross-validation is a cross-validation approach wherein the
Data is read using Python’s Librosa library, and features are
entire dataset is first divided into k equal folds or subsets of
extracted with the help of Librosa and Parselmouth. The
data. Then, in each run, one of the k folds acts as the test
features obtained are then scaled with respect to a Standard
set, while the other k − 1 folds are used to train the model.
Normal distribution, having mean 0 and variance 1. After
The model is evaluated using the test set and the accuracy
this, MFCC and LFCC features are ranked in the decreasing
is noted. This experiment is performed k times, with each
order of their mutual information with the target attribute,
and the top 20% of these features are retained and added to
other features like spectral centroids, formants, pitch and
intensity, i.e., 177 of 652 features are retained in the feature
subset selection step. After this process, to account for the
class imbalance in the RAVDESS dataset, and balance the
number of samples across all emotions, we use the Synthetic
Minority Oversampling Technique (SMOTE) to oversample
the minority class data. Neutral class is oversampled using
SMOTE and a total of 1536 audio files are considered for
classification, with each emotion being represented by 192
files. Imblearn (Lemaître et al., 2017) is used for the imple-
mentation of SMOTE.

4.3 Model training

Training and validation of the five models is performed

using stratified k-fold cross-validation. Stratified sam-
pling is a method of sampling populations that contain sub
Fig. 3 Confusion matrix for SVM model

Fig. 2 Performance of classifiers on different emotions

13
International Journal of Speech Technology (2022) 25:707–725 721

Fig. 5 ROC curves for SVM model

Fig. 4 Confusion matrix for MLP model

of the k folds acting as the test set exactly once. The aver-
age accuracy of the k experiments is reported as the k-fold
accuracy. In this study, a stratified tenfold cross-validation
strategy is used to evaluate the classification models.

4.4 Experimental results

Models have been evaluated using the stratified tenfold

cross-validation strategy and the overall accuracies for all
emotions, across all the models, before and after feature
subset selection is presented in Table 4. For the Gauss-
ian Naïve Bayes classifier, after feature subset selection,
an overall accuracy of 44.26% is obtained, which is poor
compared to the other classifiers. The k-Nearest Neighbours
algorithm performs considerably better, giving an overall
Fig. 6 ROC curves for MLP model
accuracy of 65.17%. When this classifier is fed with all 652
features without subset selection, the model performs poorly,
giving an accuracy score of only 53.25%. This shows that in the ensemble achieves an overall accuracy of 61.84% prior
k-NN is extremely sensitive to overfitting, and the presence to feature subset selection and 64.06% post feature subset
of irrelevant features degrades its performance consider- selection.
ably. The Random Forest classifier with 100 decision trees

Table 5 Performance of SVM Emotion SVM MLP

and MLP
Precision Sensitivity Specificity F1-score Precision Sensitivity Specificity F1-score

Angry 0.7772 0.8177 0.9965 0.7969 0.8274 0.8490 0.9747 0.8380

Calm 0.7740 0.8385 0.9650 0.8050 0.7736 0.8542 0.9643 0.8119
Disgust 0.7474 0.7552 0.9635 0.7513 0.7842 0.7760 0.9695 0.7801
Fearful 0.7813 0.7813 0.9688 0.7813 0.7778 0.8021 0.9673 0.7897
Happy 0.7821 0.7292 0.9710 0.7547 0.7772 0.7450 0.9695 0.7606
Neutral 0.8102 0.9115 0.9695 0.8578 0.8421 0.9167 0.9754 0.8778
Sad 0.6938 0.5781 0.9635 0.6307 0.7342 0.6042 0.9688 0.6629
Surprised 0.8486 0.8177 0.9792 0.8329 0.8404 0.8230 0.9777 0.8316

13
722 International Journal of Speech Technology (2022) 25:707–725

This shows that Random Forest is not as sensitive as k- In addition to studying overall accuracies, it is impera-
NN to the presence of irrelevant features in the data, and tive to look at the emotion-wise performance of the classi-
performs satisfactorily well even without feature subset fiers as well. Figure 2 depicts the class-wise performance
selection. While Random Forest and k-NN classifiers do of all five algorithms. The similitude across these classi-
a good job of classifying speech into emotions, as can be fiers is that each of them performs well on the neutral class,
verified from other state-of-the-art approaches tabulated in while the sad class shows the lowest performance scores.
Table 4, the best accuracies are obtained by support vector The MLP and SVM show fairly decent accuracy scores for
machine and multilayer perceptron. With SVM, the overall the sad class, with recalls of 60.82% and 57.81% respec-
accuracy with a stratified tenfold cross-validation technique tively. The second worst performing algorithm, namely, the
is a considerable 77.86% post feature subset selection, with k-NN classifier performs best on the neutral class, correctly
the maximum accuracy being 86.27%. With the MLP, an recognising 96.88% of all true neutral class samples. This
overall accuracy of 79.62% is achieved, with a maximum of is followed by the MLP, SVM and Random Forest models
84.96%. From these results, it is quite clear that SVM and with recalls of 91.67%, 91.15% and 87.5% respectively. The
MLP are the best performers in terms of validation accuracy confusion matrices of the two best performing algorithms,
and the only real contenders in the SER framework. namely, Support Vector Machine and Multilayer Perceptron,
are depicted in Figs. 3 and 4 .
From these confusion matrices, performance metrics like
precision, sensitivity, specificity and F1-score are computed
and tabulated in Table 5. While class-wise precision, recall
and F1-scores of MLP are consistently higher that SVM, the
class-wise specificity scores of SVM seem to be greater than
those of MLP, except for the neutral and sad classes. The
two algorithms are also compared on the basis of their ROC
curves. The area under ROC curves is a good indicator of a
model’s performance. The greater the area under the curve
(AUC) for a given emotion, the better is the model’s ability
to distinguish between test instances for that emotion. Fig-
ure 5 provides the ROC curves for each of the 8 classes and
their macro-average for the SVM model. Figure 6 provides
the ROC curves for the MLP model. A comparison of the
two plots reveals that the MLP model’s AUC is slightly bet-
ter than the SVM model’s, on an average. Interestingly, for
Fig. 7 Wilcoxon signed-test correlation matrix low energy emotions like calm, disgust and fearful, the SVM

Table 6 Comparison of performance on the RAVDESS dataset

Features Classification approach Overall accuracy

Shegokar and Sircar (2016) Continuous Wavelet SVM (linear, Linear SVM—52.6%
Transform (CWT) quadratic, gaussian, Quadratic SVM—60.1%
and other prosodic features cubic kernels) Cubic SVM—56.3%
Gaussian SVM—52.3%
Zeng et al. (2019) Spectrograms Gated Residual 64.48%
Networks (GResNets)
Zamil et al. (2019) MFCCs Logistic Model Tree (LMT) 70%
Bhavan et al. (2019)a MFCCs and SVM 75.69%
spectral centroids (gaussian kernel)
Christy et al. (2020) MFCCs and modulation CNN 78.20%
spectral (MS)
Proposed work MFCCs, LFCCs, SVM 77.86%
spectral centroids, (gaussian kernel)
formants, pitch and MLP 79.62%
intensity (with two hidden layers
(640, relu), (32, sigmoid)
a
Reports the best accuracy with a 90:10 train-test split; in general, average accuracy is reported

13
International Journal of Speech Technology (2022) 25:707–725 723

model does a more sensitive classification—better recall— overall accuracy they reported for the RAVDESS dataset
compared to the MLP model. was 60.1%. Zeng et al. (2019) used spectrograms with a
Comparing overall validation accuracies of the five deep neural network-based classifier and observed an accu-
models from Table 4 mistakenly compels us to conclude racy of 64.48%. In Zamil et al. (2019), a 13-dimensional
that MLP is the best model for classification on the basis feature vector comprising of MFCCs is used for trained a
of mean accuracies, completely neglecting SVM. It is per- Logistic Model Tree on the data. The best overall accuracy
haps constructive to look at the training times of the SVM reported by them is 70%. It is important to note here that
and MLP models to make a fair comparison between the MFCCs alone do not contain all the emotional information
two and come to the decision of the better classifier for the in speech. Its delta and delta-delta coefficients, along with
framework. While the training of a single fold of data (out other spectral features play an important role in classifying
of the 10 folds) takes 17 microseconds μs on average for speech to emotions correctly. Our proposed models performs
an SVM, it takes 1.000023 × 106 μs on average for training better than the aforementioned models on the RAVDESS
a single fold of data for an MLP. While one second and dataset. It should also be noted that emotion recognition
23 microseconds is not a lot, it is several folds larger than for the RAVDESS dataset is not a very simple task, with
17 μs. Moreover, the computational complexity of MLPs the human accuracy rate reported as 67% for this dataset
tends to be inherently higher due to the huge number of (Livingstone & Russo, 2018).
parameters (weights) to update in each iteration of the Authors in Bhavan et al. (2019) used a bagged ensemble
back propagation algorithm. of support vector machines for classification with MFCCs
To test the significance of the difference between SVM and spectral centroids as their primary features. With
and MLP, a non-parametric statistical hypothesis test called this, their best performance with a 90:10 train-test split is
Wilcoxon signed-rank test is used. This test is performed to reported as 75.69% on the RAVDESS dataset. In contrast,
evaluate whether two dependent samples are drawn from our best performance is 84.96% using MLP and 86.27%
the same population. It is considered an alternative to the using an SVM. Christy et al. (2020) used a convolutional
paired Student’s t-test for dependent samples. Pairwise tests neural network to perform speech emotion recognition using
are carried out for k-NN, Random Forest, SVM and MLP a combination of MFCCs and modulation spectral features.
classifiers. The Naïve Bayes classifier was excluded in the Their best accuracy was reported as 78.20%. While this is a
discussion due to its poor performance. These tests are per- close competitor to the proposed methods of classification,
formed using k-fold cross-validation with k = 20 and the it should be noted that it is worth going with the SVM model
level of significance is chosen as 5% (𝛼 = 0.05). The null for classification (accuracy = 77.86%), due to its faster train-
hypothesis for each pairwise test is that the performance ing time.
of the two classifiers, in terms of their accuracies, does not Hence, it can be concluded that the SER framework
differ significantly. Figure 7 provides the p-values for all the developed in this study surpasses reported frameworks in
pairwise tests. From Figure 7, it is clear that the SVM and terms of overall accuracy scores. With a mere 177 features
MLP models are both significantly different from the Ran- retained by the feature subset selection model, the proposed
dom Forest and KNN models, since the p-values for these 4 SVM framework is light in terms of computation and highly
pairs is very close to zero. At 5% level of significance, it is reliable. Due to its lower training times compared to its MLP
seen that there is no significant difference in the performance counterpart, the proposed SVM framework is suitable to be
of the Random Forest and k-NN classifiers. Similarly, there incorporated into real-time applications for speech emotion
is no significant difference in the performance of the SVM recognition.
and MLP models.

4.5 Discussion 5 Conclusion

We now compare our SER framework with both SVM and In this paper, a Speech emotion recognition framework has
MLP models to various other SER frameworks developed in been designed and implemented using machine learning
literature. Table 6 provides a comparison between the over- algorithms like Gaussian Naïve Bayes, k-Nearest Neigh-
all accuracy of the SVM and MLP models with the overall bours, Random Forest, Support Vector Machine and Mul-
accuracies of other SER models reported in the literature tilayer Perceptron. A combination of spectral and prosodic
for the RAVDESS dataset. All our results have been tabu- features is examined. Of these 652 features, comprising of
lated with the tenfold stratified cross-validation approach. MFCCs, LFCCs, formants, spectral centroids, intensity
Shegokar and Sircar (2016) extracted features based on and pitch, along with their velocity and acceleration com-
continuous wavelet transform (CWT) and used SVMs with ponents, 177 are retained for further processing based on
different kernel functions for classification. The highest mutual information with the target attribute. Filter-based

13
724 International Journal of Speech Technology (2022) 25:707–725

approaches like selecting the top x features based on some Another area with scope for improvement would be trying
criterion like mutual information or Pearson’s coefficient are out different feature selection algorithms. Boruta can be tried
faster and computationally less intensive than wrapper-based out to reduce the dimensionality of the extracted features.
approaches. Thus, the choice of this feature subset selec- This algorithm was used by Bhavan et al. (2019) to extract
tion algorithm is fitting for an SER framework designed to the most emotionally informative subset of features for their
work in real-time applications. The features, post feature bagged ensemble of SVMs. Evolutionary algorithms like
subset selection, are then used to train the aforementioned Genetic Algorithm (GA) and swarm-based meta-heuristics
ML models, and evaluation of these models is carried out like particle swarm optimisation (PSO) could also be experi-
using stratified ten-fold cross-validation strategy. mented with for selecting the most optimal subset of fea-
Comparative analysis of the models reveals that both the tures. Filter-based approaches that assign scores to features
MLP and SVM classifiers perform remarkably well com- based on correlation coefficients or Chi-squared test can also
pared to the other three classifiers in terms of validation be tried. Moreover, other frameworks of knowledge-based
accuracy. Wilcoxon signed rank test is used to statistically decision systems can be utilised for feature subset selection
determine, for each pair of models, how significantly their and classification phases in SER (Christopher et al., 2016;
performances differ. It is noted that MLP and SVM both Kavya et al., 2021). Apart from this, further research can be
perform significantly better than the other classifiers. With done on how gender-based information can be exploited to
overall accuracies of 79.62% and 77.86%, a pairwise test design gender-dependent SER systems, and how they can be
done between these two classifiers reports no significant an improvement over gender-neutral SER systems.
difference in their performances. While the performance of
the SVM is only slightly lesser than that of MLP, it is on an
average 1000 times faster in terms of training time. Since the
aim of the paper is to provide an SER framework that is both References
reliable and fast enough to be used in real-time applications,
SVM provides a better alternative to MLP. The performance Agrawal, E., & Christopher, J. (2020). Emotion recognition from peri-
ocular features. In International conference on machine learn-
of both these models is noted to be a significant improve- ing, image processing, network security and data sciences (pp.
ment over other state-of-the-art models reported in literature 194–208). Springer.
for the RAVDESS dataset. A number of neural network- Agrawal, E., Christopher, J. J., & Arunachalam, V. (2021). Emotion
based approaches have been reported in the last couple of recognition through voting on expressions in multiple facial
regions. ICAART, 2, 1038–1045.
years. While these models have obtained accuracies as high Anagnostopoulos, C.-N., Iliou, T., & Giannoukos, I. (2015). Features
as 82.3% on the RAVDESS dataset (Nantasri et al., 2020), it and classifiers for emotion recognition from speech: A survey
should be noted that the proposed model that uses an SVM from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.
is computationally much less taxing to train. It is also much Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support
vector machines for emotion recognition from speech. Knowledge-
faster compared to a DNN. Based Systems, 184, 104886.
Chen, L., Su, W., Feng, Y., Wu, M., She, J., & Hirota, K. (2020). Two-
5.1 Future work layer fuzzy multiple random forest for speech emotion recognition
in human-robot interaction. Information Sciences, 509, 150–163.
Christopher, J. J., Nehemiah, K. H., & Arputharaj, K. (2016). Knowl-
Further work can be done to incorporate audio quality edge-based systems and interestingness measures: Analysis with
features like shimmer, jitter and hormonic-to-noise ratio clinical datasets. Journal of Computing and Information Technol-
(HNR) to enhance the performance of the system. Moreo- ogy, 24(1), 65–78.
ver, experiments can be performed varying the number of Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. A.
(2020). Multimodal speech emotion recognition and classifica-
MFCC and LFCC coefficients and seeing how these affect tion using convolutional neural network techniques. International
the performance of the system. The real challenge in design- Journal of Speech Technology, 23, 381–388.
ing speech emotion recognition systems is to discover and Daneshfar, F., & Kabudian, S. J. (2020). Speech emotion recognition
explore the right set of features to be used, and focus more using discriminative dimension reduction by employing a modi-
fied quantum-behaved particle swarm optimization algorithm.
on the quality rather than the quantity of acoustic features. Multimedia Tools and Applications, 79(1), 1261–1289.
While the present study has focused only on paralinguistic Gomathy, M. (2021). Optimal feature selection for speech emotion
aspects of speech, linguistic (lexical) information can also be recognition using enhanced cat swarm optimization algorithm.
incorporated via an Automatic Speech Recognition (ASR) International Journal of Speech Technology, 24(1), 155–163.
Gupta, K., Gupta, M., Christopher, J., & Arunachalam, V. (2020).
system to facilitate emotion recognition. Anagnostopoulos Fuzzy system for facial emotion recognition. In International
et al. discuss the inclusion of non-linguistic vocalisations to conference on intelligent systems design and applications (pp.
improve SER systems (Anagnostopoulos et al., 2015). These 536–552). Springer.
rely on detecting cues like laughter, sighs, yawns, cries, etc.
which can be directly linked to one or more emotional states.

13
International Journal of Speech Technology (2022) 25:707–725 725

Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recogni- et al. (2011). Scikit-learn: Machine learning in python. The Jour-
tion with deep convolutional neural networks. Biomedical Signal nal of Machine Learning Research, 12, 2825–2830.
Processing and Control, 59, 101894. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on
Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing mutual information criteria of max-dependency, max-relevance,
parselmouth: A python interface to praat. Journal of Phonetics, and min-redundancy. IEEE Transactions on Pattern Analysis and
71, 1–15. Machine Intelligence, 27(8), 1226–1238.
Kavya, R., Christopher, J., Panda, S., & Lazarus, Y. B. (2021). Machine Petrushin, V. A. (2000). Emotion recognition in speech signal: Experi-
learning and XAI approaches for allergy diagnosis. Biomedical mental study, development, and application. In Sixth international
Signal Processing and Control, 69, 102681. conference on spoken language processing.
Koduru, A., Valiveti, H. B., & Budati, A. K. (2020). Feature extraction Picard, R. W. (2000). Affective computing. MIT press.
algorithms to improve the speech emotion recognition rate. Inter- Quan, C., Zhang, B., Sun, X., & Ren, F. (2017). A combined cepstral
national Journal of Speech Technology, 23(1), 45–55. distance method for emotional speech recognition. International
Kursa, M. B., Jankowski, A., & Rudnicki, W. R. (2010). Boruta—a Journal of Advanced Robotic Systems, 14(4), 1729881417719836.
system for feature selection. Fundamenta Informaticae, 101(4), Rojas, R. (1996). The backpropagation algorithm. In Neural networks
271–285. (pp. 149–182). Springer.
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion rec- Rong, J., Li, G., & Chen, Y.-P.P. (2009). Acoustic feature selection for
ognition by speech signals. In Eighth European conference on automatic emotion recognition from speech. Information Process-
speech communication and technology. ing & Management, 45(3), 315–328.
Kwon, S., et al. (2020). CLSTM: Deep feature-based speech emotion Shegokar, P., & Sircar, P. (2016). Continuous wavelet transform based
recognition using the hierarchical convLSTM network. Math- speech emotion recognition. In 2016 10th international confer-
ematics, 8(12), 2133. ence on signal processing and communication systems (ICSPCS)
Kwon, S., et al. (2021). Mlt-dnet: Speech emotion recognition using (pp. 1–8). IEEE.
1d dilated CNN based on multi-learning trick approach. Expert Surampudi, N., Srirangan, M., & Christopher, J. (2019). Enhanced
Systems with Applications, 167, 114177. feature extraction approaches for detection of sound events. In
Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced- 2019 IEEE 9th international conference on advanced computing
learn: A python toolbox to tackle the curse of imbalanced data- (IACC) (pp. 223–229). IEEE.
sets in machine learning. The Journal of Machine Learning Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech
Research, 18(1), 559–563. emotion recognition using deep neural networks. In 2018 IEEE
Liu, G. K. (2018). Evaluating gammatone frequency cepstral coef- international conference on acoustics, speech and signal process-
ficients with neural networks for emotion recognition from ing (ICASSP) (pp. 5089–5093). IEEE.
speech. arXiv preprint arXiv:1806.09010. Vogt, T., & André, E. (2006). Improving automatic emotion recognition
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual from speech via gender differentiaion. In LREC (pp. 1123–1126).
database of emotional speech and song (ravdess): A dynamic, Zamil, A. A. A., Hasan, S., Baki, S. M. J., Adam, J. M., & Zaman,
multimodal set of facial and vocal expressions in North Ameri- I. (2019). Emotion detection from speech signals using voting
can English. PLoS ONE, 13(5), e0196391. mechanism on classified frames. In 2019 international conference
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Batten- on robotics, electrical and signal processing techniques (ICREST)
berg, E., & Nieto, O. (2015). Librosa: Audio and music signal (pp. 281–285). IEEE.
analysis in python. In Proceedings of the 14th python in science Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based
conference Vol. 8, (pp. 18–25). Citeseer. multi-task audio classification. Multimedia Tools and Applica-
Nantasri, P., Phaisangittisagul, E., Karnjana, J., Boonkla, S., Keera- tions, 78(3), 3705–3722.
tivittayanun, S., Rugchatjaroen, A., Usanavasin, S., & Shino- Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., &
zaki, T. (2020). A light-weight artificial neural network for Shamma, S. (2011). Linear versus mel frequency cepstral coef-
speech emotion recognition using average values of MFCCs and ficients for speaker recognition. In 2011 IEEE workshop on auto-
their derivatives. In 2020 17th International conference on elec- matic speech recognition & understanding (pp. 559–564). IEEE.
trical engineering/electronics, computer, telecommunications
and information technology (ECTI-CON) (pp. 41–44). IEEE. Publisher's Note Springer Nature remains neutral with regard to
Pan, Y., Shen, P., & Shen, L. (2012). Speech emotion recognition using jurisdictional claims in published maps and institutional affiliations.
support vector machine. International Journal of Smart Home,
6(2), 101–108.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,

PDF Link It Level3 Teachers Pack Compress
100% (4)
PDF Link It Level3 Teachers Pack Compress
137 pages
Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
No ratings yet
Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
12 pages
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
No ratings yet
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
31 pages
Visual Testing: - Asme - Section 5 (NDT) - Section 5 - Article 9 (VT)
100% (3)
Visual Testing: - Asme - Section 5 (NDT) - Section 5 - Article 9 (VT)
29 pages
Review 3 PPT Final1)
No ratings yet
Review 3 PPT Final1)
51 pages
Jait 0708
No ratings yet
Jait 0708
25 pages
Pre Processing
No ratings yet
Pre Processing
54 pages
Speech Emotion Recognition Using Machine Learning - A Systematic Review
No ratings yet
Speech Emotion Recognition Using Machine Learning - A Systematic Review
25 pages
A Comprehensive Review of Speech Emotion Recognition Systems
No ratings yet
A Comprehensive Review of Speech Emotion Recognition Systems
20 pages
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
100% (1)
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
54 pages
Project Report
No ratings yet
Project Report
106 pages
Bulb Onion Production in Ethiopia
No ratings yet
Bulb Onion Production in Ethiopia
70 pages
XEmoAccent Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
18 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
Audio Spotlight PDF
No ratings yet
Audio Spotlight PDF
29 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
Survey Ref MFCC
No ratings yet
Survey Ref MFCC
29 pages
Speech Emotion Recognization
No ratings yet
Speech Emotion Recognization
65 pages
Neurocomputing: Javier de Lope, Manuel Graña
No ratings yet
Neurocomputing: Javier de Lope, Manuel Graña
11 pages
1 s2.0 S0950705123002757 Main
No ratings yet
1 s2.0 S0950705123002757 Main
11 pages
Speech Emotion Recognition Using Machine Learning Techniques
No ratings yet
Speech Emotion Recognition Using Machine Learning Techniques
8 pages
Final Report
No ratings yet
Final Report
27 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
Emotion Sense:-Real-time Speech Emotion Recognition For Live Calls
No ratings yet
Emotion Sense:-Real-time Speech Emotion Recognition For Live Calls
7 pages
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
No ratings yet
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
13 pages
An Enhanced Speech Emotion Recognition Using Vision Transformer
No ratings yet
An Enhanced Speech Emotion Recognition Using Vision Transformer
17 pages
EMOTIONDETECTION (1) Mini Project
No ratings yet
EMOTIONDETECTION (1) Mini Project
5 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Speech Emotion Recognition System For Human Interaction Applications
No ratings yet
Speech Emotion Recognition System For Human Interaction Applications
8 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
Manual: High Pressure Cleaner MC 300/21
No ratings yet
Manual: High Pressure Cleaner MC 300/21
46 pages
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
No ratings yet
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
7 pages
Personalized Emotion Detection Adapting Models To Individual Emotional Expressions
No ratings yet
Personalized Emotion Detection Adapting Models To Individual Emotional Expressions
6 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
Phase1 Reference Report
No ratings yet
Phase1 Reference Report
11 pages
Tzirakis 2017
No ratings yet
Tzirakis 2017
9 pages
DE09 Sol
No ratings yet
DE09 Sol
157 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
An Ensemble 1D-CNN-LSTM-GRU Model With Data Augmentation For Speech Emotion Recognition
No ratings yet
An Ensemble 1D-CNN-LSTM-GRU Model With Data Augmentation For Speech Emotion Recognition
19 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
Deep Learning Techniques For Speech Emotion Recognition A Review
No ratings yet
Deep Learning Techniques For Speech Emotion Recognition A Review
6 pages
Speech Emotion Recognition
No ratings yet
Speech Emotion Recognition
6 pages
Emotion Detection Final Paper
No ratings yet
Emotion Detection Final Paper
15 pages
10 1109@access 2019 2936124
No ratings yet
10 1109@access 2019 2936124
19 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
4 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Highway Pavement Structural Design: (JRCP)
No ratings yet
Highway Pavement Structural Design: (JRCP)
37 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
No ratings yet
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
12 pages
9 - Yogendra
No ratings yet
9 - Yogendra
5 pages
Calculation Sheet For External Surface Areas (Including Glass)
No ratings yet
Calculation Sheet For External Surface Areas (Including Glass)
20 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Data Warehouse References
No ratings yet
Data Warehouse References
40 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
8 pages
Speech Emotion Recognition System
No ratings yet
Speech Emotion Recognition System
4 pages
11.2 The Process of Cell Division
No ratings yet
11.2 The Process of Cell Division
36 pages
Speech Emotions Recognition Using Machine Learning
No ratings yet
Speech Emotions Recognition Using Machine Learning
5 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Research Proposal
No ratings yet
Research Proposal
3 pages
Beam Deflection - Moment Area Method PDF
No ratings yet
Beam Deflection - Moment Area Method PDF
10 pages
Kazadi Joel 9213934 DLMDSPWP01
No ratings yet
Kazadi Joel 9213934 DLMDSPWP01
18 pages
Import As Import As From Import
No ratings yet
Import As Import As From Import
23 pages
Sentiment Emotion Recognition
No ratings yet
Sentiment Emotion Recognition
6 pages
Duplicate Cleaner Log
No ratings yet
Duplicate Cleaner Log
183 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Fluting Vs Non-Fluting Steel Technical Bulletin V14.0
No ratings yet
Fluting Vs Non-Fluting Steel Technical Bulletin V14.0
3 pages
Superintendent
No ratings yet
Superintendent
55 pages
Speech Emotion Recognition Using Deep Learning: Nithya Roopa S., Prabhakaran M, Betty.P
No ratings yet
Speech Emotion Recognition Using Deep Learning: Nithya Roopa S., Prabhakaran M, Betty.P
4 pages
Religion, Guilt, and Ethical Standards
No ratings yet
Religion, Guilt, and Ethical Standards
17 pages
The Origin of Language: Presented By: Sadiq Mazari
No ratings yet
The Origin of Language: Presented By: Sadiq Mazari
13 pages
Lesson 4 - Contructivist Theory in Teaching Science
No ratings yet
Lesson 4 - Contructivist Theory in Teaching Science
2 pages
Sample Poster Template CSE
No ratings yet
Sample Poster Template CSE
1 page
Chapter 1
No ratings yet
Chapter 1
23 pages
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
No ratings yet
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
5 pages
Monthly Bill
No ratings yet
Monthly Bill
1 page
Uc Colorado Springs
No ratings yet
Uc Colorado Springs
17 pages
Cme 270 Midterm Exam, Fall 2010 Professor Hofmann Notes
No ratings yet
Cme 270 Midterm Exam, Fall 2010 Professor Hofmann Notes
7 pages
Bangladeshi Labour Migration To The Gulf States Patterns of Recruitment
No ratings yet
Bangladeshi Labour Migration To The Gulf States Patterns of Recruitment
19 pages
Kismet: Fundamentals and Applications
From Everand
Kismet: Fundamentals and Applications
Fouad Sabry
No ratings yet
Test 2 Answers
No ratings yet
Test 2 Answers
8 pages
UE271
No ratings yet
UE271
1 page
Quality Control Analysis of Cube Fish With Fault Tree Analysis (FTA) Method in ALJB A Case Study
No ratings yet
Quality Control Analysis of Cube Fish With Fault Tree Analysis (FTA) Method in ALJB A Case Study
6 pages
Rizal Course - Instructions For The Required Terminal Paper
No ratings yet
Rizal Course - Instructions For The Required Terminal Paper
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

THIRD - s10772 022 09985 6

Uploaded by

THIRD - s10772 022 09985 6

Uploaded by

International Journal of Speech Technology (2022) 25:707–725

Machine learning techniques for speech emotion recognition using

1 Introduction 1995 (Picard, 2000). Affective computing is the study and

While deep learning techniques have proven to be effec- 3 System overview

Fig. 1 System framework

Table 1 Feature subset selection Description Feature selection

Shegokar and Sircar (2016) Used Continuous Wavelet PCA

The number of neighbours (k) is chosen by performing a Table 3 MLP architecture

Two split criteria, namely, gini and entropy, are used

3.3.5 Multilayer Perceptron hidden layers is used. Algorithm 5 presents the multilayer

The above metric can be computed for a classifier from its

Training and validation of the five models is performed

Fig. 2 Performance of classifiers on different emotions

Fig. 5 ROC curves for SVM model

Models have been evaluated using the stratified tenfold

Table 5 Performance of SVM Emotion SVM MLP

Angry 0.7772 0.8177 0.9965 0.7969 0.8274 0.8490 0.9747 0.8380

Table 6 Comparison of performance on the RAVDESS dataset

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

THIRD - s10772 022 09985 6

Uploaded by

THIRD - s10772 022 09985 6

Uploaded by

International Journal of Speech Technology (2022) 25:707–725

Machine learning techniques for speech emotion recognition using

1 Introduction 1995 (Picard, 2000). Affective computing is the study and

While deep learning techniques have proven to be effec- 3 System overview

Fig. 1 System framework

Table 1 Feature subset selection Description Feature selection

Shegokar and Sircar (2016) Used Continuous Wavelet PCA

The number of neighbours (k) is chosen by performing a Table 3 MLP architecture

Two split criteria, namely, gini and entropy, are used

3.3.5 Multilayer Perceptron hidden layers is used. Algorithm 5 presents the multilayer

The above metric can be computed for a classifier from its

Training and validation of the five models is performed

Fig. 2 Performance of classifiers on different emotions

Fig. 5 ROC curves for SVM model

Models have been evaluated using the stratified tenfold

Table 5 Performance of SVM Emotion SVM MLP

Angry 0.7772 0.8177 0.9965 0.7969 0.8274 0.8490 0.9747 0.8380

Table 6 Comparison of performance on the RAVDESS dataset

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

While deep learning techniques have proven to be effec- 3 System overview

3.3.5 Multilayer Perceptron hidden layers is used. Algorithm 5 presents the multilayer