0% found this document useful (0 votes)
12 views18 pages

SECOND - s11042 023 16849 X

Uploaded by

Mugdha Ashiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

SECOND - s11042 023 16849 X

Uploaded by

Mugdha Ashiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Multimedia Tools and Applications

https://doi.org/10.1007/s11042-023-16849-x

Speech emotion recognition and classification using hybrid


deep CNN and BiLSTM model

Swami Mishra1 · Nehal Bhatnagar1 · Prakasam P1 · Sureshkumar T. R1

Received: 9 January 2022 / Revised: 16 August 2023 / Accepted: 4 September 2023


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023

Abstract
Accurate emotion detection from speech utterances has been a challenging and active
research affair recently. Speech emotion recognition (SER) systems play an essential role
in Human-machine interaction, virtual reality, emergency services, and many other real-
time systems. It is an open-ended problem as subjects from different regions and lingual
backgrounds convey emotions altogether differently. The conventional approach used
low-level periodic features from audio samples like energy, pitch, etc., for classification
but was not efficient enough to detect emotions accurately and not generalized. With the
recent advancements in computer vision and neural networks extracting high-level fea-
tures and more accurate recognition can be achieved. This study proposes an ensemble
deep CNN + Bi-LSTM-based framework for speech emotion recognition and classifica-
tion of seven different emotions. The paralinguistic log Mel-frequency spectral coefficients
(MFSC) is used as a feature to train the proposed architecture. The proposed Hybrid model
is validated with TESS and SAVEE datasets. Experimental results have indicated a classifi-
cation accuracy of 96.36%. The proposed model is compared with existing models, proving
the superiority of the proposed hybrid deep CNN and Bi-LSTM model.

Keywords Speech emotion recognition · Deep convolutional neural networks · LSTM ·


MFSC · Ensemble learning

* Prakasam P
prakasam.p@vit.ac.in
Swami Mishra
swami.mishra2017@vitsudent.ac.in
Nehal Bhatnagar
nehal.bhatnagar2017@vitsudent.ac.in
Sureshkumar T. R
trsureshkumar@vit.ac.in
1
School of Electronics Engineering, Vellore Institute of Technology, Vellore, India

13
Vol.:(0123456789)
Multimedia Tools and Applications

1 Introduction

Human communications have come a long way, from cave drawings to sophisticated
internet media. All humans convey emotions in one way or another, but speech remains
the most crucial, as it is the most natural and convenient way to express our feelings.
Speech usually contains two types of information, paralinguistic and linguistic. Linguis-
tic information conveys language, accent, and dialect, and paralinguistic information
conveys emotional state, context, gender, environment, and attitude; thus, using paralin-
guistic features is beneficial as it helps to understand the emotional state of the speaker,
but each language has its way of expressing feeling irrespective of the speaker. Its inter-
pretation is also subjective [1, 2].
Further, there is also the concept of overlapping feelings, for example, "calm" and
"neutral," "excited" and "happy," Therefore a proper approach is needed to cater to
such problems, thus comes the picture of deep learning-based algorithms, which have
entirely changed the way we approach detection and classification problems. Earlier,
hand-crafted low-level features such as energy and pitch formants have shown prom-
ising results with SVM, KNN, random forest, and decision trees. However, high-level
paralinguistic features combined with deep-learning-based models have outperformed
nearly every segment. Lately, improvements in neural networks have introduced recur-
rent neural networks, long-short-term memory (LSTM), terminator algorithms, and
gated recurrent units (GRU), which again have helped a lot in building better systems
[3, 4].
Applications of speech emotion recognition systems are wide-ranging from the home
to industries; it’s already in use in assistants and intelligent speakers [5]. The human-
machine interface will reach new heights as SER systems will increase accessibility,
useability and help people from all sections of life. SER systems will also be beneficial
in therapy sessions, emergency services, interactive classes, e-commerce, social educa-
tion, service industry, call centers, etc.
The significant contributions of this research are

– Propose an efficient and novel ensemble deep learning framework that can learn the
features from Log Mel spectrogram spectral coefficients (MFSC) extracted from the
acoustic speech signals.
– Develop a Hybrid model by integrating Deep CNN and BiLSTM to recognize and
classify the various emotions for an acoustic emotion classification system.
– Classify the seven major emotions using the proposed model such as neutral, fear,
anger, disgust, sadness, surprise, and happiness.
– Train and Test the proposed framework using the popular corpuses TESS and
SAVEE dataset and observed that the proposed framework achieves good accuracy
and outperforms conventional machine learning techniques.

The rest of the manuscript is organized as follows. An extensive literature review


is done on speech emotion recognition in Section 2. Section 3 presents the proposed
hybrid deep CNN and BiLSTM methodology to recognize and classify the various emo-
tions. The experimental analysis and discussion on different performance metrics are
dealt with in Section 4, and finally, Section 5 concludes the proposed research work and
motivates the researcher for future work.

13
Multimedia Tools and Applications

2 Literature review

Due to the importance of the SER classification in social studies as well as biomedical
studies [6], many researchers have suggested DL-based techniques for emotional classifica-
tions [7, 8]. An extensive survey was carried out and addressed the challenges faced in SER
classification. Also, suitable future directions were suggested for further research. Hackers
have created much fake news in social media, and the SER classification is also a support-
ing tool to detect fake news. A deep attention-based CNN model [9] and hybrid Bi-LSTM
with a self-attention model [10] were proposed to detect the fake news in social media.
This will also support news aggregation and recommendations to particular people based
on speech, emotion, and recognition. Kumbhar and Bhandari [11] focused on extracting
MFCC Features from the RAVDESS corpus, which contains both speech utterances and
song samples, and trained on an LSTM-only model, which was able to achieve an accuracy
of 84.81% but observed a heavy loss of 67.21% in implementation and concluded the need
to reduce the false-positive rate to achieve better performance. Zehra et al. [12] presented
an ensemble learning approach for cross-corpus multi-lingual emotion recognition based
on the majority voting technique. The study used Urdu, Italian, English, and German cor-
pus and experimented with different classifiers such as SVM, random forest, and decision
trees. It achieved an accuracy of 70.14% within-corpus English language SAVEE dataset.
Abbaschian et al. [13] reviewed the various deep learning methodologies. Also, they cov-
ered different viable datasets and feature extraction techniques, thus concluding that the
introduction of LSTMs has helped to take the solution of the SER problem to a new level,
and, in the future, dataset-independent solutions will be required. Three ensemble learn-
ing frameworks comprising convolutional neural networks, bi-directional LSTM, convolu-
tional recurrent neural networks, and GRU with different sample sizes of IEMOCAP cor-
pus and different features were discussed [14]. It achieves a maximum accuracy of 75%
and concludes that a personalized network model combined with customized features will
achieve better efficiency.
Meng et al. [15] presented an ADRNN network to recognize emotions from raw speech
samples. It learns local correlations and worldwide correlated information from 3-dimen-
sional log-mel features. It was tested on IEMOCAP and Berlin EMODB and achieved
63.84% recognition accuracy in the cross-corpus experiments. Zhang et al. [16] proposed a
new method for spontaneous emotion recognition using multi-CNNs (1D, 2D, 3D) fusion
with features such as raw waveform modeling, 2-dimensional time-frequency modeling,
and temporal-spatial dynamic modeling for each CNN. Experiments were conducted on
AFEW 5.0 and BAUM-1 corpora and average-pooling for segment-level classification and
score-level fusion strategy to integrate the former classification to obtain the final results.
Mustaqeem et al. [17] managed only key computational segments rather than complete
observations to lessen the computational complexity of the overall model and normalize
the features before actual processing as it helps to identify Spatio-temporal information but
using key segments can lead to underrepresentation and high-class imbalance. FC-1000
layers from CNN and BiLSTM are then utilized on IEMOCAP, EMO-DB, and RAVDESS
corpora to achieve 72.25%, 85.57%, and 77.02%, respectively.
Zhao et al. [18] proposed two models, the first one being 1D CNN-LSTM and the
other 2D CNN-LSTM, which have four local feature learning blocks as it helps to
learn local correlation and worldwide correlated information. Tests were conducted
on IEMOCAP and Berlin EMODB corpora with seven emotions in spokesperson-
dependent and independent categories and compared with other studies in the past. A

13
Multimedia Tools and Applications

self-attention transformer model [19] is proposed because it shows one or more orders
of magnitude lower computing costs for time, and traditional RNNs can store a hand-
ful of information, which is affected by exploding and vanishing problems. The model
is tested with normalized engineered IS09 and eGeMaps feature set using open smile
framework, which gave approximately % weighted accuracy of 68.1% and unweighted
accuracy close to 63.8%. A novel deep dual recurrent neural network(RNNs) encoder
model for audio and text data was suggested [20]. The model used transcripts, converted
IEMOCAP audio corpus to text using Google Cloud Speech API with a word error rate
of 5.53%, and extracted MFCC features using the open smile toolkit. The Multimodal
Dual Recurrent Encoder with Attention(MDREA) model outperforms the three other
models presented with an accuracy of 68.8%. Schuller [21] discussed this field of speech
emotion recognition from its birth in the 1970s to the latest innovations. The paper has
evaluated early studies on acoustic correlates of emotions to the first patent, the first
strongly influential paper in the field. It also discussed traditional approaches, ongoing
trends, features extraction methods, current SER benchmark engines, all the challenges
researchers face today, and how they can improve the lives of the general population.
Tzirakis et al. [22] demonstrated a new model using the RECOLA database,
which has a mere 46 different recordings divided into four subsets, i.e., audio, video,
electrocardiogram(ECG), and electro-dermal activity(EDA). Corpus also contains
French, Italian, German, and Portuguese language; thus, age, mother tongue, and gen-
der are taken into account in the model, which has four CNN layers stacked upon two
LSTM RNN layers, achieving an accuracy of 78.7% using MFCC features. Mirsamadi
et al. [23] proposed a structure that combines BiLSTM and a singular pooling tech-
nique with the help of an attention mechanism that allows the network to direct towards
emotionally salient parts of an audio file. Abdelwahab and Busso [24] used a dataset
comprising over 30 h of data, MSP-Podcast, providing a perfect resource to answer
such questions. This paper explored various important architectures to foresee arousal,
valence, and dominance scores. The same feature set has been used for the INTER-
SPEECH 2013 Computational Paralinguistics Challenge. This set consists of 65 low-
level descriptors (LLD) extracted from speech, including prosodic, spectral, energy, and
voice quality features. Neumann and Vu [25] presented extensive experiments which,
with the help of an attentive CNN and a multi-view learning objective function, pro-
duced good results. The system performance has been compared by using different
types of emotional speech, which can be improvised or scripted, different acoustic fea-
tures, and varying input lengths. These results suggested that the choice of features for a
CNN is less important than the architecture proposed and the number and type of train-
ing data taken.
Harar et al. [26] proposed an architecture incorporating DNN with convolution, pooling,
and fully connected layers. DNN being optimized using Gradient Descent has achieved
average confidence of file prediction of 69.55%. Lotfidereshgi and Gournay [27] proposed
a technique that does not involve the complicated feature extraction step as they operate
directly on the speech signal. Moreover, this technique combines the power of the recently
established LSM, SNN, and the source-filter model of human speech production. Tzinis
and Alexandro [28] used local and global features, i.e., LLD(Low-Level Descriptor) and
statistical functions, to train the LSTM unit and to probe the correct decision time scale for
each present unit. The suggested model used a segment-wise learning approach combined
with global and local features to provide further accuracy for the decision of emotional
context. Exploiting the method used on the database IEMOCAP with a simple LSTM layer
gives an accuracy of 64.16% for speaker-independent SER.

13
Multimedia Tools and Applications

Shegokar and Sircar [29] suggested two feature selection techniques. The first one
is Morlet continuous wavelet transform (MCWT), and the second is prosodic fea-
tures, including linear predictive coding coefficients, RMS energy, and zero-crossing
rate(ZCR). RAVDESS corpus was utilized with a reduced sampling rate of 16 kHz and
an SVM classifier for speaker-independent systems to identify eight emotions. Dangol
et al. [30] used a 3D log mel-spectrum of audio samples from speech emotion corpus
as features. They proposed two architectures, CNN and CNN with LSTM, for identify-
ing various emotional states. Singh. et al. in [31] proposed a method that achieved a
29.74% increment in classification accuracy when compared to baseline accuracy on
the primary dataset (CREMA-D). The lack of availability of a large amount of data to
train deep neural networks results in the SVM-Classifier giving better classification
results than the R-Classifier.
Zong et al. [32] studied the challenging cross-corpus SER problem and proposed a
novel Domain-adaptive Least Squares Regression (DaLSR) model based on their pre-
vious work. In this method, they pick a supplementary set of unlabeled testing samples
from the target dataset, which are used to serve as a spare set to jointly train the Least
Squares Regression (LSR) model together with the labeled training samples from
Berlin, interface, and AFEW 4.0 corpora with a feature set provided by the INTER-
SPEECH 2009 Emotion Challenge which can be extracted using OpenSim software.
Since the data distributions of speech feature points would differ between source and
target corpora, they propose adding a regularization restriction to the LSR objective
function to obtain DaLSR to alleviate such differences. In this case, the DaLSR model
learned from the labeled training data set and the unlabeled auxiliary data set would
apply to distinguishing the emotional states of the unlabeled target speech corpus.
Zeng et al. [33] proposed a CNN-based Gated Residual Neural Networks(GresNets)
model to study a shared representation among all spectrograms from multiple clas-
sification tasks. GresNets is a variation of a deep Residual Networks model(ResNets)
with the extension of a gate mechanism. The above method concluded that multi-task
models for related speech classification tasks surpass the task-specific ones. Yadav
and Vishwakarma [34] reviewed the popular nature-inspired sentiment analysis algo-
rithms. It provided insights into how the animal kingdom views emotions and how it
shapes their daily lives. For example, ant-colony optimization gives details on how
emotions are utilized to find the shortest and quickest path; cuckoo search tells us how
these species survive without creating their nests, and firefly algorithms explain their
reproductive patterns based on the flickering and intensity they produce. In the end, it
concluded that it is tough to achieve good accuracy due to the complex nature of the
problem.
Mohan and Ramesh Babu [35] emphasized a different approach to SER and used
isolated word recognition. The MFCCs were used as features and DWT for sequence
identification by calculating distances on five other words and their combination to
achieve a 100% efficient system and concluded that the distance of DWT for two
unique words could be 300 or more. The reviewed literature motivated the develop-
ment of a new artificial intelligence-based SER method to detect and classify emotion.
In this study, we integrate CNN with BiLSTM to develop an ensemble hybrid deep
CNN-BiLSTM classifier to detect and classify the emotion categories using log mel
frequency spectral coefficients from the voice utterances as input. The emotion classi-
fiers are then ensembled using MLP to obtain a final prediction.

13
Multimedia Tools and Applications

Fig. 1  Hybrid Deep CNN and BiLSTM architecture for speech emotional recognition and classification

3 Materials and methods

This section discusses the overall framework of the proposed hybrid deep CNN and BiL-
STM for speech emotion classification, illustrated in Fig. 1.

3.1 Data collection

This study is conducted on a combination of widely used and publicly available speech
emotion datasets – TESS and SAVEE.
Toronto Emotional Speech Set (TESS):1 This dataset was created in an acted environ-
ment comprising two native English-speaking female actors. The dataset consists of the
following emotions: angry, pleasant, surprise, disgust, happy, sad, fear, and neutral. The
data set contains a total of 2800 samples, with each emotion having 400 samples. For our
purpose, the pleasantly surprised was considered as the surprise emotion.
Surrey Audio-Visual Expressed Emotion (SAVEE) Database:2 This dataset consists of
480 vocal utterances by four British actors, which differs from the North American accent
used in the TESS dataset. The emotions have been categorized into seven classes, namely:
anger, boredom, surprise, fear, sadness, happiness, and disgust.

3.2 Feature extraction

The log Mel-frequency spectral coefficients (MFSC) features can be extracted from the
input speech audio file to train the proposed hybrid deep CNN and BiLSTM model. The
primary reason to choose MFSC as compared with MFCC is that it outperforms MFCC
[36], which are the most extensively used feature extraction techniques for speech and
emotion recognition. An establishment of missing values in the data occurred because the
spectral coefficients were taken for the complete length of each sample, and mean values
were to be taken for each time step which led to variable input shape. The missing values
were filled with zeros for all recordings to get a uniform input shape. The sample speech
signal and the corresponding spectrogram and Chromagram are shown in Fig. 2.

1
https://​tspace.​libra​ry.​utoro​nto.​ca/​handle/​1807/​24487.
2
http://​kahlan.​eps.​surrey.​ac.​uk/​savee/​Downl​oad.​html.

13
Multimedia Tools and Applications

Fig. 2  a Speech Signal b Spectrogram and c Chromagram of a sample disgust emotion signal

13
Multimedia Tools and Applications

Fig. 3  BiLSTM architecture model

3.3 BiLSTM model

The proposed BiLSTEM model is illustrated in Fig. 3. The BiLSTM model uses two
LSTMs connected in forward and backward directions. One LSTM works from top to
bottom, and another works from top to bottom.
During each time t, the forward hidden layer of LSTM with hidden function h��⃗t is
obtained based on the previously hidden value h������ t−1 and the current input value ­
⃗ xt. In
the backward LSTM with hidden value, ⃖�� ht is obtained based on the future hidden value,
t+1 , and the current input value ­xt.
⃖������
h
Therefore, the hidden layer value of the forward layer is given by
⃗ ���⃗ ���⃗
h��⃗t = g(b��⃗i h������
t−1 + Wi xt ) (1)

Also, the hidden layer value of the backward layer is given by


⃖��
ht = g(⃖��
bi h⃖������ ⃖��� ��⃗t )
t+1 + Wi x (2)

The forward hidden layer output htf and the backward hidden layer output htr are com-
bined to extract the predicted target features. After the target extraction, the opinionated
sentences are passed through a few other layers for sentiment classification.

13
Multimedia Tools and Applications

Table 1  Summary of the various Layer (Type) Output Shape Param #


layers of the proposed hybrid
model
Input Layer (None,308,1) 0
Conv1D (None,308,256) 2304
Conv1D (None,308,256) 524544
Dropout (None, 308,256) 0
Maxpooling1D (None,38,256) 0
Conv1D (None,38,128) 262272
Conv1D (None,38,128) 131200
Conv1D (None,38,128) 131200
Conv1D (None,38,128) 131200
Batch Normalization (None,38,128) 512
Maxpoooing1D (None,4,128) 0
Conv1D (None,4,64) 65600
Conv1D (None,4,64) 32832
Bi-LSTM (None,256) 197632
Dense (None,256) 65792
Dropout (None,256) 0
Dense (None,2) 514

3.4 Proposed hybrid deep CNN and BiLSTM model for speech emotion recognition
and classification

The proposec hybrid deep CNN and BiLSTM architecture learns through log Mel-
frequency spectral coefficients (MFSC) for seven emotion multi-class classifiers for 50
epochs each in this proposed model. 70% of the dataset was used for training, 20% for
validation, and 10% for testing. The architecture summary of the proposed Hybrid deep
CNN and BiLSTM is shown in Table 1. Each classifier is trained to identify its emotion
class, while the rest of the emotions are labeled as ‘other.’
In this model, 1D CNNs are used over 2D CNNs because 1D permits more excellent
filters, so the training of internal representation of sequenced data becomes much easier
to learn. Since audio samples are sequential data, a neural network that can remem-
ber past dependencies is required. Thus, we use Bidirectional Long Short-term units
(Bi-LSTM), which improve traditional RNNs. It performs well on massive datasets and
rectifies vanishing gradient problems. Audio signals are continuous in the time domain;
each frame denotes emotions in a single frame. Bi-LSTM allows our architecture model
to preserve and increase information between neighboring frames and thus reflect the
temporal continuity of features.
After this step, all the binary classifier models are ensembled using the MLP layer.
The MLP model consists of an input layer and dimension with ReLU as the activation
function, which is then followed by a dense layer, ReLU as the activation layer, and the
output layer with the activation function SoftMax after which the number of epochs
is set to 50. Other hyperparameters include optimizer: Adam, loss: categorical cross-
entropy, and metrics: accuracy.

13
Multimedia Tools and Applications

Table 2  Simulation environment/


parameters Parameter Type/Range
Optimizer Adam Optimizer
Loss Function Cross-Entropy
Learning Rate 0.003
Step Size 300
Batch Size 64
Bias Value 0
Number of epochs 50

4 Results and discussion

The experiment analysis of the proposed hybrid deep CNN and BiLSTM model is done in
two phases. In the first phase, training is done with the dataset, and the network is exposed
with all the labels; in the next step, testing is done with the test dataset. The model is built
on Anaconda’s Jupyter Notebook in Python language. The datasets used in this research are
TESS and SAVEE. The TESS dataset contains 2800 audio files, and SAVEE contains 480
audio files for the analysis. This has been divided into 2296 datasets for training, 656 for
validation, and 328 for testing the proposed model.

4.1 Simulation environment

The computational setup used for the proposed architecture is a GPU-powered Windows
10 machine equipped with an Intel Core i7 processor(9th—Generation) with 16 GB mem-
ory and NVIDIA Geforce 1660 Ti, 6 GB of onboard memory driven by Google Keras
Tensorflow library and NVIDIA cuDnn GPU-accelerated toolkit. The various simulation
parameters used to simulate the proposed Hybrid deep CNN and BiLSTM model are illus-
trated in Table 2.
The performance evaluation metrics used to analyze the proposed model are explained
in the following subsection.

4.2 Evaluation metrics

The performance of the proposed Hybrid Deep CNN and BiLSTM is examined using the
confusion matrix, Recall, Precision, F1 Score, and AUC-ROC curve.

4.2.1 Confusion matrix

The Confusion matrix represents the summary of prediction on any classification prob-
lem. The statistical analyses of correct and incorrect predictions are summarized, giving
us insights on not only the errors but also the type of errors made by the model. Table 3
represents the confusion matrix formation of the Deep Learning model.
where TP stands for True positive, i.e., the label is positive and predicted to be positive.
FP stands for False-positive, i.e., the label is negative but is predicted positive.
FN stands for False-negative, i.e., the label is positive but is predicted negative.

13
Multimedia Tools and Applications

Table 3  Confusion matrix Class 1 (predicted) Class 2


formation (pre-
dicted)

Class 1 (actual) TP FN
Class 2 (actual) FP TN

TN stands for True negative, i.e., the label is negative and is predicted to be negative.

4.2.2 Precision

Precision is a statistical measure for machine learning classification models wherein cor-
rect positive predictions are quantified. It tells us what proportion of files predicted as a
specific class label have been files of that particular class label. It is the ratio of correctly
predicted positive samples to the sum of correct positive samples. If Precision’s value is
high, it leads to a lower false-positive rate. This metric is important to determine when
False Positives lead to a high cost.
TP
Precision = (3)
TP + FP

4.2.3 Recall

Recall is another performance metric that tells us what proportion of files of a particular
class label has been predicted as files of that particular class label. The recall summarizes
how well a given class has been predicted. The recall is the ratio of correctly predicted
positive samples to all the samples in the actual class.
TP
Recall = (4)
TP + FN

4.2.4 F1‑ score

The F1 score is a performance metric incorporating precisions and recalls into a single
quantity. It is computed as the harmonic mean of recall and Precision and is the most com-
monly used metric for multi-class classification problems.

Table 4  Performance metrics of Emotion class Precision Recall F1 score Accuracy %


the proposed hybrid deep CNN
and BiLSTM model
Fear 0.87 0.77 0.82 96.10
Neutral 1.00 0.94 0.97 99.02
Angry 0.91 0.73 0.81 93.83
Disgust 0.97 0.73 0.84 95.77
Sad 0.91 0.95 0.93 98.05
Surprise 0.98 0.98 0.98 96.42
Happy 0.68 0.90 0.78 95.12

13
Multimedia Tools and Applications

Fig. 4  Classification matrix and ROC curve of angry (a, b), disgust (c, d), fear (e, f), happy (g, h), neutral ▸
(i, j), surprise (k, l), sad (m, n) of the proposed model

2 ∗ Recall ∗ Precision
F1 − Score = (5)
Recall + Precision

4.2.5 AUC‑ROC curve

Since our proposed model is used for binary classification (Positive or Negative Senti-
ment), Area Under Curve (AUC) – Receiver Operating Characteristics (ROC) is also used
for performance evaluation. ROC is a probability curve. AUC is a measure of separability.
It determines how much is the capability of the model to be able to distinguish between
classes.
Table 4 shows the detailed performance metrics in terms of Precision, Recall, F1 score,
and Accuracy of the binary classifiers for each emotion class from the TESS and SAVEE
combined dataset. The proposed model is trained and tested to detect and classify the seven
different emotions: neutral, fear, anger, disgust, sad, surprise, and happy.
From Table 3, it is found that the neutral emotion has shown the highest accuracy of
99.02%, while disgust and sad emotions have shown 95.77% and 98.05% accuracy, and
happy emotion has shown an accuracy of 95.12%. Precision for fear and happy emotion is
relatively lower than its counterparts; one reason for the higher misrecognition of positive
instances can be that features of fear and happiness were confused with others, and another
can be the limitation of data. Research has been going on for the past few years, focusing
on extracting unique, discriminative features for each emotion. After combining the seven
emotionsas for the multi-class classification, the training accuracy and the testing loss are
96.36% and 0.036, respectively. The receiver operating characteristic (ROC) curve meas-
ures how each emotion is uniquely identified against the others; as we can interpret from
the ROC graphs below, all emotion classifiers are very good, and their area under the curve
is also high. The confusion matrix and the ROC curve for different emotions are generated
and shown in Fig. 4.

4.3 Performance comparison

The performance of the proposed hybrid deep CNN and BiLSTM model is compared
with existing architectures to claim superiority. The various methods/models considered
for the comparison are ADRNN [15], CB-SER [17], DeepNet [37], pre-trained image
classification network [38], CNN-BiLSTM [39], CNN + Self Attention Model [40],
DCRNN + Ensemble Classifier [41], IMEMD + CRNN [42] and DNN [43]. The perfor-
mance comparison in terms of testing accuracy, Precision, F1 score, and Interference &
feature generation time is tabulated in Table 5.
Also, for better visualization, the comparison of the testing accuracy for the proposed
Hybrid Deep CNN and BiLSTM Model, along with a few reported models, is illustrated in
Fig. 5.
Table 5 and Fig. 5 show that the proposed Hybrid deep CNN and BiL-
STM model achieves good accuracy, Precision, and F1 scores of 96.36%,
0.9303, and 0.9257, respectively. This indicates that using the proposed

13
Multimedia Tools and Applications

13
Multimedia Tools and Applications

Fig. 4  (continued)

model, the various emotion has been classified more accurately. Also, to measure the time
complexity of the proposed model, the Interference and future generation time have been
calculated and compared with the existing methods. It proves that the proposed model is
required 990 ms, whereas other reported models/methods occupy more time. This is due to
1D CNN being a very simple model to configure.

13
Multimedia Tools and Applications

Table 5  Comparison of proposed architecture with other methods


Classifier/Model Accuracy Precision F1 score Interference &
feature generation
time (ms)

ADRNN [15] 84.99 - - 7187


CB-SER [17] 85.57 0.86 0.811 5396
DeepNet [37] 94.0% - - -
Pre-trained image classification network [38] 80.5% 0.803 0.796 2950
CNN + BiLSTM [39] 92.2% 0.902 0.919 1260
CNN + Self attention model [40] 90.0% 0.88 0.9 -
DCRNN + Ensemble classifier [41] 85.31% - 0.842 -
IMEMD + CRNN [42] 93.54% 0.91 0.93 1168
DNN [43] 92.62% - - -
Proposed hybrid deep CNN + BiLSTM 96.36% 0.9303 0.9257 990

5 Conclusion

Speech emotion recognition systems are still an open ground for research as emotions are
highly dependent on and influenced by external factors. In this research, a hybrid deep
CNN and BiLSTM model utilizes log Mel-frequency spectral coefficients (MFSC) as a
feature extracted from the speech signal to train the model. The proposed model is veri-
fied and validated with TESS and SAVEE datasets. The validation found that the proposed
model can classify the seven different emotions as neutral, fear, angry, disgust, sad, sur-
prise, and happy emotions with 96.36% accuracy. It is observed that other classification
models with different features worked differently on different datasets, which raises a natu-
ral question of which classifier and feature work best on all kinds of datasets. This study
concludes that the ensemble learning approach combined with CNN and BiLSTM outper-
forms single learners and achieves high accuracy. To enable our proposed architecture to

100
Testing Accuracy (%)

80

60

40

20

0
DCRNN
Image Hybrid
CNN+ +
Classific Deep
DeepNe CNN+Bi Self Ensemb IMEMD
ADRNN CB-SER a on DNN CNN
t LSTM Aen o le +CRNN
Networ +BILST
n Model Classifie
k M
r
Series1 84.99 85.57 84 80.5 92.2 90 85.31 93.54 92.62 96.36

Fig. 5  Performance comparison of the various methods – testing accuracy

13
Multimedia Tools and Applications

be used in real-world scenarios in the future, architecture should experiment with different
corpora of different regional languages in a natural environment. In the future, our main
priority will be diversifying the model with a suitable flexible structure that could adapt
various speeches in different environments and other paralinguistic features.

Authors’ contributions Prakasam P and Sureshkumar T R devised and proofread the main conceptual ideas.
Swami Mishra and Nehal Bhatnagar worked on almost all the technical details, devising the model, data
collection, and experimentation. Prakasam P performed the evaluation metrics to validate the proposed
model. Sureshkumar T R, Swami Mishra, and Nehal Bhatnagar prepared and verified the entire manuscript
by Prakasam P.

Data availability Will be available based on the request.

Code availability Will be available based on the request.

Declarations
Conflicts of interest We hereby declare that there is no conflict of interest in this research work/paper.

References
1. Chen J, Wang C, Wang K et al (2021) HEU Emotion: a large-scale database for multimodal emo-
tion recognition in the wild. Neural Comput Appl 33:8669–8685. https://​doi.​org/​10.​1007/​
s00521-​020-​05616-w
2. Zeng Y, Mao H, Peng D (2019) Spectrogram-based multi-task audio classification. Multimed Tools
Appl 78:3705–3722. https://​doi.​org/​10.​1007/​s11042-​017-​5539-3
3. Jahangir R, Teh YW, Hanif F et al (2021) Deep learning approaches for speech emotion recognition:
state of the art and research challenges. Multimed Tools Appl 80:23745–23812. https://​doi.​org/​10.​
1007/​s11042-​020-​09874-7
4. Jaiswal S, Nandi GC (2020) Robust real-time emotion detection system using CNN architecture. Neu-
ral Comput Appl 32:11253–11262. https://​doi.​org/​10.​1007/​s00521-​019-​04564-4
5. Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) Speech-based human emotion recognition using
MFCC. In: 2017 international conference on wireless communications, signal processing and network-
ing (WiSPNET), pp 2257–2260. https://​doi.​org/​10.​1109/​WiSPN​ET.​2017.​83001​61
6. Atmaja BT, Sasou A, Akagi M (2022) Survey on bimodal speech emotion recognition from acoustic
and linguistic information fusion. Speech Commun 140:11–28. https://​doi.​org/​10.​1016/j.​specom.​2022.​
03.​002
7. Monisha A, Tamanna S, Sadia S (2022) A review of the advancement in speech emotion recognition
for Indo-Aryan and Dravidian Languages. Adv Hum-Comput Interact 2022:9602429. https://​doi.​org/​
10.​1155/​2022/​96024​29
8. Lope JD, Graña M (2023) An ongoing review of speech emotion recognition. Neurocomputing 528:1–
11. https://​doi.​org/​10.​1016/j.​neucom.​2023.​01.​002
9. Luvembe AM, Li W, Li S, Liu F, Xu G (2023) Dual emotion based fake news detection: a deep atten-
tion-weight update approach. Inf Process Manag 60(4):103354. https://​doi.​org/​10.​1016/j.​ipm.​2023.​
103354
10. Mohapatra A, Thota N, Prakasam P (2022) Fake news detection and classification using hybrid BiL-
STM and self-attention model. Multimed Tools Appl 81:18503–18519. https://​doi.​org/​10.​1007/​
s11042-​022-​12764-9
11. Kumbhar HS, Bhandari SU (2019) Speech emotion recognition using MFCC features and LSTM net-
work. In: 2019 5th international conference on computing, communication, control and automation
(ICCUBEA), pp 1–3. https://​doi.​org/​10.​1109/​ICCUB​EA475​91.​2019.​91290​67
12. Zehra W, Javed AR, Jalil Z (2021) Cross corpus multi-lingual speech emotion recognition using
ensemble learning. Complex Intell Syst. https://​doi.​org/​10.​1007/​s40747-​020-​00250-4
13. Abbaschian BJ, Sierra-Sosa D, Elmaghraby A (2021) Deep learning techniques for speech emotion
recognition, from databases to models. Sensors 21:1249. https://​doi.​org/​10.​3390/​s2104​1249

13
Multimedia Tools and Applications

14. Zheng C, Wang C, Jia N (2020) An ensemble model for multi-level speech emotion recognition. Appl
Sci 10:205. https://​doi.​org/​10.​3390/​app10​010205
15. Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D Log-Mel spectrograms
with deep learning network. IEEE Access 7:125868–125881. https://​doi.​org/​10.​1109/​ACCESS.​2019.​
29380​07
16. Zhang S, Tao X, Chuang Y, Zhao X (2021) Learning deep multimodal affective features for spontane-
ous speech emotion recognition. Speech Commun 127:73–81. https://​doi.​org/​10.​1016/j.​specom.​2020.​
12.​009
17. Mustaqeem, Sajjad M, Kwon S (2020) Clustering-based speech emotion recognition by incorporating
learned features and deep BiLSTM. IEEE Access 8:79861-79875. https://​doi.​org/​10.​1109/​ACCESS.​
2020.​29904​05
18. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM net-
works. Biomed Signal Process Control 47:312–323. https://​doi.​org/​10.​1016/j.​bspc.​2018.​08.​035
19. Tarantino L, Garner PN, Lazaridis A (2019) Self-attention for speech emotion recognition. Proc Inter-
speech 2019:2578–2582. https://​doi.​org/​10.​21437/​Inter​speech.​2019-​2822
20. Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018
IEEE spoken language technology workshop (SLT), 112–118. https://​doi.​org/​10.​1109/​SLT.​2018.​86395​
83
21. Schuller BW (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing
trends. Commun ACM 61(5):90–99. https://​doi.​org/​10.​1145/​31293​40
22. Tzirakis P, Zhang J, Schuller BW (2018) End-to-end speech emotion recognition using deep neural net-
works. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP),
pp 5089–5093. https://​doi.​org/​10.​1109/​ICASSP.​2018.​84626​77
23. Mirsamadi S, Barsoum E, Zhang C (2017)Automatic speech emotion recognition using recurrent neu-
ral networks with local attention. In: 2017 IEEE international conference on acoustics, speech and
signal processing (ICASSP), pp 2227–2231. https://​doi.​org/​10.​1109/​ICASSP.​2017.​79525​52
24. Abdelwahab M, Busso C (2018) Study of dense network approaches for speech emotion recognition.
In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp
5084–5088. https://​doi.​org/​10.​1109/​ICASSP.​2018.​84618​66
25. Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recogni-
tion: a study on the impact of input features, signal length, and acted speech. In: Proceedings of the
Annual Conference of the International Speech Communication Association (INTERSPEECH), pp
1263–1267. https://​doi.​org/​10.​21437/​Inter​speech.​2017-​917
26. Harár P, Burget R, Dutta MK (2017) Speech emotion recognition with deep learning. In: 2017 4th
international conference on signal processing and integrated networks (SPIN), pp 137–140. https://​doi.​
org/​10.​1109/​SPIN.​2017.​80499​31
27. Lotfidereshgi R, Gournay P (2017) Biologically inspired speech emotion recognition. In: 2017 IEEE
international conference on acoustics, speech and signal processing (ICASSP), pp 5135–5139. https://​
doi.​org/​10.​1109/​ICASSP.​2017.​79531​35
28. Tzinis E, Potamianos A (2017) Segment-based speech emotion recognition using recurrent neural
networks. In: 2017 seventh international conference on affective computing and intelligent interac-
tion (ACII), pp 190–195. https://​doi.​org/​10.​1109/​ACII.​2017.​82735​99
29. Shegokar P, Sircar P (2016) Continuous wavelet transform based speech emotion recognition. In:
10th international conference on signal processing and communication systems (ICSPCS), pp 1–8.
https://​doi.​org/​10.​1109/​ICSPCS.​2016.​78433​06
30. Dangol R, Alsadoon A, Prasad PWC et al (2020) Speech emotion recognition using convolutional
neural network and long-short TermMemory. Multimed Tools Appl 79:32917–32934. https://​doi.​
org/​10.​1007/​s11042-​020-​09693-w
31. Singh R, Puri H, Aggarwal N, Gupta V (2020) An efficient language-independent acoustic emotion
classification system. Arab J Sci Eng 45:3111–3121
32. Zong Y, Zheng W, Zhang T, Huang X (2016) Cross-corpus speech emotion recognition based on
domain-adaptive least-squares regression. IEEE Signal Process Lett 23(5):585–589. https://​doi.​org/​
10.​1109/​LSP.​2016.​25379​26
33. Zeng Y, Mao H, Peng D, Yi Z (2017) Spectrogram based multi-task audio classification. Multimed
Tools Appl 78:3705–3722
34. Yadav A, Vishwakarma DK (2020) A comparative study on bio-inspired algorithms for sentiment
analysis. Clust Comput 23:2969–2989. https://​doi.​org/​10.​1007/​s10586-​020-​03062-w
35. Mohan BJ, Ramesh Babu N (2014) Speech Recognition using MFCC and DTW. In: International con-
ference on advances in electrical engineering (ICAEE), pp 1–4. https://​doi.​org/​10.​1109/​ICAEE.​2014.​
68385​64

13
Multimedia Tools and Applications

36. Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using
spectrogram & phoneme embedding. Proc Interspeech 2018:3688–3692. https://​doi.​org/​10.​21437/​Inter​
speech.​2018-​1811
37. Anvarjon T, Mustaqeem, Kwon S (2020) Deep-Net: a lightweight CNN-based speech emotion recog-
nition system using deep frequency features. Sensors 20:5212. https://​doi.​org/​10.​3390/​s2018​5212
38. Lech M, Stolar M, Best C, Bolia R (2020) Real-time speech emotion recognition using a pre-trained
image classification network: effects of bandwidth reduction and companding. Front Comput Sci
2(14). https://​doi.​org/​10.​3389/​fcomp.​2020.​00014
39. Yadav A, Vishwakarma DK (2020) A Multi-lingual Framework of CNN and Bi-LSTM for Emotion
Classification. In: 2020 11th international conference on computing, communication and networking
technologies (ICCCNT), pp 1–6. https://​doi.​org/​10.​1109/​ICCCN​T49239.​2020.​92256​14
40. Singh J, Saheer LB, Faust O (2023) Speech emotion recognition using attention model. Int J Environ
Res Public Health 20(6):5140. https://​doi.​org/​10.​3390/​ijerp​h2006​5140
41. Swain M, Maji B, Kabisatpathy P et al (2022) A DCRNN-based ensemble classifier for speech
emotion recognition in Odia language. Complex Intell Syst 8:4237–4249. https://​doi.​org/​10.​1007/​
s40747-​022-​00713-w
42. Sun C, Li H, Ma L (2023) Speech emotion recognition based on improved masking EMD and convolu-
tional recurrent neural network. Front Psychol 13:2022. https://​doi.​org/​10.​3389/​fpsyg.​2022.​10756​24
43. Ullah S, Sahib QA, Faizullah, Ullah S, Haq IU, Ullah I (2022) Speech emotion recognition using deep
neural networks. In: Proceedings of the IEEE international conference on IT and industrial technolo-
gies (ICIT), pp 01–06. https://​doi.​org/​10.​1109/​ICIT5​6493.​2022.​99891​97

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy