0% found this document useful (0 votes)

47 views19 pages

10 1109@access 2019 2936124

This article reviews deep learning techniques for speech emotion recognition. It discusses databases commonly used in speech emotion recognition research, the emotions that are typically classified, and contributions and limitations of using deep learning approaches. Deep learning methods such as deep neural networks, convolutional neural networks, and recurrent neural networks are gaining attention as alternatives to traditional feature engineering and classification techniques for speech-based emotion recognition.

Uploaded by

gtaekbmkimaeonrgyb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views19 pages

10 1109@access 2019 2936124

Uploaded by

gtaekbmkimaeonrgyb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2936124, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Speech Emotion Recognition using

Deep Learning Techniques: A Review
RUHUL AMIN KHALIL1 , EDWARD JONES2 , MOHAMMAD INAYATULLAH BABAR3 ,
TARIQULLAH JAN4 , MOHAMMAD HASEEB ZAFAR5 , AND THAMER ALHUSSAIN6
1, 3, 4
Department of Electrical Engineering, Faculty of Electrical and Computer Engineering, University of Engineering and Technology Peshawar, Pakistan
(e-mail: [1. ruhulamin, 3. babar, 4. tariqullahjan]@uetpeshawar.edu.pk)
2
Department of Electrical and Electronics Engineering, National University of Ireland, Galway, Ireland (e-mail: edward.jones@nuigalway.ie)
5
Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Kingdom of Saudi Arabia
(e-mail: 5. mzafar@kau.edu.sa)
6
Department of E-Commerce at Saudi Electronic University (SEU), Kingdom of Saudi Arabia
(e-mail: 6. talhussain@seu.edu.sa)
Corresponding author: Ruhul Amin Khalil (e-mail: ruhulamin@uetpeshawar.edu.pk).
This article was supported and funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Kingdom of
Saudi Arabia. The authors therefore, acknowledge with thanks DSR for technical and financial support.

ABSTRACT Emotion recognition from speech signals is an important but challenging component of
Human-Computer Interaction (HCI). In the literature of speech emotion recognition (SER), many tech-
niques have been utilized to extract emotions from signals, including many well-established speech analysis
and classification techniques. Deep Learning techniques have been recently proposed as an alternative to
traditional techniques in SER. This paper presents an overview of Deep Learning techniques and discusses
some recent literature where these methods are utilized for speech-based emotion recognition. The review
covers databases used, emotions extracted, contributions made toward speech emotion recognition and
limitations related to it.

INDEX TERMS Speech Emotion Recognition, Deep Learning, Deep Neural Network, Deep Boltzmann
Machine, Recurrent Neural Network, Deep Belief Network, Convolutional Neural Network

I. INTRODUCTION approach is considered as one of the fundamental approaches.

MOTION recognition from speech has evolved from
E being a niche to an important component for Human-
Computer Interaction (HCI) [1]–[4]. These systems aim to
It uses various emotions such as anger, boredom, disgust,
surprise, fear, joy, happiness, neutral and sadness [13], [14].
Another important model that is used is a three-dimensional
facilitate the natural interaction with machines by direct continuous space with parameters such as arousal, valence,
voice interaction instead of using traditional devices as input and potency.
to understand verbal content and make it easy for human
listeners to react [5]–[7]. Some applications include dialogue The approach for speech emotion recognition (SER) pri-
systems for spoken languages such as call center conver- marily comprises two phases known as feature extraction
sations, onboard vehicle driving system and utilization of and features classification phase [15]. In the field of speech
emotion patterns from the speech in medical applications processing, researchers have derived several features such
[8]. Nonetheless, there are many problems in HCI systems as source-based excitation features, prosodic features, vocal
that still need to be properly addressed, particularly as these traction factors, and other hybrid features [16]. The second
systems move from lab testing to real-world application [9]– phase includes feature classification using linear and non-
[11]. Hence, efforts are required to effectively solve such linear classifiers [17]. The most commonly used linear clas-
problems and achieve better emotion recognition by ma- sifiers for emotion recognition include Bayesian Networks
chines. (BN) or the Maximum Likelihood Principle (MLP) and Sup-
Determining the emotional state of humans is an idiosyn- port Vector Machine (SVM). Usually, the speech signal is
cratic task and may be used as a standard for any emotion considered to be non-stationary. Hence, it is considered that
recognition model [12]. Amongst the numerous models used non-linear classifiers work effectively for SER [17]. There
for categorization of these emotions, a discrete emotional are many non-linear classifiers available for SER, including

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2936124, IEEE Access

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

Gaussian Mixture Model (GMM) and Hidden Markov Model TABLE 1. List of Nomenclature used in this review paper.
(HMM) [18]. These are widely used for classification of
information that is derived from basic level features. Energy- Nomenclature Referred to
based features such as Linear Predictor Coefficients (LPC), ABC Airplane Behavior Corpus
Mel Energy-spectrum Dynamic Coefficients (MEDC), Mel- AE Auto Encoders
Frequency Cepstrum Coefficients (MFCC) and Perceptual ANN Artificial Neural Network
Linear Prediction cepstrum coefficients (PLP) are often used AVB Adversarial Variational Bayes
AVEC Audio/Visual Emotion Challenge
for effective emotion recognition from speech. Other classi- BN Bayesian Networks
fiers including K-Nearest Neighbor (KNN), Principal Com- CAM3D Cohn-Kanade dataset
ponent Analysis (PCA) and Decision trees are also applied CAS Chinese Academy of Science database
for emotion recognition [18]. CNN Convolutional Neural Network
Deep Learning has been considered as an emerging re- ComParE Computational Paralinguistic challenge
DBM Deep Boltzmann Machine
search field in machine learning and has gained more atten- DBN Deep Belief Network
tion in recent years [19]. Deep Learning techniques for SER DCNN Deep Convolutional Neural Network
have several advantages over traditional methods, including DES Danish Emotional Speech Database
their capability to detect the complex structure and features DNN Deep Neural Networks
without the need for manual feature extraction and tuning; eGeMAPS extended Geneva Minimalistic Acoustic Parameter Set
ELM Extreme Learning Machine
tendency toward extraction of low-level features from the Emo-DB Berlin Emotional database
given raw data, and ability to deal with un-labeled data [19]. FAU-AEC FAU Aibo Emotion Corpus
Deep Neural Networks (DNNs) are based on feed-forward GMM Gaussian Mixture Model
structures comprised of one or more underlying hidden layers HCI Human-Computer Interaction
between inputs and outputs. The feed-forward architectures HMM Hidden Markov Model
HRI Human-Robot Interaction
such as Deep Neural Networks (DNNs) and Convolutional IEMOCAP Interactive Emotional Dyadic Motion Capture database
Neural Networks (CNNs) provides efficient results for image KNN K-Nearest Neighbor
and video processing. On the other hand, recurrent archi- LIF Localized Invariant Features
tectures such as Recurrent Neural Networks (RNNs) and LPC Linear Predictor Coefficients
Long Short-Term Memory (LSTM) are much effective in LSTM Long-Short Term Memory
MEDC Mel Energy-spectrum Dynamic Coefficient
speech-based classification such as natural language process- MFCC Mel-Frequency Cepstrum Coefficient
ing (NLP) and SER [20]. Apart from their effective way MLP Maximum Likelihood Principle
of classification these models do have some limitations. For MT-SHL-DNN Multi-Tasking Shared Hidden Layers Deep Neural Net-
instance, the positive aspect of CNNs is to learn features from work
PCA Principle Component Analysis
high-dimensional input data, but on the other hand, it also
PLP Perceptual Liner Prediction cepstrum coefficient
learns features from small variations and distortion occur- RBM Restricted Boltzmann Machine
rence and hence, requires large storage capability. Similarly, RE Reconstruction-Error-based (RE)
LSTM-based RNNs are able to handle variable input data and RvNN Recursive Neural Network
model long-range sequential text data. RECOLA Remote Collaborative and Affective Interactions
database
The organization of this paper is as follows. A review of RNN/RCNN Recurrent Neural Network
background for speech-based emotion detection and recog- SAE Stacked Auto Encoder
nition using traditional classification techniques is given in SPAE Sparse-Auto Encoders
Section II. Section III reviews the need for deep learning SAVEE Surrey Audio-Visual Expressed Emotion
techniques utilized in a different context for SER. In Sec- SDFA Salient Discriminative Feature Analysis
SER Speech Emotion Recognition
tion IV, different deep learning techniques are discussed on SVM Support Vector Machine
the basis of their layer-wise architecture for SER. Further, VAE Variational Auto Encoder
Section V provides a summary of the papers based on these
deep learning techniques for SER along with detailed discus-
sion and future directions. Finally, concluding remarks are
presented in Section VI. Feature extraction is utilized to identify the relevant features
A list of nomenclature used throughout this review paper available in the signal. Lastly, the mapping of extracted fea-
is provided in Table 1 as follow. ture vectors to relevant emotions is carried out by classifiers.
In this section, a detailed discussion of speech signal process-
II. TRADITIONAL TECHNIQUES FOR SER ing, feature extraction, and classification is provided [23],
Emotion recognition systems based on digitized speech is [24]. Also, the differences between spontaneous and acted
comprised of three fundamental components: signal prepro- speech are discussed due to their relevance to the topic [25].
cessing, feature extraction, and classification [21]. Acoustic Figure 1 depicts a simplified system utilized for speech-based
preprocessing such as denoising, as well as segmentation, is emotion recognition. In the first stage of speech-based signal
carried out to determine meaningful units of the signal [22]. processing, speech enhancement is carried out where the
2 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

FIGURE 1. Traditional Speech Emotion Recognition System.

noisy components are removed. The second stage involves TABLE 2. Summarized form of some acoustic variations observed based
on emotions.
two parts, feature extraction, and feature selection. The re-
quired features are extracted from the preprocessed speech
Emotions Pitch Intensity Speaking Voice qual-
signal and the selection is made from the extracted features. rate ity
Such feature extraction and selection is usually based on the
Anger abrupt on much marginally breathy,
analysis of speech signals in the time and frequency domains. stress higher faster chest
During the third stage, various classifiers such as GMM and Disgust wide, lower very much grumble
HMM, etc. are utilized for classification of these features. downward faster chest tone
inflections
Lastly, based on feature classification different emotions are
Fear wide, lower much faster irregular
recognized. normal voicing
Happiness much higher faster/slower breathy,
A. ENHANCEMENT OF INPUT SPEECH DATA IN SER wider, blaring tone
upward
The input data collected for emotion recognition is often inflections
corrupted by noise during the capturing phase [26]. Due to Joy high mean, higher faster breathy;
wide range blaring
these impairments, the feature extraction and classification timbre
become less accurate [27]. This means that the enhancement Sadness slightly nar- downward lower resonant
of the input data is a critical step in emotion detection and rower inflections
recognition systems. In this preprocessing stage, the emo-
tional discrimination is kept, while the speaker and recording
variation is eliminated [28]. pitch, and rate of spoken words and quality of voice are
frequently considered [34]. Often, a straightforward view of
B. FEATURE EXTRACTION AND SELECTION IN SER emotion is considered, wherein emotions are assumed to exist
The speech signal after enhancement is characterized into as discrete categories. These discrete emotions sometimes
meaningful units called segments [29], [30]. Relevant fea- have relatively clear relationships with acoustic parameters,
tures are extracted and classified into various categories for example, as indicated in Table 2 for a subset of emotions.
based on the information extracted. One type of classification Often, the intensity and pitch are correlated to activation, so
is short term classification based on short-period character- that the value of intensity increases along with high pitch
istics such as energy, formants and pitch [31]. The other and gets low with low pitch [35], [36]. Factors that affect the
is known as long term classification; mean and standard mapping from acoustic variables to emotion include whether
deviation are two of the often-used long term features [32]. the speaker is acting, there are high speaker variations, and
Among prosodic features, the intensity, pitch, rate of spoken the mood or personality of the individual [37], [38].
words and variance are usually important to identify various In HCI, emotions are usually spontaneous and are gener-
types of emotions from the input speech signal [33]. A few ally not the prototypical discrete emotions, rather they are
of the characteristics based on acoustics emotions of speech often weakly expressed, mixed, and hard to distinguish from
are presented in Table 2. each other [39]. In the literature, emotional statements are
termed as positive and negative based on emotions expressed
C. MEASURES FOR ACOUSTICS IN SER by an individual [40]. Other experiments show that the
Information availability of emotions is encrypted in every as- listener-based acted emotions are much stronger and accurate
pect of language and the variations in it. The vocal parameters than natural emotions, which may suggest that actors exag-
and their relation to emotion recognition are among the most gerate the expression of emotions. According to the study
researched topics in this field. Parameters such as intensity, in [41], the fundamental emotions can be described by areas
VOLUME 4, 2016 3

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

within the space defined by the axes of arousal and valence TABLE 3. Few linear and non-linear classifiers used for SER.
are provided in Figure 2. Arousal represents the intensity of
calmness or excitement, whereas the valence represents the Classifiers Linear/Non- References
Linear
effect of positivity and negativity in the emotions.
Bayes Classifier Linear [41]
K-Nearest Neighbor classifier Linear [41]
GMM classifier Non-Linear [42]
HMM classifier Non-Linear [43] , [44]
PCA classifier Linear/Non- [45]
Linear
SVM classifier Linear/Non- [46], [47], [48]
Linear
ELM classfifier Linear/Non- [89], [121]
Linear

[55]. The methods available and objectives in the collection

of speech databases vary depending on the motivation for
speech systems development. Table 4 provides the character-
istics of various freely available emotional speech databases.
For development of emotional speech systems, speech
databases are categorized into three main types.. The cate-
gorization of databases can also be described according to
the continuum shown in Figure 3.

FIGURE 2. A two dimensional basic emotional space.

D. CLASSIFICATION OF FEATURES IN SER

In literature, various classifiers have been investigated to de-
velop systems such as SER, speech recognition, and speaker
verification, to name a few [41]. On the other hand, the FIGURE 3. Emotion recognition databases and their difficulty level.
justification for choosing a particular classifier to the specific
speech task is often not mentioned in most of the appli-
cations. Typically, classifiers are selected on either rule of • Simulated database:
thumb or empirical evaluation of some indicators as men- In these databases, the speech data has been recorded
tioned earlier. by well trained and experienced performers [53], [54].
Normally, pattern recognition classifiers used for SER Among all databases this one is considered as the
can be broadly be categorized into two main types, namely simplest way to obtained the speech-based dataset of
linear classifiers and non-linear classifiers. Linear classifiers various emotions. It is considered that almost 60% of
usually perform classification based on object features with speech databases are gathered by this technique.
a linear arrangement of various objects [42]. These objects • Induced database:
are mostly evaluated in the form of an array termed as a fea- This is another type of database in which the emotional
ture vector. In contrast, non-linear classifiers are utilized for set is collected by creating an artificial emotional situ-
object characterization in developing the non-linear weighted ation [55], [56]. This is done without the knowledge of
combination of such objects. the performer or speaker. As compared to actor-based
Table 3 depicts a few traditional linear and non-linear database, this is a more naturalistic database. However,
classifiers used for SER. an issue of ethics may apply, because the speaker should
know that they have been recorded for research-based
E. DATABASES USED FOR SER activities.
Speech emotional databases are used by many researchers • Natural database:
in a variety of research activities [49], [50]. Quality of the While most realistic, these databases are hard to obtain
databases utilized and performance achieved are the most due to the difficulty in recognition [52]. Natural emo-
important factors in evaluation for emotion recognition [54], tional speech databases are usually recorded from the
4 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

TABLE 4. Characteristics of various free available emotional speech databases.

S.No. Database Language Emotions Size Source Access References

1. Berlin emotional German Happiness, 800 utterances Professional Public and free [49], [50], [51]
database sadness, actors
boredom, neutral,
disgust and anger
2. Danish emotional Danish Anger, sadness, 4 actors × 5 emo- Non-professional Free license [49], [50]
databases surprise, neutral, tions actors
and joy
3. Interactive English Happiness, 10 actors (5 male Professional Free license [49], [50]
Emotional anger, sadness, and 5 female) actors
Dyadic Motion frustration,
Capture surprise, fear,
disgust, excited
and neutral state
4. INTERFACE05 English, Spanish, Neutral, disgust, English=186, Actors Commercially [49], [50], [53]
French and fear, joy, sadness Spanish=184, available
Slovenian French=175 and
Slovenien=190
utterances,
respectively
5. LDC Emotional English Despair, Sadness, 7 actors × 15 Professional Commercially [49], [50], [52]
Speech and Neutral, interest, emotions × 10 actors available
Transcripts joy, panic, anger, utterances
shame, contempt,
elation, pride and
cold anger

general public conversation, call center conversations human-like behavior for interaction with human beings [60].
and so on. As discussed earlier, a SER system is made up of vari-
In early 1990s, when the research on speech-based emo- ous components that include feature selection and extrac-
tion recognition developed in earnest, researchers often com- tion, feature classification, acoustic modeling, recognition
menced with acted databases and later moved to realistic per unit, and most importantly language-based modeling.
databases [53]. In acted databases, the most commonly used The traditional SER systems typically incorporate various
databases are Berlin Emotional Speech Database (EmoDB) classification models such as GMMs and HMMs [61], [62].
and Danish Emotional Speech Database (DES) that contains The GMMs are utilized for illustration of acoustic features of
the recorded voices of 10 performers.This includes 4 persons sound units, while, the HMMs are utilized for dealing with
for testing, who were asked to speak various sentences in 5 temporal variations occurrence in speech signals.
different emotional states. The data comprises the German- Deep learning methods are comprised of various non-
Aibo emotion and Smart-Kom data, where the actors’ voices linear components that perform computation on a parallel
are recorded in a laboratory. Additionally, the call center con- basis [63]. However, these methods need to be structured
versations in a fully realistic environment from live record- with deeper layers of architecture to overcome the limita-
ings have been used [51]. tions of other techniques. Deep learning techniques such as
Literature suggests there is a large variation between the Deep Boltzmann Machine (DBM), Recurrent Neural Net-
databases in the number of emotions recognized and the work (RNN), Recursive Neural Network (RNN), Deep Belief
number of performers, purpose, and methodology. Speech Network (DBN), Convolutional Neural Networks (CNN) and
emotional databases are employed in psychological studies Auto Encoder (AE) are considered a few of the fundamental
for knowing the patient’s behavior as well as in situations deep learning techniques used for SER, that significantly
where automation in emotion recognition is desired [52], improves the overall performance of the designed system
[57]. The system becomes complex and emotion recognition [63].
is hard to achieve when the real-time data is employed [57]. Deep learning is an emerging research field in machine
learning and has gained much attention in recent years [64].
III. NEED OF DEEP LEARNING TECHNIQUES FOR SER A few researchers have used DNNs to trained their respective
Speech processing usually functions in a straightforward models for SER. Figure 4 depicts the difference between
manner on an audio signal [58]. It is considered significant traditional machine learning flow and deep learning flow
and necessary for various speech-based applications such as mechanisms for SER. Table 5 shows a detailed comparative
SER, speech denoising and music classification [59]. With analysis of the traditional algorithms with Deep learning i.e.,
recent advancements, SER has gained much significance. Deep Convolutional Neural Network(DCNN) algorithm in
However, it still requires accurate methodologies to mimic the context of measuring various emotions using IEMOCAP,
VOLUME 4, 2016 5

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

FIGURE 4. Traditional Machine Learning Flow vs Deep Learning Flow.

Emo-DB and SAVEE datasets and recognized various emo-

tions such as happiness, anger, and sadness [65]. It is deduced
that deep learning algorithms perform well in emotion recog-
nition as compared to traditional techniques.

TABLE 5. Comparative Analysis of different classifiers in SER [65].

Algorithms Anger Happy Sad

k-nearest neighbor 93% 55% 77%
Linear discriminant analysis 68% 49% 72%
Support vector machine 74% 70% 93%
Regularized discriminant analysis 83% 73% 97%
Deep Convolutional neural network 99% 99% 96%
FIGURE 5. Generic layer-wise Deep Neural Network (DNN) Architecture.
In the next section, the paper aims to discuss various
deep learning techniques in the context of SER. These
methods provide accurate results as compared to traditional
techniques but are computationally complex. This section deep learning algorithms such as DBMs, DBNs, CNNs,
provides literature-based support to researchers and readers RNNs, RvNNs, and AEs are discussed.
to assess the HCI feasibility and help them to analyze the
user’s emotional voice in the given scenario. The real-time A. DEEP BOLTZMANN MACHINE (DBM)
applications of these techniques are much more complex, DBMs are basically derived from Markov Random fields
however, emotion recognition from speech input data is a and are comprised of various hidden layers [69], [70]. These
feasible option [66]. These methods do have limitations, layers are based on randomly chosen variables and coupled
however, a combination of two or more of these classifiers with stochastic entities. The domain of visible entities is
results in a new step and possibly improve the detection of given by vi {0, 1} ∈C , with combination of hidden units
emotions. (1) (2) (L)
hi ∈ {0, 1}b1 , hi ∈ {0, 1}C2 , · · ·, hi ∈ {0, 1}CL as
shown in Figure 6 (a). Contrary, in a Restricted Boltzmann
IV. DEEP LEARNING TECHNIQUES FOR SER
Machine (RBM), there is no inter-connection between the
Deep learning is derived from the family of machine learning,
entities of the same layers. A generalized three-layer RBM
which is a broader learning technique for data representation
network is shown in Figure 6 (b). The probability allotted to
such as emotions [67]. This deep learning can un-supervised,
each vector component in vi is given by
semi-supervised or fully supervised. Figure 5 illustrates a
generic layer-wise architecture for DNN.
Currently, deep learning is a fast-growing research area 1 X P abW ab(1) vi hb(1) P (2) (2) (2)

due to its multi-layered structure and efficient results deliv- p(vi ) = {e + e bcW bc hb hi +
Z
ery. These research areas include speech emotion recogni- hi
(3)
cdW cd(3) h(3)
P
tion, speech and image recognition, natural language pro- e c hi }
cessing, and pattern recognition [68]. In this section, various (1)
6 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

FIGURE 6. Graphical representation of (a) Deep Boltzmann Machine (DBM) and (b) Restricted Boltzmann Machine (RBM).

(1) (2) (3)

where h = {hi , hi , hi } represent the set of hidden f (U(xt ) +Ws(t−1) ). According to [74], RNNs are suitable for
(1) (2) (3)
layer entities and θ = {Wi , Wi , Wi } corresponds to speech emotion recognition due to short-time level framing
symmetric interaction between visible and hidden units. This for acoustic features.
further denotes the visible-hidden and hidden-hidden vector The main problem that affects the overall performance
(2) (3)
interaction. If Wi = Wi = 0, the system is termed as of the RNN is its sensitivity towards the disappearance of
RBM. the gradients [75]. In other words, the gradients may decay
The main advantage of DBM is its tendency to learn exponentially during the training phase and get multiplied
quickly and provide efficient representation. It achieves this with lots of small or large derivatives. However, this sensi-
by layer to layer pre-training [71]. This is the reason that tivity gets reduced over a while and results in forgetting the
DBM can provide better results for emotion recognition inputs provided at the initial level. To avoid such a situation,
when speech is provided as input. Along with this, DBM has Long Short-Term Memory (LSTM) is utilized for providing
some disadvantages as well, such as restricted effectiveness a block between the recurrent connections. Each block of
in certain scenarios [72]. memory stores the network temporal states, and include gated
units for controlling the inflow of new information. The
B. RECURRENT NEURAL NETWORK (RNN) residual connections are usually very deep and hence useful
RNN is a branch of neural network based on sequential for reducing the gradient issue.
information, where the outputs and inputs are interdependent
[73]. Usually, this interdependency is useful in predicting the C. RECURSIVE NEURAL NETWORK (RvNN)
future state of the input. RNNs like CNNs require memory RvNN is a hierarchical deep learning technique with no
to store the overall information obtained in the sequential dependency on the tree-structured input sequence on RvNN
process of deep learning modeling, and generally works is a hierarchical deep learning technique with no dependency
efficiently only for a few back-propagation steps. Figure 7 on the tree-structured input sequence [76]. It can easily learn
depicts the basic RNN architecture. the parse tree of the provided data by dividing the input into
small chunks. Its governing equation is provided in (2) and
Figure 8 depicts a RvNN architecture.

p1,2 = tanh(W [c1 ; c2 ]) (2)

where W is designed as n × 2n weighted matrix. This struc-
tured network is well utilized for syntactic parsing for natural
language processing, speech processing, speech emotion and
pattern recognition in audio and visual input data.
RvNN is mostly used for natural language processing but
its architecture is able to handle different modalities such
FIGURE 7. Basic architecture of Recurrent Neural Network.
as SER and speech recognition. According to the study in
[77], the RvNN can be used for the classification of natural
where xt is the input, st is the underlaying hidden state, and language sentences as well as natural image processing. It
ot is the output at time step t. The U, V, W are known as initially computes the overall score of the possible pair for
parameters for hidden matrices and their values may varies merging them in pairs and to build a syntactic tree. The pair
for every time step. The hidden state is calculated as St = of highest score is further combined with a vector known as
VOLUME 4, 2016 7

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

its weights and state vectors units y from its visible layer is
expressed as

E(y, z) = −mT y − nT z − y T W z (3)

where z is termed as binary configuration units of the hidden
layer, m and n refers to prejudices of both hidden and
visible layers. The matrix W provides a connection between
the weights of various layers. The probability between pair
vectors of hidden and visible layers is given by the following
equation

e−E(y,z)
P (y, z) = (4)
L
where L is known as partition function defined for all possi-
FIGURE 8. Basic architecture of Recursive Neural Network.
ble configurations normalizing the probabilistic distribution
to unity. As the RBM is unable to model the original input
data, so DBN uses its greedy algorithm to improve the gen-
compositional vector. Once the pair gets merged, the RvNN erative model by allowing all subnetwork to receive various
then generates multiple units, the region representing vectors, representations of data. Moreover, with the addition of a new
and the classification labels. layer into DBN, the overall variational bounds on the deeper
layer are further improved as compared to the previous RBM
D. DEEP BELIEF NETWORK (DBN) block.
DBN is much more complicated in structure and is built The first main advantage of DBN is the un-supervised
from cascaded RBM structures [78]. DBN is an extension of nature in pre-training techniques with large and unlabeled
RBMs, in which RBMs are trained layer to layer in a bottom- databases [80]. The second advantage of DBNs is that they
up manner. The DBNs are usually used for speech emotion can compute the required output weight of the variables
recognition due to their ability to learn the recognition param- using inference procedure approximation. There lies some
eters efficiently, no matter how a large number of parameters. limitation as well because the inference procedure of DBNs is
It also avoids the non-linearity in layers [79]. DBNs are only restricted to bottom-up pass. There exists a greedy layer,
used to tackle slow speed localized problems using back that learns features of a single layer and never re-adjusts with
propagation algorithms during training. Figure 9 represents the remaining layers [81].
the layer-wise architecture of the DBN in which the RBMs
are trained and evaluated layer-wise from bottom to top. E. CONVOLUTIONAL NEURAL NETWORK (CNN)
CNN is another type of Deep learning technique based solely
on feed-forward architecture [82] for classification. CNNs
are commonly used for pattern recognition and provide better
data classification. These networks have small size neurons
present on every layer of the designed model architecture
that process the input data in the form of receptive fields
[83]. Figure 10 provides the layer-wise architecture of a basic
CNN network.
Filters are the base of local connections that are convolved
with the input and share the same parameters (weight W i
and bias ni ) to generate i feature maps (z i ), each of size
a − b − 1. The convolutional layers compute the dot product
between the weights and provided inputs. So, the parameters
for weight W i and biasing ni for generation of maps z i for i
features with sizes a − b − 1 can be given as

FIGURE 9. Layer-wise architecture of Deep Belief Network. z i = g(W i ∗ r + ni ) (5)

An activation function f or a non-linear methodology
An RBM is usually generative stochastic because it pro- needs to be applied to get the output of the convolution layers.
vides probabilistic distributions as output for a given input. It should be noted that inputs are very small regions of the
The configuration of RBM based on its energy level for original volumes as depicted in Figure 10. Down sampling
determination of the output distribution function along with is carried out at each subsampling layer to feature maps and
8 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

FIGURE 10. Layer-wise Convolutional Neural Network Architecture.

decrease the parameters in the network. This, in turn, controls vector i in the h hidden layers. In function fθ , θ = {Wt , c},
the overfitting and boosts the training process. The pooling where Wt is the weight metric, and c is the biasing vector. As
process is carried out over p×p elements (also known as filter far as the decoder function gθ is concerned, the hidden layer
size) for adjoining expanse of all the feature maps. In the final represents h to the input d that is reconstructed via eo .
stage, the layers need to be fully connected as in other neural There are two variations of AE, that include Stacked Auto
networks. These later layers take the previous low-level and Encoder (SAE) and Variational Auto Encoders (VAE). The
mid-level features and generate high-level abstraction form VAE utilizes the log-likelihood of data and influences the
the input speech data. The last layer also known as SVM or lower bound estimator from a given graphical prototype
Softmax is utilized to further generate the score of classifica- with uninterrupted underlying variables [86]. The generative
tion in probabilistic terms to relate to a certain class. parameters θ (generative model parameter) and φ (variational
parameter) assist the overall process of approximation. An-
F. AUTO ENCODER (AE) other variation termed as Auto-Encoding Variational Bayes
AE is a type of neural network with a detailed built-in model (AEVB) algorithm further optimizes the parameters θ and
representation [84], [85]. The generalized architecture of AE φ for various probabilistic encoders qφ (j/i) in any defined
is depicted in Figure 11. neural network. This in turn leads to approximation of the
generative model pθ (i, j), where j is expressed as latent
variable under a simplified distribution given as N (0, I), and
I is the identity matrix. The aim to enhance the probability of
each i in the given training set with the underlying generative
process given by
Z
Pθ (i) = pθ (j)pθ (i/j)dj (6)

In the next section, a summary of the papers based on

different deep learning techniques for speech-based emotion
recognition is presented. Also, discussion is made on the
layer-wise working of these deep learning techniques, and
future directions are provided.

V. SUMMARY OF THE LITERATURE, DISCUSSION AND

FUTURE DIRECTIONS
For SER, many deep learning algorithms have been devel-
oped [87]–[91]. However, there exist meaningful prospects
and fertile ground for future research opportunities not only
in SER but many other domains [92]–[94], [96]. The layer-
FIGURE 11. Auto Encoder architecture.
wise structure of neural networks adaptively learns features
from available raw data hierarchically [96]. The remainder of
The encoder function fθ is mapped to the provided input this section summarizes the literature on deep layer architec-
VOLUME 4, 2016 9

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

tures, learning and regularization methodologies discussed in efficients has been presented in [103]. The DNN-HMM has
the context of SER. been combined with RBM utilizing unsupervised training
Deep learning techniques utilize some key features during to recognize different speech emotions. The Hybrid deep
various applications such as SER, natural language process- learning modality can achieve better results [132]. The same
ing (NLP) and sequential information processing described DNN-HMM has been presented and compared with the
in Table 6. In the case of SER, most of these techniques use Gaussian Mixture Model (GMM). It is investigated along
supervised algorithms during their implementation, however, with restricted Boltzmann Machine (RBM) for a scenario
there is a shift to semi supervised learning [87]. This will where unsupervised and discriminative pre-training is con-
enhance the learning of real-world data without the need cerned. The results obtained in both cases are then compared
for manual human labels. Table 7 provides a summary of with those obtained for two layers and multilayers perception
the Deep learning techniques used by researchers along with of GMM-HMMs and shallow-NN-HMMs. The hybrid DNN-
their respective descriptive key features, databases used, and HMMs with pre-training has accuracy using eNTERFACE05
results from accuracy outcome, and some commentary on dataset of 12.22% with unsupervised training, 11.67% for
future directions. GMM- HMMs, 10.56% for MLP-HMMs and 17.22% for
shallow- NN-HMMs respectively. This suggests multimodal-
TABLE 6. Summary of deep learning techniques with descriptive key ity as a fruitful avenue for research, and also, there is a
features.
span for improving the accuracy of emotion recognition,
robustness, and efficiency of the recognition system [133].
Deep Learning Techniques Descriptive Key References
Features
The main problem that affects the overall performance
of the RNN is its sensitivity towards the disappearance of
DBM Unsupervised [69]–[72]
learning for RBM
gradients [100], [101]. In [104], an adaptive SER system
and directionless based on deep learning technique known as DRNN is used for
connections SER. The learning stage of the model includes both frame-
RCNN Efficient for [73] level and short-time acoustic features, due to their similar
sequential
information structure. Another multi-tasking deep neural network with
processing such as shared hidden layers named MT-SHL-DNN is utilized in
NLP and SER [119], where the transformation of features is shared. Here,
RvNN Utilizes tree-like [76], [77]
structure specially the output layers have an association separately with each
for NLP data set used. The DNN also helps in measuring the SER
DBN Unsupervised [78]–[81] based on the nature of the speaker and gender. When the
learning
and directed
DNNs are used for encoding of segments into length vectors
connections that are fixed in nature, this is done by using pooling of
CNN Basically designed [82], [83], [97], various hidden layer over the specified time. The design of
for Image [98] the feature encoding procedure is done in such a manner
recognition but
can be extended that it can be used jointly with segmental level classifier for
for NLP, computer efficient classification.
vision and speech Convolutional Neural Network (CNN) also uses the layer-
processing
AE/SAE/VAE Unsupervised [84]–[86] wise structure and can categorize the seven universal emo-
learning based tions from the defined speech spectrograms [119]. In [123], a
on Probabilistic technique for SER that is based on spectrograms and deep
graphical models
CNN is presented. The model consists of three fully con-
nected convolutional layers for extracting emotional features
The traditional SER systems typically incorporate various from the spectrogram images of the speech signal. Another
classification models such as GMMs and HMMs. The GMMs adaptation, where a technique that shares priors between
are utilized for representation of the acoustic features of related source and target classes (SPRST) is carried out in
sound units. The HMMs, on the other hand, are utilized for [124]. The two-layered neural network where speech data is
dealing with temporal variations in speech signals [99]. The collected from various sources and scenarios usually led to
modeling process using such traditional techniques requires a mismatching and hence degrades the overall performance of
larger dataset to achieve accuracy in emotion recognition, and the system. Initially, a pre-training of the weights is carried
hence, is time-consuming. In contrast, deep learning methods out for the first layer and then classification parameters of
are comprised of various non-linear rudiments that perform the second layer are enforced between the two taken classes.
computation on a parallel basis [100]–[102]. However, these These classes with less labeled data in the target domain can
methods need to be structured with deeper layered architec- borrow information from the source associated domain to
tures [119]. compensate for the deficiencies.
A deep learning technique based on discriminative pre- The tendency of DNNs to learning the specific features
training modality using DNN-HMM along with MFCC co- from various auditory emotion recognition systems is an-
10 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

TABLE 7. Summary of literature on deep learning techniques for SER with discussion and future directions.

S. No. References Emotion recognized Databases used Deep Learning ap- Contribution Future Direction
proach used towards Emotion
recognition and
accuracy
1. J. Zhao et. al (2019) Happiness, Sadness, Berlin EmoDB and CNN and DBN with Deep 1D and 2D The presented model
[99] Neutral, Surprise, IEMOCAP four LFLBs and one CNN LSTM to can be extended to
Disgust, Fear, and LSTM achieve 91.6% and multimodal emotion
Anger 92.9% accuracy recognition
2. S. Tripathi et. al Anger, Happiness, IEMOCAP database LSTM based RNN Model is tested on Future work may
(2018) [100] Sadness, Neutral with 3 layers MoCap data with lead to inclusion of
overall accuracy of further layers to get
71.04% improved results
3. P. Yenigalla et. al Happiness, Sadness, IEMOCAP database 2D CNN with Achieve accuracy in It can be tested in
(2018) [101] Neutral, Surprise, Phoneme data input SER for about 4% other applications
Disgust, Fear, and data above average such conversational
Anger Chatbot and other
databases
4. E. Lakomkin et. al Anger, Happiness, IEMOCAP database RNN and CNN Combine RNN-CNN Future work may
(2018) [102] Neutral and Sadness for iCLub robot and lead to use of
in-domain data with generative models
83.2% accuracy for real-time data
input
5. D. Tang et. al (2018) Anger, Happiness, EmotAsS (EMO- Combine CNN and The CNN+RNN Future work may
[103] Neutral and Sadness Tional Sensitivity RNN with ResNEt model achieves include the learning
Assistance System 45.12% on the of features and
for people with improvement augmentation
disabilities) dataset dataset, and 42.27%
on the test dataset.
6. C. W. Lee et. al Sadness, Happiness, CMU-MOSEI LSTM based CNN The proposed model Future work may
(2018) [104] Anger, Disgust, Sur- dataset gives overall 83.11% lead to testing on
prise, and Fear accuracy in SER multimodalilty
7. S. Sahu et. al (2018) Open Smile features IEMOCAP database Adversarial auto- Accuracy with UAR Future work may
[105] recognized encoders (AAE) of approximately lead to recognizing
57.88% other emotions
8. S. Latif et. al (2018) Motherese, Joyful, FAU-AIBO, RBM based DBN DBN is used for Future work may
[106] Neutral, Rest, Angry, IEMOCAP, EMoDB, transfer learing for include application
Touchy, Emphatic SAVEE and cross-corpus and of other neural
and Sadness EMOVO databases cross-language networks
9. M. Chen et. al (2018) anger, Sadness, IEMOCAP and 3-D CNN with The model achieves Future work includes
[107] Happy, Neutral, Emo-DB databases LSTM to learn an overall of 86.99% the testing on differ-
Fear, Disgust and discriminative ent databases
Bored features
10. M. Sarma et. al Happiness, Sadness, IEMOCAP database TDNN-LSTM for TDNN-LSTM with Future work includes
(2018) [108] Neutral, Surprise, SER time-restriction for investigation in
Disgust, Fear, and self-attention, and multi-dimensional
Anger achieve a weighted space
accuracy of 70.6%,
versus 61.8%
previously
11. S. E. Eskimez et. al Anger, Frustration, USC-IEMOCAP CNN with VAE, Better achievement Future work may in-
(2018) [109] Neutral, and Sadness audio-visual dataset AAE and AVB on F-1 score level as clude RNN
47%
12. J. Zhao et. al (2018) Joy, Happiness, Neu- Berlin EmoDB and Deep Convolutional Merged Deep 1D and Future work may
[110] tral, Sadness, Dis- IEMOCAP Neural Network 2D CNN for high- include addition of
gust, Fear, and Anger (DCNN) level learning of LSTM for better
features from input SER
audio and log-mel
spectrograms with
92.71% accuracy
13. W. Zhang et. al Fear, Anger, Neutral, CAS emotional Features fusion with DBM provides ac- Future direction may
(2017) [111] Joy, Surprise and speech database SVM and Deep Be- curacy of 94.6% as leads to more train
Sadness lief Network for SER compared to SVM DBN with combina-
is carried out that is 84.54% tion of lexical fea-
tures and audio fea-
tures
14. Z. Lianzhang et. al Anger, Fear, Happy, Chinese Academy of Combined SVM and The combined model Future work includes
(2017) [112] Neutral, Sad, Sur- Sciences emotional DBN for SER achieves 94.6% ac- testing on other
prise, and Average speech database curacy databases

VOLUME 4, 2016 11

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

S. No. References Emotion recognized Databases used Deep Learning ap- Contribution Future Direction
proach used towards Emotion
recognition
15. P. Tzirakis et. al Anger, Happiness, Spontanous Convolutional End-to-end method- Future work involves
(2017) [113] Sadness, Neutral emotional RECOLA Neural Network ology is used to rec- the application of
and AVEC 2016 (CNN) and ResNet ognize various emo- proposed model on
Database of 50 layers for tions with 78.7% ac- other databases
both audio-visual curacy
modality along with
LSTM
16. H.M. Fayek et. al Anger, Happy, Neu- IEMOCAP Database Feed forward in Proposed SER Future work may
(2017) [114] tral, Sad and Silence combination with technique relies be to apply the
Recurrent Neural on minimal speech same model to other
Network (RNN) and processing and databases
CNN to recognize frame based end-to-
emotion from speech end deep learning
64.78% accuracy
17. Q. Mao et. al (2017) Happiness, Sadness, INTERSPEECH Two layers Emotion- the proposed Future direction may
[115] Neutral, Surprise, 2009 Emotion discriminative and method effectively leads to application
Disgust, Fear, and Challenge, ABC and Domain-invariant transferred on other databases
Anger Emo-DB Feature Learning knowledge and
Method (EDFLM) enhanced the
classification
performance with
65.62% and 61.63%
accuracy
18. S. Zhang et. al Anger, Fear, Happy, EMoDB, DCNN model with The model provides Future directions
(2017) [116] Neutral, Sad, Sur- eNTERFACE05, DTPM strategy accuracy of 87.31% may lead to CNN
prise, and Average RML and BAUM-1s in combination for EMo-DB, with LSTM modality
with temporal 69.70% for for SER
pyramid matching RML, 76.56%
and Lp-norm for eNTERFACE05
pooling for feature and 44.61% for
representation BAUM-1s
19. J. Deng et. al (2017) Neutral, Anger, ABC, EMoDB and Unsupervised The model learn dis- The future works
[117] Fear, Disgust, Geneva Whispered Universum criminative data and may leads to
Sadness, Boredom Emotion Corpus Autoencoders incorporate the unla- paralinguistic
and Happiness (GeWEC) adaptive model beled learning with computations
62% , 63.3% and
62.8% accuracy
20. S. Mirsamadi et. al Happiness, Sadness, IEMOCAP Deep Recurrent Neu- Deep RNNs The presented model
(2017) [118] Neutral, Surprise, databases ral Network (RNN) for frame-level can be tested on
Disgust, Fear, and characterization other deep learning
Anger achieves +5.7% and techniques and
+3.1% accuracy databases
in WA and UA,
respectively as
compared to SVM
21. Y. Zhang et. al Happiness, Sadness, ComParE and multi-tasking DNNs multi-tasking DNN Can use multiple
(2017) [119] Neutral, Surprise, eGeMAPS feature with few shared has been used for the Softmax in
Disgust, Fear, and set hidden layers (MT- very first time with combination with
Anger SHL-DNN) 93.6% accuracy linear layers for
classification and
regression tasks
22. Y. Zhao et. al (2017) Neutral, Anger, Fear, IEMOCAP for emo- Recurrent Convolu- This hybrid model Possibility of making
[120] Disgust, Sadness, tion recognition and tional Neural Net- has been applied for this model more
Joy and Happiness TIMIT for phoneme work (RCNN) SER for the very first generic and efficient
recognition time with 83.4% ac- cross modal deep
curacy learnings
23. Z. Q. Wang and I. Ta- Happiness, Sadness, Mandarin dataset DNN based kernel DNN-ELM approach Future works include
shev (2017) [121] Neutral, Surprise, extreme learning ma- provides 3.8% testing on other cor-
Disgust, Fear, and chine (ELM) weighted accuracy pus
Anger and 2.94% un-
weighted accuracy in
SER
24. J. Han et. al (2017) Anger, Happiness, Spontanous Memory enhance Novelty here is Future work involves
[122] Sadness, Neutral emotional RECOLA Recurrent Neural Reconstruction- the utilization of
Database Network (RNN) with Error-based (RE- BLSTM-RNN with
baseline of LSTM based) framework regression like
for emotion Support Vector
recognition in a Regression
continuous data

12 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

S. No. References Emotion recognized Databases used Deep Learning ap- Contribution Future Direction
proach used towards Emotion
recognition
25. Abdul Malik et. al Anger, Boredom, Berlin Emotions Deep Convolutional Emotion recognition The model need to
(2017) [123] Disgust, Joy, Sadness Database Neural Network has been done using be trained enough to
and Neutral (CNN) spectrogram of the recognize the emo-
speech signal tion for fear. Also it
can be test with other
emotional databases
26. Qirong Mao et. al Happiness, Sadness, INTERSPEECH Two layers DNN the proposed Further direction
(2016) [124] Neutral, Surprise, 2009 Emotion for Sharing Priors method effectively may lead from single
Disgust, Fear, and Challenge, FAU- between Related transferred to many layers
Anger AEC, Emo-DB Source and Target knowledge and architecture.
classes (SPRST) enhanced the
classification
performance for
emotion recognition
27. Pablo et. al (2016) Happiness, Sadness, GTZAN database, Cross-channel Model perform Model need training
[125] Neutral, Surprise, SAVEE database architecture based better for speech, to achieve optimum
Disgust, Fear, and and EMotiW corpus Deep Neural music and complex results for general-
Anger databases Network (DNN) audio signals from ized auditory scenar-
recorded clips ios.
28. Lim Wootaek et. al Neutral, Anger, Fear, Berlin Emotional Concatenation of The deep hierarchi- The future works
(2016) [126] Disgust, Sadness, Database (Emo-DB) CNNs and RNNs cal feature extrac- may leads to a
Boredom and Happy with no traditional tion architecture of more concatenated
hand-crafted features CNNs is combined CNNs and using
with LSTM network it for multimodal
layers for better emo- (audio/video) based
tion recognition emotion recognition.
29. W. Q. Zheng et. al Neutral, Anger, Fear, Interactive emotional Principle component Better results Input audio data is
(2015) [127] Sadness, Joy and dyadic motion analysis (PCA) obtained using fixed, need extension
Happiness capture database based Deep CNN with 40% more to variable length in-
(IEMOCAP) Convolutional accuracy. Works put data.
Neural Network efficiently with SVM
(CNN)
30. Pablo et. al (2015) Happiness, Sadness, Cohn-Kanade Deep Convolutional Obtained emotional Improvement can be
[128] Neutral, Surprise, dataset for emotions, Neural Network expressions are spon- done to achieve ac-
Disgust, Fear, and CAM3D corpus (CNN) taneous and thus can curate decisions for
Anger for spontaneous easily be classified various emotions.
response into positive or nega-
tive
31. Fayek et. al (2015) Anger, Boredom, eNTERFACE and End-to-End architec- 60.53% accuracy Model can be modi-
[129] Disgust, Joy, Sadness SAVEE database ture based on DNN for eNTERFACE fied for the real-time
and Neutral dataset (6 emotions) input speech data.
and 59.7% accuracy
for SAVEE dataset
(all 7 emotions)
32. Qirong Mao et. al Happiness, Sadness, Emo-DB, MES, Sparse Auto Encoder Trained the CNN Model can be mod-
(2014) [130] Neutral, Surprise, SAVEE, DES (SAE) and Salient with learn affect- ified for the natu-
Disgust, Fear, and databases discriminative salient features and ralistic speech input
Anger feature analysis achieved robust data to obtain real-
(SDFA) based emotion recognition time emotion recog-
Convolutional with variational nition
Neural Network speaker, language
(CNN) and environment
33. Z. Huang et. al Happiness, Sadness, Emo-DB, MES, Unsupervised and Feature learning and Model can be mod-
(2014) [131] Neutral, Surprise, SAVEE, DES Semi-Supervised emotion-salient fea- ified for the natu-
Disgust, Fear, and databases CNN tures learning using ralistic speech input
Anger semi-CNN data to obtain real-
time emotion recog-
nition.
34. Jianwei et. al (2014) Anger, boredom, dis- TIMIT Corpus DNN based acous- 3 Hidden layers Future work may in-
[132] gust, fear, joy, sad- database tic features such as based DNN with volves the increase in
ness, neutral MFCCs, PLPs and acoustics of MFCCs, hidden layers of the
FBANKs PLPs, FBANKs considered DNN
and achieved 8.2%
emotion recognition

VOLUME 4, 2016 13

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

S. No. References Emotion recognized Databases used Deep Learning ap- Contribution Future Direction
proach used towards Emotion
recognition
35. L. Li et. al (2013) Happiness, Sadness, eNTERFACE’05 and HMM based Hybrid Using DNN-HMM The future direction
[133] Neutral, Surprise, Berlin Database DNN (DNN-HMM) provides 12.22% shows the authors
Disgust, Fear, and better results for interest to make their
Anger eNTERFACE’05, designed system an
11.67% for GMM- audio-visual based
HMM, 10.56% SER.
for MLP-HMM,
and 17.22% for
shallow-NN-HMM,
respectively
36. André et. al (2011) Happiness, Sadness, eNTERFACE Generalized Comparative this comparison can
[134] Neutral, Surprise, database and EMOD Discriminant analysis of GerDA be modeled and
Disgust, Fear, and database Analysis (GerDA) based DNN with extended towards
Anger based Deep Neural SVM and obtained multi-modality.
Network (DNN) better speech
emotion recognition
37. L. Fu et. al (2008) Anger, boredom, dis- Berlin Database and Hybrid features Achieved better re- Exploration and
[135] gust, fear, joy, sad- Mandarin Database based Artificial sults for relative fea- extraction of
ness, neutral Neural Network tures as compared to more rationalized
(ANN) absolute features for method for feature
emotion recognition extraction.

alyzed in [125]. These features include voice and music- tions. It can also be used to obtain further insights into the
based recognition. Further, the utilization of cross-channel behavior of BLSTM-based RNN using regression models
architecture can improve the general performance in a com- such as SVR [103].
plex environment. The model provided some good results for SER algorithms based on CNNs and RNNs have been
human speech signal and music signal; however, the results investigated in [126]. The deep hierarchical CNNs architec-
for generalized auditory emotion recognition are not optimal ture for feature extraction has been combined with LSTM
[118]. The purpose of this cross-channel hierarchy is to network layers. It was found that CNNs have a time-based
extract specific features and combine them into a much more distributed network that provides results with greater accu-
generalized scenario. Also, these models can be coupled with racy. Similarly, in [127], a system based on a deep convo-
visual-based DNNs to improve automatic SER. RNNs uti- lutional network (DCNN) that used audio data as input is
lization in such a scenario can further boost the performance presented, named PCA-DCNNs-SER. It contained 2 con-
for the input data with time-dependent constraints. volutional layers and 2 pooling layers. Before computing
According to the study in [128], the evaluation of CNN the log-spectrogram, the background interferences have been
has been assessed using a more autonomous scenario known eliminated by using Principal Component Analysis (PCA)
as Human-Robot Interaction (HRI). The HRI used a hu- scheme. The noise-free spectrogram is divided into non-
manoid robotic head that resulted in the desired emotional overlapping components. Using hand-crafted acoustic fea-
feedback. It works with several processes simultaneously: tures, this model also performed well for the SVM based clas-
feature extraction based on shape, facial characteristics and sification. As emotional expressions are usually spontaneous,
information regarding the subject motion. The reception of their classification into the positive and negative domain is
shunting inhibitory fields internally increased the depth of quite easy.
the network for emotion extraction. Later on, cross-channel Preprocessing is an aditionaltask usually required to be
learning is used to correlate the static and dynamic streams carried out before recognition of emotion from the speech
in the same scenario. As the model works efficiently for a signal [129]. To eliminate these drawbacks, a real-time SER
person performing spontaneous real-time expressions, it can system is needed that can work on end-to-end deep learn-
be further extended to a multimodal system, where visual ing architecture. An example based spectrogram processing
stimuli can also be used as an input with audio. using DNN is given in [130]. It requires a deep hierarchical
A hybrid deep learning modality may inherit the underly- architecture, regularization, and augmentation of data.
ing properties of RNN with CNN, with convolutional levels Emotion recognition from speech using salient features
implanted with RNN [102], [113], [126]. This enables the has also been researched, where a CNN based on affective
model to obtain both frequency and temporal dependency learning is used [131]. Initially, the CNN is trained with
in a given speech signal. Sometimes, a memory enhanced unlabeled samples for learning localized invariant features
reconstruction-error-based RNN for continuous speech emo- (LIF) using a Sparse auto-encoder (SAE). A well-known
tion recognition can also be used [105]. This RNN model variation of auto-encoder is the Variational Auto-encoder
uses two components, first an auto-encoder for the recon- (VAE) [130]. Afterward, the LIF is utilized as input for the
struction of features, and second for the prediction of emo- extraction of features using SDFA. The reason here is to
14 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

learn salient features that are discriminative and orthogonal VI. CONCLUSIONS
for speech emotion recognition [131]. The results obtained This paper has provided a detailed review of the deep learn-
with these experiments are more stable, accurate and robust ing techniques for SER. Deep learning techniques such as
in recognition in complex scenarios where there is variation DBM, RNN, DBN, CNN, and AE have been the subject of
in language and speaker, and other environmental distortion. much research in recent years. These deep learning methods
Emotions such as happiness, joy, sadness, neutral, surprise, and their layer-wise architectures are briefly elaborated based
boredom, disgust, fear, and anger can easily be recognized. on the classification of various natural emotion such as
But it becomes hard to do so when real-time emotion recog- happiness, joy, sadness, neutral, surprise, boredom, disgust,
nition is desired. A variation of this salient features-based fear, and anger. These methods offer easy model training
emotion learning has been presented in [131], where a semi- as well as the efficiency of shared weights. Limitations of
Convolutional neural network (semi-CNN) is presented for deep learning techniques include their large layer-wise inter-
SER. The same two-stage training has been carried out. nal architecture, less efficiency for temporally-varying input
However, the experimental results obtained are robust, stable data and over-learning during memorization of layer-wise
and accurate for emotion recognition in scenarios such as a information. This research work forms a base to evaluate
distorted speech environment. The experiments are carried the performance and limitations of current deep learning
out on synthetic data. techniques. Further, it highlights some promising directions
A comparison of DNN with GMM for SER has been pre- for better SER systems.
sented in [132], where both the techniques are tested for the
six universal emotions. The comparative analysis of GMM REFERENCES
and DNN classifiers showed that the system’s performance [1] B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell,
could be improved after the introduction of deep learning. For benchmarks, and ongoing trends,” Communications of the ACM, vol. 61,
system modeling, various sizes of deep hidden layers (512, no. 5, pp. 90–99, 2018.
[2] M. S. Hossain and G. Muhammad, “Emotion recognition using deep
1024, 2048 and so on) and acoustic features such as MFCCs, learning approach from audio–visual emotional big data,” Information
PLPs, FBANKs are used. Combining these three acoustic Fusion, vol. 49, pp. 69–78, 2019.
features showed better performance and provide recogni- [3] M. Chen, P. Zhou, and G. Fortino, “Emotion communication system,”
IEEE Access, vol. 5, pp. 326–337, 2017.
tion rate of 92.3% as compared to single acoustic features [4] N. D. Lane and P. Georgiev, “Can deep learning revolutionize mobile
that were 92.1% when using MFCCs. This study concluded sensing?” in Proceedings of the 16th International Workshop on Mobile
that emotion recognition systems that are based on deep Computing Systems and Applications. ACM, 2015, pp. 117–122.
[5] J. G. Rázuri, D. Sundgren, R. Rahmani, A. Moran, I. Bonet, and A. Lars-
learning architecture provide better performance compared
son, “Speech emotion recognition in emotional feedback for human-robot
to the systems using GMMs as classifiers. A DNN based interaction,” International Journal of Advanced Research in Artificial
on Generalized Discriminant Analysis (GerDA) for SER is Intelligence (IJARAI), vol. 4, no. 2, pp. 20–27, 2015.
proposed in [133]–[135]. The model learns features that are [6] D. Le and E. M. Provost, “Emotion recognition from spontaneous speech
using hidden markov models with deep belief networks,” in 2013 IEEE
discriminative for low dimensions and optimized for faster Workshop on Automatic Speech Recognition and Understanding. IEEE,
classification of large acoustic feature set. A comparative 2013, pp. 216–221.
analysis is carried out for GerDA and its competitor linear [7] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech
recognition using deep neural networks: A systematic review,” IEEE
classifiers such as SVMs to analyze its performance. Access, vol. 7, pp. 19 143–19 165, 2019.
Deep learning is mostly used for natural language process- [8] J. G. Rázuri, D. Sundgren, R. Rahmani, A. Moran, I. Bonet, and A. Lars-
ing but its architecture is able to handle different modalities son, “Speech emotion recognition in emotional feedbackfor human-robot
interaction,” International Journal of Advanced Research in Artificial
such as SER and speech recognition. According to [136], the Intelligence (IJARAI), vol. 4, no. 2, pp. 20–27, 2015.
RvNN can be used for the classification of natural language [9] S. Lalitha, A. Madhavan, B. Bhushan, and S. Saketh, “Speech emotion
sentences as well as natural image processing. It initially recognition,” in Advances in Electronics, Computers and Communica-
tions (ICAECC), 2014 International Conference on. IEEE, 2014, pp.
computes the overall score of the possible pair for merging 1–4.
them in pairs and to build a syntactic tree. The pair of highest [10] K. R. Scherer, “What are emotions? and how can they be measured?”
score is further combined with a vector known as compo- Social science information, vol. 44, no. 4, pp. 695–729, 2005.
[11] T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, K. Karpouzis,
sitional vector [137]–[139]. Once the pair gets merged, the
and S. Kollias, “Emotion analysis in man-machine interaction systems,”
RvNN then generates multiple units, the region representing in International Workshop on Machine Learning for Multimodal Interac-
vectors, and the classification labels. It should be noted that tion. Springer, 2004, pp. 318–328.
the RvNN tree structure is the compositional representation [12] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias,
W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer
of vectors for the entire considered region. interaction,” IEEE Signal processing magazine, vol. 18, no. 1, pp. 32–80,
As a final remark, deep learning is rapidly becoming a 2001.
method of choice over traditional techniques for SER. Also, [13] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, “Emotion recognition by
speech signals,” in Eighth European Conference on Speech Communica-
most of the research is evolving towards multimodal and tion and Technology, 2003.
unsupervised SER, speech recognition and NLP [140]–[143]. [14] R. W. Picard, Affective computing. Perceptual Computing Section,
Multimodal emotion recognition can use input data such as Media Laboratory, Massachusetts Institute of Technology, 1995.
[15] S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a
audio-visual at the same time, in an efficient way. review,” International journal of speech technology, vol. 15, no. 2, pp.
99–117, 2012.

VOLUME 4, 2016 15

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

[16] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion [38] M. Alpert, E. R. Pouget, and R. R. Silva, “Reflections of depression in
recognition: Features, classification schemes, and databases,” Pattern acoustic measures of the patient’s speech,” Journal of affective disorders,
Recognition, vol. 44, no. 3, pp. 572–587, 2011. vol. 66, no. 1, pp. 59–69, 2001.
[17] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, “Emotion recognition by [39] Y. Zhou, Y. Sun, J. Zhang, and Y. Yan, “Speech emotion recognition
speech signals,” in Eighth European Conference on Speech Communica- using both spectral and prosodic features,” in Information Engineering
tion and Technology, 2003. and Computer Science, 2009. ICIECS 2009. International Conference on.
[18] A. D. Dileep and C. C. Sekhar, “Gmm-based intermediate matching IEEE, 2009, pp. 1–4.
kernel for classification of varying length patterns of long duration speech [40] S. Mozziconacci, “Prosody and emotions,” in Speech Prosody 2002,
using support vector machines,” IEEE transactions on neural networks International Conference, 2002.
and learning systems, vol. 25, no. 8, pp. 1421–1432, 2014. [41] O. Pierre-Yves, “The production and recognition of emotions in speech:
[19] L. Deng, D. Yu et al., “Deep learning: methods and applications,” features and algorithms,” International Journal of Human-Computer
Foundations and Trends® in Signal Processing, vol. 7, no. 3-4, pp. 197– Studies, vol. 59, no. 1-2, pp. 157–183, 2003.
387, 2014. [42] D. Neiberg, K. Elenius, and K. Laskowski, “Emotion recognition in
[20] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neu- spontaneous speech using gmms,” in Ninth International Conference on
ral networks, vol. 61, pp. 85–117, 2015. Spoken Language Processing, 2006.
[43] A. Dileep and C. C. Sekhar, “Hmm based intermediate matching kernel
[21] T. Vogt and E. André, “Comparing feature sets for acted and spontaneous
for classification of sequential patterns of speech using support vector
speech in view of automatic emotion recognition,” in Multimedia and
machines,” IEEE Transactions on Audio, Speech, and Language Process-
Expo, 2005. ICME 2005. IEEE International Conference on. IEEE,
ing, vol. 21, no. 12, pp. 2570–2582, 2013.
2005, pp. 474–477.
[44] G. Vyas, M. K. Dutta, K. Riha, J. Prinosil et al., “An automatic emotion
[22] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and
recognizer using mfccs and hidden markov models,” in Ultra Modern
classifiers for emotion recognition from speech: a survey from 2000 to
Telecommunications and Control Systems and Workshops (ICUMT),
2011,” Artificial Intelligence Review, vol. 43, no. 2, pp. 155–177, 2015.
2015 7th International Congress on. IEEE, 2015, pp. 320–324.
[23] A. Batliner, B. Schuller, D. Seppi, S. Steidl, L. Devillers, L. Vidrascu, [45] S. Wang, X. Ling, F. Zhang, and J. Tong, “Speech emotion recogni-
T. Vogt, V. Aharonson, and N. Amir, “The automatic recognition of tion based on principal component analysis and back propagation neu-
emotions in speech,” in Emotion-Oriented Systems. Springer, 2011, ral network,” in Measuring Technology and Mechatronics Automation
pp. 71–99. (ICMTMA), 2010 International Conference on, vol. 3. IEEE, 2010, pp.
[24] E. Mower, M. J. Mataric, and S. Narayanan, “A framework for automatic 437–440.
human emotion classification using emotion profiles,” IEEE Transactions [46] Y. Pan, P. Shen, and L. Shen, “Speech emotion recognition using support
on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057– vector machine,” International Journal of Smart Home, vol. 6, no. 2, pp.
1070, 2011. 101–108, 2012.
[25] J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Prediction-based learning [47] Y. Chavhan, M. Dhore, and P. Yesaware, “Speech emotion recognition
for continuous emotion recognition in speech,” in Acoustics, Speech and using support vector machine,” International Journal of Computer Appli-
Signal Processing (ICASSP), 2017 IEEE International Conference on. cations, vol. 1, no. 20, pp. 6–9, 2010.
IEEE, 2017, pp. 5005–5009. [48] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition com-
[26] ——, “Reconstruction-error-based learning for continuous emotion bining acoustic features and linguistic information in a hybrid support
recognition in speech,” in Acoustics, Speech and Signal Processing vector machine-belief network architecture,” in Acoustics, Speech, and
(ICASSP), 2017 IEEE International Conference on. IEEE, 2017. Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International
[27] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect Conference on, vol. 1. IEEE, 2004, pp. 1–577.
recognition methods: Audio, visual, and spontaneous expressions,” IEEE [49] M. Swain, A. Routray, and P. Kabisatpathy, “Databases, features and clas-
transactions on pattern analysis and machine intelligence, vol. 31, no. 1, sifiers for speech emotion recognition: a review,” International Journal of
pp. 39–58, 2009. Speech Technology, vol. 21, no. 1, pp. 93–120, 2018.
[28] T. Vogt, E. André, and J. Wagner, “Automatic recognition of emotions [50] D. Ververidis and C. Kotropoulos, “A state of the art review on emotional
from speech: a review of the literature and recommendations for prac- speech databases,” in Proceedings of 1st Richmedia Conference. Cite-
tical realisation,” in Affect and emotion in human-computer interaction. seer, 2003, pp. 109–119.
Springer, 2008, pp. 75–91. [51] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss,
[29] J. Deng, S. Frühholz, Z. Zhang, and B. Schuller, “Recognizing emotions “A database of german emotional speech,” in Ninth European Conference
from whispered speech based on acoustic feature transfer learning,” IEEE on Speech Communication and Technology, 2005.
Access, vol. 5, pp. 5235–5246, 2017. [52] J. A. Coan and J. J. Allen, Handbook of emotion elicitation and assess-
[30] S. Demircan and H. Kahramanlı, “Feature extraction from speech data for ment. Oxford university press, 2007.
emotion recognition,” Journal of advances in Computer Networks, vol. 2, [53] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual
no. 1, pp. 28–30, 2014. emotion database,” in Data Engineering Workshops, 2006. Proceedings.
22nd International Conference on. IEEE, 2006, pp. 8–8.
[31] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion in speech,” in
[54] M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor-
Fourth International Conference on Spoken Language Processing, 1996.
res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016:
[32] Y. Zhou, Y. Sun, J. Zhang, and Y. Yan, “Speech emotion recognition
Depression, mood, and emotion recognition workshop and challenge,” in
using both spectral and prosodic features,” in Information Engineering
Proceedings of the 6th International Workshop on Audio/Visual Emotion
and Computer Science, 2009. ICIECS 2009. International Conference on.
Challenge, ser. AVEC ’16, no. 8. New York, NY, USA: ACM, 2016, pp.
IEEE, 2009, pp. 1–4.
3–10.
[33] S. Haq, P. J. Jackson, and J. Edge, “Audio-visual feature selection and [55] P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee)
reduction for emotion classification,” in Proc. Int. Conf. on Auditory- database,” University of Surrey: Guildford, UK, 2014.
Visual Speech Processing (AVSP’08), Tangalooma, Australia, 2008. [56] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the
[34] M. Alpert, E. R. Pouget, and R. R. Silva, “Reflections of depression in recola multimodal corpus of remote collaborative and affective interac-
acoustic measures of the patient’s speech,” Journal of affective disorders, tions,” in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE
vol. 66, no. 1, pp. 59–69, 2001. International Conference and Workshops on. IEEE, 2013, pp. 1–8.
[35] D. Ververidis and C. Kotropoulos, “Emotional speech recognition: Re- [57] R. Cowie, E. Douglas-Cowie, and C. Cox, “Beyond emotion archetypes:
sources, features, and methods,” Speech communication, vol. 48, no. 9, Databases for emotion modeling using neural networks,” Neural net-
pp. 1162–1181, 2006. works, vol. 18, no. 4, pp. 371–388, 2005.
[36] S. Mozziconacci, “Prosody and emotions,” in Speech Prosody 2002, [58] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
International Conference, 2002. no. 7553, p. 436, 2015.
[37] J. Hirschberg, S. Benus, J. M. Brenier, F. Enos, S. Friedman, S. Gilman, [59] W. Wang, Machine Audition: Principles, Algorithms and Systems: Prin-
C. Girand, M. Graciarena, A. Kathol, L. Michaelis et al., “Distinguishing ciples, Algorithms and Systems. IGI Global, 2010.
deceptive from non-deceptive speech,” in Ninth European Conference on [60] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic
Speech Communication and Technology, 2005. emotions and affect in speech: State of the art and lessons learnt from

16 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

the first challenge,” Speech communication, vol. 53, no. 9-10, pp. 1062– Technology in Automation, Control, and Intelligent Systems (CYBER),
1087, 2011. 2016 IEEE International Conference on. IEEE, 2016, pp. 308–312.
[61] M. Sidorov, S. Ultes, and A. Schmitt, “Emotions are a personal thing: [85] J. Deng, R. Xia, Z. Zhang, Y. Liu, and B. Schuller, “Introducing shared-
Towards speaker-adaptive emotion recognition,” in Acoustics, Speech hidden-layer autoencoders for transfer learning and their application
and Signal Processing (ICASSP), 2014 IEEE International Conference in acoustic emotion recognition,” in Proc. International Conference on
on. IEEE, 2014, pp. 4803–4807. Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 4851–
[62] M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan, 4855.
“Context-sensitive multimodal emotion recognition from speech and [86] J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoder-
facial expression using bidirectional lstm modeling,” in Proc. INTER- based feature transfer learning for speech emotion recognition,” in 2013
SPEECH 2010, Makuhari, Japan, 2010, pp. 2362–2365. Humaine Association Conference on Affective Computing and Intelli-
[63] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, gent Interaction. IEEE, 2013, pp. 511–516.
no. 7553, p. 436, 2015. [87] L. Cen, W. Ser, and Z. L. Yu, “Speech emotion recognition using canon-
[64] Y. Niu, D. Zou, Y. Niu, Z. He, and H. Tan, “A breakthrough in speech ical correlation analysis and probabilistic neural network,” in Machine
emotion recognition using deep retinal convolution neural networks,” Learning and Applications, 2008. ICMLA’08. Seventh International Con-
arXiv preprint arXiv:1707.09917, 2017. ference on. IEEE, 2008, pp. 859–862.
[65] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neu- [88] X. Zhou, J. Guo, and R. Bie, “Deep learning based affective model for
ral networks, vol. 61, pp. 85–117, 2015. speech emotion recognition,” in Ubiquitous Intelligence & Computing,
[66] L. Deng, D. Yu et al., “Deep learning: methods and applications,” Advanced and Trusted Computing, Scalable Computing and Communi-
Foundations and Trends® in Signal Processing, vol. 7, no. 3-4, pp. 197– cations, Cloud and Big Data Computing, Internet of People, and Smart
387, 2014. World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), 2016
[67] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and Intl IEEE Conferences. IEEE, 2016, pp. 841–846.
trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. [89] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep
[68] M. W. Bhatti, Y. Wang, and L. Guan, “A neural network approach for neural network and extreme learning machine,” in Fifteenth annual
human emotion recognition in speech,” in Circuits and Systems, 2004. conference of the international speech communication association, 2014.
ISCAS’04. Proceedings of the 2004 International Symposium on, vol. 2. [90] K.-C. Huang and Y.-H. Kuo, “A novel objective function to optimize
IEEE, 2004, pp. II–81. neural networks for emotion recognition from speech patterns,” in Nature
[69] K. Poon-Feng, D.-Y. Huang, M. Dong, and H. Li, “Acoustic emotion and Biologically Inspired Computing (NaBIC), 2010 Second World
recognition based on fusion of multiple feature-dependent deep boltz- Congress on. IEEE, 2010, pp. 413–417.
mann machines,” in Chinese Spoken Language Processing (ISCSLP), [91] E. M. Albornoz, M. Sánchez-Gutiérrez, F. Martinez-Licona, H. L.
2014 9th International Symposium on. IEEE, 2014, pp. 584–588. Rufiner, and J. Goddard, “Spoken emotion recognition using deep learn-
[70] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- ing,” in Iberoamerican Congress on Pattern Recognition. Springer, 2014,
mann machines,” in Proceedings of the 27th international conference on pp. 104–111.
machine learning (ICML-10), 2010, pp. 807–814. [92] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature learning in
[71] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm deep neural networks-studies on speech recognition tasks,” arXiv preprint
for deep belief nets,” Neural computation, MIT Press, vol. 18, no. 7, pp. arXiv:1301.3605, 2013.
1527–1554, 2006. [93] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
[72] A.-r. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural
recognition,” in Nips workshop on deep learning for speech recognition networks for acoustic modeling in speech recognition: The shared views
and related applications, vol. 1, no. 9. Vancouver, Canada, 2009, p. 39. of four research groups,” IEEE Signal processing magazine, vol. 29,
[73] J. Lee and I. Tashev, “High-level feature representation using recurrent no. 6, pp. 82–97, 2012.
neural network for speech emotion recognition,” 2015. [94] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-
[74] V. Chernykh, G. Sterling, and P. Prihodko, “Emotion recogni- trained deep neural networks for large-vocabulary speech recognition,”
tion from speech with recurrent neural networks,” arXiv preprint IEEE Transactions on audio, speech, and language processing, vol. 20,
arXiv:1701.08071, 2017. no. 1, pp. 30–42, 2012.
[75] F. Weninger, F. Ringeval, E. Marchi, and B. W. Schuller, “Discrimi- [95] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recognition from
natively trained recurrent neural networks for continuous dimensional speech using deep learning on spectrograms.” in INTERSPEECH, 2017,
emotion recognition from audio.” in IJCAI, vol. 2016, 2016, pp. 2196– pp. 1089–1093.
2202. [96] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-
[76] Y. Kamp and M. Hasler, Recursive neural networks for associative dependent deep neural networks for conversational speech transcription,”
memory. John Wiley & Sons Chichester, 1990. in Automatic Speech Recognition and Understanding (ASRU), 2011
[77] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenes IEEE Workshop on. IEEE, 2011, pp. 24–29.
and natural language with recursive neural networks,” in Proceedings of [97] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neu-
the 28th international conference on machine learning (ICML-11), 2011, ral network for modelling sentences,” arXiv preprint arXiv:1404.2188,
pp. 129–136. 2014.
[78] G. Wen, H. Li, J. Huang, D. Li, and E. Xun, “Random deep belief [98] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,
networks for recognizing emotions from speech signals,” Computational B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emo-
intelligence and neuroscience, vol. 2017, 2017. tion recognition using a deep convolutional recurrent network,” in Acous-
[79] A.-r. Mohamed, G. E. Dahl, G. Hinton et al., “Acoustic modeling using tics, Speech and Signal Processing (ICASSP), 2016 IEEE International
deep belief networks,” IEEE Trans. Audio, Speech & Language Process- Conference on. IEEE, 2016, pp. 5200–5204.
ing, vol. 20, no. 1, pp. 14–22, 2012. [99] J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep
[80] C. Huang, W. Gong, W. Fu, and D. Feng, “A research of speech emo- 1d & 2d cnn lstm networks,” Biomedical Signal Processing and Control,
tion recognition based on deep belief network and svm,” Mathematical vol. 47, pp. 312–323, 2019.
Problems in Engineering, vol. 2014, 2014. [100] S. Tripathi and H. Beigi, “Multi-modal emotion recognition on iemocap
[81] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- dataset using deep learning,” arXiv preprint arXiv:1804.05788, 2018.
wise training of deep networks,” in Advances in neural information [101] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, “Speech
processing systems, 2007, pp. 153–160. emotion recognition using spectrogram & phoneme embedding,” Proc.
[82] W. Zheng, J. Yu, and Y. Zou, “An experimental study of speech emotion Interspeech 2018, pp. 3688–3692, 2018.
recognition based on deep convolutional neural networks,” in Affective [102] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, “On
Computing and Intelligent Interaction (ACII), 2015 International Confer- the robustness of speech emotion recognition for human-robot interaction
ence on. IEEE, 2015, pp. 827–831. with deep neural networks,” in 2018 IEEE/RSJ International Conference
[83] Y. Kim, “Convolutional neural networks for sentence classification,” on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 854–860.
arXiv preprint arXiv:1408.5882, 2014. [103] D. Tang, J. Zeng, and M. Li, “An end-to-end deep learning framework
[84] W. Fei, X. Ye, Z. Sun, Y. Huang, X. Zhang, and S. Shang, “Research with speech emotion recognition of a typical individuals,” Proc. Inter-
on speech emotion recognition based on deep auto-encoder,” in Cyber speech 2018, pp. 162–166, 2018.

VOLUME 4, 2016 17

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

[104] C. W. Lee, K. Y. Song, J. Jeong, and W. Y. Choi, “Convolutional attention target classes,” in Acoustics, Speech and Signal Processing (ICASSP),
networks for multimodal emotion recognition from speech and text data,” 2016 IEEE International Conference on. IEEE, 2016, pp. 2608–2612.
arXiv preprint arXiv:1805.06606, 2018. [125] P. Barros, C. Weber, and S. Wermter, “Learning auditory neural repre-
[105] S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, sentations for emotion recognition,” in Neural Networks (IJCNN), 2016
“Adversarial auto-encoders for speech based emotion recognition,” arXiv International Joint Conference on. IEEE, 2016, pp. 921–928.
preprint arXiv:1806.02146, 2018. [126] W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using con-
[106] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Transfer learning volutional and recurrent neural networks,” in Signal and information
for improving speech emotion classification accuracy,” arXiv preprint processing association annual summit and conference (APSIPA), 2016
arXiv:1801.06353, 2018. Asia-Pacific. IEEE, 2016, pp. 1–4.
[107] M. Chen, X. He, J. Yang, and H. Zhang, “3-d convolutional recurrent [127] W. Zheng, J. Yu, and Y. Zou, “An experimental study of speech emotion
neural networks with attention model for speech emotion recognition,” recognition based on deep convolutional neural networks,” in 2015 in-
IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018. ternational conference on affective computing and intelligent interaction
[108] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and (ACII). IEEE, 2015, pp. 827–831.
N. Dehak, “Emotion identification from raw speech signals using dnns,” [128] P. Barros, C. Weber, and S. Wermter, “Emotional expression recognition
Proc. Interspeech 2018, pp. 3097–3101, 2018. with a cross-channel convolutional neural network for human-robot
[109] S. E. Eskimez, Z. Duan, and W. Heinzelman, “Unsupervised learning interaction,” in Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th
approach to feature analysis for automatic speech emotion recognition,” International Conference on. IEEE, 2015, pp. 582–587.
in 2018 IEEE International Conference on Acoustics, Speech and Signal [129] H. M. Fayek, M. Lech, and L. Cavedon, “Towards real-time speech emo-
Processing (ICASSP). IEEE, 2018, pp. 5099–5103. tion recognition using deep neural networks,” in Signal Processing and
[110] J. Zhao, X. Mao, and L. Chen, “Learning deep features to recognize Communication Systems (ICSPCS), 2015 9th International Conference
speech emotion using merged deep cnn,” IET Signal Processing, vol. 12, on. IEEE, 2015, pp. 1–5.
no. 6, pp. 713–721, 2018. [130] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for
[111] W. Zhang, D. Zhao, Z. Chai, L. T. Yang, X. Liu, F. Gong, and S. Yang, speech emotion recognition using convolutional neural networks,” IEEE
“Deep learning and svm-based emotion recognition from chinese speech Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
for smart affective services,” Software: Practice and Experience, vol. 47, [131] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition
no. 8, pp. 1127–1138, 2017. using cnn,” in Proceedings of the 22nd ACM international conference on
[112] L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, “Emotion recognition Multimedia. ACM, 2014, pp. 801–804.
from chinese speech for smart affective services using a combination of [132] J. Niu, Y. Qian, and K. Yu, “Acoustic emotion recognition using deep
svm and dbn,” Sensors, vol. 17, no. 7, p. 1694, 2017. neural network,” in Chinese Spoken Language Processing (ISCSLP),
[113] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and 2014 9th International Symposium on. IEEE, 2014, pp. 128–132.
S. Zafeiriou, “End-to-end multimodal emotion recognition using deep [133] L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin,
neural networks,” IEEE Journal of Selected Topics in Signal Processing, and H. Sahli, “Hybrid deep neural network–hidden markov model (dnn-
vol. 11, no. 8, pp. 1301–1309, 2017. hmm) based speech emotion recognition,” in Affective Computing and
[114] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning Intelligent Interaction (ACII), 2013 Humaine Association Conference on.
architectures for speech emotion recognition,” Neural Networks, vol. 92, IEEE, 2013, pp. 312–317.
pp. 60–68, 2017. [134] A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller,
[115] Q. Mao, G. Xu, W. Xue, J. Gou, and Y. Zhan, “Learning emotion- “Deep neural networks for acoustic emotion recognition: raising the
discriminative and domain-invariant features for domain adaptation in benchmarks,” in Acoustics, speech and signal processing (ICASSP),
speech emotion recognition,” Speech Communication, vol. 93, pp. 1–10, 2011 IEEE international conference on. IEEE, 2011, pp. 5688–5691.
2017. [135] L. Fu, X. Mao, and L. Chen, “Relative speech emotion recognition based
[116] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recogni- artificial neural network,” in Computational Intelligence and Industrial
tion using deep convolutional neural network and discriminant temporal Application, 2008. PACIIA’08. Pacific-Asia Workshop on, vol. 2. IEEE,
pyramid matching,” IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 2008, pp. 140–144.
1576–1590, 2017. [136] D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech
[117] J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Universum emotion recognition with multi-task deep recurrent neural network.” in
autoencoder-based domain adaptation for speech emotion recognition,” INTERSPEECH, 2017, pp. 1108–1112.
IEEE Signal Processing Letters, vol. 24, no. 4, pp. 500–504, 2017. [137] N. Morgan, “Deep and wide: Multiple layers in automatic speech recog-
[118] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion nition,” IEEE Transactions on Audio, Speech, and Language Processing,
recognition using recurrent neural networks with local attention,” in vol. 20, no. 1, pp. 7–13, 2012.
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE Inter- [138] S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, “"of all
national Conference on. IEEE, 2017, pp. 2227–2231. things the measure is man" automatic classification of emotions and inter-
[119] Y. Zhang, Y. Liu, F. Weninger, and B. Schuller, “Multi-task deep neural labeler consistency [speech-based emotion recognition],” in Acoustics,
network with shared hidden layers: Breaking down the wall between Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE
emotion representations,” in Acoustics, Speech and Signal Processing International Conference on, vol. 1. IEEE, 2005, pp. I–317.
(ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. [139] B. Schuller, J. Stadermann, and G. Rigoll, “Affect-robust speech recog-
4990–4994. nition by dynamic emotional adaptation,” in Proc. Speech Prosody 2006,
[120] Y. Zhao, X. Jin, and X. Hu, “Recurrent convolutional neural network Dresden, 2006.
for speech processing,” in Acoustics, Speech and Signal Processing [140] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang, “Semisu-
(ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. pervised learning of classifiers: Theory, algorithms, and their application
5300–5304. to human-computer interaction,” IEEE Transactions on Pattern Analysis
[121] Z.-Q. Wang and I. Tashev, “Learning utterance-level representations for and Machine Intelligence, vol. 26, no. 12, pp. 1553–1566, 2004.
speech emotion and age/gender recognition using deep neural networks,” [141] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michal-
in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE Inter- ski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-
national Conference on. IEEE, 2017, pp. 5150–5154. Lewandowski et al., “Emonets: Multimodal deep learning approaches for
[122] J. Han, Z. Zhang, F. Ringeval, and B. Schuller, “Reconstruction-error- emotion recognition in video,” Journal on Multimodal User Interfaces,
based learning for continuous emotion recognition in speech,” in Acous- vol. 10, no. 2, pp. 99–111, 2016.
tics, Speech and Signal Processing (ICASSP), 2017 IEEE International [142] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal,
Conference on. IEEE, 2017, pp. 2367–2371. “Recurrent neural networks for emotion recognition in video,” in Pro-
[123] A. M. Badshah, J. Ahmad, N. Rahim, and S. W. Baik, “Speech emotion ceedings of the 2015 ACM on International Conference on Multimodal
recognition from spectrograms with deep convolutional neural network,” Interaction. ACM, 2015, pp. 467–474.
in Platform Technology and Service (PlatCon), 2017 International Con- [143] H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multimodal
ference on. IEEE, 2017, pp. 1–5. emotion recognition using deep learning architectures,” in 2016 IEEE
[124] Q. Mao, W. Xue, Q. Rao, F. Zhang, and Y. Zhan, “Domain adaptation for Winter Conference on Applications of Computer Vision (WACV). IEEE,
speech emotion recognition by sharing priors between related source and 2016, pp. 1–9.

18 VOLUME 4, 2016

Ruhul Amin Khalil et al.: Speech Emotion Recognition using Deep Learning Techniques: A Review

RUHUL AMIN KHALIL received his bache- TARIQULLAH JAN did his PhD in the field of
lor’s and master’s degrees in Electrical Engi- Electronic Engineering from the University of Sur-
neering from Department of Electrical Engineer- rey, United Kingdom in 2012. He did his Bachelor
ing,University of Engineering & Technology Pe- in Electrical Engineering from the University of
shawar, Pakistan, in 2013, and 2015 respectively. Engineering & Technology Peshawar, Pakistan in
He is currently enrolled in Ph.D. program in 2002. Currently he is serving as Associate Pro-
Electrical Engineering at Department of Electrical fessor at Department of Electrical Engineering,
Engineering, University of Engineering & Tech- Faculty of Electrical and Computer Systems Engi-
nology Peshawar, Pakistan. He is also serving as neering, University of Engineering & Technology
Lecturer at Department of Electrical Engineering, Peshawar, Pakistan.
University of Engineering & Technology Peshawar, Pakistan. His Research interest includes Blind signal processing, machine learning,
His research interests include audio signal processing and its applications, blind reverberation time estimation, speech enhancement, multimodal based
pattern recognition, machine learning, and wireless communication. approaches for the blind source separation, compressed sensing, and Non-
negative matrix/tensor factorization for the blind source separation.

MOHAMMAD HASEEB ZAFAR is a Profes-

sor in the Faculty of Computing and IT at King
EDWARD JONES holds the BE degree in Elec- Abdulaziz University, Saudi Arabia. He is also a
tronic Engineering (First Class Honors), and the Visiting Researcher at the Centre for Intelligent
PhD degree in Electronic Engineering, both from Dynamic Communications (CIDCOM) in the De-
National University of Ireland Galway. His PhD partment of Electronic and Electrical Engineer-
research topic was on the development of compu- ing (EEE), University of Strathclyde, Glasgow,
tational auditory models for speech processing. He UK. He earned his PhD degree in Electronic and
is currently a Professor in Electrical & Electronic Electrical Engineering (EEE) from University of
Engineering at NUI Galway. From June 2010 to Strathclyde in 2009.
December 2016, he served as Vice-Dean of the His main research interests lie in performance analysis of diverse com-
College of Engineering & Informatics, with re- puter and wireless communication networks & systems. He is particularly
sponsibility for Performance, Planning and Strategy. He also has a number interested in design, deployment and analysis of Wireless Sensor Networks
of years of industrial experience in senior positions, in both indigenous start- (WSNs), Mobile Ad-Hoc Networks (MANETs), Wireless Mesh Networks,
up and multinational companies. Wireless Personal Area Networks (WPANs), Internet of Things (IoT), Rout-
His current research interests are in DSP algorithm development and ing, Network Traffic Estimation, Software Defined Networks, Machine 2
embedded implementation for applications in connected and autonomous Machine Communications, Femtocells and Intelligent Transportation Sys-
vehicles, biomedical engineering, speech and audio processing and environ- tems. He is a Senior Member of IEEE.
mental/agriculture applications.

THAMER ALHUSSAIN is an Associate Profes-

sor in the Department of E-Commerce at Saudi
Electronic University (SEU), Kingdom of Saudi
Arabia. He obtained his Master and PhD degrees’
MOHAMMAD INAYATULLAH BABAR received
in Information and Communication Technology
his Bachelor of Science Degree in Electrical Engi- from Griffith University, Australia. He is currently
neering from University of Engineering and Tech-
serving as Vice President for Academic Affairs,
nology (UET), Peshawar, Pakistan in 1997. He re-
Saudi Electronic University (SEU), Kingdom of
ceived his Masters and Doctorate Degrees in 2001 Saudi Arabia.
and 2005 respectively from School of Engineering
His current research interests include the suc-
and Applied Sciences, George Washington Uni-
cess of information systems, measuring IS effectiveness, mobile services,
versity, Washington DC USA. and e-learning.
He has authored and co-authored more than
50 Publications in reputable Engineering Confer-
ences and Journals. He is a member of IEEE USA and ACM USA. He also
taught a number of Telecommunications Engineering Courses at Graduate
Level in School of Engineering, Stratford University, Virginia USA as
Adjunct faculty. Currently, he is working as Professor in Department of
Electrical Engineering, supervising postgraduate Scholars in the field of
Wireless Communications Network.

VOLUME 4, 2016 19

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
No ratings yet
Speech and Text Emotion Recognition Using Machine Learning Batch Number - 08 First Review 2.0
12 pages
Red Dead Redemption 2 - Entire Script
No ratings yet
Red Dead Redemption 2 - Entire Script
1,182 pages
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
No ratings yet
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
31 pages
Pre Processing
No ratings yet
Pre Processing
54 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
Speech Emotion Recognition Using Machine Learning - A Systematic Review
No ratings yet
Speech Emotion Recognition Using Machine Learning - A Systematic Review
25 pages
Neurocomputing: Javier de Lope, Manuel Graña
No ratings yet
Neurocomputing: Javier de Lope, Manuel Graña
11 pages
Survey Ref MFCC
No ratings yet
Survey Ref MFCC
29 pages
Review 3 PPT Final1)
No ratings yet
Review 3 PPT Final1)
51 pages
EpochSER MTA
No ratings yet
EpochSER MTA
35 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
DL For SER
No ratings yet
DL For SER
9 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
Audio Spotlight PDF
No ratings yet
Audio Spotlight PDF
29 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
Wa0007
No ratings yet
Wa0007
6 pages
EMOTIONDETECTION (1) Mini Project
No ratings yet
EMOTIONDETECTION (1) Mini Project
5 pages
XEmoAccent Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
18 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
4 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
Deep Learning Techniques For Speech Emotion Recognition A Review
No ratings yet
Deep Learning Techniques For Speech Emotion Recognition A Review
6 pages
Pediatrics
100% (1)
Pediatrics
4 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
An Enhanced Speech Emotion Recognition Using Vision Transformer
No ratings yet
An Enhanced Speech Emotion Recognition Using Vision Transformer
17 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
Speech Emotion Recognition System Using Recurrent Neural Network in Deep Learning
No ratings yet
Speech Emotion Recognition System Using Recurrent Neural Network in Deep Learning
7 pages
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
No ratings yet
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
12 pages
SMR6!
No ratings yet
SMR6!
14 pages
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
No ratings yet
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
7 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
No ratings yet
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
13 pages
THIRD - s10772 022 09985 6
No ratings yet
THIRD - s10772 022 09985 6
19 pages
Speech Emotion Recognition Based On SVM Using MATLAB: March 2016
No ratings yet
Speech Emotion Recognition Based On SVM Using MATLAB: March 2016
7 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Speech Emotion Recognization
No ratings yet
Speech Emotion Recognization
65 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Paper5 Implementation
No ratings yet
Paper5 Implementation
7 pages
9 - Yogendra
No ratings yet
9 - Yogendra
5 pages
TM1 SWBL Renel Cuaresma
100% (3)
TM1 SWBL Renel Cuaresma
23 pages
Set Conference Draft Paper - 223585
No ratings yet
Set Conference Draft Paper - 223585
6 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
Research Paper
No ratings yet
Research Paper
5 pages
1 ST
No ratings yet
1 ST
23 pages
Emotion Detection Final Paper
No ratings yet
Emotion Detection Final Paper
15 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
JETIR2106163
No ratings yet
JETIR2106163
5 pages
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
No ratings yet
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
12 pages
Group 110 Arun Kumar Review 2 Report
No ratings yet
Group 110 Arun Kumar Review 2 Report
14 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
Fiitjee: The Value of The Definite Integral Is Equal To: (A) 0 (B) 1 (C) e (D) e
No ratings yet
Fiitjee: The Value of The Definite Integral Is Equal To: (A) 0 (B) 1 (C) e (D) e
8 pages
Speech Emotions Recognition Using Machine Learning
No ratings yet
Speech Emotions Recognition Using Machine Learning
5 pages
Sentiment Emotion Recognition
No ratings yet
Sentiment Emotion Recognition
6 pages
Sample Poster Template CSE
No ratings yet
Sample Poster Template CSE
1 page
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Speech Emotion Recognition Using Deep Learning: Nithya Roopa S., Prabhakaran M, Betty.P
No ratings yet
Speech Emotion Recognition Using Deep Learning: Nithya Roopa S., Prabhakaran M, Betty.P
4 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Urgency of IKS by Prateeik Prajapati
No ratings yet
Urgency of IKS by Prateeik Prajapati
17 pages
Materials Management
No ratings yet
Materials Management
1 page
A History of Organized Labor in Panama and Central America
No ratings yet
A History of Organized Labor in Panama and Central America
321 pages
Pericles Nov 26
No ratings yet
Pericles Nov 26
81 pages
Blended Learning and Teaching Pros and Cons
0% (1)
Blended Learning and Teaching Pros and Cons
10 pages
Sroufe 1985 B
No ratings yet
Sroufe 1985 B
16 pages
Learn Abouts 1a My Birthday
No ratings yet
Learn Abouts 1a My Birthday
4 pages
Shoulderexaminationppt 180505152418
No ratings yet
Shoulderexaminationppt 180505152418
19 pages
G-10-4 (Questions and Answers) Bank
No ratings yet
G-10-4 (Questions and Answers) Bank
15 pages
Sales Objections
No ratings yet
Sales Objections
10 pages
Shahin Reg Company
No ratings yet
Shahin Reg Company
7 pages
Jayden YUEN (10R) - MAC TSoW C - WK Template
100% (1)
Jayden YUEN (10R) - MAC TSoW C - WK Template
8 pages
Exam 1 KeyFinance
No ratings yet
Exam 1 KeyFinance
7 pages
Taoist Tradition
No ratings yet
Taoist Tradition
5 pages
31
No ratings yet
31
44 pages
Cell Division Cell Division: (/cellbiology/index - php/File:Historic - 1882 - Mitosis - Drawing - JPG)
No ratings yet
Cell Division Cell Division: (/cellbiology/index - php/File:Historic - 1882 - Mitosis - Drawing - JPG)
16 pages
Assessment Manual 2014
No ratings yet
Assessment Manual 2014
27 pages
Director Seymore Butts Tells You The Truth About What Really Happens in Porn Men's Health
No ratings yet
Director Seymore Butts Tells You The Truth About What Really Happens in Porn Men's Health
1 page
Li 2017
No ratings yet
Li 2017
20 pages
Ludlow 1982
No ratings yet
Ludlow 1982
2 pages
SAP S4 HANA For Fashion - Groupsoft US Inc.
No ratings yet
SAP S4 HANA For Fashion - Groupsoft US Inc.
4 pages
Vanijzendoorn 1988
No ratings yet
Vanijzendoorn 1988
11 pages
CH 28
No ratings yet
CH 28
6 pages
The Fundamental Equipment of The Learner Map
No ratings yet
The Fundamental Equipment of The Learner Map
2 pages
Reviewer For PCM 1 PDF
No ratings yet
Reviewer For PCM 1 PDF
8 pages
The Manager and Management Accounting: © 2012 Pearson Prentice Hall. All Rights Reserved
No ratings yet
The Manager and Management Accounting: © 2012 Pearson Prentice Hall. All Rights Reserved
15 pages
TORRES Vs LOPEZ
No ratings yet
TORRES Vs LOPEZ
1 page
NSC 68
No ratings yet
NSC 68
2 pages
Raz Cqlc21 Ilookedevery
No ratings yet
Raz Cqlc21 Ilookedevery
1 page
From Maudoodism To Islam
No ratings yet
From Maudoodism To Islam
4 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10 1109@access 2019 2936124

Uploaded by

10 1109@access 2019 2936124

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

Speech Emotion Recognition using

I. INTRODUCTION approach is considered as one of the fundamental approaches.

FIGURE 1. Traditional Speech Emotion Recognition System.

[55]. The methods available and objectives in the collection

FIGURE 2. A two dimensional basic emotional space.

D. CLASSIFICATION OF FEATURES IN SER

TABLE 4. Characteristics of various free available emotional speech databases.

S.No. Database Language Emotions Size Source Access References

FIGURE 4. Traditional Machine Learning Flow vs Deep Learning Flow.

Emo-DB and SAVEE datasets and recognized various emo-

TABLE 5. Comparative Analysis of different classifiers in SER [65].

Algorithms Anger Happy Sad

(1) (2) (3)

p1,2 = tanh(W [c1 ; c2 ]) (2)

E(y, z) = −mT y − nT z − y T W z (3)

FIGURE 9. Layer-wise architecture of Deep Belief Network. z i = g(W i ∗ r + ni ) (5)

FIGURE 10. Layer-wise Convolutional Neural Network Architecture.

In the next section, a summary of the papers based on

V. SUMMARY OF THE LITERATURE, DISCUSSION AND

MOHAMMAD HASEEB ZAFAR is a Profes-

THAMER ALHUSSAIN is an Associate Profes-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.