Speech Recognition Final
Speech Recognition Final
on
Speech Emotion Recognition using MLP Classifier
Submitted in partial fulfilment of the requirements for the award of the Degree of
BACHELOR OF TECHNOLOGY
in
By
M. HARIKA P. SOWMIKA
(18FE1A0591) (18FE1A05A2)
Assistant professor
CERTIFICATE
External Examiner
DECLARATION
We hereby declare that the project report entitled “Speech Emotion detection using
MLP Classifier” is a record of an original work done by us under the guidance of Mrs. V.
SWAPNA, Assistant Professor of Computer Science and Engineering and this project report
is submitted in the fulfilment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering. The results embodied in this project report
are not submitted to any other University or Institute for the award of any Degree or Diploma.
Date: P. SOWMIKA(18FE1A05A2)
N. PAVAN KUMAR(18FE1A0592)
M.SUSANNA (18FE1A0588)
ACNOWLEDGMENT
The satisfaction that accompanies with the successful completion of any task would be
incomplete without the mention of people whose ceaseless cooperation made it possible, whose
constant guidance and encouragement crown all efforts with success.
We also express our thanks to Dr. K. PHANEENDRA KUMAR, Principal, Vignan’s Lara
Institute of Technology & Science for providing the resources to carry out the project.
We also express our sincere thanks to our beloved Chairman Dr. LAVU RATHAIAH for
providing support and stimulating environment for developing the project.
We also place our floral gratitude to all other teaching and lab technicians for their constant
support and advice throughout the project.
Project Members
M. HARIKA(18FE1A0591)
P. SOWMIKA(18FE1A05A2)
N. PAVAN KUMAR, (18FE1A0592)
M.SUSANNA(18FE1A0588)
TABLE OF CONTENTS
ABSTRACT i
LIST OF FIGURES ii
LIST OF ABBREVIATIONS iv
Clarity and intelligibility in speech signal demands removal of noise and interference
associated with the signal at the source. This poses further challenge when the speech signal
is colored with human emotions. In this work, the authors have taken a novel step to
enhance the emotional speech signal adaptively before classification. Most popular adaptive
algorithm such as Least mean square (LMS), Normalized least mean squares (NLMS) and
Recursive least square (RLS) has been put to test to obtain enhanced speech emotions.
Neural network based Multilayer perceptron (MLP) classifier is used to recognize fear
speech emotion as against neutral voices using effective Linear Prediction coefficients
(LPCs). The accuracy has improved to approximately 77% with enhanced signal. The
increased accuracy of this signal has been witnessed with RLS algorithm as against the
noisy signal with corresponding algorithm.In this work, we would like to analyze basic
emotions of a human being like calm, happy, fearful, disgust etc. from emotional speech
signals. We use Multi Layer Perceptron Classifier to categorize the given data into
respective groups. Mel-frequency cepstrum coefficients (MFCC), chroma and Mel features
are extracted from the speech signals and are used to train the MLP classifier. For achieving
this objective, we use python libraries like Librosa, sklearn, pyaudio, NumPy and soundfile
to analyse the speech modulations and recognize the emotion.
i
LIST OF FIGURES
ii
LIST OF TABLES
iii
LIST OF ABBREVIATIONS
LMS Least mean square
SS spectral subtraction
NLMS Normalized least mean squares
TVLMS Time varying Least Mean Square
RLS Recursive least square
BSS blind source- separation
SNR Signal to noise ratio
MSE Mean square error
ANN Artificial Neural Networks
SER speech emotion recognition
HCI human-computer interaction
MLP Multi- Layer Perceptron
SVM Support Vector Machine
LPCC linear prediction cepstrum coefficient
MFCC Mel frequency cepstrum coefficient
KNN k-nearest neighbors
HMM Hidden Markov Model
GMM Gaussian Mixtures Model
iv
CHAPTER 1
INTRODUCTION
Rapid development and applications in the area of speech signal processing suffers due to
inherent noise associated at the source of voice pick-up. In spite of advanced acoustic chambers and
adequate recording environments used for recording speech signal, noise and interference makes
their presence felt. Hence, postprocessing at the receiver end of the channel uses some form of
adaptive filtering algorithm to enhance the speech signal affected with noise. As complete
elimination of noise from a signal is not possible, it poses a challenge for speech researchers to
revive new methodology in the area of speech enhancement for better audibility. Few speech
enhancement algorithms as spectral subtraction (SS), Least mean square (LMS), Normalized least
mean squares (NLMS), Time varying Least Mean Square (TVLMS), Recursive least square (RLS),
Fast Transversal RLS have been quite effective in providing enhanced speech to some extent [1].
Single channel SS method of speech enhancement is suitable due its simplicity. However, the
method is unable to accommodate nonstationary noise as its accuracy depends on the accuracy of
voice activation detector. Further, the technique tends to produce a type of randomly fluctuating
noise with narrow band spectrum. This is a musical noise with tone-like characteristic [2]. LMS
algorithm approaches noise cancellation adaptively, using stochastic gradient descent method. The
gradient search is accomplished based on the instantaneous error square of the gradient at the
current time [3].
It is simple and robust algorithm but sensitive to the scale factor used in its input. It makes the
selection of learning rate parameter difficult. Consequently, stability hampers. The drawbacks of
LMS can be eliminated to some extent by normalizing the input power in NLMS algorithm [4].
Contrarily, the RLS algorithm provides optimum performance in a dynamic environment. It
recursively minimizes the weighted linear least squares cost function of the input signals to achieve
the desired adaptability with faster convergence. Hence, this method is crucial for speech
enhancement application. However, the algorithm is computationally complex and possesses
unwarranted stability problem [5]. Bendoumia et al., proposed a new algorithm based on forward-
and-backward (FB) and blind source- separation (BSS) for noise reduction and speech enhancement
[6]. The FB structures used the least-meansquare (LMS) algorithm in combination with two BSS
structures. Therefore, two new two-channel variable-stepsize FB algorithms (2C-VSSF and 2C-
VSSB) were developed to improve the previous LMS-based algorithm in the transient and the
steady- state phases. These two new proposed FB algorithms were based on recursive formulas,
which provide efficient estimation of the optimal step-sizes of the cross-coupling filters. In [7], a
new speech enhancement method was proposed combining the statistical models and non-negative
1
matrix factorization (NMF) with Kullback–Leibler divergence.
In time-varying noise environments, both the speech and noise bases of NMF were considered
with the help of the estimated speech presence probability to get the better-enhanced speech. Three
algorithms as LMS, NLMS and RLS are compared for their ability to enhance emotional fear and
neutral speech signal in this work. Speech is always associated with some form of emotions. Few
basic emotions such as angry, happiness, fear, sad, surprise and so on are easily identified during
conversation [8]. However, sometimes they are overlapped in an utterance making it difficult to
recognize individually [9]. Speech happens to play a pivot role, as it is the lone effective channel of
communication via the phone. Speech when colored with emotions will be associated with
additional noise due to variability in expression and environment at which it is exhibited. Factors
such as hiss, restlessness and other physiological components during emotional speech encounter
makes the speech corrupted. Hence, in this work, we have taken a step to remove unwanted noise
from fear speech emotions recorded in a real life situation along with neutral voice.
The utterances are initially enhanced against any noise and interferences using the chosen
adaptive algorithms. Subsequently, the enhanced emotional speech is fed as input to MLP classifier
for recognition [10-11]. A comparison between the noisy emotional speech and enhanced emotional
speech has been made based on Signal to noise ratio (SNR), Mean square error (MSE) both at the
filter end and classification stage [12]. Speech emotion processing and recognition system is
generally composed of three parts, the first being speech signal acquisition, then comes the feature
extraction followed by emotion recognition. Artificial Neural Networks are biologically inspired
tools for information processing. Speech recognition modelling by artificial neural networks doesn’t
require any prior knowledge of speech process and this technique quickly became an attractive
substitute to Hidden Markov Models. The conventional neural networks of Multi Layer Perceptron
type have been increasingly in use for speech recognition and for various other speech processing
applications. Speech recognition is the process of converting an acoustic signal, captured by
microphone or a telephone, to a set of characters. They can also serve as the input to further
linguistic processing to achieve speech understanding, a subject covered in section. It is widely
applied in human - computer interaction, interaction teaching, security fields, etc…
Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize
human emotion and the associated affective states from speech [13]. This is capitalizing on the fact
that voice often reflects underlying emotion through tone and pitch. Emotion recognition is a rapidly
growing research domain in recent years. Unlike humans, machines lack the abilities to perceive and
show emotions. But human-computer interaction can be improved by implementing automated
emotion recognition, thereby reducing the need of human intervention.In naturalistic human-
computer interaction (HCI), speech emotion recognition (SER) is becoming increasingly important
2
in various applications. At present, speech emotion recognition is an emerging crossing field of
artificial intelligence and artificial psychology; besides, it is a popular research topic of signal
processing and pattern recognition [14]. The research is widely applied in human-computer
interaction, interactive teaching, entertainment, security fields, and so on. Speech emotion
processing and recognition system is generally composed of three parts, the first being speech signal
acquisition, then comes the feature extraction followed by emotion recognition. The most propitious
technique for speech recognition is the neural network based approach. Artificial Neural Networks,
(ANN) are biologically inspired tools for information processing. Speech recognition modeling by
artificial neural networks (ANN) doesn’t require any prior knowledge of speech process and this
technique quickly became an attractive substitute to HMM [15]. RNN can learn the sublunary
relationship of Speech – data & is capable of modeling time dependent phonemes. The conventional
neural networks of Multi- Layer Perceptron (MLP) type have been increasingly in use for speech
recognition and also for various other speech processing applications. Speech recognition is the
process of converting an acoustic signal, captured by microphone or a telephone, to a set of
characters. They can also serve as the input to further linguistic processing to achieve speech
understanding, a subject covered in section. As we know, speech recognition performs tasks that
similar with human brain [16]. Traditional emotional feature extraction was based on the analysis
and comparison of all kinds of emotion characteristic parameters, selecting all the emotional
characteristics with high emotional resolution for the purpose of feature extraction. The traditional
approach concentrates on the analysis of the features in the speech like time construction, amplitude
construction, and frequency construction, etc. Speech time construction refers to the emotion speech
pronunciation differences in time. Different emotions have different types of pronunciation time
periods which can be recognized and analyzed by closely examining few datasets. Such variations
can also be found in the frequency and amplitude of the parameters of respective audio signals. This
method, however is the basic concept of categorizing emotions from speech, it also has many
drawbacks like time taken is high, judging criteria may vary, and complex programming is required.
There are also many models which were proposed earlier to improve the predicting accuracy of the
SERS [17]. For example, we have Support Vector Machine (SVM), which is a classifier that
mathematically computes the parameters of the audio signal to be able to predict the emotion. This
model has been very successful in the domain of SER. But the main disadvantage with SVM’s is
that it can only classify the data into two classes; either class 1 or 2. And other disadvantages
include processing time, noise leading to errors in prediction and low accuracy [18].
3
CHAPTER 2
LITERATURE SURVEY
Hadhami Aouani and Yassine Ben Ayed [19] , 2020 have proposed an approach for
automatically detecting emotions in speech that explores some characteristics of how speech
signals are detected and meta-information of the signals are calculated for labeling its emotion.
Features like 39 coefficients of Mel Frequency Cepstral Coefficients, Zero Crossing Rate, Harmonic
to Noise Rate are used. In this paper , they have used SVM algorithm for training the machine.
Yogesh Kumar and Manish Mahajan, 2019 have proposed an approach to automatically detect
emotions in speech of a speaker that explores some characteristics of how speech signals are
detected and meta-information of the signals. In this paper , they have used KNN algorithm for
detecting emotions in speech signals. Features like MFCC , LPCC are used for data preprocessing
and feature extraction. Speech Emotion Recognition is one of the booming research topics in the
computer science world. Emotion is a medium by which one expresses how a person feels and one’s
state of mind. Predicting emotions is a tough task as every individual has a different tone and
intonation of speech. Thus emotions are difficult to extract using current machine learning systems
easily. Therefore, many researchers have used Deep Learning and Machine Learning techniques to
extract the emotions of speech signals.
There are many ways of communication but the speech signal is one of the fastest and most
natural methods of communications between humans. Therefore the speech can be the fast and
efficient method of interaction between human and machine also [10]. Humans have the natural
ability to use all their available senses for maximum awareness of the received message. Through all
the available senses people actually sense the emotional state of their communication partner. The
emotional detection is natural for humans but it is very difficult task for machine. Therefore the
purpose of emotion recognition system is to use emotion related knowledge in such a way that
human machine communication will be improved [21]. In speech emotion recognition, the emotions
from the speech of male or female speakers are found out [22]. In the past century some speech
features were studied which involved the fundamental frequencies, Mel frequency cepstrum
coefficient (MFCC), linear prediction cepstrum coefficient (LPCC), etc., which form the basis for
speech processing even today. In one of the research the spectrograms of real and acted emotional
speech were studied and found similar recognition rate for both, which recommend that later one
can be use for the speech emotion recognition system. In another research a correlation between
emotion and speech features were present. Further humans and machine emotion recognition rate
was Compared, in which same recognition rates were found for both.
4
After this study a speech emotion recognition system using Hidden Ashish B. Ingale,
Department of Electronics and Telecommunication Engineering, Government College of
Engineering, Amravati, India, Mobile Phone No.: +919860305773, (e-mail:
ingale.ashish7@gamil.co.in). D. S. Chaudhari, Department of Electronics and Telecommunication
Engineering, Government College of Engineering, Amravati, India, (e-mail: ddsscc@yahoo.com).
Markov Model was presented and achieved an accuracy of 70% for seven emotional states. In
another study Support Vector Machine for speech motion recognition of the four different emotions
with an accuracy of 73% was obtained [23]. Emotion recognition from the speaker‟s speech is very
difficult because of the following reasons: In differentiating between various emotions which
particular speech features are more useful is not clear. Because of the existence of the different
sentences, speakers, speaking styles, speaking rates accosting variability was introduced, because of
which speech features get directly affected. The same utterance may show different emotions. Each
emotion may correspond to the different portions of the spoken utterance.
Therefore it is very difficult to differentiate these portions of utterance. Another problem is
that emotion expression is depending on the speaker and his or her culture and environment. As the
culture and environment gets change the speaking style also gets change, which is another challenge
in front of the speech emotion recognition system. There may be two or more types of emotions,
long term emotion and transient one, so it is not clear which type of emotion the recognizer will
detect [24]. Emotion recognition from the speech information may be the speaker dependent or
speaker independent. The different classifiers available are k-nearest neighbors (KNN), Hidden
Markov Model (HMM) and Support Vector Machine (SVM), Artificial Neural Network (ANN),
Gaussian Mixtures Model (GMM). The paper reviews the mentioned classifiers [25]. The
application of the speech emotion recognition system include the psychiatric diagnosis, intelligent
toys, lie detection, in the call centre conversations which is the most important application for the
automated recognition of emotions from the speech, in car board system where information of the
mental state of the driver may provide to the system to start his/her safety [26].
Speech emotion recognition is one of the latest challenges in speech processing. Besides
human facial expressions speech has proven as one of the most promising modalities for the
automatic recognition of human emotions. Especially in the field of security systems a growing
interest can he observed throughout the last year. Besides, the detection of lies, video games and
psychiatric aid are often claimed as further scenarios for emotion recognition [27]. Addressing
classification in a practical view it has to be considered that a technical approach can only rely on
pragmatic decisions about kind, extent and number of emotions suiting the situation. It seems
reasonable to adapt and limit this number and kind of recognizable emotions to the requirements
given within the application to ensure a robust classification.
5
Yet no standard exists for the classification of emotions in technical recognition. An often
favored way is to distinguish between a defined set of discrete emotions. However, as mentioned, no
common opinion exists about their number and naming. A recent approach can he found in the
MPEG4 standard, which names the six emotions anger, disgust, fear, joy, sadness and surprise. The
addition of a neutral state seems reasonable to realize the absence of any of these emotions. This
classification is used as a basis for the comparison throughout this work also expecting further
comparisons. Most approaches in nowadays speech emotion recognition use global statistics of a
phrase as basis [28].
However also first efforts in recognition of instantaneous features exist [29][30]. We present
two working engines using both alluded alternatives by use of continuous hidden Markov models,
which have evolved as a far spread standard technique in speech processing. There are many
motivations in identifying the emotional state of speakers. In human–machine interaction, the
machine can be made to produce more appropriate responses if the state of emotion of the person
can be accurately identified. Most state-of-the-art automatic speech recognition systems resort to
natural language understanding to improve the accuracy of recognition of the spoken words. Such
language understanding can be further improved if an emotional state of the speaker can be
extracted, and this in turn will enhance the accuracy of the system. In general, translation is required
to carry out communications using dif-ferent languages. Current automatic translation algorithms
focus mainly on the semantic part of the speech. It would provide the communicating parties an
additional useful information if an emotional state of the speaker can also be identi- fied and
presented, especially in non-face-to-face situations.
Other applications of automatic emotion recognition systems include, tutoring, alerting, and
entertainment (Cowie et al., 2001). Before delving into the details of automatic emotion
recognition, it is appropriate to have some understanding of psychological, biological, and
linguistic aspects of emotion. The activation–evaluation space (Cowie et al., 2001) provides a
simple approach in understanding and classifying emotions. In a nutshell, it considers the stimulus
that excites the emotion, the cognition ability of the agent to appraise the nature of the stimulus and
subsequently his/her mental and physical responses to the stimulus. The mental response is in the
form of emotional state. The physical response is in the form of fifight or flflight 1, or as described
by Fox (1992), approach or withdrawal. From a biological perspective, Darwin (1965) looked at the
emotional and physical responses as distinctive action patterns selected by evolution because of
their survival value. Thus, emotional arousal will have an effffect on, the heart rate, skin resistivity,
temperature, pupillary diam eter, and muscle activity, as the agent prepares for fifight or flflight. As
a result, the emotional state is also manifested in spoken words and facial expressions (Ekman and
Friesen, 1975). tructure (Otaley and Jenkis, 1996): For example, people with emotional disorders
6
such as, manic depression or pathological anxiety may be in those emotional states for months and
years, or one may be in a bad mood for weeks and months, or emotions such as Anger and Joy may
be transient in nature and last no longer than a few minutes. Thus, emotion has a broad sense and a
narrow sense effffect. The broad sense reflflects the underlying long-term emotion and the narrow
sense refers to the short-term excitation of the mind that prompts people to action.
In automatic recognition of emotion, a machine would not distinguish if the emotional state
were due to long-term or shortterm effffect so long as it is reflflected in the speech or facial
expression. The output of an automatic emotion recognizer will naturally consist of labels of
emotion. The choice of a suitable set of labels is important. Linguists have a large vocabulary of
terms of describing emotional states. Schubiger (1958) and O Connor and Arnold (1973) used 300
labels between the states in their studies. The palette theory (Cowie et al., 2001) suggests that basic
categories be identifified to serve as primaries and mixing may be done in order to produce other
emotions similar to the mixing of primary colors to produce all other colors. The primary
emotions that are often used include, Joy, Sadness, Fear, Anger, Surprise and Disgust. They are
often referred to as archetypal emotions. Although these archetypal emotions cover a rather small
part of emotional life, they nevertheless represent the popularly known emotions and are
recommended for testing the capabilities of an automatic recognizer. Cognitive theory would argue
against equating emotion recognition with assigning category labels. Instead, it would want to
recognize the way a person perceives the world or key aspects of it. Perhaps it is true to say that
category labels are not a sufficient representation of emotional state, but they are a better way to
indicate the output from an automatic emotion recognition system.
It is to be noted that the emotional state of a speaker can be identifified from the facial
expression (Ekman, 1973; Davis and College, 1975; Scherer and Ekman, 1984), speech
(McGilloway et al., 2000; Dellaert et al., 1996; Nicholson et al., 1999), perhaps brainwaves, and
other biological features of the speaker. Ultimately, a combination of these features may be the way
to achieve high accuracy of recognition. In this paper, the focus is on emotional speech recognition.
Emotion recognition from speech has evolved from being a niche to an important component
for HumanComputer Interaction (HCI) [31]–[32]. These systems aim to facilitate the natural
interaction with machines by direct voice interaction instead of using traditional devices as input to
understand verbal content and make it easy for human listeners to react [33]–[34]. Some
applications include dialogue systems for spoken languages such as call center conversations,
onboard vehicle driving system and utilization of emotion patterns from the speech in medical
applications [35]. Nonetheless, there are many problems in HCI systems that still need to be
properly addressed, particularly as these systems move from lab testing to real-world application
7
[36]– [37]. Hence, efforts are required to effectively solve such problems and achieve better
emotion recognition by machines. Determining the emotional state of humans is an idiosyncratic
task and may be used as a standard for any emotion recognition model [38]. Amongst the numerous
models used for categorization of these emotions, a discrete emotional approach is considered as
one of the fundamental approaches. It uses various emotions such as anger, boredom, disgust,
surprise, fear, joy, happiness, neutral and sadness [39], [40].
Another important model that is used is a three-dimensional continuous space with parameters
such as arousal, valence, and potency. The approach for speech emotion recognition (SER)
primarily comprises two phases known as feature extraction and features classification phase [41].
In the field of speech processing, researchers have derived several features such as source-based
excitation features, prosodic features, vocal traction factors, and other hybrid features [42]. The
second phase includes feature classification using linear and nonlinear classifiers [43]. The most
commonly used linear classifiers for emotion recognition include Bayesian Networks (BN) or the
Maximum Likelihood Principle (MLP) and Support Vector Machine (SVM). Usually, the speech
signal is considered to be non-stationary. Hence, it is considered that non-linear classifiers work
effectively for SER [44]. There are many non-linear classifiers available for SER, including
Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) [45].
These are widely used for classification of information that is derived from basic level
features. Energybased features such as Linear Predictor Coefficients (LPC), Mel Energy-spectrum
Dynamic Coefficients (MEDC), MelFrequency Cepstrum Coefficients (MFCC) and Perceptual
Linear Prediction cepstrum coefficients (PLP) are often used for effective emotion recognition from
speech. Other classifiers including K-Nearest Neighbor (KNN), Principal Component Analysis
(PCA) and Decision trees are also applied for emotion recognition [46]. Deep Learning has been
considered as an emerging research field in machine learning and has gained more attention in
recent years [47]. Deep Learning techniques for SER have several advantages over traditional
methods, including their capability to detect the complex structure and features without the need for
manual feature extraction and tuning; tendency toward extraction of low-level features from the
given raw data, and ability to deal with un-labeled data [48].
Deep Neural Networks (DNNs) are based on feed-forward structures comprised of one or
more underlying hidden layers between inputs and outputs. The feed-forward architectures such as
Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) provides efficient
results for image and video processing. On the other hand, recurrent architectures such as Recurrent
Neural Networks (RNNs) and Long Short-Term Memory (LSTM) are much effective in speech-
based classification such as natural language processing (NLP) and SER [49].
8
Apart from their effective way of classification these models do have some limitations. For
instance, the positive aspect of CNNs is to learn features from high-dimensional input data, but on
the other hand, it also learns features from small variations and distortion occurrence and hence,
requires large storage capability. Similarly, LSTM-based RNNs are able to handle variable input
data and model long-range sequential text data.Emotion recognition systems based on digitized
speech is comprised of three fundamental components: signal preprocessing, feature extraction, and
classification [50].
Acoustic preprocessing such as denoising, as well as segmentation, is carried out to determine
meaningful units of the signal [51]. Nomenclature Referred to ABC Airplane Behavior Corpus AE
Auto Encoders ANN Artificial Neural Network AVB Adversarial Variational Bayes AVEC
Audio/Visual Emotion Challenge BN Bayesian Networks CAM3D Cohn-Kanade dataset CAS
Chinese Academy of Science database CNN Convolutional Neural Network ComParE
Computational Paralinguistic challenge DBM Deep Boltzmann Machine DBN Deep Belief Network
DCNN Deep Convolutional Neural Network DES Danish Emotional Speech Database DNN Deep
Neural Networks eGeMAPS extended Geneva Minimalistic Acoustic Parameter Set ELM Extreme
Learning Machine Emo-DB Berlin Emotional database FAU-AEC FAU Aibo Emotion Corpus
GMM Gaussian Mixture Model HCI Human-Computer Interaction HMM Hidden Markov Model
HRI Human-Robot Interaction IEMOCAP Interactive Emotional Dyadic Motion Capture database
KNN K-Nearest Neighbor LIF Localized Invariant Features LPC Linear Predictor Coefficients
LSTM Long-Short Term Memory MEDC Mel Energy-spectrum Dynamic Coefficient MFCC Mel-
Frequency Cepstrum Coefficient MLP Maximum Likelihood Principle MT-SHL-DNN Multi-
Tasking Shared Hidden Layers Deep Neural Network PCA Principle Component Analysis PLP
Perceptual Liner Prediction cepstrum coefficient RBM Restricted Boltzmann Machine RE
Reconstruction-Error-based (RE) RvNN Recursive Neural Network RECOLA Remote
Collaborative and Affective Interactions database RNN/RCNN Recurrent Neural Network SAE
Stacked Auto Encoder SPAE Sparse-Auto Encoders SAVEE Surrey Audio-Visual Expressed
Emotion SDFA Salient Discriminative Feature Analysis SER Speech Emotion Recognition SVM
Support Vector Machine VAE Variational Auto Encoder Feature extraction is utilized to identify
the relevant features available in the signal.
Lastly, the mapping of extracted feature vectors to relevant emotions is carried out by
classifiers. In this section, a detailed discussion of speech signal processing, feature extraction, and
classification is provided [52], [53]. Also, the differences between spontaneous and acted speech are
discussed due to their relevance to the topic [54]. Figure 1 depicts a simplified system utilized for
speech-based emotion recognition. In the first stage of speech-based signal processing, speech
enhancement is carried out where the noisy components are removed.
9
The second stage involves two parts, feature extraction, and feature selection. The required
features are extracted from the preprocessed speech signal and the selection is made from the
extracted features. Such feature extraction and selection is usually based on the analysis of speech
signals in the time and frequency domains. During the third stage, various classifiers such as GMM
[55] and HMM [56], etc. are utilized for classification of these features. Lastly, based on feature
classification different emotions are recognized.
10
CHAPTER 3
PROPOSED METHODOLOGY
Speech Emotion Recognition is one of the booming research topics in the computer science
world. Emotion is a medium by which one expresses how a person feels and one’s state of mind.
Emotions are difficult to extract using current machine learning systems easily. Therefore, many
researchers have used Neural Network and Machine Learning techniques to extract the emotions of
speech signals. Our proposed system consists of MLP classifier which is shown in figure 1. In the
Speech Emotion Recognition System (SER), the audio files are given as the input. The data sets
travels through a number of blocks of processes which makes it executable to help for the analysis
of the speech parameters. The data is pre-processed to change it to the suitable format and the
respective features. This process helps in breaking down the audio files into the numerical values
which represents the frequency, time, amplitude or any other such parameters which can help in the
analysis of the audio files. After the extraction of the required features from the audio files, the
model is trained. We have used the RAVDESS dataset of audio files which has speeches of 24
people with variations in parameters [57]. For the training, we store the numerical values of
emotions and their respective features correspondingly in different arrays. These arrays are given as
an input to the MLP Classifier that has been initialized [58]. The Classifier identifies different
categories in the datasets and classifies them into different emotions. The model will now be able to
understand the ranges of values of the speech parameters that fall into specific emotions. For
testing the performance of the model, if we enter the unknown test dataset as an input, it will
retrieve the parameters and predict the emotion as per training dataset values. The accuracy of the
system is displayed in the form of percentage which will be the final result of our project.
11
Fig. 1.1 Proposed System Architecture
12
3.2 Technologies used
PYTHON
Python is an interpreted, high-level, general-purpose programming language. Created by
Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes
code readability, notably using significant whitespace. It has a wide range of applications from
Web development (like: Django and Bottle), Scientific and mathematical computing (Orange,
SciPy, NumPy) to desktop graphical user Interfaces (Pygame, Panda3D). Python is a general-
purpose interpreted, interactive, object-oriented, and high-level programming language. An
interpreted language, Python has a design philosophy that emphasizes code readability(notably
using white space indentation to delimit code blocks rather than curly brackets or keywords),
and a syntax that allows programmers to express concepts in fewer lines of code than might be
used in languages such as C++ or Java.
It provides constructs that enable clear programming on both small and large scales.
Python interpreters are available for many operating systems. CPython, the reference
implementation of Python, is open source software and has a community-based development
model, as do nearly all of its variant implementations. CPython is managed by the non-profit
Python Software Foundation. Python features a dynamic type system and automatic memory
management. It supports multiple programming paradigms, including object-oriented, imperative,
functional and procedural, and has a large and comprehensive standard library. Python is free and
quick to learn. It is mainly high-level, dynamically typed and interpreted. It is very significant. It
promotes error debugging and enables the rapid development of application designs and makes it a
language in which to code. In 1989 Guido Van Rossum created Python and stressed the concept
and readability of DRY (Don't Repeat Yourself). Python supports cross-platform operating
systems , making it even more easy to create applications. Many of the world-renowned apps like
YouTube, BitTorrent, DropBox, etc. use Python to do their work.
Python is a well-known language for programming known for its simple object-oriented
design advantage. Some of Python's other remarkable features are its library functions and
modules which make interactive mode easier for developers. This also supports other ideas, offers
type-specific dynamic code analysis, easy access to the database framework, simple user interface
programming, python programming, as it is available free & open source. It allows for
expandability and scalability and finally it is easy to read, understand and write the code. Python is
one of the market-orientated software frameworks for open source object-oriented programming.
Python 's application creation, automated testing, multiple programming compilation, fully
developed programming libraries, data storage framework usability, quick and readable code, easy
to use for complex software development processes, help in the test driven software application de
13
Framework, are some of Python 's uses. Python also uses technologies including robotics, web
crawling, scripting, artificial intelligence, data processing, machine learning, facial recognition,
color detection, 3D CAD apps, console technologies, audio-based apps, video-based applications ,
business applications and image applications. With the support of the research oriented
development process, Python enables coding and research. Until any code creation, the test cases
can be easily written. Once the application creation has started, the written test cases will
concurrently begin testing the application and provide the result. This may also be used to verify
or check the source code-based pre-requirements.
Python is freely accessible and open source. This also adds to the expense of software
development. Python frameworks, libraries and programming tools are available for various open
source applications to build the application without cost. Python Frameworks simplify and speed
up the web application creation process and are built with Django, Flask, pyramid, etc. For
creating a GUI-based application, the Python GUI frameworks are available.
Applications from Python is used as a general programming language to simplify the complex
software development process. This is used to build complex applications such as science,
numerical and desktop and web applications. Python has features such as data analysis and
visualization, which allows you create custom solutions without requiring additional time and
energy. It lets you easily visualize and view data.
Simple reading and maintaining Python code. Even where it is required, it is easily reusable.
Python has a simple syntax that allows the different concepts to evolve without any additional
code being written. The code should be of high quality and easy to manage and simplify
maintenance for the software application. This also emphasizes the readability of code, which is
the great function, in comparison to other programs. It helps create custom apps and clean code
helps to manage and upgrade mobile apps without using the same technology.
Python uses also make it easy to navigate the database. The interfaces for various databases, such
as MySQL , Oracle, Square Servers, PostgreSQL and other databases are developed by Python.
This has a Durus and ZODB object database. This is used with the main API database and is freely
available.
Primarily compatible with major platforms and frameworks, Python is primarily used for
application development. The python code can be used to support many operating systems using
python interpreters on different platforms and tools. As Python is a high level language which can
operate on various platforms, it helps you to execute this code. Before recompilation, the new and
updated code can be executed and its effect monitored or managed. This implies that after every
update, it is not appropriate to recompile the code. This feature helps to save developers'
development time.
14
Python has a broad and robust standard library for application creation. This also allows Python
used in many languages by developers. The standard library allows the use of a variety of Python
modules. As this module will allow you to add the function without any code. Documentation on
the main library of python may be referred to obtain knowledge on different modules. The
standard library documentation facilitates the creation of any Internet program, the
implementation of online services, string operations and other applications, such as the interface
protocol.
Python is also used for the continuous support of other paradigms in programming. Help for
object-oriented and structured programming. Python has the structured programming language
that also follows different definitions. The system is used for automatic memory management and
dynamic type systems. You can build small and large applications with the features of python
language and programming paradigms. This can be used for complex applications of apps.
This can be used for different applications, including web applications, interactive applications
focused on visual user interface, software creation, scientific and numerical applications, network
programming, games and 3D applications, etc. It offers an interactive interface and easy
application creation.
Often functions are too specialized to be supported by standard applications. That is when it comes
to scripting. Python enables developers to write customer automation and to reduce job
performance. Automation by scripting of routine functions, such as emailing & voice mailing, file
organization and directories, system initialization, form filling, etc.
Python is extensively used in Network Programming. At low level, there is basic socket support.
The language allows users to implement clients and servers for both connection-oriented and
connectionless protocols. The language supports libraries that provide higher level access to
specific application level network protocols.
Data science consists of three distinct and overlapping areas: interdisciplinary domain
(a) how data like a figure to be modeled and summarized.
(b) How to design process and view data such as computer scientists through algorithms.
(c) How, like an expert in the field, to formulate and provide answers in the right way.
Python is a trusted language for the study and visualization of broad data sets in scientific
calculation research. The use of Python cases in data science stems from the broad and active
ecosystem of third-party packages, including NumPy for homogeneous arrays, Pandas for
heterogeneous and etiquette manipulation of data, SciPi for computation, Sci-Kit Learn for
computer training and so for other applications. Therefore, a data scientist uses mathematical
methods to analyze and interpret complicated data using a Python programming language.
AI and ML models and projects vary completely from the conventional models of the software
15
and requires different resources and technology than those used in software projects today. Within
this fairly new area, Python Application Development has also settled.
Projects of artificial intelligence and computer education require both a stable and safe language
and a versatile one. Python is an all-embracing programming language, packed with tools to
manage specific special requirements of an artificial intelligence and machinery learning project.
Pythons can be used for data analysis, TensorFlow for computer education and SciPy for advanced
computing. You can use python packages.
The Computer Vision, by way of visual imagery or images, helps computers to recognise objects.
The implementation of CV by Python allows developers to automate visualization tasks. Python
dominates the market while other programming languages support Computer Vision.
Inkscape, GIMP, Paint Shop Pro and Scribus, 2D picking softwares used python. Furthermore,
Python is also used in varying amounts in 3D animation applications, such as Blender, 3ds Max,
Cinema 4D, Houdini, Lightwave, and Maya.
Software at the business level or enterprise applications vary significantly from standard
applications, since features such as legibility, extensiveness and scalability were previously
necessary. Basically, company applications are designed rather than for individual customers to
fulfill the requirements of an organization.
These software must also be able to interact with legacy structures such as existing databases
and non-web software. When software applications are developed, the whole development process
is very complex taking into account the individual requirements to meet the particular needs of the
organizational model of an organization.
Python has a special case of image processing and graphic design applications of addition to any
of the above applications. Globally, the language for programming is used to design and build 2D
graphics applications such as Inkscape, GIMP, Paint Shop Pro and Scribus. In addition to this,
Python is used in other 3D graphics applications including Blender, Houdini, 3ds Max, Maya,
Cinema 4D, and Lightwave.
A console application is a computer program designed for text-only interfaces, for example
the Unix command-line interface, DOS. Advanced python libraries support the creation of full-
size applications for the command line interface for developing console applications.
This is the Python programming's most interesting implementation. The two most common
frameworks for game creation are PyGame and PyKyra. In addition, you can rely on different
libraries for 3D rendering. Game developers can follow PyWeek to find out more on how Python
is used to develop games in the semiannual game programming competitions.
Python is based on C, so that Embedded C program can be used for embedded applications. It
16
helps us render applications at higher rates on smaller computers that are able to compute Python.
The Raspberry Pi which uses Python for computing may be the most popular embedded
application. It can be used for high-level computations as a computer or as a simple embedded
surface.
The data is valuable if you are able to collect knowledge that is important and can help you take
calculated risks and maximize income. You review your data, operate and extract the necessary
information. NumPy allows you to retrieve knowledge from libraries like pandas. You can also
view database libraries like Matplotlib, Seaborn, which are useful to trace graphs. Python gives
you this chance to become a data scientist.
KERAS
Keras is an Open-Source Neural Network library written in Python that runs on top of Theano
or TensorFlow. It is designed to be modular, fast and easy to use. It was developed by François
Cholet, a Google engineer. Keras doesn't handle low-level computation. Instead, it uses another
library to do it, called the "Backend. So Keras is high-level API wrapper for the low-level
API,capable of running on top of Tensor Flow, CNTK, or Theano. Keras doesn't handle Low-
Level API such as making the computational graph, making tensors or other variables because
it has been handled by the "backend" engine. Keras High-Level API handles the way we make
models, defining layers, or set up multiple input-output models. In this level, Keras also
compiles our model with loss and optimizer functions, training process with fit function. Keras
doesn't handle Low-Level API such as making the computational graph, making tensors or
other variables because it has been handled by the "backend‖ engine.
TENSORFLOW
Tensor Flow is an open-source software library for dataflow and differentiable programming
across a range of tasks. It is a symbolic math library, and is also used for machine learning
applications such as neural networks. It is used for both research and production at Google. It
is a standard expectation in the industry to have experience in Tensor Flow to work in machine
learning. Tensor Flow was developed by the Google Brain team for internal Google use. It was
released under the Apache 2.0 open-source license on November 9, 2015.
OpenCV
17
modify the code.
The library has more than 2500 optimized algorithms, which includes a comprehensive set of
both classic and state-of-the-art computer vision and machine learning algorithms. These
algorithms can be used to detect and recognize faces, identify objects, classify human actions
in videos, track camera movements, track moving objects, extract 3D models of objects,
produce 3D point clouds from stereo cameras, stitch images together to produce a high
resolution image of an entire scene, find similar images from an image database, remove red
eyes from images taken using flash, follow eye movements, recognize scenery and establish
markers to overlay it with augmented reality, etc. The library is used extensively in companies,
research groups and by governmental bodies.
Along with well-established companies like Google, Yahoo, Microsoft, Intel, IBM, Sony,
Honda, Toyota that employ the library, there are many start-ups such as Applied Minds,
VideoSurf, and Zeitera, that make extensive use of OpenCV. OpenCV’s deployed uses span
the range from stitching street view images together, detecting intrusions in surveillance video
in Israel, monitoring mine equipment in China, helping robots navigate and pick up objects at
Willow Garage, detection of swimming pool drowning accidents in Europe, running interactive
art in Spain and New York, checking runways for debris in Turkey, inspecting labels on
products in factories around the world on to rapid face detection in Japan.
It has C++, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and
Mac OS. OpenCV leans mostly towards real-time vision applications and takes advantage of
MMX and SSE instructions when available. A full-featured CUDA and OpenCL interfaces are
being actively developed right now. There are over 500 algorithms and about 10 times as many
functions that compose or support those algorithms. OpenCV is written natively in C++ and
has a templated interface that works seamlessly with STL containers.
18
CHAPTER 4
SYSTEM DESIGN
19
4.2 UML DIAGRAMS
class diagram in the Unified Modelling Language is a type of static structure diagram that
describes the structure of a system by showing the system's classes, their attributes, operations,
and the relationships among objects.
20
4.2.2 USECASE DIAGRAM
Use case diagram at its simplest is a representation of a user's interaction with the system that
shows the relationship between the user and the different use cases in which the user is involved.
21
4.2.3 SEQUENCE DIAGRAM
Sequence diagram shows object interactions arranged in time sequence. It depicts the objects
and classes involved in the scenario and the sequence of messages exchanged between the
objects needed to carry out the functionality of the scenario.
22
4.2.4 STATE CHART DIAGRAM
state diagram is a type of diagram used in computer science and related fields to describe the
behaviour of systems. State diagrams require that the system described is composed of a finite
number of states; sometimes, this is indeed the case, while at other times this is a reasonable
abstraction.
23
4.2.5 ACTIVITY DIAGRAM
Activity diagram is another important diagram in UML to describe the dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity. The activity can be described as an operation of the system.
24
4.2.6 OBJECT DIAGRAM
Object diagrams are derived from class diagrams so object diagrams are dependent
upon class diagrams.
Object diagrams represent an instance of a class diagram. The basic concepts are similar
for class diagrams and object diagrams. Object diagrams also represent the static view of a
system but this static view is a snapshot of the system at a particular moment. Object diagrams
are used to render a set of objects and their relationships as an instance.
The purposes of object diagrams are similar to class diagrams. The difference is that a
class diagram represents an abstract model consisting of classes and their relationships.
However, an object diagram represents an instance at a particular moment, which is concrete
in nature.
25
CHAPTER 5
TESTING
Software Testing is a process of executing the application with intentto find any software bugs.
It is used to check whether the application met its expectations and all the functionalities of the
application are working. The final goal of testing is to check whether the application is
behaving in the way it is supposed to under specified conditions. All aspects of the code are
examined to check the quality of the application. The primary purpose of testing is to detect
software failures so that defects may be uncovered and corrected. The test cases are designed
in such way that scope of finding the bugs is maximum.
TESTING LEVELS
There are various testing levels based on the specificity of the test.
Unit Testing: -Unit testing refers to tests conducted on a section of code in order to verify the
functionality of that piece of code. This is done at the function level.
Integration Testing: -Integration testing is any type of software testing that seeks to verify the
interfaces between components of a software design. Its primary purpose is to expose the
defects associated with the interfacing of modules.
System Testing: -System testing tests a completely integrated system to verify that the system
meets its requirements.
Acceptance Testing: -Acceptance testing tests the readiness of application, satisfying all
requirements.
Performance Testing: -Performance testing is the process of determining the speed or
effectiveness of a computer, network, software program or devices such as response time or
millions of instructions per second etc.
26
SYSTEM TEST CASES
A test case is a set of test data, preconditions, expected results and post conditions, developed
for a test scenario to verify compliance with a specific requirement. I have designed and
executed a few test cases to check if the project meets the functional requirements.
Test with new data, rather than the original training data. If necessary, split your training set
into two groups: one that does training, and one that does test. Better, obtain and use fresh
data if you are able.
Understand the architecture of the network as a part of the testing process. Testers won’t
necessarily understand how the neural network was constructed, but need to understand
whether it meets requirements. And based on the measurements that they are testing, they
may have to recommend a radically different approach, or admit the software is just not
capable of doing what it was asked to do with confidence.
27
CHAPTER 6
RESULTS
The fear emotion speech is recorded in a closed room. Subsequently, random noise is added to the
speech signal of variance 0.01. Then the noisy signal is passed through three different adaptive
filters. First, the LMS algorithm is chosen for filtering. The result is highly depending on the step
size (μ) value in this case [59] . Various μ values are tested to obtain better-enhanced result. From
the experiment, it is clear that increase in μ provided faster convergence. Hence, it gives better
accuracy in this case. However too large μ degraded the performance of the filter. A μ value of 0.4
and 0.2 resulted in the required enhancement for LMS and NLMS respectively in this experiment.
For the RLS algorithm, a forgetting factor of one and regularizing factor of 0.1 provided the desired
accuracy. The mean square error curve for different adaptive algorithm has been shown (Fig.3, Fig.
5 and Fig.7). The MSE [60] of LMS algorithm from is highest followed by NLMS algorithm (Fig. 3
and Fig. 5). RLS provided the lowest error between the enhanced speech and the actual speech
emotion (Fig.7). However, the variation in MSE founds to be decreasing as number of iterations
increase in all these figures mentioned. Lowest MSE has been observed between 15 to 20 iteration
for RLS and NLMS [61]. For LMS it also provided similar results after 15 iterations. The enhanced
speech signal is better for RLS than NLMS (Fig. 4 and Fig. 6). LMS algorithm provided lowest
enhancement of emotional speech signal among all discussed algorithm (Fig.2). Classification using
MLP classifier for the clean speech and noisy speech has been discussed using LPC feature
extraction techniques [62] in Table I and Table II. An enhancement in accuracy and decrease in
MSE has been observed for fear utterance using MLP classifier [63-65]. Highest accuracy and
lowest MSE has been observed with RLS algorithm [66] in our case as shown in these tables.
28
Fig. 6.1 Enhancement of fear speech emotion using MLP algorithm
Fig. 6.2 Error minimization curve using MLP algorithm of fear speech emotion
29
Fig.6.3 Enhancement of fear speech emotion using MLP algorithm
Fig. 6.4 Error minimization curve using NLMS algorithm of fear speech emotion
30
Fig. 6.5 Enhancement of fear speech emotion using MLP algorithm
Fig. 6.6 Error minimization curve using MLP algorithm of fear speech emotion
31
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
Speech enhancement has increased the classification accuracy due to increase in SNR for fear
speech in this work. Ultimately, the MSE is also reduced. Among all the tested algorithms RLS
provided better performance in terms of MSE and classification accuracy followed by NLMS for
their advantages explained earlier. This has been found true for both enhancement and classification
mode.
32
REFERENCES
[1] Jyoti Dhiman, Shadab Ahmad and Kuldeep Gulia, ―Comparison between Adaptive filter
Algorithms (LMS, NLMS and RLS)‖ International Journal of Science, Engineering and
Technology Research (IJSETR), Vol. 2, Issue-5, pp.1100–1103, May 2013.
[2] Anuja Chougule and V. V. Patil, ―Survey of Noise Estimation Algorithms for Speech
Enhancement Using Spectral Subtraction‖, International Journal on Recent and Innovation
Trends in Computing and Communication, Vol.2, Issue.12, pp,4156–4160, Dec. 2014.
[3] S.Haykin, ―Adaptive Filter Theory‖, 3rd Edition, Prentice Hall, Upper Saddle River, 1996.
[4] Allam Mousa, Marwa Qados and Sherin Bader, ―Speech Signal Enhancement Using Adaptive
Noise Cancellation Techniques‖, No. 7, pp.375–383, Sept. 2012.
[5] G. Iliev and N. Kasabov, ―Adaptive filtering with averaging in noise cancellation for voice and
speech recognition‖, in Proc. ICONIP/ANZIIS/ANNES’99 Workshop, Dunedin, New Zealand,
pp. 71-74, 22–23 Nov.1999.
[6] Redha Bendoumia, Mohamed Djendi. ―Two-channel variable-stepsize forward-and-backward
adaptive algorithms for acoustic noise reduction and speech enhancement‖, Signal Processing,
Elsevier Journal, pp 226-244, 2015.
[7] Meng Sun, Yinan Li, Jort F. Gemmeke and Xiongwei Zhang, ―Speech Enhancement Under Low
SNR Conditions Via Noise Estimation Using Sparse and Low-Rank NMF with Kullback–
Leibler Divergence‖, IEEE/ACM Transactions on Audio, Speech, and Language Processing,
Vol. 23, No. 7, July 2015.
[8] H.K. Palo, M.N. Mohanty and M. Chandra, ―Emotion recognition using MLP and GMM for
Oriya language‖, International Journal of Computational Vision and Robotics, Inderscience (in
press), 2015.
[9] Hemanta Kumar Palo, Mihir Narayan Mohanty, Mahesh Chandra, ―Efficient feature
combination techniques for emotional speech classification‖, International Journal of Speech
Technology, Vol.19, pp. 135-150, Jan 2016.
[10] H.K. Palo, Mihir Narayan Mohanty, ―Classification of Emotional Speech of Children Using
Probabilistic Neural Network‖, International Journal of Electricaland Computer Engineering
(IJECE), Vol. 5, No. 2, pp. 311–317, April 2015.
[11] S. Haykin, ―Neural Networks‖, 2nd Ed., Prentice Hall, 1999.
[12] Abbas A, Abdelsamea MM, Gaber MM (2020) Classification of COVID-19 in chest X-ray
images using DeTraC deep convolutional neural network. arXiv preprint arXiv :20031 3815.
[13] Akçay MB, Oğuz K (2020) Speech emotion recognition: emotional models, databases, features,
preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76. https ://
doi.org/10.1016/j.speco m.2019.12.001.
33
[14] Aouani H, Ayed YB (2019) Deep support vector machines for speech emotion recognition.
[15]Aouani H, Ben Ayed Y (2018) Emotion recognition in speech using MFCC with SVM, DSVM and
auto-encoder. In: 2018 4th International conference on advanced technologies for signal and image
processing (ATSIP), pp 1–5
[16]Badshah AM, Ahmad J, Rahim N, Baik SW (2017) Speech emotion recognition from spectrograms
with deep convolutional neural network. In: 2017 International conference on platform technology
and service (PlatCon), IEEE, pp 1–5
[17]Barra S, Carta SM, Corriga A, Podda AS, Recupero DR (2020) Deep learning and time series-to-
image encoding for financial forecasting. IEEE/CAA J Autom Sin 7(3):683–692
[18]Basu S, Chakraborty J, Bag A, Aftabuddin M (2017) A review on emotion recognition using
speech. In: 2017 International conference on inventive communication and computational
technologies (ICICCT), pp 109–114
[19]Bhavan A, Chauhan P, Hitkul SRR (2019) Bagged support vector machines for emotion recognition
from speech. Knowl BasedSyst 184:104886. https ://doi.org/10.1016/j.knosy s.2019.10488 6
[20]Bhaykar M, Yadav J, Rao KS (2013) Speaker dependent, speaker independent and cross language
emotion recognition from speech using GMM and HMM. In: 2013 National conference on
communications (NCC), pp 1–5. https ://doi.org/10.1109/ NCC.2013.64879 98
[21]Bojani M, Deli V, Karpov A (2020) Call redistribution for a call center based on speech emotion
recognition. Appl Sci 10(13):4653. https ://doi.org/10.3390/app10 13465 3
[22]Cen L, Wu F, Yu ZL, Hu F (2016) Chapter 2—A real-time speech emotion recognition system and
its application in online learning. In: Tettegah SY, Gartmeier M (eds) Emotions, technology, design,
and learning, emotions and technology. Academic Press, San Diego, pp 27–46. https
://doi.org/10.1016/B978-0-12-80185 6-9.00002 -5
[23]Chen L, Mao X, Xue Y, Cheng LL (2012) Speech emotion recognition: features and classification
models. Digit Signal Process 22(6):1154–1160. https ://doi.org/10.1016/j.dsp.2012.05.007
[24]Cibau N, Albornoz E, Rufiner H (2013) Speech emotion recognition using a deep autoencoder.
[25] Daneshfar F, Kabudian SJ (2019) Speech emotion recognition using discriminative dimension
reduction by employing a modified quantum-behaved particle swarm optimization algorithm.
Multimed Tools Appl 79:1261–1289.
[26]Deb S, Dandapat S (2016) Emotion classification using residual sinusoidal peak amplitude. In: 2016
International conference on signal processing and communications (SPCOM), pp 1–5
[27]Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoderbased feature transfer learning for
speech emotion recognition. In: 2013 Humaine association conference on affective computing
and intelligent interaction, pp 511–516
34
[28]Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017) Universum autoencoder-based domain
adaptation for speech emotion recognition. IEEE Signal Process Lett 24(4):500–504
[29]Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2018) Semisupervised autoencoders for speech
emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26(1):31–43
[30]Han T, Zhang J, Zhang Z, Sun G, Ye L, Ferdinando H, Alasaarela E, Seppänen T, Yu X, Yang S
(2018) Emotion recognition and school violence detection from children speech. EURASIP J Wirel
Commun Netw 1:235
[31]Huang C, Gong W, Fu W, Feng D (2014) A research of speech emotion recognition based on deep
belief network and SVM. Math Probl Eng 2014:1–7. https ://doi.org/10.1155/2014/74960 4
[32]Jannat R, Tynes I, Lime LL, Adorno J, Canavan S (2018) Ubiquitous emotion recognition using
audio and video data. In: Proceedings of the 2018 ACM international joint conference and 2018
international symposium on pervasive and ubiquitous computing and wearable computers,
association for computing machinery, New York, NY, USA, UbiComp’18, pp 956–959. https ://doi.
org/10.1145/32673 05.32676 89
[33]Kamaruddin N, Wahab A (2010) Driver behavior analysis through speech emotion understanding.
In: 2010 IEEE intelligent vehicles symposium, pp 238–243
[34]Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) Speech based human emotion recognition
using MFCC. In: 2017 International conference on wireless communications, signal processing and
networking (WiSPNET), pp 2257–2260
[35]Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song
(RAVDESS). https ://doi. org/10.5281/zenod o.11889 76. Funding Information Natural Sciences
and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music
and emotional speech from Phonak
[36]Low DM, Bentley KH, Ghosh SS (2020) Automated assessment of psychiatric disorders using
speech: a systematic review. Laryngoscope Investig Otolaryngol 5(1):96–116
[37]Mansour A, Chenchah F, Lachiri Z (2019) Emotional speaker recognition in real life conditions
using multiple descriptors and I-vector speaker modeling technique. Multimed Tools Appl
78(6):6441–6458
[38]Martin GS, Droguett EL, Meruane V, das Chagas Moura M (2019) Deep variational auto-encoders:
a promising tool for dimensionality reduction and ball bearing elements fault diagnosis. Struct
Health Monit 18(4):1092–1128. https ://doi.org/10.1177/14759 21718 78829 9
[39]Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent
neural networks with local attention. In: 2017 IEEE International conference on acoustics, speech
and signal processing (ICASSP), pp 2227–2231
35
[40]Muljono M, Prasetya M, Harjoko A, Supriyanto C (2019) Speech emotion recognition of
Indonesian movie audio tracks based on MFCC and SVM. pp 22–25. https ://doi.org/10.1109/IC3I4
6837.2019.90555 09
[41]Mustaqeem, Kwon S (2019) A CNN-assisted enhanced audio signal processing for speech emotion
recognition. Sensors 20(1):183. https ://doi.org/10.3390/s2001 0183
[42]Narin A, Kaya C, Pamuk Z (2020) Automatic detection of coronavirus disease (COVID-19) using
X-ray images and deep convolutional neural networks. arXiv preprint arXiv :20031 0849
[43]Naviamos MP, Niguidula JD (2020) A study on determining household poverty status: SVM based
classification model. In: Proceedings of the 3rd international conference on software engineering
and information management, association for computing machinery, New York, NY, USA,
ICSIM’20, pp 79–84. https :// doi.org/10.1145/33789 36.33789 69
[44]Pandey SK, Shekhawat HS, Prasanna SRM (2019) Deep learning techniques for speech emotion
recognition: a review. In: 2019 29th International conference Radioelektronika
(RADIOELEKTRONIKA), pp 1–6
[45]Pantazi XE, Moshou D, Bochtis D (2020) Chapter 2—Artificial intelligence in agriculture. In:
Pantazi XE, Moshou D, Bochtis D (eds) Intelligent data mining and fusion systems in agriculture.
Academic Press, pp 17 – 101. https ://doi.org/10.1016/B978-0-12- 81439 1-9.00002 -9.
http://www.scien cedir ect.com/scien ce/artic le/ pii/B9780 12814 39190 00029
[46]Pichora-Fuller MK, Dupuis K (2020) Toronto emotional speech set (TESS). https
://doi.org/10.5683/SP2/E8H2M F
[47]Polzehl T, Schmitt A, Metze F, Wagner M (2011) Anger recognition in speech using acoustic and
linguistic cues. Speech Commun 53:1198–1209
[48]Popova A, Rassadin A, Ponomarenko A (2018) Emotion recognition in sound. Neuroinformatics
736:117–124. https ://doi. org/10.1007/978-3-319-66604 -4_18
[49]Sahay R, Mahfuz R, Gamal AE (2019) Combatting adversarial attacks through denoising and
dimensionality reduction: a cascaded autoencoder approach. In: 2019 53rd Annual conference on
information sciences and systems (CISS), pp 1–6
[50]Schipor OA et al (2014) Improving computer assisted speech therapy through speech based emotion
recognition. In: Conference proceedings of eLearning and Software for Education (eLSE), Carol I
National Defence University Publishing House, 01, pp 101–104
[51]Shankar K, Lakshmanaprabu S, Gupta D, Maseleno A, De Albuquerque VHC (2020) Optimal
feature-based multi-kernel SVM approach for thyroid disease classification. J Supercomput
76(2):1128–1143
36
[52]Sonawane A, Inamdar MU, Bhangale KB (2017) Sound based human emotion recognition using
MFCC multiple SVM. In: 2017 International conference on information, communication,
instrumentation and control (ICICIC), pp 1–4
[53]Sowmya V, Rajeswari A (2020) Speech emotion recognition for Tamil language speakers. In:
Agarwal S, Verma S, Agrawal DP (eds) Mach Intell Signal Process. Springer, Singapore, pp 125–
136
[54]Sun L, Fu S, Wang F (2019) Decision tree SVM model with fisher feature selection for speech
emotion recognition. EURASIP J Audio Speech Music Process 1:2
[55]Thomas SA, Race AM, Steven RT, Gilmore IS, Bunch J (2016) Dimensionality reduction of mass
spectrometry imaging data using autoencoders. In: 2016 IEEE symposium series on computational
intelligence (SSCI), pp 1–7
[56]Tomba K, Dumoulin J, Mugellini E, Khaled OA, Hawila S (2018) Stress detection through speech
analysis. In: Proceedings of the 15th International joint conference on e-Business and
telecommunications— Volume 1: ICETE, INSTICC, SciTePress, pp 394–398. https
://doi.org/10.5220/00068 55803 94039 8
[57]Vijayarajeswari R, Parthasarathy P, Vivekanandan S, Basha AA (2019) Classification of
mammogram for early detection of breast cancer using SVM classifier and Hough transform.
Measurement 146:800–805. https ://doi.org/10.1016/j.measu remen t.2019.05.083
[58]Wang L, Wong A (2020) COVID-net: a tailored deep convolutional neural network design for
detection of COVID-19 cases from chest X-ray images. arXiv preprint arXiv :20030 9871
[59]Wang J, He H, Prokhorov DV (2012) A folded neural network autoencoder for dimensionality
reduction. Proced Comput Sci 13:120– 127. https ://doi.org/10.1016/j.procs .2012.09.120
(proceedings of the International Neural Network Society Winter Conference (INNS-WC2012))
[60]Wang W, Huang Y, Wang Y, Wang L (2014) Generalized autoencoder: a neural network
framework for dimensionality reduction. In: 2014 IEEE Conference on computer vision and pattern
recognition workshops, pp 496–503
[61]Xia R, Deng J, Schuller B, Liu Y (2014) Modeling gender information for emotion recognition
using denoising autoencoder. In: 2014 IEEE International conference on acoustics, speech and
signal processing (ICASSP), pp 990–994
[62]Zabalza J, Ren J, Zheng J, Zhao H, Qing C, Yang Z, Du P, Marshall S (2016) Novel segmented
stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral
imaging. Neurocomputing 185:1–10. https ://doi.org/10.1016/j.neuco m.2015.11.044
[63]Zhang B, Provost EM, Essl G (2016) Cross-corpus acoustic emotion recognition from singing and
speaking: a multi-task learning approach. In: 2016 IEEE International conference on acoustics,
speech and signal processing (ICASSP), pp 5805–5809
37
[64]Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN LSTM
networks. Biomed Signal Process Control 47:312–323
[65]Zheng L, Li Q, Ban H, Liu S (2018) Speech emotion recognition based on convolution neural
network combined with random forest. In: 2018 Chinese control and decision conference (CCDC),
pp 4143–4147
[66]Zhou DX (2020) Universality of deep convolutional neural networks. Appl Comput Harmonic Anal
48(2):787–794
38
BIBLIOGRAPHY
[1]https://www.google.co.in/books/edition/Automatic_Speech_Recognition/rUBTBQAAQBAJ?hl=
en&gbpv=1&dq=speech+recognition&printsec=frontcover
[2]https://www.google.co.in/books/edition/Robust_Automatic_Speech_Recognition/_uXIBAAAQB
AJ?hl=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[3]https://www.google.co.in/books/edition/Readings_in_Speech_Recognition/DOfLYTOvTMQC?h
l=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[4]https://www.google.co.in/books/edition/Python_Machine_Learning_Cookbook/EwNwDQAAQB
AJ?hl=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[5]https://www.google.co.in/books/edition/Statistical_Methods_for_Speech_Recogniti/1C9dzcJTW
owC?hl=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[6]https://www.google.co.in/books/edition/Deep_Learning_for_NLP_and_Speech_Recogni/8cmcD
wAAQBAJ?hl=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[7]https://www.google.co.in/books/edition/Mechanisms_of_Speech_Recognition/7hSoBQAAQBAJ
?hl=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[8]https://www.google.co.in/books/edition/Automatic_Speech_Recognition/KVwFCAAAQBAJ?hl
=en&gbpv=1&dq=speech+recognition&printsec=frontcover
[9]https://www.google.co.in/books/edition/Speech_Language_Processing/LCnx6xaDZsIC?hl=en&g
bpv=1&dq=speech+recognition&printsec=frontcover
[10]https://www.google.co.in/books/edition/Speech_Recognition_and_Coding/45nvCAAAQBAJ?h
l=en&gbpv=1&dq=speech+recognition&printsec=frontcover
39
URL’S
[1]https://www.analyticsinsight.net/speech-emotion-recognition-ser-through-machine-learning/
[2] https://data-flair.training/blogs/python-mini-project-speech-emotion-recognition/
[3] https://www.frontiersin.org/articles/10.3389/fcomp.2020.00014/full
[4] https://blog.dataiku.com/speech-emotion-recognition-deep-learning
[5]https://www.researchgate.net/publication/335360469_Speech_Emotion_Recognition_Using_De
ep_Learning_Techniques_A_Review
[6] https://www.intechopen.com/chapters/65993
[7] https://www.ijrte.org/wp-content/uploads/papers/v7i4s/E1917017519.pdf
[8] https://www.sciencedirect.com/science/article/pii/S1877050920318512
[9]https://medium.com/analytics-vidhya/speech-emotion-recognition-using-machine-learning-
df31f6fa8404
[10] https://towardsdatascience.com/self-supervised-voice-emotion-recognition-using transfer-
learning-d21ef7750a10
40
APPENDIX
MAIN CODE:
#Connect your Drive with Colab
from google.colab import drive
drive.mount('/content/drive/')
Check where your Dataset Zip File is
!ls '/content/drive/My Drive/Important Extras/Data Science Works/_Data Science Work/Speech
Emotion Recognition'
#Unzip the file contents
!unzip '/content/drive/My Drive/Important Extras/Data Science Works/_Data Science Work/Speech
Emotion Recognition/speech-emotion-recognition-ravdess-data.zip'
!ls
!pip install librosa soundfile
#Import All Important Libraries
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
#function for extracting mfcc, chroma, and mel features from sound file
def extract_feature(file_name, mfcc, chroma, mel):
with soundfile.SoundFile(file_name) as sound_file:
X = sound_file.read(dtype="float32")
sample_rate=sound_file.samplerate
if chroma:
stft=np.abs(librosa.stft(X))
result=np.array([])
if mfcc:
mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
result=np.hstack((result, mfccs))
if chroma:
chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
41
result=np.hstack((result, chroma))
if mel:
mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
result=np.hstack((result, mel))
return result
#Define the motions dictionary
emotions = {
'01':'neutral',
'02':'calm',
'03':'happy',
'04':'sad',
'05':'angry',
'06':'fearful',
'07':'disgust',
'08':'surprised'
}
#Emotions we want to observe
observed_emotions = ['calm', 'happy', 'fearful', 'disgust']
#Load the data and extract features for each sound file
def load_data(test_size = 0.2):
x, y = [], []
for folder in glob.glob('/content/Actor_*'):
print(folder)
for file in glob.glob(folder + '/*.wav'):
file_name = os.path.basename(file)
emotion = emotions[file_name.split('-')[2]]
if emotion not in observed_emotions:
continue
feature = extract_feature(file, mfcc = True, chroma = True, mel = True)
x.append(feature)
y.append(emotion)
return train_test_split(np.array(x), y, test_size = test_size, random_state = 9)
x_train,x_test,y_train,y_test=load_data(test_size=0.2)
#Shape of train and test set and Number of features extracted
42
print((x_train.shape[0], x_test.shape[0]))
print(f'Features extracted: {x_train.shape[1]}')
#Initialise Multi Layer Perceptron Classifier
model = MLPClassifier(alpha = 0.01, batch_size = 256, epsilon = 1e-08, hidden_layer_sizes =
(300,), learning_rate = 'adaptive', max_iter = 500)
model.fit(x_train, y_train)
#Predict for the test set
y_pred = model.predict(x_test)
#Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
43