0% found this document useful (0 votes)
9 views5 pages

9 - Yogendra

The document discusses advances in Speech Emotion Recognition (SER) using machine learning, particularly deep learning techniques, to analyze human emotions through speech. It highlights the evolution from traditional methods to sophisticated models that enhance accuracy and scalability while addressing challenges such as emotional variability across cultures and ethical concerns. The proposed methodology involves feature extraction from speech data and the use of convolutional neural networks to classify emotions, demonstrating significant potential applications in various fields including healthcare and education.

Uploaded by

ANUBHAB RATH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

9 - Yogendra

The document discusses advances in Speech Emotion Recognition (SER) using machine learning, particularly deep learning techniques, to analyze human emotions through speech. It highlights the evolution from traditional methods to sophisticated models that enhance accuracy and scalability while addressing challenges such as emotional variability across cultures and ethical concerns. The proposed methodology involves feature extraction from speech data and the use of convolutional neural networks to classify emotions, demonstrating significant potential applications in various fields including healthcare and education.

Uploaded by

ANUBHAB RATH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

GLIMPSE - Journal of Computer Science •Vol.

3,
No. 2, JULY-DECEMBER 2024, pp. 44-48

ADVANCES IN SPEECH EMOTION


RECOGNITION: TECHNIQUES, CHALLENGES,
AND APPLICATIONS
Yogendra Narayan Prajapati1, Arvind Goutam2
1
Department of CSE, Ajay Kumar Garg Engineering College, Ghaziabad, India
2
Department of CSE, Ajay Kumar Garg Engineering College, Ghaziabad, India
1
ynp1581@gmail.com, 2 john.doe@example.com

Abstract—In recent years, the use of machine learning to recognize linked to different emotional states. Early approaches relied
human emotions through speech analysis has received a lot of on rule-based systems and manually crafted characteristics,
attention. This approach involves identifying the relationship resulting in limited accuracy and scalability. However,
between speech features and emotions and training machine the introduction of machine learning techniques brought
learning models to classify emotions based on these features. In
about a paradigm shift, facilitating the development of
this article, we present a new method for understanding human
emotions by analyzing speech using neural networks. Without more sophisticated models capable of learning intricate
requiring artificial intelligence, our approach extracts features data patterns. Recent progress in deep learning, particularly
from unprocessed speech data by leveraging the capabilities of the utilization of neural networks, has notably enhanced
deep learning. We achieve the best performance in the case of the performance of SER systems by streamlining feature
knowledge for fundamental needs by evaluating our strategy extraction and harnessing extensive datasets.
using large data. Our findings demonstrate that deep learning
can enhance the conventional method by eliminating speech-to- Despite these advancements, numerous challenges persist
cognitive recognition elements. By using discourse analysis, this in the realm of SER. One major obstacle is the variability
article advances the field of cognitive psychology research and
in emotional expression across diverse languages,
highlights the benefits of deep learning in this area. Speech is a
powerful tool for communicating emotions, and understanding cultures, and individual speakers, potentially affecting the
people’s emotions by analyzing speech can have important generalizability of SER systems. Furthermore, the subtlety
applications inmany areas. In this article, we propose a method and contextdependency of emotional cues necessitate
for combining speech and data for cognitive recognition. Our resilient models capable of handling a range of real-world
approach involves extracting text from speech data using data variations. Additionally, ethical considerations and
natural language processing techniques and combining it with privacy issues surrounding the acquisition and utilization
acoustic features. We then use a deep neural network model to of emotional data represent crucial concerns that warrant
classifyemotions. meticulous attention. [2]
Keywords—Psychology, cognitive, spectrogram, optimization,
effectiveness. This manuscript delivers a thorough overview of the present
status of speech emotion recognition, concentrating on
I. INTRODUCTION the latest methodologies, ongoing hurdles, and emerging
Emotion recognition from speech, an integral aspect of applications.
affective computing, has experienced notable progress in
recent years due to advancements in artificial intelligence (AI) Various approaches employed in SER, spanning from
and machine learning technologies. The accurate and efficient traditional acoustic analysis to advanced deep learning
interpretation of emotional states from spoken language has methods, will be explored. Moreover, the paper will delve
become increasingly crucial as these technologies evolve. into the practical implications of SER technology across
Speech Emotion Recognition (SER) involves identifying and different sectors and underscore the prospective avenues for
categorizing emotional cues conveyed through vocal research and advancement. By amalgamating recent progress
expressions, with significant implications for various fields and pinpointing critical challenges, this paper endeavors to
such as human-computer interaction, healthcare, customer furnish valuable insights into the future of emotion recognition
service, and entertainment.[1]. from speech and its potential ramifications on technology and
society. [3].
The exploration of emotions in speech historically commenced
with fundamental research on acoustic and linguistic features

44
GLIMPSE - Journal of Computer Science • Vol. 3, No. 2, JULY-DECEMBER 2024

A. LITERATURE REVIEW significantly. The researchers put forth a novel feature


Khorrami and colleagues (2017) introduced a novel approach selection algorithm in their investigation, which harnesses
for speech emotion recognition by integrating deep neural genetic programming to automatically pinpoint the most
networks (DNN) with manually crafted features. Through pertinent features for tasks related to recognizing emotions in
their experimental investigations, they were able to speech. By carrying out a series of experiments using the
demonstrate the remarkable enhancement in speech emotion Berlin Emotional Speech Database, they managed to confirm
recognition accuracy by synergistically leveraging both the efficacy and efficiency of the proposed approach. The
deep neural networks and handcrafted features, surpassing outcomes of their research demonstrate the positive results of
the performance of individual approaches. The system they employing genetic programming for feature selection in the
proposed exhibited a notable classification accuracy of 63.8 on domain of speech emotion recognition [8].
the IEMOCAP dataset, surpassing the previously established
state-of-the-art results in this domain.[4].” Ververidis and In a study conducted by Mower et al. in 2009, the primary
Kotropoulos (2006) underscored the formidable nature of focus was on the introduction of a newly established collection
speech emotion recognition, attributing its complexity to of emotional speech data referred to as the MSP-IMPROV
the wide-ranging variability and intricate nature of human corpus. This particular corpus was specifically crafted to
emotions. The authors’ presentation entails an examination of capture instances of spontaneous emotional speech that were
current methodologies employed in the realm of speech captured during improvisational acting sessions, thereby
emotion recognition, accompanied by the introduction providing a unique repository for investigations concerning
of an innovative feature extraction technique founded on the expression of emotions through speech. The preliminary
wavelet packet decomposition. Through the utilization of results from this inquiry suggest that the MSP-IMPROV
the Berlin Emotional Speech Database, the authors carried corpus demonstrates the potential to be used for training
out experimental analyses to gauge the efficacy of the systems to detect emotional cues in speech in a manner that
proposed approach. The findings from these experiments can generalize effectively across different speakers, thus
serve to validate the promising potential and effectiveness of highlighting its versatility and relevance in both research and
the novel feature extraction method put forth by Ververidis practical applications.applications.[9]. “Alippi et al., 2018:”
and Kotropoulos. [5].” Busso et al., 2008 introduced the In this paper, we propose a novel approach to speech emotion
Interactive Emotional Dyadic Motion Capture (IEMOCAP) recognition based on the analysis of electroencephalography
database, a resource that encompasses multimodal recordings (EEG) signals... Our experimental results show that the
showcasing emotional exchanges between two actors. The proposed method achieves a classification accuracy of 74.4 on
authors meticulously outline the process of compiling and the SEED-IV database, outperforming previous methods based
annotating the IEMOCAP database in their study, emphasizing on speech and physiological signals alone .” Chakraborty and
its significance for advancing the field of speech emotion Ghosal, 2021: ”In this paper, we propose a novel speech
recognition. Through a series of experiments, it was evident emotion recognition system based on a hybrid deep learning
that enhancing speech emotion recognition performance is approach... Our system uses a combination of convolutional
achievable by integrating additional contextual elements such neural networks (CNNs) and recurrent neural networks
as the surrounding dialogue context and the identity of the (RNNs) to extract features from speech signals and classify
speaker.[6].” emotions... Experimental results on the MSP-IMPROV corpus
demonstrate the effectiveness of the proposed approach .” Lee
In the study conducted by Li and colleagues in 2020, the et al., 2020: ”In this paper, we propose a novel approach to
authors emphasized the significance of emotion recognition speech emotion recognition that uses unsupervised learning
from both speech and facial expressions, which is a highly to discover emotional states in speech signals... Our method
relevant area of investigation across various domains. uses a variational autoencoder (VAE) to learn a latent space
The researchers introduced a novel multimodal emotion representation of speech, which is then used to cluster
recognition framework in their research, employing emotional states... Experimental results on the IEMOCAP
sophisticated deep learning techniques to amalgamate data database demonstrate the effectiveness of the proposed
extracted from both speech patterns and facial cues. The approach [9]. “Kim et al., 2019:” Speech emotion recognition
outcomes of their empirical analysis revealed that the proposed is a challenging task due to the high variability and complexity
system attained an impressive classification accuracy rate of of emotional expression... In this paper, we propose a novel
80.2 on the AffectNet database, surpassing the performances method for speech emotion recognition based on a multi-head
of previously established state of-the-art methodologies. [7].” attention mechanism that selectively attends to informative
Eyben and colleagues (2010) underscored the importance of regions of the speech signal... Experimental results on the
feature selection in the realm of speech emotion recognition, IEMOCAP and MSP-IMPROV corpora demonstrate the
emphasizing its ability to not only improve classification effectiveness of the proposed approach .”
accuracy but also reduce computational complexity

45
GLIMPSE - Journal of Computer Science • Vol. 3, No. 2, JULY-DECEMBER 2024

II. METHODOLOGY techniques such as data augmentation and change training.


A. STEPS OF PROCESS The discussion on the process or Hyperparameters such as learning rate, heap size, and number
methods utilized for the identification of human emotions of iterations can affect the performance of the model. In order
through speech analysis is as follows: to enhance the model’s performance, optimization modifies it
1. The initial phase involves the collection and preprocessing by adding layers or increasing the number of filters used in the
of speech data. The primary task is to compile and prepare convolution process. By performing numerous adjustments,
speech material by gathering extensive audio recordings from including data augmentation, pitch shifting, time stretching,
various sources like audio files, videos, and online platforms. and noise addition, new models are created. Using a prior
Subsequently, the preprocessed data undergoes formatting, model of a related task to enhance the model’s performance
adjustment of loudness, and elimination of noise to render on a new task is known as transfer learning.
it suitable for utilization by Convolutional Neural Networks
(CNN). Following this, the dataset is segregated into training, B. ALGORITHM //Anaconda and Python Jupyter Book
validation, and test sets. The training phase is dedicated to tools. Step 1: Provide audio samples as input. Step 2: Draw
training the CNN model, the validation process focuses on spectrograms and waveforms from audio files. Step 3: Using
optimizing its hyperparameters, and the testing stage assesses LIBROSA, a python library, we usually extract MFCCs (Mel
the model’s performance[10]. Frequency Cepstrum coefficients) around 10-20. // Build
2. The subsequent step involves the extraction of speech software Step 4: Remix the data, separate it for training and
features from the previously processed file. Various methods testing, then build a CNN model and next steps to train the
are employed for extraction, such as mel-frequency cepstral data. Step 5: Determine the volume from the data shown
coefficients (MFCC), linear predictive coding (LPC), and (number of samples - estimated value - actual value)
spectrograms. MFCCs are commonly utilized in speech
recognition due to their ability to capture the inherent features III. PROPOSED SYSTEM
of speech. LPC is another method for extracting speech A proposed process that uses machine learning to recognize
by scrutinizing autoregressive patterns of speech signals. human speech perception. The system consists of several
Spectrograms represent a two-dimensional visual depiction of modules, each of which is responsible for a specific task in
the frequency content of an audio signal, generated through authentication. The system starts using the audio signal from
the calculation of the short-time Fourier transform of the the user and is then pre-filtered to remove noise and other
speech signal. unwanted stuff. The signal is pre-analyzed to extract various
3. Moving on, the third phase encompasses the construction properties such as pitch, power and spectral characteristics.
and training of a CNN model using the extracted spoken These features are then fed into a machine-learning algorithm
words. CNN is a neural network capable of learning and to classify speech according to different types of thought.
extracting features from images, sounds, and diverse data The machine learning algorithm used in this system is
types. A standard CNN comprises various layers, including a convolutional neural network (CNN), which is a deep
generalization, integration, and communication layers. To learning algorithm very suitable for speech recognition tasks.
extract features from input data, a convolutional layer applies The preparation process is designed to recognize various
a filter layer. The pooling layer is then utilized to downsample emotions such as happiness, sadness, anger, fear, and surprise,
the output of the convolutional process, reducing the sample’s among others. The system also takes into account the fact
complexity. Complete layers are employed to categorize the that emotions are not always expressed in the same way. For
extracted features into distinct categories. The model example, a person may express their happiness in different
undergoes training through backpropagation and a stochastic ways depending on their culture, language, and personality.
gradient descent algorithm to minimize loss[11]. To guarantee the accuracy and dependability of the system,
4. The final step involves evaluating the performance of it has been trained on a vast library of speech patterns from
the trained model on the test platform. Model performance
is assessed using diverse metrics such as accuracy,
reproducibility, and F1 score. The precision metric gauges the
percentage of samples excluded from testing, while accuracy
measures the percentage of accurate predictions among
positive samples.
The F1 score serves as a balance between precision and
recall, often serving as a comprehensive indicator of
performance[10]. 5. Improving the Model’s Performance:
The final step is to improve the performance of the model by
optimizing the model’s hyperparameters, fine-tuning
the model architecture, and using advanced learning Fig. 1: Flow chart

46
GLIMPSE - Journal of Computer Science • Vol. 3, No. 2, JULY-DECEMBER 2024

tests to assess people’s mental health show potential for use


in many fields, including mental health, education, and ad-
diction. However, according to the dataset we used, the CNN
model could detect the sis types of emotions which were an-
ger, sad, fear, happy, surprise and neutral. We calculated the
result in terms of precision, recall, F1-score and support. The
accuracy of the result was obtained at 83

Table I: Precision, recall, F1-score, and support for


each emotion class

Fig. 2: Flow chart On our test data, we obtained an overall accuracy of 83;
however, this might be further enhanced by utilising additional
various individuals, cultures, and languages. Additionally, augmentation techniques and a variety of contemporary
the system is made to take into consideration the unique feature extraction approaches
characteristics of each user, including their speaking patterns
and personal habits. Planning strategies are widely used in V. CONCLUSION
a variety of industries, including psychology, education, and In conclusion, speech analysis for emotional intelligence has
entertainment. By examining alterations in speech patterns, shown great potential in recent years. With the use of machine
the technique can now be utilized in psychology to identify learning algorithms,speech characteristics such as tone, vol-
early indicators of mental illnesses like anxiety or sadness. ume, and intensity can be analyzed to describe human emo-
By giving users feedback on pronunciation and intonation, the tions accurately. The technology has many applications in a
technology can be utilized in education to enhance language variety of industries, including psychological diagnostics,
acquisition. More interactive games could be made with the customer service, and human-computer interaction. However,
system to provide entertainment value. All in all, the demand ethical issues such as privacy concerns and the possibility of
to understand human emotions using speech analysis is abuse must be addressed to ensure fair and transparent use
a great way to change the way we interact with machines. of technology. Future research in thisarea should focus on
With the rapid development of machine learning algorithms, developing new features and models for emotional analysis,
we can expect to see more accurate and reliable emotional integrating other emotional data sources, and ensuring fair use
intelligence in the future. of these methods. Speech analysis for emotionrecognition is a
complex process involving many machine-learning algo-
IV. RESULT rithms and techniques. Although it has shown potential for
The results of the project ”Emotion Recognition Using Speech many applications, more research is needed to improve its
Analysis” vary depending on the specific methods and models accuracy and overcome its limitations. In addition, ethical is-
used in the research. Overall, the results show that machine sues such as data privacy and transparencymust be addressed
learning algorithms can classify human emotions based on to ensure that these technologies are developed and used re-
speech analysis. For example, studies have reported positive sponsibly and ethically. Using speech analysis to understand
results ranging from 70 percent for emotional awareness us- humanemotions is a rapidly growing area of research that has
ing speech more than 90 percent. The accuracy of results de- the potential to change the way we interact withtechnology
pends on many factors, such as the quality of the data, the and with each other. By recognizing good behavior in conver-
inference process, and the classification pattern. Also, the pro- sation, communication can be improved, psychological prob-
posed method using a convolutional neural network (CNN) lems diagnosed, and customer service improved. However, to
shows good results in speech perception recognition. The use realize these results, more research is needed toimprove the ac-
of CNNs can help improve classification accuracy by extract- curacy and reliability of emotional recognition through speech
ing features from speech signals and preserving relationships and to resolve ethical issues such as data privacy and integrity.
between them. Overall, the findings from the use of speaking In general,the development and use of emotional intelligence

47
GLIMPSE - Journal of Computer Science • Vol. 3, No. 2, JULY-DECEMBER 2024

through speech analysis should be guided by ethicaland so- to-date crustal deformation map of iran using integrated
cial considerations. All things considered, speech analysis as campaign-mode and permanent gps velocities,” Geophysical
a means of assessing emotional intelligence is a fascinating Journal International, vol. 217, 02 2019.
field of study with great potential to influence numerous com- [5] D. Ververidis and C. Kotropoulos, “Emotional speech recogni-
tion: Resources, features, and methods,” Speech Communica-
panies and sectors in the years to come. New characteristics
tion, vol. 48, no. 9, pp. 1162–1181, 2006. [Online]. Available:
and models for sentiment analysis are still being investigated https://www.sciencedirect.com/science/article/pii/
to increase the efficacy and accuracy of speech analysis-based S0167639306000422
emotion recognition. For example, researchers are exploring [6] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,
the use of deep learning techniques such as neural networks S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap:
and recurrent neural networks to better interact and make dif- interactive emotional dyadic motion capture database,” Lan-
ferent decisions between listening and speaking. Another im- guage Resources and Evaluation, vol. 42, no. 4, pp. 335–359,
portant aspect of emotional recognition is theintegration of 12 2008. [Online]. Available: https://doi.org/10.1007/s10579-
various information such as facialexpressions, body language, 008-9076-6
[7] W. Mellouk and W. Handouzi, “Facial emotion recognition us-
and body movements, which can provide additional informa-
ing deep learning: review and insights,” Procedia Computer
tion to thecontrol message voice and increase the accuracy of Science, vol. 175, pp. 689–694, 2020, the 17th International
the thought process. However, it is important to remember Conference on Mobile Systems and Pervasive Computing
that technology use should be guided by ethics and attributes (MobiSPC),The 15th International Conference on Future Net-
such as privacy, transparency, action, or health. works and Communications (FNC),The 10th International
Conference on Sustainable Energy Information Technology.
REFERENCES [Online]. Available: https://www.sciencedirect.com/science/
[1] M. S. Ram, A. Sreeram, M. Poongundran, P. Singh, Y. N. Pra- article/pii/S1877050920318019
japati, and S. Myrzahmetova, “Data fusion opportunities in iot [8] Vikas, “Machine learning methods in software engineering –
and its impact on decision- making process of organisations,” review,” Journal of Computer Science, vol. 3, no. 1, pp. 48–
in 2022 6th International Conference on Intelligent Comput- 51, jan 2024.
ing and Control Systems (ICICCS), 2022, pp. 459–464. [9] F. Eyben, M. W¨ollmer, and B. Schuller, “opensmile – the mu-
[2] S. Jain, “Deep learning’s obstacles in medical image analysis: nich versatile and fast open-source audio feature extractor,” 01
Boosting trust and explainability,” Journal of Computer Sci- 2010, pp. 1459–1462.
ence, vol. 3, no. 1, pp. 21–24, jan 2024. [10] Y. N. Prajapati and M. Sharma, “Novel machine learning algo-
[3] Y. N. Prajapati, U. Sesadri, T. Mahesh, S. Shreyanth, A. Ober- rithms for predicting covid-19 clinical outcomes with gender
oi, and K. P. Jayant, “Machine learning algorithms in big data analysis,” in Advanced Computing, D. Garg, J. J. P. C. Rod-
analytics for social media data based sentimental analysis,” rigues, S. K. Gupta, X. Cheng, P. Sarao, and G. S. Patel, Eds.
International Journal of Intelligent Systems and Applications Cham: Springer Nature Switzerland, 2024, pp. 296–310.
in Engineering, vol. 10, no. 2s, pp. 264–267, 2022. [11] ——, “Designing ai to predict covid-19 outcomes by gender,”
[4] F. Khorrami, P. Vernant, F. Masson, F. Nilfouroushan, Z. in 2023 International Conference on Data Science, Agents Ar-
Mousavi, N. R, R. Saadat, A. Walpersdorf, S. Hosseini, P. tificial Intelligence (ICDSAAI), 2023, pp. 1–7.
Tavakoli, A. Aghamohammadi, and M. Alijanzade, “An up-

48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy