Speech Emotion Recognition Using Machine Learning
Speech Emotion Recognition Using Machine Learning
Project members:
Project Guide:
1.K.Mounika 2022-2322012
Mrs.P.Ratna Pavani
2.B.Shiny Grace 2022-2322029
Head of the Department of Computer Applications
3.K.Priyanka 2022-2322038
4.K.Thanushka 2022-2322060
Contents
• Introduction
• Algorithm
• Conclusion
• Proposed System
ABSTRACT
• Emotions play a critical role in communication, influencing how messages are perceived and
interpreted. SER goes a step further by analyzing acoustic features like pitch, tone,
intensity, and rhythm to infer the underlying emotional state, such as happiness,
sadness, anger, fear, or neutrality.
Exisiting
System
• Speech Emotion Recognition (SER) has seen significant advancements over the years, with
various systems and frameworks developed to accurately detect and classify emotions from
speech. These systems leverage machine learning (ML) and deep learning (DL) techniques, along
with diverse datasets and feature extraction methods.
• Speech Emotion Recognition (SER) systems are designed to detect emotions like happiness,
sadness, anger, or fear from a person's voice. These systems use machine learning (ML) and deep
learning (DL) techniques to analyze speech signals and classify emotions. Here's a breakdown of
existing systems in simple and technical terms:
Traditional Machine Learning-Based Systems :
• Feature Extraction: Traditional SER systems rely on handcrafted acoustic features such as:
• Mel-frequency cepstral coefficients (MFCCs): Captures spectral characteristics of speech.
• Pitch (Fundamental Frequency): Indicates vocal cord vibrations, useful for detecting
emotions like anger or excitement.
• Energy/Intensity: Reflects the loudness or intensity of speech.
• Spectral Features: Such as spectral centroid, bandwidth, and roll-off.
• Temporal Features: Including speech rate and pauses.
Classification Algorithms :
• Support Vector Machines (SVM): Widely used for emotion classification due to its
effectiveness in handling high-dimensional data.
• Random Forests: Utilized for ensemble learning and feature importance analysis.
• k-Nearest Neighbors (k-NN): Simple yet effective for small datasets.
• Gaussian Mixture Models (GMMs): Used for modeling the distribution of acoustic features.
• Datasets:
• RAVDESS: Contains 24 actors expressing 8 emotions (calm, happy, sad, angry, etc.).
• CREMA-D: Includes 7,442 clips from 91 actors with 6 emotions.
• TESS: Focuses on older female voices expressing 7 emotions.
Proposed
System
• A proposed system for Speech Emotion Recognition (SER) aims to address the limitations of existing
systems while improving accuracy, efficiency, and robustness. A proposed SER system, including its
architecture, workflow, and key innovations.
• The proposed system leverages advanced deep learning techniques, multimodal data fusion, and real-
time processing capabilities to accurately detect emotions from speech.
• It is designed to handle real-world challenges such as noise, variability in speech, and limited labeled
data.
Key Components of the Proposed System:
A) Data Preprocessing
• Input: Raw speech signals (audio files or real-time audio streams).
• Steps:
• Noise Reduction: Use noise-removal techniques (e.g., spectral gating) to clean the
audio.
• Normalization: Normalize audio signals to ensure consistent volume levels.
• Feature Extraction: Where meaningful information is extracted from speech signals.
• Extract Mel-spectrograms or MFCCs as input features for deep learning models..
B) Deep Learning Model Architecture:
• The proposed system uses a hybrid deep learning model combining the strengths of
Convolutional Neural Networks (CNNs) and Transformers.
• CNN Module:
• Processes Mel-spectrograms to capture spatial patterns in speech (e.g., frequency and tone
variations).
• Transformer Module:
• Captures long-range dependencies and temporal patterns in speech (e.g., how emotions evolve
over time).
• Fusion Layer:
• Combines features from the CNN and Transformer modules for final emotion classification.
Random Forest
Algorithm
Random Forest is a powerful ensemble learning algorithm that improves classification accuracy by combining
multiple decision trees. It is widely used in Speech Emotion Recognition (SER) to classify emotions based on
extracted speech features.
In this ,the Machine Learning to recognize emotions from speech audio and gain insights into how humans
express emotions through voice. This technology has many practical applications, such as analyzing customer
emotions in call centers, improving voice-based virtual assistants and chatbots, and even assisting in linguistic
research.
One exciting use case is detecting fake emotions in phone calls, which can help improve security and fraud
detection. However, a major challenge in building accurate models is overfitting, which happens when too many
features make the model less reliable. To solve this, we can enhance accuracy by adding preprocessing steps like
data cleaning and dimensionality reduction, ensuring the system focuses only on the most important speech features.