0% found this document useful (0 votes)
10 views20 pages

ml2 Copy

The document outlines a project on Facial Emotion Recognition (FER) using a Vision Transformer (ViT) model to classify emotions from facial expressions in real-time. It details the methodology including data collection, preprocessing, model design, and evaluation metrics such as accuracy, precision, and recall, highlighting the challenges of class imbalance in the dataset. The model achieved an overall accuracy of 87.93%, demonstrating its effectiveness in recognizing emotions across various conditions.

Uploaded by

nguyen23062004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

ml2 Copy

The document outlines a project on Facial Emotion Recognition (FER) using a Vision Transformer (ViT) model to classify emotions from facial expressions in real-time. It details the methodology including data collection, preprocessing, model design, and evaluation metrics such as accuracy, precision, and recall, highlighting the challenges of class imbalance in the dataset. The model achieved an overall accuracy of 87.93%, demonstrating its effectiveness in recognizing emotions across various conditions.

Uploaded by

nguyen23062004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF HANOI

DEPARTMENT OF INFORMATION AND COMMUNICATION TECHNOLOGY

MACHINE LEARNING AND DATA MINING II

Facial Emotion Recognition

Group Members

Tran Bao Nguyen 22BI13342

Nguyen Thu Trang 22BI13426

Le Quang Minh 22BI13286

May, 2025
Table of Contents

Abstract 2

1 Introduction 3
1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Expected Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Materials and Methods 5


2.1 Data collection and pre-processing . . . . . . . . . . . . . . . . . . . . . 6
2.2 Model Design and Implementation . . . . . . . . . . . . . . . . . . . . . 9
2.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Top K Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Live Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Result and Discussions 13


3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Challenges Encountered . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Model Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Conclusion and Future work 18


4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

References 19

1
Abstract

Facial Expression Recognition (FER) plays a significant role in enhancing human-computer


interaction, mental health monitoring, and emotion-aware systems. Recognizing facial
emotions accurately remains a challenging task due to variations in expressions, lighting
conditions, and occlusions. Traditional approaches often struggle with real-time per-
formance and generalization across diverse datasets. In this study, we propose a deep
learning-based approach utilizing a Vision Transformer (ViT) model for multi-class facial
emotion classification. The workflow involves comprehensive data preprocessing, model
training with fine-tuned hyperparameters, and performance evaluation using metrics such
as Accuracy, Precision, Recall, and Top-K Accuracy. Our implementation leverages the
power of transfer learning and transformer-based architectures to achieve robust perfor-
mance across multiple emotion categories. This project contributes toward the devel-
opment of intelligent systems that can interpret and respond to human emotions more
effectively, with potential applications in healthcare, education, and human-centered AI
systems.

2
Chapter 1

Introduction

1.1 Context and Motivation

Facial expressions are among the most powerful and universal forms of non-verbal com-
munication. They play a vital role in human interaction, conveying a range of emotions,
such as happiness, sadness, anger, surprise, fear, and disgust, without the use of spoken
language. With the advancement of artificial intelligence and computer vision, auto-
matic recognition and interpretation of facial cues has become an increasingly important
research area known as Facial Emotion Recognition (FER).

FER systems aim to detect human faces from images or video input and classify the
facial expressions into discrete emotion categories. These systems have a wide array
of real-world applications, including mental health monitoring, driver alertness systems,
adaptive e-learning platforms, entertainment, and human-computer interaction.

Around the world, researchers and developers have explored both traditional and modern
deep learning techniques to tackle FER:

• Traditional approaches often rely on handcrafted features such as Local Binary Pat-
terns (LBP), Gabor filters, or geometric landmarks, followed by machine learning
models such as SVMs or decision trees. However, these methods typically strug-
gle in complex, real-world settings due to their sensitivity to lighting conditions,
occlusions, and facial variability.

• Modern approaches leverage deep learning models, particularly Convolutional Neu-


ral Networks (CNNs), which automatically extract features from raw images. Re-
cent advances include the use of Residual Networks (ResNet), attention-based mod-
els, and Vision Transformers (ViT), which offer superior generalization and accu-

3
racy in challenging data sets..

To overcome the limitations of traditional methods, our project utilizes a deep learning-
based FER pipeline powered by a Vision Transformer (ViT). The model is trained on
a labeled emotion dataset to classify facial expressions from still images and is then
deployed in a real-time application using webcam input.

1.2 Expected Outcomes

The input to our system is a live video feed (webcam) capturing real-time facial expres-
sions. The output is a bounding box drawn around each detected face along with a
predicted emotion label such as ”happy”, ”sad”, ”angry”, etc., updated frame-by-frame.
This allows for real-time facial emotion recognition and visualization.

4
Chapter 2

Materials and Methods

Our project will follow this pipeline from collecting the data of fruits’ quality, analyze
the data, pre-process the data, building and implement a CNN model and then evaluate
the performance of our model on some metrics.

Figure 2.1: Workflow diagram of Facial Emotion Recognition

5
2.1 Data collection and pre-processing

In this research, we collect data from the FER in Kaggle.

Data collection

The Facial Emotion Recognition dataset contains 35887 grayscale images of human faces,
each labeled with one of seven basic emotion categories: Angry, Disgust, Fear, Happy,
Sad, Surprise, and Neutral. Each image is of size 48x48 pixels, capturing a wide range
of facial expressions in varied lighting conditions, poses, and backgrounds. This diversity
makes the dataset suitable for training robust and real-world applicable facial emotion
recognition models. The dataset is pre-split into a training set of 28,821 images and a
test set of 7,066 images, enabling effective model development and unbiased evaluation.

Figure 2.2: Class Distribution of Image per Label of Facial Emotion Recognition train
set

6
Figure 2.3: Class Distribution of Image per Label of Facial Emotion Recognition test set

The class distribution in the FER training set exhibits significant variation in the num-
ber of images across different emotion categories. While some classes such as Happy
contain a large number of samples (7164), others like Disgust are severely underrepre-
sented (436 images). This imbalance can introduce bias during model training, leading
to poor generalization and misclassification of minority emotions. Such skewed distribu-
tions may result in decision boundaries that overly favor majority classes. To mitigate
this issue, we apply oversampling techniques, which increase the number of instances in
underrepresented classes by replicating existing samples. This helps balance the dataset
and improves the model’s ability to learn from all emotion categories more effectively.

Data pre-processing

Facial Emotion Recognition dataset

The Facial Expression Recognition (FER) dataset from Kaggle is primarily used in this
research for training and testing. The original training set is split into two subsets: 80%
for training and 20% for validation. During preprocessing, each image is assigned a nu-
merical label based on its corresponding emotion class. This label encoding enables the
model to interpret categorical emotions as numeric targets during training. Each image
is resized and normalized to a standardized format to ensure consistent input dimensions
for the model.

7
Due to significant class imbalance in the dataset where emotions like Happy are overrep-
resented and Disgust severely underrepresented, oversampling is applied to the minority
classes in the training set. This is done by replicating images in the underrepresented
classes until each class contains an approximately equal number of samples. This tech-
nique reduces bias toward majority classes and improves the model’s ability to generalize
across all emotion categories.

After oversampling, the dataset is restructured into a standardized format suitable for
deep learning workflows. Emotion labels are converted into consistent integer identifiers,
and the dataset is organized to ensure type consistency and efficient access during train-
ing. Data is subsequently loaded in mini-batches of 32 and shuffled at the beginning of
each epoch to promote training stability and improve generalization performance.

Figure 2.4: Class Distribution of Facial Emotion Recognition train set after augmentation

8
2.2 Model Design and Implementation

Figure 2.5: ViT model structure

We have developed a facial emotion classification model based on the Vision Transformer
(ViT) architecture, specifically using the pretrained version vit-base-patch16-224-in21k
developed by Google. The model is designed to extract hierarchical features from input
images and perform classification across seven basic emotions: sad, disgust, angry, neu-
tral, fear, surprise, and happy.

The Vision Transformer (ViT) model processes each input image by resizing it to 224×224
pixels and dividing it into non-overlapping patches of size 16×16 pixels. This results in
a total of 196 patches per image. Each patch is flattened into a vector and linearly
projected into a fixed-dimensional embedding. A learnable positional embedding is then
added to each patch embedding to preserve spatial structure across the image. These
196 positional patch embeddings are passed into a sequence of Transformer Encoder
layers. Each encoder layer consists of two main components: a Multi-Head Self-Attention
mechanism and a feed-forward neural network (MLP), both with layer normalization and
residual (skip) connections. Within the attention mechanism, each patch embedding is
transformed into a Query (Q), Key (K), and Value (V) vector. Attention scores are
computed for every pair of patches using the formula:
( )
QK T
Attention(Q, K, V ) = Softmax √ V
dk

9
These scores determine how strongly one patch should attend to others, allowing the
model to capture global contextual relationships across the image. Multi-head attention
enables the model to learn from different perspectives simultaneously. The outputs of
all attention heads are concatenated and passed through the MLP to extract deeper and
more complex features. At the final stage, a special classification token is used as the
aggregate representation of the image, which is passed through a separate MLP head to
produce the predicted emotion class.

2.3 Model Evaluation

We evaluate the model’s effectiveness based on five metric values: Accuracy, Precision,
Recall and Top K Accuracy.

2.3.1 Accuracy

Accuracy measures the overall correctness of the model when predicting facial expressions
across all emotion classes.

TP + TN
Accuracy =
TP + TN + FP + FN

In this context, True Positive (TP) represents the number of correctly predicted samples
for a specific emotion class, while True Negative (TN) refers to the number of correctly
identified samples that do not belong to that class. False Positive (FP) occurs when the
model incorrectly predicts a face as belonging to a certain emotion, and False Negative
(FN) happens when the model fails to detect an emotion it should have.

However, Accuracy can sometimes be misleading in the presence of imbalanced datasets,


where some emotions (like Happy) are overrepresented, and others (like Disgust) are
underrepresented.

2.3.2 Precision and Recall

To account for class imbalances, we will compute these two metrics:

10
Precision is the ratio of correctly predicted samples for a given emotion class among all
samples predicted to belong to that class.

TP
Precision =
TP + FP
Recall is the proportion of correctly predicted samples for a given emotion among all
actual samples of that emotion.

TP
Recall =
TP + FN

2.3.3 Top K Accuracy

Top-K Accuracy extends the traditional accuracy metric by evaluating whether the cor-
rect emotion label is among the model’s top K predictions. For example, Top-1 Accuracy
corresponds to standard accuracy, while Top-3 Accuracy considers a prediction correct if
the true label is within the top three predicted classes.

This metric is especially valuable in facial expression recognition, where subtle differences
between emotions can lead to prediction ambiguity. Top-K Accuracy provides a more
forgiving and realistic assessment of model performance in such scenarios.

2.3.4 Live Detection

Figure 2.6: System workflow diagram

11
This is a system workflow diagram.

1. Model Setup
The system begins by loading a fine-tuned Vision Transformer (ViT) model specif-
ically trained for facial emotion classification. This model is capable of identifying
various emotional states such as happy, sad, angry, surprise, and more. To ensure
consistency and compatibility with the model’s expected input, an image proces-
sor is applied. This processor includes a transformation pipeline that standardizes
the input images by resizing, converting to tensors, and normalizing based on the
model’s training configuration.

2. Face Detector
The system employs OpenCV’s Haar Cascade classifier to identify faces in real-
time video frames. This lightweight and efficient face detection technique scans
each frame from the webcam and locates facial regions. Once detected, each face
is cropped from the original frame and prepared for emotion analysis in the next
stage.

3. Emotion Prediction
The cropped face images undergo preprocessing steps such as resizing, normaliza-
tion, and format conversion to align with the ViT model’s input requirements.
These processed images are then passed through the model, which predicts the
most probable emotion category.

4. Live Display and Feedback


For user interaction and real-time feedback, the system visually annotates the live
webcam feed. Bounding boxes are drawn around each detected face, and the pre-
dicted emotion label is displayed just above the box.

5. Data logging
To support further analysis and potential improvements, the system logs each detec-
tion event. It saves the coordinates of detected faces along with their corresponding
emotion predictions into a CSV file. Additionally, cropped face images are stored
locally, allowing users to examine individual samples or augment datasets for re-
training or evaluation purposes.

12
Chapter 3

Result and Discussions

This section presents the performance of our model on the test set from the Facial Ex-
pression Recognition (FER) dataset. The evaluation was conducted using multiple per-
formance metrics, including accuracy, precision, recall, Top K Accuracy. These metrics
help assess the effectiveness of the model in handling both common and rare emotion
classes.

3.1 Results

The facial emotion recognition model demonstrates strong and consistent performance
across all major evaluation metrics. It achieved an overall accuracy of 87.93%, with
precision and recall both at 87.91% and 87.93%, respectively. These metrics suggest
that the model is not only accurate but also reliable in maintaining a balance between
detecting true positives and minimizing false predictions.

13
Figure 3.1: Top K Accuracy for test set

Further insight is provided by the Top-K Accuracy analysis, which shows that the model’s
performance significantly improves when more prediction options are considered. While
the Top-1 accuracy is already high at around 88%, it climbs to 95% for Top-2, and
continues to increase, approaching 99% by Top-5. This indicates that the model can
still identify the correct emotion within its top predictions, which is especially useful in
applications where multi-label or ranked predictions are acceptable.

14
Figure 3.2: Confusion Matrix for test set

The confusion matrix offers a deeper look at class-wise performance. The model excels
at detecting emotions such as ”disgust” and ”sad”, which exhibit very few misclassifica-
tions. However, some confusion remains among similar expressions, for instance, ”angry”
is occasionally predicted as ”happy” or ”neutral”, and ”happy” sometimes overlaps with
”surprise” or ”fear”. These confusions are likely due to overlapping facial features be-
tween these emotions, suggesting potential for improvement via better feature separation
or emotion-specific data augmentation.

In order to evaluate our CNN model performance, we also implemented a pretrained


ResNet-18 on this dataset and then compare the result of two models.

15
Model Accuracy Precision Recall
ResNet-18 71% 71% 71%
ViT 87.93% 87.91% 87.93%

Table 3.1: The difference between our ViT and the Resnet-18 model

ViT is significantly higher than traditional CNN-based models (ResNet). Because Con-
volutional Neural Networks (CNNs) typically use small kernels to extract local features
from an image. While this is effective for capturing fine-grained patterns, it limits the
model’s ability to understand long-range dependencies across spatial regions. In contrast,
Vision Transformers (ViT) divide the image into patches and process all patches simul-
taneously using self-attention, allowing the model to learn global spatial relationships,
such as those between the eyes and mouth. In the context of facial emotion recognition
(FER), emotional expressions are often represented by a combination of changes across
multiple facial regions rather than localized features alone. Therefore, ViT offers a sig-
nificant advantage by modeling these broader spatial interactions, leading to improved
performance in recognizing complex emotions.

3.2 Challenges Encountered

During the development of the facial expression recognition model, we encountered sev-
eral challenges that negatively impacted performance. One of the major issues was class
imbalance, where the number of images across different emotion categories was highly un-
even. For example, the ”Happy” class had significantly more images compared to classes
like ”Disgust” or ”Surprise” (as shown in Figure X). This imbalance led the model to favor
majority classes during prediction, resulting in lower accuracy for minority classes. For
instance, the model was more likely to misclassify subtle emotions like ”Disgust” as ”Neu-
tral” or ”Angry” due to the lack of sufficient training samples for the underrepresented
classes.

16
3.3 Model Limitations

Despite the strong performance of our model, there are still several limitations to consider.
First, the model’s accuracy is highly dependent on the diversity and representativeness
of the dataset. In real-world scenarios, variations such as lighting conditions, occlusions),
and head poses can negatively impact performance, especially if these variations are
underrepresented in the training data.

17
Chapter 4

Conclusion and Future work

4.1 Conclusion

In this project, we introduced a Vision Transformer (ViT)-based model for real-time facial
expression recognition using live webcam input. The system captures frames from a live
video stream, detects faces, and classifies emotions into seven categories: Angry, Disgust,
Fear, Happy, Neutral, Sad, and Surprise. Experimental results indicate that the ViT
model performs well in recognizing emotions across varying facial expressions, achieving
strong evaluation metrics including quite high accuracy and balanced precision-recall
scores.

4.2 Future work

Although the model achieved promising results, several areas can be improved in future
work. Firstly, we aim to expand the dataset to include more diverse facial expressions,
ethnicities, lighting conditions, and age groups to enhance the model’s generalization and
robustness in real-world scenarios.

Additionally, we want to improve the system’s performance under challenging conditions


such as occlusions, low resolution, or extreme facial angles.

Finally, we plan to develop a user-friendly application that integrates our FER model
with live webcam input, allowing real-time emotion detection with visual feedback. This
application would serve as a practical demonstration of the model’s potential in domains
such as education, healthcare, and human-computer interaction.

18
References

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy