ml2 Copy
ml2 Copy
Group Members
May, 2025
Table of Contents
Abstract 2
1 Introduction 3
1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Expected Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
References 19
1
Abstract
2
Chapter 1
Introduction
Facial expressions are among the most powerful and universal forms of non-verbal com-
munication. They play a vital role in human interaction, conveying a range of emotions,
such as happiness, sadness, anger, surprise, fear, and disgust, without the use of spoken
language. With the advancement of artificial intelligence and computer vision, auto-
matic recognition and interpretation of facial cues has become an increasingly important
research area known as Facial Emotion Recognition (FER).
FER systems aim to detect human faces from images or video input and classify the
facial expressions into discrete emotion categories. These systems have a wide array
of real-world applications, including mental health monitoring, driver alertness systems,
adaptive e-learning platforms, entertainment, and human-computer interaction.
Around the world, researchers and developers have explored both traditional and modern
deep learning techniques to tackle FER:
• Traditional approaches often rely on handcrafted features such as Local Binary Pat-
terns (LBP), Gabor filters, or geometric landmarks, followed by machine learning
models such as SVMs or decision trees. However, these methods typically strug-
gle in complex, real-world settings due to their sensitivity to lighting conditions,
occlusions, and facial variability.
3
racy in challenging data sets..
To overcome the limitations of traditional methods, our project utilizes a deep learning-
based FER pipeline powered by a Vision Transformer (ViT). The model is trained on
a labeled emotion dataset to classify facial expressions from still images and is then
deployed in a real-time application using webcam input.
The input to our system is a live video feed (webcam) capturing real-time facial expres-
sions. The output is a bounding box drawn around each detected face along with a
predicted emotion label such as ”happy”, ”sad”, ”angry”, etc., updated frame-by-frame.
This allows for real-time facial emotion recognition and visualization.
4
Chapter 2
Our project will follow this pipeline from collecting the data of fruits’ quality, analyze
the data, pre-process the data, building and implement a CNN model and then evaluate
the performance of our model on some metrics.
5
2.1 Data collection and pre-processing
Data collection
The Facial Emotion Recognition dataset contains 35887 grayscale images of human faces,
each labeled with one of seven basic emotion categories: Angry, Disgust, Fear, Happy,
Sad, Surprise, and Neutral. Each image is of size 48x48 pixels, capturing a wide range
of facial expressions in varied lighting conditions, poses, and backgrounds. This diversity
makes the dataset suitable for training robust and real-world applicable facial emotion
recognition models. The dataset is pre-split into a training set of 28,821 images and a
test set of 7,066 images, enabling effective model development and unbiased evaluation.
Figure 2.2: Class Distribution of Image per Label of Facial Emotion Recognition train
set
6
Figure 2.3: Class Distribution of Image per Label of Facial Emotion Recognition test set
The class distribution in the FER training set exhibits significant variation in the num-
ber of images across different emotion categories. While some classes such as Happy
contain a large number of samples (7164), others like Disgust are severely underrepre-
sented (436 images). This imbalance can introduce bias during model training, leading
to poor generalization and misclassification of minority emotions. Such skewed distribu-
tions may result in decision boundaries that overly favor majority classes. To mitigate
this issue, we apply oversampling techniques, which increase the number of instances in
underrepresented classes by replicating existing samples. This helps balance the dataset
and improves the model’s ability to learn from all emotion categories more effectively.
Data pre-processing
The Facial Expression Recognition (FER) dataset from Kaggle is primarily used in this
research for training and testing. The original training set is split into two subsets: 80%
for training and 20% for validation. During preprocessing, each image is assigned a nu-
merical label based on its corresponding emotion class. This label encoding enables the
model to interpret categorical emotions as numeric targets during training. Each image
is resized and normalized to a standardized format to ensure consistent input dimensions
for the model.
7
Due to significant class imbalance in the dataset where emotions like Happy are overrep-
resented and Disgust severely underrepresented, oversampling is applied to the minority
classes in the training set. This is done by replicating images in the underrepresented
classes until each class contains an approximately equal number of samples. This tech-
nique reduces bias toward majority classes and improves the model’s ability to generalize
across all emotion categories.
After oversampling, the dataset is restructured into a standardized format suitable for
deep learning workflows. Emotion labels are converted into consistent integer identifiers,
and the dataset is organized to ensure type consistency and efficient access during train-
ing. Data is subsequently loaded in mini-batches of 32 and shuffled at the beginning of
each epoch to promote training stability and improve generalization performance.
Figure 2.4: Class Distribution of Facial Emotion Recognition train set after augmentation
8
2.2 Model Design and Implementation
We have developed a facial emotion classification model based on the Vision Transformer
(ViT) architecture, specifically using the pretrained version vit-base-patch16-224-in21k
developed by Google. The model is designed to extract hierarchical features from input
images and perform classification across seven basic emotions: sad, disgust, angry, neu-
tral, fear, surprise, and happy.
The Vision Transformer (ViT) model processes each input image by resizing it to 224×224
pixels and dividing it into non-overlapping patches of size 16×16 pixels. This results in
a total of 196 patches per image. Each patch is flattened into a vector and linearly
projected into a fixed-dimensional embedding. A learnable positional embedding is then
added to each patch embedding to preserve spatial structure across the image. These
196 positional patch embeddings are passed into a sequence of Transformer Encoder
layers. Each encoder layer consists of two main components: a Multi-Head Self-Attention
mechanism and a feed-forward neural network (MLP), both with layer normalization and
residual (skip) connections. Within the attention mechanism, each patch embedding is
transformed into a Query (Q), Key (K), and Value (V) vector. Attention scores are
computed for every pair of patches using the formula:
( )
QK T
Attention(Q, K, V ) = Softmax √ V
dk
9
These scores determine how strongly one patch should attend to others, allowing the
model to capture global contextual relationships across the image. Multi-head attention
enables the model to learn from different perspectives simultaneously. The outputs of
all attention heads are concatenated and passed through the MLP to extract deeper and
more complex features. At the final stage, a special classification token is used as the
aggregate representation of the image, which is passed through a separate MLP head to
produce the predicted emotion class.
We evaluate the model’s effectiveness based on five metric values: Accuracy, Precision,
Recall and Top K Accuracy.
2.3.1 Accuracy
Accuracy measures the overall correctness of the model when predicting facial expressions
across all emotion classes.
TP + TN
Accuracy =
TP + TN + FP + FN
In this context, True Positive (TP) represents the number of correctly predicted samples
for a specific emotion class, while True Negative (TN) refers to the number of correctly
identified samples that do not belong to that class. False Positive (FP) occurs when the
model incorrectly predicts a face as belonging to a certain emotion, and False Negative
(FN) happens when the model fails to detect an emotion it should have.
10
Precision is the ratio of correctly predicted samples for a given emotion class among all
samples predicted to belong to that class.
TP
Precision =
TP + FP
Recall is the proportion of correctly predicted samples for a given emotion among all
actual samples of that emotion.
TP
Recall =
TP + FN
Top-K Accuracy extends the traditional accuracy metric by evaluating whether the cor-
rect emotion label is among the model’s top K predictions. For example, Top-1 Accuracy
corresponds to standard accuracy, while Top-3 Accuracy considers a prediction correct if
the true label is within the top three predicted classes.
This metric is especially valuable in facial expression recognition, where subtle differences
between emotions can lead to prediction ambiguity. Top-K Accuracy provides a more
forgiving and realistic assessment of model performance in such scenarios.
11
This is a system workflow diagram.
1. Model Setup
The system begins by loading a fine-tuned Vision Transformer (ViT) model specif-
ically trained for facial emotion classification. This model is capable of identifying
various emotional states such as happy, sad, angry, surprise, and more. To ensure
consistency and compatibility with the model’s expected input, an image proces-
sor is applied. This processor includes a transformation pipeline that standardizes
the input images by resizing, converting to tensors, and normalizing based on the
model’s training configuration.
2. Face Detector
The system employs OpenCV’s Haar Cascade classifier to identify faces in real-
time video frames. This lightweight and efficient face detection technique scans
each frame from the webcam and locates facial regions. Once detected, each face
is cropped from the original frame and prepared for emotion analysis in the next
stage.
3. Emotion Prediction
The cropped face images undergo preprocessing steps such as resizing, normaliza-
tion, and format conversion to align with the ViT model’s input requirements.
These processed images are then passed through the model, which predicts the
most probable emotion category.
5. Data logging
To support further analysis and potential improvements, the system logs each detec-
tion event. It saves the coordinates of detected faces along with their corresponding
emotion predictions into a CSV file. Additionally, cropped face images are stored
locally, allowing users to examine individual samples or augment datasets for re-
training or evaluation purposes.
12
Chapter 3
This section presents the performance of our model on the test set from the Facial Ex-
pression Recognition (FER) dataset. The evaluation was conducted using multiple per-
formance metrics, including accuracy, precision, recall, Top K Accuracy. These metrics
help assess the effectiveness of the model in handling both common and rare emotion
classes.
3.1 Results
The facial emotion recognition model demonstrates strong and consistent performance
across all major evaluation metrics. It achieved an overall accuracy of 87.93%, with
precision and recall both at 87.91% and 87.93%, respectively. These metrics suggest
that the model is not only accurate but also reliable in maintaining a balance between
detecting true positives and minimizing false predictions.
13
Figure 3.1: Top K Accuracy for test set
Further insight is provided by the Top-K Accuracy analysis, which shows that the model’s
performance significantly improves when more prediction options are considered. While
the Top-1 accuracy is already high at around 88%, it climbs to 95% for Top-2, and
continues to increase, approaching 99% by Top-5. This indicates that the model can
still identify the correct emotion within its top predictions, which is especially useful in
applications where multi-label or ranked predictions are acceptable.
14
Figure 3.2: Confusion Matrix for test set
The confusion matrix offers a deeper look at class-wise performance. The model excels
at detecting emotions such as ”disgust” and ”sad”, which exhibit very few misclassifica-
tions. However, some confusion remains among similar expressions, for instance, ”angry”
is occasionally predicted as ”happy” or ”neutral”, and ”happy” sometimes overlaps with
”surprise” or ”fear”. These confusions are likely due to overlapping facial features be-
tween these emotions, suggesting potential for improvement via better feature separation
or emotion-specific data augmentation.
15
Model Accuracy Precision Recall
ResNet-18 71% 71% 71%
ViT 87.93% 87.91% 87.93%
Table 3.1: The difference between our ViT and the Resnet-18 model
ViT is significantly higher than traditional CNN-based models (ResNet). Because Con-
volutional Neural Networks (CNNs) typically use small kernels to extract local features
from an image. While this is effective for capturing fine-grained patterns, it limits the
model’s ability to understand long-range dependencies across spatial regions. In contrast,
Vision Transformers (ViT) divide the image into patches and process all patches simul-
taneously using self-attention, allowing the model to learn global spatial relationships,
such as those between the eyes and mouth. In the context of facial emotion recognition
(FER), emotional expressions are often represented by a combination of changes across
multiple facial regions rather than localized features alone. Therefore, ViT offers a sig-
nificant advantage by modeling these broader spatial interactions, leading to improved
performance in recognizing complex emotions.
During the development of the facial expression recognition model, we encountered sev-
eral challenges that negatively impacted performance. One of the major issues was class
imbalance, where the number of images across different emotion categories was highly un-
even. For example, the ”Happy” class had significantly more images compared to classes
like ”Disgust” or ”Surprise” (as shown in Figure X). This imbalance led the model to favor
majority classes during prediction, resulting in lower accuracy for minority classes. For
instance, the model was more likely to misclassify subtle emotions like ”Disgust” as ”Neu-
tral” or ”Angry” due to the lack of sufficient training samples for the underrepresented
classes.
16
3.3 Model Limitations
Despite the strong performance of our model, there are still several limitations to consider.
First, the model’s accuracy is highly dependent on the diversity and representativeness
of the dataset. In real-world scenarios, variations such as lighting conditions, occlusions),
and head poses can negatively impact performance, especially if these variations are
underrepresented in the training data.
17
Chapter 4
4.1 Conclusion
In this project, we introduced a Vision Transformer (ViT)-based model for real-time facial
expression recognition using live webcam input. The system captures frames from a live
video stream, detects faces, and classifies emotions into seven categories: Angry, Disgust,
Fear, Happy, Neutral, Sad, and Surprise. Experimental results indicate that the ViT
model performs well in recognizing emotions across varying facial expressions, achieving
strong evaluation metrics including quite high accuracy and balanced precision-recall
scores.
Although the model achieved promising results, several areas can be improved in future
work. Firstly, we aim to expand the dataset to include more diverse facial expressions,
ethnicities, lighting conditions, and age groups to enhance the model’s generalization and
robustness in real-world scenarios.
Finally, we plan to develop a user-friendly application that integrates our FER model
with live webcam input, allowing real-time emotion detection with visual feedback. This
application would serve as a practical demonstration of the model’s potential in domains
such as education, healthcare, and human-computer interaction.
18
References
19