0% found this document useful (0 votes)
14 views62 pages

Dissertation

The document presents a project titled 'Exam Hall Surveillance System,' aimed at enhancing security and fairness during examinations through an AI-powered framework. It integrates advanced technologies such as the TimeSformer model for video classification, Google MediaPipe for pose estimation, and YOLO for object detection, creating a smart surveillance solution tailored for academic environments. The project emphasizes the need for intelligent systems that can proactively monitor student behavior, reduce human oversight, and uphold academic integrity in increasingly crowded exam settings.

Uploaded by

shounakdas246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views62 pages

Dissertation

The document presents a project titled 'Exam Hall Surveillance System,' aimed at enhancing security and fairness during examinations through an AI-powered framework. It integrates advanced technologies such as the TimeSformer model for video classification, Google MediaPipe for pose estimation, and YOLO for object detection, creating a smart surveillance solution tailored for academic environments. The project emphasizes the need for intelligent systems that can proactively monitor student behavior, reduce human oversight, and uphold academic integrity in increasingly crowded exam settings.

Uploaded by

shounakdas246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

JADAVPUR UNIVERSITY

Smart Surveillance System


Ensuring Fairness in Exam Halls

Shounak Das

Supervisor: Prof. Sarmistha Neogy

Computer Science and Engineering


Jadavpur University
Kolkata, India

May 2025
2

Declaration

I, Shounak Das bearing Roll No. 002110501109 hereby de-


clare that the work in the project entitled Exam Hall Surveillance
System: Smart System for Enhanced Security and Fairness has
been done by me and I submit the same for evaluation.

Signature of Student
1

Certificate of Recommendation

This is to certify that the work in this project entitled Exam


Hall Surveillance System: Smart System for Enhanced Secu-
rity and Fairness has been satisfactorily completed by Shounak
Das (Registration Number 158155 of 2021-2022, Class Roll No.
002110501109, Examination Roll No. CSE00258072). It is a bona-
fide piece of work carried out under my supervision and guidance
at Jadavpur University, Kolkata-700032 for partial fulfilment of the
requirements for the awarding of the Bachelor of Engineering in
Computer Science and Engineering degree of the Department of
Computer Science and Engineering, Jadavpur University, during
the academic year 2024-2025.

Signature of the Supervisor


Contents

1 Introduction 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Survey 6
2.1 Human Activity Recognition (HAR) . . . . . . . . . . . . . . . . . . . 6
2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Pose and Gesture Estimation . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Intelligent Surveillance Systems in Education . . . . . . . . . . . . . 8
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Methodology 10
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The TimeSformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 The Google MediaPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 YOLO, You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Video Acquisition and Dataset Preparation 23


4.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Create a Dataset for TimeSFormer Training . . . . . . . . . . . . . . 24

5 Training the Model with TimeSformer 35


5.1 Load TimesFormer Model . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Define Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Check Dataset sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
CONTENTS 3

5.4 Define the Transformations . . . . . . . . . . . . . . . . . . . . . . . . 38


5.5 MyTimeSformer Model Overview Initialize the Model . . . . . . . . 38
5.6 Train the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.7 Evaluate the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.8 Compute a Confusion Matrix Video Classification . . . . . . . . . . 41
5.9 Save the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Improvement by Capturing Labelled Data 45


6.1 Capturing Labelled Data . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Deep-learning Approach Using Mediapipe . . . . . . . . . . . . . . . 46

7 Discussion and Conclusion 49


List of Figures

3.1 The video self-attention blocks we investigate in this work. Each


attention layer implements self-attention (Vaswani et al., 2017b)
on a specified spatiotemporal neighborhood of frame-level patches
(see Figure 2 for a visualization of the neighborhoods). We use
residual connections to aggregate information from different at-
tention layers within each block. A 1-hidden-layer MLP is applied
at the end of each block. The final model is constructed by repeat-
edly stacking these blocks on top of each other. . . . . . . . . . . . . 13
3.2 Visualization of the five space-time self-attention schemes studied
in this work. Each video clip is viewed as a sequence of frame-
level patches with a size of 16 × 16 pixels. For illustration, we de-
note in blue the query patch and show in non-blue colors its self-
attention space-time neighborhood under each scheme. Patches
without color are not used for the self-attention computation of
the blue patch. Multiple colors within a scheme denote attentions
separately applied along different dimensions (e.g., space and
time for (T+S)) or over different neighborhoods (e.g., for (L+G)).
Note that self-attention is computed for every single patch in the
video clip, i.e., every patch serves as a query. We also note that
although the attention pattern is shown for only two adjacent
frames, it extends in the same fashion to all frames of the clip. . . 14
3.3 MediaPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Face Landmark detected . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Data obtained from https://huggingface.co/datasets/ETHZurich/


biwi_kinect_head_pose . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 CSE, Classroom at Jadavpur University . . . . . . . . . . . . . . . . . 26
4.3 YOLO finds the face with bounding box . . . . . . . . . . . . . . . . 27
4.4 YOLO finds the face with bounding box . . . . . . . . . . . . . . . . 27
LIST OF FIGURES 5

4.5 We should not crop the YOLO box, but 80 % extended . . . . . . . 28

5.1 Epoch 11: Confusion Matrix Training Set . . . . . . . . . . . . . . . 43

6.1 Face moving towards left, Nose is the clear indicator of move-
ment, not the eye-ball . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Epoch 11: Confusion Matrix Training Set . . . . . . . . . . . . . . . 47
6.3 Epoch 11: Confusion Matrix Test Set, Loss: 0.2278, AUC-ROC:
0.9962, Accuracy: 0.9598 Validation Loss: 0.2236, Accuracy: 99.02% 47
6.4 Epoch 89: Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 48
List of Tables

4.1 Frames are annotated into six labels . . . . . . . . . . . . . . . . . . 29


4.2 MediaPipe Face Mesh Key Landmark Indices . . . . . . . . . . . . . 30
LIST OF TABLES 1

Abstract

This project introduces the Exam Hall Surveillance System, a


smart, AI-powered framework designed to enhance security and
ensure fairness during examinations. Leveraging recent advance-
ments in computer vision and deep learning, the system integrates
multiple state-of-the-art models to monitor student activity in real
time and detect anomalies indicative of unfair practices.
At the core of the system lies the TimeSformer model, a trans-
former-based architecture optimized for spatio-temporal video clas-
sification. It is complemented by Google MediaPipe, which enables
real-time pose and gesture estimation, and YOLO (You Only Look
Once), a fast and efficient object detection algorithm used to lo-
calize relevant entities within video frames. A custom dataset was
created to train and fine-tune the models for the specific environ-
ment and behavioral context of exam halls.
The training pipeline involves dataset preprocessing, model con-
figuration, evaluation using confusion matrices, and performance
optimization. Results demonstrate the system’s capability to ac-
curately classify activities and identify suspicious behavior under
varied conditions. This project highlights the potential of combin-
ing temporal modeling with real-time detection tools to build ro-
bust surveillance systems that support institutional integrity and
fairness in academic environments.
Chapter 1

Introduction

Examinations are a cornerstone of academic evaluation, designed to assess a stu-


dent’s knowledge, understanding, and application of concepts. As institutions
scale and exam halls become more crowded, ensuring fairness and maintaining
discipline during these assessments has grown increasingly challenging. While
traditional surveillance methods rely heavily on human invigilators and passive
CCTV monitoring, they often lack the efficiency, precision, and real-time respon-
siveness required to uphold academic integrity in large or distributed settings.

1.1 Overview

With recent advancements in computer vision, deep learning, and real-time


video analytics, there lies a unique opportunity to augment traditional invigi-
lation methods using intelligent systems. This project, titled "Exam Hall Surveil-
lance System: Smart System for Enhanced Security and Fairness," aims to ex-
plore the fusion of multiple AI technologies to build an automated, context-
aware surveillance solution tailored specifically for academic environments.
Our system brings together three powerful components: TimeSformer, a
transformer-based model for capturing spatio-temporal patterns in video data;
Google MediaPipe, a lightweight tool for real-time pose and hand tracking; and
YOLO (You Only Look Once), an efficient object detection framework. By com-
bining the temporal understanding of human actions with spatial localization
and pose estimation, the system is capable of recognizing and classifying student
behavior, flagging suspicious activity, and assisting invigilators in maintaining
fairness during exams.
This work is not merely about replacing human invigilation but about en-
hancing it—offering intelligent assistance, reducing oversight fatigue, and min-
imizing subjective bias. In doing so, we also address broader themes of trust,
automation, and the responsible application of AI in educational spaces.
Motivation 3

The following sections delve into the models and tools used, dataset prepa-
ration, training methodologies, and the system’s real-world performance, ulti-
mately evaluating its effectiveness in fostering a secure and fair examination
environment.

1.2 Background

Examinations have long served as the benchmark for assessing a student’s under-
standing, knowledge retention, and critical thinking abilities. In large academic
institutions, where hundreds or thousands of students are evaluated simultane-
ously, maintaining the integrity of these assessments becomes both logistically
and ethically critical. Traditionally, this has been managed through human in-
vigilators and passive CCTV setups, with the assumption that human presence
and retrospective video review are sufficient deterrents to malpractice.
However, as student populations grow and exam formats evolve (e.g., online
or hybrid), the traditional approach reveals several limitations. Human invig-
ilators are prone to fatigue, distraction, and oversight, especially in large or
high-pressure environments. CCTV footage, while helpful, often requires man-
ual review and lacks real-time responsiveness. Most importantly, these systems
operate passively—they monitor, but they do not interpret.
Advancements in computer vision, machine learning, and deep learning have
opened up new avenues for real-time, intelligent video analysis. These technolo-
gies allow machines to detect patterns, identify anomalies, and even classify
complex human behaviors with high accuracy. By integrating such technologies
into exam surveillance, institutions can move beyond reactive monitoring and
toward proactive, intelligent invigilation systems.

1.3 Motivation

The core motivation behind this project stems from the increasing need for scal-
able, intelligent, and context-aware surveillance solutions that can uphold aca-
demic integrity across large, distributed examination environments. The rise in
sophisticated cheating techniques—such as silent coordination, hidden devices,
and impersonation—demands a system that goes beyond traditional monitoring
and incorporates behavioral understanding and real-time decision-making.
Furthermore, manual invigilation not only imposes a significant resource
burden on institutions but also introduces inconsistency and subjectivity into
the process. A smart surveillance system can offer uniformity in monitoring, re-
Outline of the Report 4

duce the need for excessive manpower, and generate actionable alerts instantly,
thereby enhancing both efficiency and fairness.
From a technical standpoint, this project presents an opportunity to fuse mul-
tiple AI domains—object detection, human pose estimation, anomaly detection,
and spatio-temporal modeling—into a unified system. Such integration not only
serves the academic sector but also contributes to broader research in intelligent
video surveillance and human behavior analysis.
Ultimately, the motivation is to empower educational institutions with a
modern, automated tool that enhances trust in assessments, respects student
privacy, and operates reliably in real-world conditions.

1.4 Outline of the Report

This report begins with an Introduction that highlights the critical role of exami-
nations in academic assessment and the growing challenges in ensuring fairness
and discipline within increasingly crowded or distributed exam environments.
It introduces the concept of leveraging AI and computer vision to augment tra-
ditional invigilation systems and sets the stage for the proposed solution.
The Background section explores the evolution of exam monitoring practices,
from manual invigilation to passive CCTV systems. It discusses the limitations
of these methods in real-time behavior analysis and the need for more intelli-
gent systems. Following this, the Motivation section delves into the rationale
behind the project, emphasizing the importance of real-time detection, context-
awareness, and scalable deployment to maintain academic integrity.
In the Problem Definition, the core challenge is articulated: the absence of a
smart surveillance solution capable of interpreting human behavior within the
exam context. It outlines the project’s objectives, such as behavior classification,
anomaly detection, and ensuring minimal false positives while maintaining pri-
vacy.
The Literature Survey reviews relevant academic works and commercial im-
plementations in AI-based proctoring. It identifies key technologies like pose
estimation, facial recognition, and deep learning-based action detection, draw-
ing insights from their limitations to guide the design of a more robust system.
The Proposed Methodology describes the architecture of the solution, which
includes modules for object and pose detection, temporal behavior analysis, and
real-time alert generation. It elaborates on the use of tools like YOLO, OpenPose,
Outline of the Report 5

and RNNs, and how these components integrate to deliver an end-to-end intel-
ligent surveillance system.
In the Implementation section, the report discusses the data collection pro-
cess, system setup, and integration of machine learning models. It also covers
software frameworks used (e.g., PyTorch, OpenCV) and the computational in-
frastructure.
The Evaluation and Results section presents the system’s performance across
various metrics, including detection accuracy, false positive rates, and real-time
responsiveness. Sample cases of flagged behaviors are provided to demonstrate
system capabilities.
The Discussion reflects on the findings, addressing both the strengths and
limitations of the system, and discusses broader implications like privacy, ethical
considerations, and potential biases in detection.
Finally, the Conclusion and Future Work summarizes the contributions of the
project and proposes directions for further enhancement, such as integrating
audio cues, improving adaptability to different environments, or deploying in
online proctoring scenarios. The report ends with References and Appendices
detailing supporting materials and data.
Chapter 2

Literature Survey

Developing an intelligent surveillance system for examination monitoring in-


volves integrating advances in human activity recognition (HAR), object detec-
tion, and pose estimation—each of which has matured significantly with the
advent of deep learning and computer vision techniques. This section explores
key developments in these domains and their applicability to educational surveil-
lance.

2.1 Human Activity Recognition (HAR)

Human Activity Recognition forms the foundation of behavior understanding in


surveillance systems. Earlier approaches primarily relied on handcrafted fea-
tures such as background subtraction, motion vectors, and optical flow, which
often struggled with generalization and real-time application. With the rise of
deep learning, particularly Convolutional Neural Networks (CNNs) and transformer-
based models, HAR has advanced rapidly in terms of both accuracy and scala-
bility.
One of the early influential architectures was the CNN+LSTM model pro-
posed by Donahue et al. (2015) [1], which combined spatial feature extraction
from CNNs with the temporal modeling capabilities of LSTM networks. While
effective for short video clips, these models faced limitations in capturing long-
term dependencies and often required computationally expensive preprocessing.
To address temporal modeling more directly, Carreira and Zisserman [2] in-
troduced the I3D (Inflated 3D ConvNet) model, which expanded 2D convolu-
tional filters into 3D to directly process video frames. I3D demonstrated strong
performance on datasets like Kinetics-400 but proved too computationally de-
manding for real-time use in surveillance.
A significant leap in efficiency and accuracy was achieved with the TimeS-
former model by Bertasius [3] et al. (2021), which applied transformer archi-
Pose and Gesture Estimation 7

tectures [4] to video by separating spatial and temporal attention. Inspired by


the success of Vision Transformers (ViTs), TimeSformer achieved state-of-the-art
results on multiple video benchmarks including Something-Something V2 and
Kinetics, with reduced computational overhead compared to I3D. For our sys-
tem, TimeSformer serves as the backbone for video-based activity classification
due to its superior capacity for modeling long-range temporal patterns.

2.2 Object Detection

Object detection is vital in identifying prohibited items such as mobile phones,


notes, or books, which may indicate dishonest behavior during exams. The field
has seen a transition from region-based methods to real-time detection algo-
rithms.
The RCNN family (RCNN, Fast R-CNN, Faster R-CNN) initially led the field
by introducing region proposals and CNN-based classification, offering high ac-
curacy at the cost of inference speed. These were soon outpaced in real-time
settings by the YOLO (You Only Look Once) [5] family, introduced by Redmon
et al., which reformulated detection as a single regression problem, enabling ex-
tremely fast inference. YOLOv3 through YOLOv5 brought improvements in both
speed and accuracy, making them particularly suitable for continuous monitor-
ing in live surveillance scenarios.
Although alternatives such as SSD (Single Shot Detector) and RetinaNet
offered competitive performance, YOLO models consistently achieved higher
frame rates without sacrificing detection quality. In our implementation, YOLO
is utilized to detect objects and contextual cues such as hand gestures, book
visibility, or the use of electronic devices.

2.3 Pose and Gesture Estimation

Beyond object detection, understanding a student’s posture and gestures can re-
veal signs of distraction, collaboration, or potential misconduct. Pose estimation
thus adds a crucial behavioral layer to our surveillance system.
OpenPose [6], one of the pioneering tools for real-time pose estimation, in-
troduced Part Affinity Fields (PAFs) to associate human body parts and extract
skeletal keypoints. While OpenPose offered good accuracy, it was computation-
ally intensive and less practical for deployment in resource-constrained environ-
ments like classroom settings.
Summary 8

MediaPipe, developed by Google [7], presents a lightweight and efficient


alternative, capable of tracking facial landmarks, hand gestures, and full-body
pose in real time [8]. Its cross-platform support and low latency make it ideal
for constrained environments such as examination halls where fixed camera an-
gles and limited occlusion are common. Prior research has successfully applied
MediaPipe in educational contexts for attention tracking and engagement anal-
ysis, validating its effectiveness in similar domains. Our system uses MediaPipe
to continuously monitor upper-body movements and detect behavioral patterns
indicative of cheating or inattentiveness.

2.4 Intelligent Surveillance Systems in Education

A growing body of research explores the application of AI in educational moni-


toring. Systems like Smart Eye and EduCam have used facial recognition, gaze
tracking, and motion analysis to evaluate student engagement and detect anoma-
lies. These solutions demonstrate the feasibility and effectiveness of intelligent
surveillance in structured academic settings.
Moreover, the PRISM system highlights the importance of privacy-aware de-
sign in educational surveillance. It integrates anonymization techniques and
emphasizes on-device inference to minimize privacy intrusions while maintain-
ing high detection accuracy.
Recent advances in Edge AI further support real-time applications. By lever-
aging model compression tools such as TensorRT and ONNX Runtime, deep
learning models like YOLO and MediaPipe have been successfully deployed on
edge devices such as NVIDIA Jetson, enabling low-latency, scalable surveillance
without requiring constant cloud connectivity.

2.5 Summary

This survey illustrates that while human activity recognition, object detection,
and pose estimation are individually well-developed fields, their cohesive appli-
cation in the context of exam hall surveillance is still nascent. Our work seeks to
bridge this gap by integrating state-of-the-art models—TimeSformer for tempo-
ral behavior analysis, YOLO for object detection, and MediaPipe for pose estima-
tion—into a unified pipeline designed for real-time, context-aware invigilation.
Through this multi-modal approach, the system offers a balanced solution
that ensures timely alerts, minimal false positives, and actionable insights for
Summary 9

human invigilators—thereby enhancing fairness, efficiency, and security in aca-


demic assessment environments.
Chapter 3

Methodology

Introduction to core models

In this chapter, we define the problem. We introduce and discuss the core mod-
els and tools that form the foundation of our proposed Exam Hall Surveillance
System. Building a smart and responsive system for monitoring real-time activi-
ties in an exam environment requires the integration of multiple state-of-the-art
technologies in computer vision and deep learning. Each component plays a dis-
tinct yet complementary role in achieving reliable detection, classification, and
interpretation of student behavior.
We explore three major tools—TimeSformer, YOLO, and MediaPipe —cho-
sen for their robustness, efficiency, and suitability in handling video data, object
detection, and pose estimation, respectively. TimeSformer, a transformer-based
video classification model, allows us to understand and classify complex tem-
poral patterns in video sequences. YOLO, known for its real-time object de-
tection capabilities, helps in identifying suspicious items or actions within the
frame. MediaPipe adds fine-grained motion and gesture tracking, enabling de-
tailed analysis of human posture and hand movements.
Together, these tools create a synergistic pipeline that enables our system to
detect cheating behavior with high accuracy while maintaining real-time per-
formance. The following sections provide an in-depth explanation of each tool,
their internal mechanisms, and their specific role in our surveillance framework.
By integrating deep learning models such as TimeSformer for temporal video
analysis, YOLO for object detection, and MediaPipe for pose and gesture estima-
tion, the system seeks to create a holistic framework that enhances the integrity
of examinations through intelligent automation.
The TimeSformer Model 11

3.1 Problem Definition

In academic institutions, ensuring a fair and secure environment during exami-


nations is of paramount importance. Traditional surveillance methods, such as
manual invigilation or basic CCTV monitoring, often fall short when it comes to
identifying subtle or coordinated instances of malpractice. These systems lack
the ability to intelligently interpret student behavior, detect anomalies in real
time, or adapt to complex, dynamic scenarios that occur within an exam hall.
The core problem lies in the absence of a smart, automated system that can
not only monitor activities continuously but also interpret them with context-
aware intelligence—distinguishing between normal and suspicious behaviors
without relying solely on human observation. Additionally, real-time identifi-
cation of such events and alert generation is critical for proactive intervention.
This project aims to address this gap by developing an intelligent, multi-
model video analytics system for exam hall surveillance. The objective is to
design a solution capable of:

• Real-time detection and classification of human activity in video feeds.

• Identification of actions potentially indicative of cheating or disruptive be-


havior.

• Ensuring minimal false positives by leveraging spatial and temporal mod-


eling of behavior.

• Maintaining student privacy and system efficiency while providing action-


able insights to invigilators.

3.2 The TimeSformer Model

TimeSformer (Time-Space Transformer) is a cutting-edge deep learning model


introduced for video understanding tasks, particularly action recognition. Un-
like traditional 3D convolutional neural networks that rely on spatial and tempo-
ral convolutions [9], TimeSformer leverages the transformer architecture—originally
popularized in natural language processing—to capture both spatial and tempo-
ral dependencies through self-attention mechanisms [10].
The model processes input video frames as a sequence of image patches (to-
kens), allowing it to learn long-range interactions across time and space more
effectively than convolution-based models. One of the key advantages of TimeS-
former is its divided space-time attention mechanism, which separately models
The TimeSformer Model 12

spatial features within each frame and temporal correlations across frames, mak-
ing it computationally efficient and scalable.
In the context of exam hall surveillance, TimeSformer is used to classify com-
plex student activities over time, such as head-turning, passing objects, or suspi-
cious hand movements, by analyzing the video feed holistically [11]. Its ability
to model nuanced temporal dynamics makes it particularly suited for detecting
behavior patterns that unfold gradually rather than instantaneously.
By integrating TimeSformer into our system, we achieve robust activity recog-
nition while maintaining interpretability and real-time responsiveness.

3.2.1 Input clip.

The TimeSformer takes as input a clip X ∈ RH×W ×3×F consisting of F RGB frames
of size H × W sampled from the original video.

3.2.2 Decomposition into patches.

Following the ViT (Dosovitskiy et al., 2020), we decompose each frame into N
non-overlapping patches, each of size P × P , such that the N patches span the
2
entire frame, i.e., N = HW /P 2 . We flatten these patches into vectors x(p,t) ∈ R3P
with p = 1, ..., N denoting spatial locations and t = 1, ..., F depicting an index
over frames.

3.2.3 Linear embedding.


(0)
We linearly map each patch x( p, t) into an embedding vector z(p,t) ∈ R D by means
2
of a learnable matrix E ∈ R D×3P :

(0) pos
z(p,t) = Ex(p,t) + e(p,t) (3.1)
pos
where e(p,t) ∈ R D represents a learnable positional embedding added to encode
the spatiotemporal position of each patch. The resulting sequence of embedding
(0)
vectors z(p,t) for p = 1, ..., N , and t = 1, ..., F represents the input to the Trans-
former, and plays a role similar to the sequences of embedded words that are
fed to text Transformers in NLP. As in the original BERT Transformer (Devlin et
al., 2018) [12], we add in the first position of the sequence a special learn- able
(0)
vector z(0,0) ∈ R D representing the embedding of the classification token.
The TimeSformer Model 13

Figure 3.1: The video self-attention blocks we investigate in this work. Each
attention layer implements self-attention (Vaswani et al., 2017b) on a spec-
ified spatiotemporal neighborhood of frame-level patches (see Figure 2 for
a visualization of the neighborhoods). We use residual connections to ag-
gregate information from different attention layers within each block. A
1-hidden-layer MLP is applied at the end of each block. The final model is
constructed by repeatedly stacking these blocks on top of each other.
The TimeSformer Model 14

Figure 3.2: Visualization of the five space-time self-attention schemes stud-


ied in this work. Each video clip is viewed as a sequence of frame-level
patches with a size of 16 × 16 pixels. For illustration, we denote in blue
the query patch and show in non-blue colors its self-attention space-time
neighborhood under each scheme. Patches without color are not used for
the self-attention computation of the blue patch. Multiple colors within
a scheme denote attentions separately applied along different dimensions
(e.g., space and time for (T+S)) or over different neighborhoods (e.g., for
(L+G)). Note that self-attention is computed for every single patch in the
video clip, i.e., every patch serves as a query. We also note that although
the attention pattern is shown for only two adjacent frames, it extends in
the same fashion to all frames of the clip.
The TimeSformer Model 15

3.2.4 Query-Key-Value computation.

Our Transformer consists of L encoding blocks. At each block l, a query/key/-


(l−1)
value vector is computed for each patch from the representation z(p,t) encoded
by the preceding block:

(l,a) (l,a) (l−1)


€ Š
q(p,t) = WQ LN z(p,t) ∈ R Dh (3.2)

(l,a) (l,a) (l−1)


€ Š
k(p,t) = WK LN z(p,t) ∈ R Dh (3.3)

(l,a) (l,a) (l−1)


€ Š
v(p,t) = WV LN z(p,t) ∈ R Dh (3.4)

where LN () denotes LayerNorm (Ba et al., 2016), a = 1, ..., A is an index over


multiple attention heads and A denotes the total number of attention heads. The
latent dimensionality for each attention head is set to Dh = D/A. Self-attention
computation. Self-attention weights are computed via dot-product. The self-
(l,a)
attention weights α(p,t) ∈ RN F +1 for query patch (p, t) are given by:
 
(l,a) T
(l,a)
q(p,t) h
(l,a)
¦
(l,a)
© i
α(p,t) = SM  p k(0,0) k(p′ ,t ′ )  (3.5)
Dh p′ =1,...N t ′ =1,...F

where SM denotes the softmax activation function. Note that when attention
is computed over one dimension only (e.g., spatial-only or temporal-only), the
computation is significantly reduced. For example, in the case of spatial atten-
tion, only N + 1 query-key comparisons are made, using exclusively keys from
the same frame as the query:
 
(l,a) T
(l,a)space
q(p,t) h
(l,a)
¦
(l,a)
© i
α(p,t) = SM  p k(0,0) k(p′ ,t ′ )  (3.6)
Dh p′ =1,...N

3.2.5 Encoding.

The encoding z ( l)( p, t) at block l is obtained by first computing the weighted


sum of value vectors using self-attention coefficients from each attention head:

N X
X F
(l,a) (l,a) (l,a) (l,a) (l,a)
s(p,t) = α(p,t),(0,0) v(0,0) + α(p,t),(p′ ,t ′ ) v(p′ ,t ′ ) (3.7)
p′ =1 t ′ =1
The TimeSformer Model 16

Then, the concatenation of these vectors from all heads is projected and
passed through an MLP, using residual connections after each operation:

(l,1)
 
s(p,t)

 . 

.
 
′ (l) (l)
z (p,t) = WO   + z(p,t) (3.8)
 
 . 
 
 . 
(l,A)
s(p,t)

(l) (l) (l)


€ € ŠŠ
z(p,t) = M LP LN z′ (p,t) + z′ (p,t) (3.9)

3.2.6 Classification embedding.

The final clip embedding is obtained from the final block for the classification
token:

(L)
€ Š
y = LN z(0,0) ∈ R (3.10)

On top of this representation we append a 1-hidden-layer MLP, which is used


to predict the final video classes. Space-Time Self-Attention Models. We can
reduce the computational cost by replacing the spatiotemporal attention of Eq.
5 with spatial attention within each frame only (Eq. 6). However, such a model
neglects to capture temporal dependencies across frames. As shown in our ex-
periments, this approach leads to degraded classification accuracy compared to
full spatiotemporal attention, especially on bench- marks where strong temporal
modeling is necessary.
We propose a more efficient architecture for spatiotemporal attention, named
“Divided Space-Time Attention” (denoted with T+S), where temporal attention
and spatial attention are separately applied one after the other. This architec-
ture is compared to that of Space and Joint Space-Time attention in Fig. 1. A
visualization of the different attention models on a video example is given in
Fig. 2. For Divided Attention, within each block l, we first compute temporal
attention by comparing each patch (p, t) with all the patches at the same spatial
location in the other frames:
 
(l,a) T h
(l,a)space
q(p,t) (l,a)
¦
(l,a)
© i
s(p,t) = SM  p k(0,0) k(p′ ,t ′ ) ′  (3.11)
Dh p =1,...N
The Google MediaPipe 17

′(l)t ime
The encoding z(p,t) resulting from the application of Eq. 8 using tempo-
ral attention is then fed back for spatial attention computation instead of be-
ing passed to the MLP. In other words, new key/query/value vectors are ob-
′(l)t ime
tained from z(p,t) and spatial attention is then computed using Eq. 6. Fi-
′(l)space
nally, the resulting vector z(p,t) is passed to the MLP of Eq. 9 to compute
′(l)t ime
the final encoding z(p,t) of the patch at block l. For the model of divided at-
(l,a) (l,a) (l,a)
¦ ©
tention, we learn distinct query/key/value matrices WQ t ime , WK t ime , WV t ime and
 (l,a) (l,a) (l,a)
WQspace , WK space , WV space over the time and space dimensions. Note that compared
to the (N F + 1) comparisons per patch needed by the joint spatiotemporal at-
tention model of Eq. 5, Divided Attention performs only (N +F +2) comparisons
per patch. Our experiments demonstrate that this space-time factorization is not
only more efficient but it also leads to improved classification accuracy.

3.3 The Google MediaPipe

A wide range of potential Machine Learning applications today rely on several


fundamental baseline Machine Learning tasks. For example, both gestural nav-
igation and sign language detectors rely on the ability of a program to identify
and track human hands. Given that building something like a hand tracking
model is time-consuming and resource-intensive, a developmental bottleneck
exists in the creation of all applications that rely on hand tracking. To address
this problem, Google invented MediaPipe [7].

3.3.1 MediaPipe Solutions.

MediaPipe Solutions provides a suite of libraries and tools for you to quickly
apply artificial intelligence (AI) and machine learning (ML) techniques in your
applications. You can plug these solutions into your applications immediately,
customize them to your needs, and use them across multiple development plat-
forms. MediaPipe Solutions is part of the MediaPipe open source project, so you
can further customize the solutions code to meet your application needs. The
MediaPipe Solutions suite includes the following:
These libraries and resources provide the core functionality for each Medi-
aPipe Solution:

• MediaPipe Tasks: Cross-platform APIs and libraries for deploying solu-


tions.
The Google MediaPipe 18

Figure 3.3: MediaPipe

• MediaPipe Models: Pre-trained, ready-to-run models for use with each


solution.

These tools let you customize and evaluate solutions:

• MediaPipe Model Maker: Customize models for solutions with your data.

• MediaPipe Studio: Visualize, evaluate, and benchmark solutions in your


browser.

Detect the most prominent face from an input image, then estimate 478 3D
facial landmarks and 52 facial blendshape scores in real-time. This solution can
be used to create a virtual try-on experience or a virtual avatar that mimics a
person’s facial expressions.
The MediaPipe Face Landmarker task lets you detect face landmarks and fa-
cial expressions in images and videos. You can use this task to identify human
facial expressions, apply facial filters and effects, and create virtual avatars. This
task uses machine learning (ML) models that can work with single images or a
continuous stream of images. The task outputs 3-dimensional face landmarks,
blendshape scores (coefficients representing facial expression) to infer detailed
facial surfaces in real-time, and transformation matrices to perform the trans-
formations required for effects rendering.
Features:

• Input image processing - Processing includes image rotation, resizing, nor-


malization, and color space conversion.

• Score threshold - Filter results based on prediction scores.


YOLO, You Only Look Once 19

Figure 3.4: Face Landmark detected

3.3.2 Models- MediaPipe

The Face Landmarker uses a series of models to predict face landmarks. The
first model detects faces, a second model locates landmarks on the detected
faces, and a third model uses those landmarks to identify facial features and
expressions.
The following models are packaged together into a downloadable model
bundle:

• Face detection model: detects the presence of faces with a few key facial
landmarks.

• Face mesh model: adds a complete mapping of the face. The model out-
puts an estimate of 478 3-dimensional face landmarks.

• Blendshape prediction model: receives output from the face mesh model
predicts 52 blendshape scores, which are coefficients representing facial
different expressions.

3.4 YOLO, You Only Look Once

YOLO was developed by Joseph Redmon and his team at the University of Wash-
ington and has become one of the most popular object detection algorithms used
in computer vision applications [5].
YOLO, You Only Look Once 20

3.4.1 How It Works

Prior detection systems repurpose classifiers or localizers to perform detection.


They apply the model to an image at multiple locations and scales. High scoring
regions of the image are considered detections.
We use a totally different approach. We apply a single neural network to the
full image. This network divides the image into regions and predicts bounding
boxes and probabilities for each region. These bounding boxes are weighted by
the predicted probabilities [13].
Our model has several advantages over classifier-based systems. It looks at
the whole image at test time so its predictions are informed by global context
in the image. It also makes predictions with a single network evaluation unlike
systems like R-CNN which require thousands for a single image. This makes it
extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast
R-CNN. See our paper for more details on the full system.

3.4.2 Set up the environment

We need a Python environment with OpenCV, a popular computer vision library,


and YOLO installed. Install all the necessary dependencies, such as ultralytics,
opencv-python and other dependencies.

$ pip install opencv-python

We’ll capture frames from the webcam using OpenCV. This can be done using
the VideoCapture function in OpenCV.
1 import cv2
2

3 cap = cv2 . VideoCapture (0)


4 cap . set (3 , 640)
5 cap . set (4 , 480)
6

7 while True :
8 ret , img = cap . read ()
9 cv2 . imshow ( ’ Webcam ’ , img )
10

11 if cv2 . waitKey (1) == ord ( ’q ’) :


12 break
13

14 cap . release ()
15 cv2 . des troyAllWi ndows ()
YOLO, You Only Look Once 21

3.4.3 Operating YOLO with ultralytics

We install the ultralytics library that makes working with YOLO very easy and
hassle-free.

$ pip install ultralytics

The YOLO model is loaded using the ultralytics library and specifies the location
of the YOLO weights file in the yolo-Weights/yolov8n.pt.
1 from ultralytics import YOLO
2 model = YOLO ( " yolo - Weights / yolov8n . pt " )

We instantiate a classNames variable containing a list of object classes that


the YOLO model is trained to detect.
1 classNames = [ " person " , " bicycle " , " car " , " motorbike " , "
aeroplane " , " bus " , " train " , " truck " , " boat " , " traffic
light " , " fire hydrant " , " stop sign " , " parking meter " , "
bench " , " bird " , " cat " , " dog " , " horse " , " sheep " , " cow "
, " elephant " , " bear " , " zebra " , " giraffe " , " backpack " , "
umbrella " , " handbag " , " tie " , " suitcase " , " frisbee " , "
skis " , " snowboard " , " sports ball " , " kite " , " baseball
bat " , " baseball glove " , " skateboard " , " surfboard " , "
tennis racket " , " bottle " , " wine glass " , " cup " , " fork " ,
" knife " , " spoon " , " bowl " , " banana " , " apple " , " sandwich "
, " orange " , " broccoli " , " carrot " , " hot dog " , " pizza " ,
" donut " , " cake " , " chair " , " sofa " , " pottedplant " , " bed "
, " diningtable " , " toilet " , " tvmonitor " , " laptop " , "
mouse " , " remote " , " keyboard " , " cell phone " , " microwave "
, " oven " , " toaster " , " sink " , " refrigerator " , " book " , "
clock " , " vase " , " scissors " , " teddy bear " , " hair drier
" , " toothbrush " ]

The while loop starts and it reads each frame from the webcam using cap.read().
Then it passes the frame to the YOLO model for object detection. The results of
object detection are stored in the ‘results’ variable.
1 import cv2
2

3 cap = cv2 . VideoCapture (0)


4 cap . set (3 , 640)
5 cap . set (4 , 480)
6

7 while True :
8 ret , img = cap . read ()
9 results = model ( img , stream = True )
10
YOLO, You Only Look Once 22

11 cv2 . imshow ( ’ Webcam ’ , frame )


12

13 if cv2 . waitKey (1) == ord ( ’q ’) :


14 break
15

16 cap . release ()
17 cv2 . des troyAllW indows ()

For each result, the code extracts the bounding box coordinates of the de-
tected object and draws a rectangle around it using cv2.rectangle(). It also prints
the confidence score and class name of the detected object on the console.
Chapter 4

Video Acquisition and Dataset Preparation

Dataset from Video for AI Training

Training AI systems to understand real-world human behavior requires rich, dy-


namic data—and videos serve as an ideal source. Unlike still images, video
datasets allow models to learn from sequences of actions, context transitions,
and subtle temporal cues. In this project, videos were carefully recorded to
simulate an actual exam hall environment, capturing both ordinary student be-
havior and staged suspicious activities.
The raw footage was then divided into manageable segments, each clip rep-
resenting a short time span where a specific behavior occurs. These clips were
labeled manually, providing ground truth for various categories such as "writ-
ing", "glancing sideways", or "passing objects". This kind of detailed annotation
helps models not only detect objects or poses in a frame, but also understand
what is happening over time—an essential aspect for models like TimeSformer
that focus on spatiotemporal learning.
Building a dataset from video is both art and science: it demands attention
to realism, diversity, and clarity. The final dataset used in this project forms the
backbone of the training pipeline, ensuring that the AI system can generalize
well to unseen situations in real exam environments.

4.1 Video Acquisition

To ensure the effectiveness and reliability of the Exam Hall Surveillance System,
it was crucial to collect real-world video data representative of typical examina-
tion environments. For this purpose, video footage was captured in controlled
indoor settings simulating an exam hall. The setup included students seated in
spaced rows, performing regular activities such as writing, looking around, ad-
justing posture, or in some staged cases, mimicking suspicious behavior (e.g.,
Create a Dataset for TimeSFormer Training 24

passing notes or glancing at neighbors). Videos were recorded using standard


surveillance cameras with resolutions ranging from 720p to 1080p at 30 frames
per second to emulate practical deployment conditions in educational institu-
tions.
Multiple camera angles were experimented with, including overhead wide-
angle views and side views from fixed positions, to analyze which perspective
offered better visibility for behavior recognition. Natural lighting and ambient
classroom noise were retained during the recordings to reflect realistic exam-
time conditions. We also ensured diversity in the dataset by varying lighting
conditions slightly and recording across different times of the day. The partici-
pants, who were mostly friends and volunteers, gave informed consent and were
briefed on the purpose of the recordings. The video capturing process empha-
sized the balance between realism and ethical responsibility while generating
data suitable for training models in a robust and fair manner.
Once the videos were captured, they were segmented into clips based on
specific behaviors—such as writing, stretching, turning, whispering, or showing
potentially dishonest gestures. These segments were annotated manually and
stored in a structured format for downstream processing. Each clip was labeled
with corresponding class tags and metadata such as camera angle, timestamp,
and participant ID (anonymized). This meticulous process helped ensure that
the training data reflected nuanced temporal patterns of both normal and sus-
picious exam-time behavior, making it highly suitable for training models like
TimeSformer and YOLO in a real-world context.

4.2 Create a Dataset for TimeSFormer Training

To train a TimeSFormer, we need a well-prepared dataset. Here’s a step-by-step


methodology

4.2.1 Define the Task and Data Requirements

• Decide the application (video understanding)

• Choose the required resolution, format, and data diversity

• Determine the dataset size based on the model’s complexity.


Create a Dataset for TimeSFormer Training 25

Figure 4.1: Data obtained from https://huggingface.co/datasets/


ETHZurich/biwi_kinect_head_pose

4.2.2 Collect Video Data

• Use existing datasets: https://huggingface.co/datasets/ETHZurich/biwi_


kinect_head_pose

• Capture our own videos and organize them in class-wise folders.

• Synthetic Data Generation

4.2.3 Convert Videos to Frames

TimeSFormer processes short video clips as input tensors, so videos need to be


converted into sequences of frames.
1 import cv2
2 import os
3

4 def extract_frames ( video_path , out_folder , frame_rate =1) :


5 os . makedirs ( output_folder , exist_ok = True )
6 cap = cv2 . VideoCapture ( video_path )
7 count = 0
8 while cap . isOpened () :
9 ret , frame = cap . read ()
10 if not ret :
11 break
12 if count % frame_rate == 0:
13 cv2 . imwrite ( os . path . join ( out_folder ,
Create a Dataset for TimeSFormer Training 26

Figure 4.2: CSE, Classroom at Jadavpur University

14 f " frame_ { count }. jpg " ) , frame )


15 count += 1
16 cap . release ()
17

18 extract_frames ( " video . mp4 " , " output_frames / " )

4.2.4 Preprocess and Clean the Images

• Remove duplicates, blurry, or corrupted images.

• Ensure a balanced dataset (equal representation across classes).

• Normalize image sizes (e.g., 224×224 for ViTs)

To find faces, we used YOLOv8, We get the coordinates of the face. We have to
crop 80% extended the detected face.

4.2.5 Annotate the individual frames

• If training for classification: Assign labels to each image.

• If training for object detection/segmentation: Use tools like LabelImg or


CVAT for bounding boxes/masks.

In our work we need to annotate each frames of the video captured. This is
a huge task. We took semi automated method to annotate the frames of the
video. Each video has different parts to annotate. We have used MediaPipe and
Create a Dataset for TimeSFormer Training 27

Figure 4.3: YOLO finds the face with bounding box

Figure 4.4: YOLO finds the face with bounding box


Create a Dataset for TimeSFormer Training 28

Figure 4.5: We should not crop the YOLO box, but 80 % extended
Create a Dataset for TimeSFormer Training 29

OpenCV to annotate the video frames. This proptotype model we plan initially
six labels of classification as given below.

Label Description
static Not much movement with time
right Steadily moving horizontally right
left Steadily moving horizontally left
up Steadily moving vertically up
down Steadily moving vertically down
angular Steadily moving non-axial direction

Table 4.1: Frames are annotated into six labels

Manually labeling individual frames of the video movie is near impossible


task, therefore we had to adopt automated data labeling idea. A sensor based
approach will be the best where we can label much fine grained classes. As we
do not have the hardware with us we tried a semi-automated approach where a
pretrained model of MediaPipe has been used.
The MediaPipe Face Landmarker task lets you detect face landmarks and fa-
cial expressions in images and videos. This task uses machine learning (ML)
models that can work with single images or a continuous stream of images. The
task outputs 3-dimensional face landmarks, blendshape scores (coefficients rep-
resenting facial expression) to infer detailed facial surfaces in real-time, and
transformation matrices to perform the transformations required for effects ren-
dering. Basically: we first try to detect the face and then track landmarks as
long as confidence is above 0.5.
1 mp_face_mesh = mp . solutions . face_mesh
2 face_mesh = mp_face_mesh . FaceMesh (
m i n _ d e t e c t i o n _ c o n f i d e n ce =0.5 , m in _ tr ac ki n g_ co n fi de nc e
=0.5)

This line number 1 imports the Face Mesh module from the MediaPipe library
and assigns it the alias mp_face_mesh for convenience.

• Face Detection: Detects the presence of a face.

• Face Landmark Tracking: Predicts 468 landmarks on the face.

Line number 2, we are initializing the Face Mesh pipeline.


min_detection_confidence=0.5: This sets the minimum confidence score for
the face detector to initially detect a face. 0.5 means at least 50% confidence is
required.
Create a Dataset for TimeSFormer Training 30

min_tracking_confidence=0.5: After the face is detected, this sets the confi-


dence threshold for tracking landmarks in subsequent frames.

Landmark Index
Chin Tip 199
Nose Tip 1
Left Eye Center 468
Left Eye Inner Corner 33
Right Eye Center 473
Right Eye Inner Corner 263
Left Pupil 469
Right Pupil 474
Left Mouth Corner 61
Right Mouth Corner 291

Table 4.2: MediaPipe Face Mesh Key Landmark Indices

Now we have to processes the input image (e.g., image_rgb) using Medi-
aPipe Face Mesh to detect and extract 468 facial landmarks.
1 results = face_mesh . process ( image_rgb )
2 image . flags . writeable = True # To improve performance
3 print ( results . m u l t i _face_landmarks )

• Stores the detected facial landmarks.

• If faces are found, results.multi_face_landmarks will contain the coordi-


nates of 468 landmarks for each face.

Each landmark has:

• x → Horizontal position (normalized between 0 and 1)

• y → Vertical position (normalized between 0 and 1)

• z → Depth (negative = closer to the camera)

Output will be: [landmark x: 0.489120125 y: 0.55023412 z: -0.0034132 ... ]


Here if lm → landmark structure, and img_w × img_h indicates size of the image
then nose_2d and nose_3d are the land marks in 2D and 3D respectively.
Create a Dataset for TimeSFormer Training 31

1 if idx in [33 , 263 , 1 , 61 , 291 , 199]:


2 if idx == 1:
3 nose_2d = ( lm . x * img_w , lm . y * img_h )
4 nose_3d = ( lm . x * img_w , lm . y * img_h , lm . z * 3000)
5 x , y = int ( lm . x * img_w ) , int ( lm . y * img_h )
6 # get the 2 D coordinates
7 face_2d . append ([ x , y ])
8 # get the 3 D coordinates
9 face_3d . append ([ x , y , lm . z ])
10 # convert 2 D and 3 D coordinates into numpy array
11 face_2d = np . array ( face_2d , dtype = np . float64 )
12 face_3d = np . array ( face_3d , dtype = np . float64 )
13

14 # the camera matrix


15 focal_length = 1 * img_w
16 # here ( img_h /2 , img_w /2) is optical center
17 cam_matrix = np . array ([[ focal_length , 0 , img_h / 2] ,
18 [0 , focal_length , img_w / 2] ,
19 [0 , 0 , 1]])
20

21 # the distortion parameters ( zeros , assuming no


distortion )
22 dist_matrix = np . zeros ((4 , 1) , dtype = np . float64 )
23

24 # solve Perspective -n - Point ( PnP )


25 success , rot_vec , trans_vec = cv2 . solvePnP ( face_3d ,
face_2d , cam_matrix , dist_matrix )
26 # get rotational matrix and jacobian matrix ( not used )
27 # This decomposes a 3 x3 rotation matrix ( rmat ) into
three Euler angles ( rx , ry , rz ) and a scale component (
tx , ty , tz ) .
28 rmat , jacob = cv2 . Rodrigues ( rot_vec )
29 # The function computes a RQ decomposition using the
given rotations . This function is used in cv .
d e c o m p o s e P r o j e c t i o n M a t r i x to decompose the left 3 x3
submatrix of a projection matrix into a camera and a
rotation matrix .
30 angles , mtxR , mtxQ , Qx , Qy , Qz = cv2 . RQDecomp3x3 ( rmat )
31

32 x , y , z = angles [0] * 360 , angles [1] * 360 , angles [2] *


360
33

34 return x , y , z # Return extracted values

Landmark index=1, nose tip. Line number 25 cv2.solvePnP() is an OpenCV


function used to estimate the 3D pose of an object (e.g., a face) given its 3D
model points and corresponding 2D image points. It computes the rotation and
Create a Dataset for TimeSFormer Training 32

translation of an object in 3D space relative to the camera. This helps determine


where an object is positioned and how it’s oriented in a scene.
Line number 27: This converts a rotation vector (rot_vec) into a 3×3 rotation
matrix (rmat). Finally we get the Eular angles in degrees.
The obtained x,y,z values, we take x and y values only for our convinience.
Now we have to determine the head movement/ face turning. The movement
of the students are captured 30fps, these x, y values are jittery because may
not be always smooth, we took moving average over a temporal window and
differentiate the noise filtered averaged x,y values to determine the direction of
the movement. Thus we label the frames as determined by the above mentioned
algorithm.

4.2.6 Augment the Dataset

Apply transformations to increase diversity:

• Rotation, flipping, cropping.

• Color jittering, Gaussian noise.

• MixUp and CutMix (useful for transformers)

4.2.7 Split the Dataset

• Training set (70-80%): Used to train the model.

• Validation set (10-15%): Helps in tuning hyperparameters.

• Test set (10-15%): Used for final performance evaluation.

4.2.8 Convert Data to Tensor Format

• Convert images into PyTorch tensors (torchvision.transforms) format.

• Apply normalization (e.g., mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] for
RGB).

1 from torchvision import transforms


2

3 # Define transformations
4 transform = transforms . Compose ([
5 transforms . Resize ((224 , 224) ) , # Resize frames
Create a Dataset for TimeSFormer Training 33

6 transforms . ToTensor () , # Convert to tensor


7 transforms . Normalize ( mean =[0.485 , 0.456 , 0.406] ,
8 std =[0.229 , 0.224 , 0.225]) , # Normalize
9 ])

4.2.9 Create a Data Loader

• PyTorch’s DataLoader for efficient batching and shuffling.

• Optimize for GPU acceleration (use num_workers in PyTorch).

1 from torch . utils . data import DataLoader


2 train_root_dir = " / home / allimages / datasettrain2 "
3 val_root_dir = " / home / allimages / datasetval2 "
4 test_root_dir = " / home / allimages / datasettest2 "
5

6 # Create datasets
7 train_dataset = FrameDataset (
8 train_root_dir , num_frames =8 , transform = transform )
9 val_dataset = FrameDataset (
10 val_root_dir , num_frames =8 , transform = transform )
11 test_dataset = FrameDataset (
12 test_root_dir , num_frames =8 , transform = transform )
13

14 # Create DataLoaders
15 train_loader = DataLoader (
16 train_dataset ,
17 batch_size =8 , # Batch of 8 sequences
18 shuffle = True ,
19 num_workers =4 ,
20 pin_memory = True ,
21 )
22

23 val_loader = DataLoader (
24 val_dataset ,
25 batch_size =8 ,
26 shuffle = False , # No need to shuffle validation data
27 num_workers =4 ,
28 pin_memory = True ,
29 )
30

31 test_loader = DataLoader (
32 test_dataset ,
33 batch_size =8 ,
34 shuffle = False , # No need to shuffle test data
35 num_workers =4 ,
Create a Dataset for TimeSFormer Training 34

36 pin_memory = True ,
37 )

4.2.10 Store and Manage the Dataset

• Save images in structured folders (train/, val/, test/).

• Use HDF5, TFRecord, or LMDB format for large datasets.

4.2.11 Perform Exploratory Data Analysis (EDA)

• Check class distributions.

• Visualize sample images and labels.

• Ensure there are no data leaks across train/val/test sets.


Chapter 5

Training the Model with TimeSformer

write brief of the chapter

5.1 Load TimesFormer Model

1 # Initialize the fine - tuned model


2 model = MyTimeSformer (
3 img_size =224 ,
4 patch_size =16 ,
5 num_classes =400 , # Original number of classes
6 num_frames =8 ,
7 attention_type = ’ di vided_space_time ’ ,
8 pretrained_model = ’/ home / hari / timesfrmr_env / TimeSformer /
timesformer / models / T i m e S f o r m e r _ d i v S T _ 8 x 3 2 _ 2 2 4 _ K 4 0 0 . pyth
’,
9 f i n e t u n e _ n u m_ c l a s s e s =6 , # Number of classes in your new
dataset
10 )

Use the pre-trained model from timm and modify it for my dataset:

5.2 Define Training Setup

• image_size (int, optional, defaults to 224) — The size (resolution) of each


image.

• patch_size (int, optional, defaults to 16) — The size (resolution) of each


patch.

• num_channels (int, optional, defaults to 3) — The number of input chan-


nels.

• num_frames (int, optional, defaults to 8) — The number of frames in


each video.
Define Training Setup 36

• hidden_size (int, optional, defaults to 768) — Dimensionality of the en-


coder layers and the pooler layer.

• num_hidden_layers (int, optional, defaults to 12) — Number of hidden


layers in the Transformer encoder.

• num_attention_heads (int, optional, defaults to 12) — Number of atten-


tion heads for each attention layer in the Transformer encoder.

• intermediate_size (int, optional, defaults to 3072) — Dimensionality of


the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

• hidden_act (str or function, optional, defaults to "gelu") — The non-linear


activation function (function or string) in the encoder and pooler. Sup-
ported values: "gelu", "relu", "selu", "gelu_new".

• hidden_dropout_prob (float, optional, defaults to 0.0) — The dropout


probability for all fully connected layers in the embeddings, encoder, and
pooler.

• attention_probs_dropout_prob (float, optional, defaults to 0.0) — The


dropout ratio for the attention probabilities.

• initializer_range (float, optional, defaults to 0.02) — The standard devia-


tion of the truncated_normal_initializer for initializing all weight matrices.

• layer_norm_eps (float, optional, defaults to 1e-6) — The epsilon used by


the layer normalization layers.

• qkv_bias (bool, optional, defaults to True) — Whether to add a bias to the


queries, keys, and values.

• attention_type (str, optional, defaults to "divided_space_time") — The


attention type to use. Must be one of "divided_space_time", "space_only",
"joint_space_time".

• drop_path_rate (float, optional, defaults to 0) — The dropout ratio for


stochastic depth.

5.2.1 Loss Function & Optimizer

1 import torch . nn as nn
2 import torch . optim as optim
3
Check Dataset sizes 37

4 criterion = nn . CrossEntropyLoss ()
5 optimizer = optim . AdamW ( model . parameters () ,
6 lr =1 e -4 , weight_decay =1 e -4)
7

5.2.2 Create Data Loader

1 # Create DataLoaders
2 train_loader = DataLoader ( train_dataset ,
3 batch_size =8 , # Batch of 8 sequences
4 shuffle = True , num_workers =4 , pin_memory = True ,
5 )
6 val_loader = DataLoader ( val_dataset ,
7 batch_size =8 ,
8 shuffle = False , # No need to shuffle validation data
9 num_workers =4 , pin_memory = True , )
10

11 test_loader = DataLoader ( test_dataset ,


12 batch_size =8 ,
13 shuffle = False , # No need to shuffle test data
14 num_workers =4 , pin_memory = True , )

5.3 Check Dataset sizes

1 from collections import defaultdict


2

3 traind = defaultdict ( int )


4 for i in train_dataset :
5 traind [ i [1]]+=1
6 print ( traind . items () )
7

8 traind = defaultdict ( int )


9 for i in val_dataset :
10 traind [ i [1]]+=1
11 print ( traind . items () )
12

13 traind = defaultdict ( int )


14 for i in test_dataset :
15 traind [ i [1]]+=1
16 print ( traind . items () )

dict_items([(5, 83), (3, 214), (4, 238), (1, 293), (2, 81), (0, 91)]) ← Train Set
dict_items([(3, 32), (1, 56), (5, 14), (0, 15), (4, 35), (2, 5)]) ← Val Set
dict_items([(4, 68), (5, 23), (3, 61), (0, 25), (1, 82), (2, 22)]) ← Test Set
MyTimeSformer Model Overview Initialize the Model 38

5.4 Define the Transformations

1 from torchvision import transforms


2

3 # Define transformations
4 transform = transforms . Compose ([
5 transforms . Resize ((224 , 224) ) , # Resize frames
6 transforms . ToTensor () , # Convert to tensor
7 transforms . Normalize ( mean =[0.485 , 0.456 , 0.406] , std
=[0.229 , 0.224 , 0.225]) , # Normalize
8 ])
9

5.5 MyTimeSformer Model Overview Initialize the Model

Base Model: VisionTransformer


Input Preprocessing:
PatchEmbed with Conv2D (3 → 768 channels, kernel size: 16×16, stride:
16)
Dropouts:
Spatial: Dropout(p=0.0)
Temporal: Dropout(p=0.0)
Positional: Dropout(p=0.0)
Transformer Blocks (12 Total)
Block 0 and Blocks 1–11:
LayerNorm (Input and MLP Normalization)
Attention Layers:
Spatial and Temporal Attention
qkv: Linear (768 → 2304)
proj: Linear (768 → 768)
Dropouts: proj_drop, attn_drop
Temporal FC: Linear (768 → 768)
MLP Block:
fc1: Linear (768 → 3072)
GELU activation
fc2: Linear (3072 → 768)
Train the Model 39

Dropout(p=0.0)
Residual Paths:
Identity in Block 0
DropPath in Blocks 1–11
Output Layer
Final LayerNorm on embeddings
Head: Linear(768 → 6) for classification
1 # Initialize the fine - tuned model
2 model = MyTimeSformer (
3 img_size =224 ,
4 patch_size =16 ,
5 num_classes =400 , # Original number of classes
6 num_frames =8 ,
7 attention_type = ’ di vided_space_time ’ ,
8 pretrained_model = ’/ home / hari / timesfrmr_env / TimeSformer /
timesformer / models / T i m e S f o r m e r _ d i v S T _ 8 x 3 2 _ 2 2 4 _ K 4 0 0 . pyth
’,
9 f i n e t u n e _ n u m_ c l a s s e s =6 , # Number of classes in your new
dataset
10 )
11

12 # Freeze pre - trained layers ( optional )


13 for param in model . parameters () :
14 param . requires_grad = False
15 for param in model . model . head . parameters () :
16 param . requires_grad = True
17

18 # Move the model to the GPU ( if available )


19 device = torch . device ( " cuda " if torch . cuda . is_available
() else " cpu " )
20 print ( device )
21 model . to ( device )
22

23 # Define the loss function and optimizer


24 criterion = nn . CrossEntropyLoss ()
25 optimizer = torch . optim . Adam ( model . parameters () , lr
=0.001)

5.6 Train the Model

1 for epoch in range (30) : # Number of epochs


2 model . train ()
Train the Model 40

3 epoch_loss = 0.0
4 all_labels = []
5 all_preds = []
6 progress_bar = tqdm ( train_loader , desc = f " Epoch { epoch +
1}")
7

8 for inputs , labels in progress_bar : # Assume


train_loader is defined
9 inputs = inputs . to ( device )
10 labels = labels . to ( device )
11

12 # Forward pass
13 outputs = model ( inputs )
14 loss = criterion ( outputs , labels )
15

16 # Backward pass and optimization


17 optimizer . zero_grad ()
18 loss . backward ()
19 optimizer . step ()
20

21 epoch_loss += loss . item ()


22 progress_bar . set_postfix ( loss = loss . item () )
23

24 # Store predictions and labels for AUC - ROC and accuracy


calculation
25 preds = torch . softmax ( outputs , dim =1) . detach () . cpu () .
numpy ()
26 labels_np = labels . cpu () . numpy ()
27 all_preds . extend ( preds )
28 all_labels . extend ( labels_np )
29

30 cm = confusion_matrix ( all_labels , [ p . argmax () for p in


all_preds ])
31 plt . figure ( figsize =(8 , 6) )
32 sns . heatmap ( cm , annot = True , fmt = ’d ’ , cmap = ’ Blues ’)
33 plt . xlabel ( ’ Predicted Label ’)
34 plt . ylabel ( ’ True Label ’)
35 plt . title (f ’ Confusion Matrix - Epoch { epoch + 1} ’)
36 plt . show ()
37

38 # Compute AUC - ROC and Accuracy


39 auc_roc = roc_auc_score ( all_labels , all_preds ,
multi_class = ’ ovr ’)
40 accuracy = accuracy_score ( all_labels , [ p . argmax () for p
in all_preds ])
41

42 print ( f " Epoch { epoch + 1} , Loss : { epoch_loss / len (


Compute a Confusion Matrix Video Classification 41

train_loader ) :.4 f } , AUC - ROC : { auc_roc :.4 f } , Accuracy : {


accuracy :.4 f }")

5.7 Evaluate the Model

1 model . eval ()
2 with torch . no_grad () :
3 val_loss = 0
4 correct = 0
5 total = 0
6 val_labels = []
7 val_preds = []
8

9 for frames , labels in test_loader :


10 frames , labels = frames . to ( device ) , labels . to ( device )
11 outputs = model ( frames )
12 loss = criterion ( outputs , labels )
13 val_loss += loss . item ()
14 _ , predicted = outputs . max (1)
15 total += labels . size (0)
16 correct += predicted . eq ( labels ) . sum () . item ()
17

18 val_labels . extend ( labels . cpu () . numpy () )


19 val_preds . extend ( predicted . cpu () . numpy () )
20

21 print ( f " Validation Loss : { val_loss / len ( val_loader ) :.4 f } ,


Accuracy : {100.* correct / total :.2 f }% " )
22

5.8 Compute a Confusion Matrix Video Classification

A confusion matrix is a table used to evaluate the performance of a classification


model. It compares the model’s predicted labels with actual labels to identify
correct and incorrect classifications. The matrix consists of True Positives (TP),
False Positives (FP), True Negatives (TN), and False Negatives (FN). The confu-
sion matrix is crucial for understanding class-wise errors and improving model
performance. A perfect model has a confusion matrix with values only along
the diagonal. Misclassifications appear as non-diagonal elements, highlighting
where the model fails. Normalization (percentage format) helps interpret per-
formance across different class sizes.

• Diagonal values → Correct predictions.


Compute a Confusion Matrix Video Classification 42

• Off-diagonal values → Misclassifications.

• Class-wise performance → Identify which classes are frequently misclassi-


fied.

• Precision = TP
T P+F P measures how many predicted positives are actually
correct.

• Recall (Sensitivity) = TP
T P+F N indicates how well the model identifies actual
positives.

• F1-Score = 2 × P r ecision×Recall
P r ecision+Recall balances precision and recall.

5.8.1 Import Necessary Libraries

1 import torch
2 import numpy as np
3 import matplotlib . pyplot as plt
4 from sklearn . metrics import confusion_matrix
5 from sklearn . metrics import Co nfu sio nM atr ix Dis pl ay

5.8.2 Prepare Model Predictions

Run the trained model on the test dataset and collect predictions.
1 model . eval ()
2 all_labels = []
3 all_predictions = []
4

5 with torch . no_grad () :


6 for videos , labels in dataloader : # Assuming dataloader
is for test data
7 videos , labels = videos . to ( device ) , labels . to ( device )
8

9 outputs = model ( videos )


10 _ , predicted = torch . max ( outputs , 1) # Get class with
highest probability
11

12 all_labels . extend ( labels . cpu () . numpy () )


13 all_predictions . extend ( predicted . cpu () . numpy () )
14
Compute a Confusion Matrix Video Classification 43

Figure 5.1: Epoch 11: Confusion Matrix Training Set

5.8.3 Compute the Confusion Matrix

1 cm = confusion_matrix ( all_labels , all_predictions )


2

where all_labels: Actual class labels. and all_predictions: Model’s predicted


class labels.

5.8.4 Visualize the Confusion Matrix

1 # Define class names ( adjust based on dataset )


2 class_names = [ " Class1 " , " Class2 " , " Class3 " , ... , "
ClassN " ]
3

4 # Display the confusion matrix


5 disp = C o n f u s i o n M a t rix Dis pl ay ( confusion_matrix = cm ,
display_labels = class_names )
6 disp . plot ( cmap = plt . cm . Blues , xticks_rotation = " vertical " )
7 plt . title ( " Confusion Matrix " )
8 plt . show ()
9

5.8.5 Normalize the Confusion Matrix

To normalize the matrix (showing percentages instead of absolute counts):


Save the Model 44

1 cm_normalized = cm . astype ( ’ float ’) / cm . sum ( axis =1) [: ,


np . newaxis ]
2

3 disp = C o n f u s i o n M a t rix Dis pl ay ( confusion_matrix =


cm_normalized , display_labels = class_names )
4 disp . plot ( cmap = plt . cm . Blues , xticks_rotation = " vertical " )
5 plt . title ( " Normalized Confusion Matrix " )
6 plt . show ()

5.9 Save the Model

Once the training process of a machine learning model is complete, saving the
trained model is a crucial final step. It preserves the learned parameters, archi-
tecture, and performance optimizations so that the model can be reused without
the need to retrain from scratch. This is especially important in video-based tasks
like surveillance, where training deep models like TimeSformer is computation-
ally intensive and time-consuming.
In this project, after evaluating the model’s performance using metrics such
as accuracy, confusion matrix, and loss curves, the best-performing version of the
trained model was serialized and stored using formats such as .pth (for PyTorch)
or .h5 (for Keras/TensorFlow). Along with the model weights, metadata such
as class labels, input size, and preprocessing requirements were also saved to
ensure seamless deployment and inference later.
Saving the model allows for future use in real-time monitoring systems, fine-
tuning with new data, or further research. It also ensures reproducibility—an
essential principle in scientific and academic work.
Chapter 6

Improvement by Capturing Labelled Data

Capturing labelled movie data is essential for enhancing the performance of


models like TimeSformer. Such data enables the model to learn temporal and
spatial patterns across video sequences effectively. Through supervised training,
the model associates video clips with specific actions, scenes, or categories. Ac-
curate labels help reduce noise and ambiguity, leading to improved results in
classification and detection tasks. For TimeSformer, an abundance of annotated
video data strengthens its capacity to model long-range dependencies. It also
supports better generalization across varied movie content. While manual anno-
tation provides precise supervision, it is time-consuming. Semi-automated tools
and pre-trained embeddings can accelerate the process but did not yield strong
results in this case. Both manual and semi-automated labelling approaches fell
short. A more deliberate and structured data collection strategy led to a remark-
able improvement in performance.

6.1 Capturing Labelled Data

Initially, we collected data in a semi-random manner, assuming that MediaPipe


would be effective in automatically labelling it. However, the results were not
satisfactory, prompting us to revisit our data collection strategy. We then sys-
tematically recorded four distinct video clips for each individual, carefully con-
trolling facial movements as follows:

• Moving from right to left

• Moving from left to right

• Moving from up to down

• Moving from down to up


Deep-learning Approach Using Mediapipe 46

We recorded each video clip for approximately 4 seconds at 40 frames per sec-
ond, resulting in a total of 4×4=16 seconds of footage per individual. This
provided us with well-labelled data suitable for training our model based on the
TimeSformer framework as shown in Figure 6.1

(a) Frame 1 (b) Frame 5 (c) Frame 8

Figure 6.1: Face moving towards left, Nose is the clear indicator of move-
ment, not the eye-ball

This new set of data is a trained using our model. The output layer has minor
modification. The Confusion Matrix of Training Set and Test Set are given in the
fig 6.2 and 6.3 Loss: 0.2278, AUC-ROC: 0.9962, Accuracy: 0.9598 Validation
Loss: 0.2236, Accuracy: 99.02% .

6.2 Deep-learning Approach Using Mediapipe

After receiving the excellent result as explained in previous section, we took a


new approach where heavy TimeSformer framework is replaced by lightweight
MediaPipe. Previous work gave less than 75% accuracy, we decided to change
the model instead of TimeSformer, we can use MediaPipe and deep-learning
taking the basic concept of TimeSformer where temporal facial landscape infor-
mation of consecutive eight frames can be used in deep learning. Thus we got
nearly 95% accuracy. The result is shown in 6.4
Deep-learning Approach Using Mediapipe 47

Figure 6.2: Epoch 11: Confusion Matrix Training Set

Figure 6.3: Epoch 11: Confusion Matrix Test Set, Loss: 0.2278, AUC-ROC:
0.9962, Accuracy: 0.9598 Validation Loss: 0.2236, Accuracy: 99.02%
Deep-learning Approach Using Mediapipe 48

Figure 6.4: Epoch 89: Confusion Matrix


Chapter 7

Discussion and Conclusion

The development of a smart surveillance system for exam halls is a timely re-
sponse to the increasing need for enhanced fairness, integrity, and security in
academic environments. In this project, we designed a real-time video analytics
framework utilizing state-of-the-art computer vision models like TimeSformer,
YOLO, and Google MediaPipe. These tools work together to provide compre-
hensive visual understanding of the exam hall environment, capturing student
activities, identifying suspicious behavior, and ensuring fairness in real-time.
Throughout the course of implementation, we faced a number of challenges,
including noisy environments, low-resolution video feeds, and the real-time pro-
cessing requirement. Each of these posed unique technical problems that needed
to be resolved through a combination of model fine-tuning, optimization strate-
gies, and efficient dataset curation. The integration of multiple models was not
straightforward, as we had to manage inter-model latency, ensure seamless data
flow between detection and classification stages, and handle a wide range of hu-
man postures and interactions.
The choice of TimeSformer was crucial, as it brought the power of transformer-
based architectures into video understanding. Unlike traditional CNN-based
models, TimeSformer leverages self-attention mechanisms along both spatial
and temporal dimensions. This made it particularly effective in recognizing sub-
tle temporal patterns of student behavior, such as frequent turning of heads, pro-
longed glances in a particular direction, or abnormal postures that may indicate
cheating.
YOLO was primarily employed for object detection. It allowed us to detect
unauthorized materials such as mobile phones, calculators, or slips of paper that
students may attempt to use unfairly. Its real-time processing speed, even on
modest hardware, made it an indispensable component of our system. YOLO’s
bounding box outputs also facilitated region-of-interest tracking, which could
then be further analyzed using TimeSformer for behavioral context.
CHAPTER 7. DISCUSSION AND CONCLUSION 50

Google MediaPipe was used for pose estimation, enabling us to capture body
landmarks of students across frames. This added a valuable layer of inter-
pretability to our system. For example, MediaPipe helped in distinguishing be-
tween normal writing behavior and suspicious posture shifts. Furthermore, com-
bining pose information with TimeSformer predictions allowed us to improve
the reliability of our behavior classifier through multimodal fusion.
One of the strengths of this project lies in the design of the dataset. Real-
world footage, curated and annotated with help from friends and volunteers,
ensured that the training data was representative of actual exam environments.
Diverse camera angles, lighting conditions, and student postures enriched the
dataset, making the model robust to domain variations.
During the evaluation phase, we achieved a high classification accuracy for
distinguishing between normal and suspicious behaviors. The confusion matrix
indicated that our system could accurately detect potential cheating events while
maintaining a low false-positive rate. However, some borderline cases—such
as students stretching or adjusting their seating—occasionally triggered false
alerts. This suggests that a soft probabilistic alerting system, rather than a binary
classifier, might be more appropriate in future iterations.
Another important insight came from latency measurement. While our mod-
els perform well individually, the end-to-end pipeline introduced cumulative de-
lays. To address this, we experimented with lightweight alternatives and prun-
ing strategies. Future work could benefit from deploying models on edge devices
like NVIDIA Jetson to further reduce reliance on cloud-based processing and in-
crease scalability.
Ethical considerations were central to the project. Surveillance in academic
settings raises questions about privacy and student comfort. Therefore, we en-
sured that all participants were informed and consented to video data collection.
Moreover, the system is designed to assist invigilators, not replace them. Final
decisions are left to human supervisors, and AI-generated alerts are only sug-
gestive, not definitive.
An unexpected benefit of the system is its potential for post-exam analysis.
Stored surveillance data, tagged and classified, could help academic authorities
analyze overall student behavior patterns and even improve seating arrange-
ments or room layouts for future exams. This secondary use of data opens up
new avenues for academic administration and behavioral research.
We also found that many current open-source tools, while powerful, lacked
integration-friendly documentation. This prompted us to write several custom
wrappers and pipeline orchestrators to link models efficiently. These contribu-
CHAPTER 7. DISCUSSION AND CONCLUSION 51

tions can be shared with the community in future open-source releases, con-
tributing to the body of work on surveillance AI.
On a broader note, this work touches upon the larger debate around au-
tomation in educational monitoring. While AI can enhance efficiency and re-
duce human error, it must be deployed responsibly. Trust, transparency, and
explainability are essential features of any system that makes high-stakes judg-
ments, especially in academic contexts. In this spirit, all components of our
system provide visual outputs—bounding boxes, pose overlays, and attention
heatmaps—to explain why certain behaviors were flagged [8].
In conclusion, this project demonstrates a successful fusion of modern deep
learning models into a real-world application with societal relevance. The exam
hall surveillance system is not just a technological solution; it is a tool aimed at
reinforcing academic fairness and reducing stress on human invigilators. Through
careful model selection, dataset preparation, and thoughtful system design, we
have built a robust prototype capable of real-time action recognition and object
detection in a high-stakes environment.
Looking forward, there are several exciting directions to explore. One is
the addition of audio-based analysis—whisper detection or speech recognition
could be used to detect collaboration between students. Another is integrating
the system into a multi-camera network where spatiotemporal behavior can be
tracked across rooms. Furthermore, reinforcement learning could be used to
adaptively tune alert thresholds based on invigilator feedback over time.
We also envision deploying this system on embedded edge devices, such
as Jetson Nano or Raspberry Pi clusters, enabling cost-effective deployment
in resource-constrained institutions. Federated learning approaches could help
preserve student privacy while continually improving the model across multiple
deployments.
This work serves as a foundation. It merges multiple domains—computer
vision, machine learning, real-time systems, and ethics—to address a persistent
real-world challenge. While there is still work to be done in terms of robust-
ness, fairness, and deployment, the results are promising and demonstrate the
potential of AI in reimagining educational infrastructure. We hope that this sys-
tem inspires further interdisciplinary research and responsible AI practices in
academic institutions across the globe.
References

1. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan


S, Saenko K, et al. Long-term recurrent convolutional networks for visual
recognition and description. Proceedings of the IEEE conference on com-
puter vision and pattern recognition. 2015; p. 2625–2634.

2. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and


the kinetics dataset. In: Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition; 2017. p. 6299–6308.

3. Bertasius G, Wang H, Torresani L. Is Space-Time Attention All You Need for


Video Understanding? In: Proceedings of the International Conference on
Machine Learning (ICML); 2021.

4. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al.


Attention is All You Need. In: Advances in Neural Information Processing
Systems (NeurIPS); 2017.

5. Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified,


Real-Time Object Detection. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR); 2016. p. 779–788.

6. Andriluka M, Pishchulin L, Gehler P, Schiele B. 2D Human Pose Estimation:


New Benchmark and State of the Art Analysis. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR); 2014.

7. Google Research. MediaPipe: Cross-platform Framework for Building Per-


ception Pipelines; 2020. https://github.com/google/mediapipe.

8. Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y. OpenPose: Realtime Multi-


Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 2021;43(1):172–186.

9. Feichtenhofer C, Pinz A, Zisserman A. Convolutional Two-Stream Network


Fusion for Video Action Recognition. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR); 2016.
REFERENCES 53

10. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning Spatiotemporal


Features with 3D Convolutional Networks. In: Proceedings of the IEEE
International Conference on Computer Vision (ICCV); 2015.

11. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object De-
tection. In: Proceedings of the IEEE International Conference on Computer
Vision (ICCV); 2017. p. 2980–2988.

12. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep


Bidirectional Transformers for Language Understanding. arXiv preprint
arXiv:181004805. 2019;.

13. Redmon J, Farhadi A. YOLOv3: An Incremental Improvement; 2018.


REFERENCES 54

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy