Dissertation
Dissertation
Shounak Das
May 2025
2
Declaration
Signature of Student
1
Certificate of Recommendation
1 Introduction 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Survey 6
2.1 Human Activity Recognition (HAR) . . . . . . . . . . . . . . . . . . . 6
2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Pose and Gesture Estimation . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Intelligent Surveillance Systems in Education . . . . . . . . . . . . . 8
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Methodology 10
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The TimeSformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 The Google MediaPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 YOLO, You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1 Face moving towards left, Nose is the clear indicator of move-
ment, not the eye-ball . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Epoch 11: Confusion Matrix Training Set . . . . . . . . . . . . . . . 47
6.3 Epoch 11: Confusion Matrix Test Set, Loss: 0.2278, AUC-ROC:
0.9962, Accuracy: 0.9598 Validation Loss: 0.2236, Accuracy: 99.02% 47
6.4 Epoch 89: Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 48
List of Tables
Abstract
Introduction
1.1 Overview
The following sections delve into the models and tools used, dataset prepa-
ration, training methodologies, and the system’s real-world performance, ulti-
mately evaluating its effectiveness in fostering a secure and fair examination
environment.
1.2 Background
Examinations have long served as the benchmark for assessing a student’s under-
standing, knowledge retention, and critical thinking abilities. In large academic
institutions, where hundreds or thousands of students are evaluated simultane-
ously, maintaining the integrity of these assessments becomes both logistically
and ethically critical. Traditionally, this has been managed through human in-
vigilators and passive CCTV setups, with the assumption that human presence
and retrospective video review are sufficient deterrents to malpractice.
However, as student populations grow and exam formats evolve (e.g., online
or hybrid), the traditional approach reveals several limitations. Human invig-
ilators are prone to fatigue, distraction, and oversight, especially in large or
high-pressure environments. CCTV footage, while helpful, often requires man-
ual review and lacks real-time responsiveness. Most importantly, these systems
operate passively—they monitor, but they do not interpret.
Advancements in computer vision, machine learning, and deep learning have
opened up new avenues for real-time, intelligent video analysis. These technolo-
gies allow machines to detect patterns, identify anomalies, and even classify
complex human behaviors with high accuracy. By integrating such technologies
into exam surveillance, institutions can move beyond reactive monitoring and
toward proactive, intelligent invigilation systems.
1.3 Motivation
The core motivation behind this project stems from the increasing need for scal-
able, intelligent, and context-aware surveillance solutions that can uphold aca-
demic integrity across large, distributed examination environments. The rise in
sophisticated cheating techniques—such as silent coordination, hidden devices,
and impersonation—demands a system that goes beyond traditional monitoring
and incorporates behavioral understanding and real-time decision-making.
Furthermore, manual invigilation not only imposes a significant resource
burden on institutions but also introduces inconsistency and subjectivity into
the process. A smart surveillance system can offer uniformity in monitoring, re-
Outline of the Report 4
duce the need for excessive manpower, and generate actionable alerts instantly,
thereby enhancing both efficiency and fairness.
From a technical standpoint, this project presents an opportunity to fuse mul-
tiple AI domains—object detection, human pose estimation, anomaly detection,
and spatio-temporal modeling—into a unified system. Such integration not only
serves the academic sector but also contributes to broader research in intelligent
video surveillance and human behavior analysis.
Ultimately, the motivation is to empower educational institutions with a
modern, automated tool that enhances trust in assessments, respects student
privacy, and operates reliably in real-world conditions.
This report begins with an Introduction that highlights the critical role of exami-
nations in academic assessment and the growing challenges in ensuring fairness
and discipline within increasingly crowded or distributed exam environments.
It introduces the concept of leveraging AI and computer vision to augment tra-
ditional invigilation systems and sets the stage for the proposed solution.
The Background section explores the evolution of exam monitoring practices,
from manual invigilation to passive CCTV systems. It discusses the limitations
of these methods in real-time behavior analysis and the need for more intelli-
gent systems. Following this, the Motivation section delves into the rationale
behind the project, emphasizing the importance of real-time detection, context-
awareness, and scalable deployment to maintain academic integrity.
In the Problem Definition, the core challenge is articulated: the absence of a
smart surveillance solution capable of interpreting human behavior within the
exam context. It outlines the project’s objectives, such as behavior classification,
anomaly detection, and ensuring minimal false positives while maintaining pri-
vacy.
The Literature Survey reviews relevant academic works and commercial im-
plementations in AI-based proctoring. It identifies key technologies like pose
estimation, facial recognition, and deep learning-based action detection, draw-
ing insights from their limitations to guide the design of a more robust system.
The Proposed Methodology describes the architecture of the solution, which
includes modules for object and pose detection, temporal behavior analysis, and
real-time alert generation. It elaborates on the use of tools like YOLO, OpenPose,
Outline of the Report 5
and RNNs, and how these components integrate to deliver an end-to-end intel-
ligent surveillance system.
In the Implementation section, the report discusses the data collection pro-
cess, system setup, and integration of machine learning models. It also covers
software frameworks used (e.g., PyTorch, OpenCV) and the computational in-
frastructure.
The Evaluation and Results section presents the system’s performance across
various metrics, including detection accuracy, false positive rates, and real-time
responsiveness. Sample cases of flagged behaviors are provided to demonstrate
system capabilities.
The Discussion reflects on the findings, addressing both the strengths and
limitations of the system, and discusses broader implications like privacy, ethical
considerations, and potential biases in detection.
Finally, the Conclusion and Future Work summarizes the contributions of the
project and proposes directions for further enhancement, such as integrating
audio cues, improving adaptability to different environments, or deploying in
online proctoring scenarios. The report ends with References and Appendices
detailing supporting materials and data.
Chapter 2
Literature Survey
Beyond object detection, understanding a student’s posture and gestures can re-
veal signs of distraction, collaboration, or potential misconduct. Pose estimation
thus adds a crucial behavioral layer to our surveillance system.
OpenPose [6], one of the pioneering tools for real-time pose estimation, in-
troduced Part Affinity Fields (PAFs) to associate human body parts and extract
skeletal keypoints. While OpenPose offered good accuracy, it was computation-
ally intensive and less practical for deployment in resource-constrained environ-
ments like classroom settings.
Summary 8
2.5 Summary
This survey illustrates that while human activity recognition, object detection,
and pose estimation are individually well-developed fields, their cohesive appli-
cation in the context of exam hall surveillance is still nascent. Our work seeks to
bridge this gap by integrating state-of-the-art models—TimeSformer for tempo-
ral behavior analysis, YOLO for object detection, and MediaPipe for pose estima-
tion—into a unified pipeline designed for real-time, context-aware invigilation.
Through this multi-modal approach, the system offers a balanced solution
that ensures timely alerts, minimal false positives, and actionable insights for
Summary 9
Methodology
In this chapter, we define the problem. We introduce and discuss the core mod-
els and tools that form the foundation of our proposed Exam Hall Surveillance
System. Building a smart and responsive system for monitoring real-time activi-
ties in an exam environment requires the integration of multiple state-of-the-art
technologies in computer vision and deep learning. Each component plays a dis-
tinct yet complementary role in achieving reliable detection, classification, and
interpretation of student behavior.
We explore three major tools—TimeSformer, YOLO, and MediaPipe —cho-
sen for their robustness, efficiency, and suitability in handling video data, object
detection, and pose estimation, respectively. TimeSformer, a transformer-based
video classification model, allows us to understand and classify complex tem-
poral patterns in video sequences. YOLO, known for its real-time object de-
tection capabilities, helps in identifying suspicious items or actions within the
frame. MediaPipe adds fine-grained motion and gesture tracking, enabling de-
tailed analysis of human posture and hand movements.
Together, these tools create a synergistic pipeline that enables our system to
detect cheating behavior with high accuracy while maintaining real-time per-
formance. The following sections provide an in-depth explanation of each tool,
their internal mechanisms, and their specific role in our surveillance framework.
By integrating deep learning models such as TimeSformer for temporal video
analysis, YOLO for object detection, and MediaPipe for pose and gesture estima-
tion, the system seeks to create a holistic framework that enhances the integrity
of examinations through intelligent automation.
The TimeSformer Model 11
spatial features within each frame and temporal correlations across frames, mak-
ing it computationally efficient and scalable.
In the context of exam hall surveillance, TimeSformer is used to classify com-
plex student activities over time, such as head-turning, passing objects, or suspi-
cious hand movements, by analyzing the video feed holistically [11]. Its ability
to model nuanced temporal dynamics makes it particularly suited for detecting
behavior patterns that unfold gradually rather than instantaneously.
By integrating TimeSformer into our system, we achieve robust activity recog-
nition while maintaining interpretability and real-time responsiveness.
The TimeSformer takes as input a clip X ∈ RH×W ×3×F consisting of F RGB frames
of size H × W sampled from the original video.
Following the ViT (Dosovitskiy et al., 2020), we decompose each frame into N
non-overlapping patches, each of size P × P , such that the N patches span the
2
entire frame, i.e., N = HW /P 2 . We flatten these patches into vectors x(p,t) ∈ R3P
with p = 1, ..., N denoting spatial locations and t = 1, ..., F depicting an index
over frames.
(0) pos
z(p,t) = Ex(p,t) + e(p,t) (3.1)
pos
where e(p,t) ∈ R D represents a learnable positional embedding added to encode
the spatiotemporal position of each patch. The resulting sequence of embedding
(0)
vectors z(p,t) for p = 1, ..., N , and t = 1, ..., F represents the input to the Trans-
former, and plays a role similar to the sequences of embedded words that are
fed to text Transformers in NLP. As in the original BERT Transformer (Devlin et
al., 2018) [12], we add in the first position of the sequence a special learn- able
(0)
vector z(0,0) ∈ R D representing the embedding of the classification token.
The TimeSformer Model 13
Figure 3.1: The video self-attention blocks we investigate in this work. Each
attention layer implements self-attention (Vaswani et al., 2017b) on a spec-
ified spatiotemporal neighborhood of frame-level patches (see Figure 2 for
a visualization of the neighborhoods). We use residual connections to ag-
gregate information from different attention layers within each block. A
1-hidden-layer MLP is applied at the end of each block. The final model is
constructed by repeatedly stacking these blocks on top of each other.
The TimeSformer Model 14
where SM denotes the softmax activation function. Note that when attention
is computed over one dimension only (e.g., spatial-only or temporal-only), the
computation is significantly reduced. For example, in the case of spatial atten-
tion, only N + 1 query-key comparisons are made, using exclusively keys from
the same frame as the query:
(l,a) T
(l,a)space
q(p,t) h
(l,a)
¦
(l,a)
© i
α(p,t) = SM p k(0,0) k(p′ ,t ′ ) (3.6)
Dh p′ =1,...N
3.2.5 Encoding.
N X
X F
(l,a) (l,a) (l,a) (l,a) (l,a)
s(p,t) = α(p,t),(0,0) v(0,0) + α(p,t),(p′ ,t ′ ) v(p′ ,t ′ ) (3.7)
p′ =1 t ′ =1
The TimeSformer Model 16
Then, the concatenation of these vectors from all heads is projected and
passed through an MLP, using residual connections after each operation:
(l,1)
s(p,t)
.
.
′ (l) (l)
z (p,t) = WO + z(p,t) (3.8)
.
.
(l,A)
s(p,t)
The final clip embedding is obtained from the final block for the classification
token:
(L)
y = LN z(0,0) ∈ R (3.10)
′(l)t ime
The encoding z(p,t) resulting from the application of Eq. 8 using tempo-
ral attention is then fed back for spatial attention computation instead of be-
ing passed to the MLP. In other words, new key/query/value vectors are ob-
′(l)t ime
tained from z(p,t) and spatial attention is then computed using Eq. 6. Fi-
′(l)space
nally, the resulting vector z(p,t) is passed to the MLP of Eq. 9 to compute
′(l)t ime
the final encoding z(p,t) of the patch at block l. For the model of divided at-
(l,a) (l,a) (l,a)
¦ ©
tention, we learn distinct query/key/value matrices WQ t ime , WK t ime , WV t ime and
(l,a) (l,a) (l,a)
WQspace , WK space , WV space over the time and space dimensions. Note that compared
to the (N F + 1) comparisons per patch needed by the joint spatiotemporal at-
tention model of Eq. 5, Divided Attention performs only (N +F +2) comparisons
per patch. Our experiments demonstrate that this space-time factorization is not
only more efficient but it also leads to improved classification accuracy.
MediaPipe Solutions provides a suite of libraries and tools for you to quickly
apply artificial intelligence (AI) and machine learning (ML) techniques in your
applications. You can plug these solutions into your applications immediately,
customize them to your needs, and use them across multiple development plat-
forms. MediaPipe Solutions is part of the MediaPipe open source project, so you
can further customize the solutions code to meet your application needs. The
MediaPipe Solutions suite includes the following:
These libraries and resources provide the core functionality for each Medi-
aPipe Solution:
• MediaPipe Model Maker: Customize models for solutions with your data.
Detect the most prominent face from an input image, then estimate 478 3D
facial landmarks and 52 facial blendshape scores in real-time. This solution can
be used to create a virtual try-on experience or a virtual avatar that mimics a
person’s facial expressions.
The MediaPipe Face Landmarker task lets you detect face landmarks and fa-
cial expressions in images and videos. You can use this task to identify human
facial expressions, apply facial filters and effects, and create virtual avatars. This
task uses machine learning (ML) models that can work with single images or a
continuous stream of images. The task outputs 3-dimensional face landmarks,
blendshape scores (coefficients representing facial expression) to infer detailed
facial surfaces in real-time, and transformation matrices to perform the trans-
formations required for effects rendering.
Features:
The Face Landmarker uses a series of models to predict face landmarks. The
first model detects faces, a second model locates landmarks on the detected
faces, and a third model uses those landmarks to identify facial features and
expressions.
The following models are packaged together into a downloadable model
bundle:
• Face detection model: detects the presence of faces with a few key facial
landmarks.
• Face mesh model: adds a complete mapping of the face. The model out-
puts an estimate of 478 3-dimensional face landmarks.
• Blendshape prediction model: receives output from the face mesh model
predicts 52 blendshape scores, which are coefficients representing facial
different expressions.
YOLO was developed by Joseph Redmon and his team at the University of Wash-
ington and has become one of the most popular object detection algorithms used
in computer vision applications [5].
YOLO, You Only Look Once 20
We’ll capture frames from the webcam using OpenCV. This can be done using
the VideoCapture function in OpenCV.
1 import cv2
2
7 while True :
8 ret , img = cap . read ()
9 cv2 . imshow ( ’ Webcam ’ , img )
10
14 cap . release ()
15 cv2 . des troyAllWi ndows ()
YOLO, You Only Look Once 21
We install the ultralytics library that makes working with YOLO very easy and
hassle-free.
The YOLO model is loaded using the ultralytics library and specifies the location
of the YOLO weights file in the yolo-Weights/yolov8n.pt.
1 from ultralytics import YOLO
2 model = YOLO ( " yolo - Weights / yolov8n . pt " )
The while loop starts and it reads each frame from the webcam using cap.read().
Then it passes the frame to the YOLO model for object detection. The results of
object detection are stored in the ‘results’ variable.
1 import cv2
2
7 while True :
8 ret , img = cap . read ()
9 results = model ( img , stream = True )
10
YOLO, You Only Look Once 22
16 cap . release ()
17 cv2 . des troyAllW indows ()
For each result, the code extracts the bounding box coordinates of the de-
tected object and draws a rectangle around it using cv2.rectangle(). It also prints
the confidence score and class name of the detected object on the console.
Chapter 4
To ensure the effectiveness and reliability of the Exam Hall Surveillance System,
it was crucial to collect real-world video data representative of typical examina-
tion environments. For this purpose, video footage was captured in controlled
indoor settings simulating an exam hall. The setup included students seated in
spaced rows, performing regular activities such as writing, looking around, ad-
justing posture, or in some staged cases, mimicking suspicious behavior (e.g.,
Create a Dataset for TimeSFormer Training 24
To find faces, we used YOLOv8, We get the coordinates of the face. We have to
crop 80% extended the detected face.
In our work we need to annotate each frames of the video captured. This is
a huge task. We took semi automated method to annotate the frames of the
video. Each video has different parts to annotate. We have used MediaPipe and
Create a Dataset for TimeSFormer Training 27
Figure 4.5: We should not crop the YOLO box, but 80 % extended
Create a Dataset for TimeSFormer Training 29
OpenCV to annotate the video frames. This proptotype model we plan initially
six labels of classification as given below.
Label Description
static Not much movement with time
right Steadily moving horizontally right
left Steadily moving horizontally left
up Steadily moving vertically up
down Steadily moving vertically down
angular Steadily moving non-axial direction
This line number 1 imports the Face Mesh module from the MediaPipe library
and assigns it the alias mp_face_mesh for convenience.
Landmark Index
Chin Tip 199
Nose Tip 1
Left Eye Center 468
Left Eye Inner Corner 33
Right Eye Center 473
Right Eye Inner Corner 263
Left Pupil 469
Right Pupil 474
Left Mouth Corner 61
Right Mouth Corner 291
Now we have to processes the input image (e.g., image_rgb) using Medi-
aPipe Face Mesh to detect and extract 468 facial landmarks.
1 results = face_mesh . process ( image_rgb )
2 image . flags . writeable = True # To improve performance
3 print ( results . m u l t i _face_landmarks )
• Apply normalization (e.g., mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] for
RGB).
3 # Define transformations
4 transform = transforms . Compose ([
5 transforms . Resize ((224 , 224) ) , # Resize frames
Create a Dataset for TimeSFormer Training 33
6 # Create datasets
7 train_dataset = FrameDataset (
8 train_root_dir , num_frames =8 , transform = transform )
9 val_dataset = FrameDataset (
10 val_root_dir , num_frames =8 , transform = transform )
11 test_dataset = FrameDataset (
12 test_root_dir , num_frames =8 , transform = transform )
13
14 # Create DataLoaders
15 train_loader = DataLoader (
16 train_dataset ,
17 batch_size =8 , # Batch of 8 sequences
18 shuffle = True ,
19 num_workers =4 ,
20 pin_memory = True ,
21 )
22
23 val_loader = DataLoader (
24 val_dataset ,
25 batch_size =8 ,
26 shuffle = False , # No need to shuffle validation data
27 num_workers =4 ,
28 pin_memory = True ,
29 )
30
31 test_loader = DataLoader (
32 test_dataset ,
33 batch_size =8 ,
34 shuffle = False , # No need to shuffle test data
35 num_workers =4 ,
Create a Dataset for TimeSFormer Training 34
36 pin_memory = True ,
37 )
Use the pre-trained model from timm and modify it for my dataset:
1 import torch . nn as nn
2 import torch . optim as optim
3
Check Dataset sizes 37
4 criterion = nn . CrossEntropyLoss ()
5 optimizer = optim . AdamW ( model . parameters () ,
6 lr =1 e -4 , weight_decay =1 e -4)
7
1 # Create DataLoaders
2 train_loader = DataLoader ( train_dataset ,
3 batch_size =8 , # Batch of 8 sequences
4 shuffle = True , num_workers =4 , pin_memory = True ,
5 )
6 val_loader = DataLoader ( val_dataset ,
7 batch_size =8 ,
8 shuffle = False , # No need to shuffle validation data
9 num_workers =4 , pin_memory = True , )
10
dict_items([(5, 83), (3, 214), (4, 238), (1, 293), (2, 81), (0, 91)]) ← Train Set
dict_items([(3, 32), (1, 56), (5, 14), (0, 15), (4, 35), (2, 5)]) ← Val Set
dict_items([(4, 68), (5, 23), (3, 61), (0, 25), (1, 82), (2, 22)]) ← Test Set
MyTimeSformer Model Overview Initialize the Model 38
3 # Define transformations
4 transform = transforms . Compose ([
5 transforms . Resize ((224 , 224) ) , # Resize frames
6 transforms . ToTensor () , # Convert to tensor
7 transforms . Normalize ( mean =[0.485 , 0.456 , 0.406] , std
=[0.229 , 0.224 , 0.225]) , # Normalize
8 ])
9
Dropout(p=0.0)
Residual Paths:
Identity in Block 0
DropPath in Blocks 1–11
Output Layer
Final LayerNorm on embeddings
Head: Linear(768 → 6) for classification
1 # Initialize the fine - tuned model
2 model = MyTimeSformer (
3 img_size =224 ,
4 patch_size =16 ,
5 num_classes =400 , # Original number of classes
6 num_frames =8 ,
7 attention_type = ’ di vided_space_time ’ ,
8 pretrained_model = ’/ home / hari / timesfrmr_env / TimeSformer /
timesformer / models / T i m e S f o r m e r _ d i v S T _ 8 x 3 2 _ 2 2 4 _ K 4 0 0 . pyth
’,
9 f i n e t u n e _ n u m_ c l a s s e s =6 , # Number of classes in your new
dataset
10 )
11
3 epoch_loss = 0.0
4 all_labels = []
5 all_preds = []
6 progress_bar = tqdm ( train_loader , desc = f " Epoch { epoch +
1}")
7
12 # Forward pass
13 outputs = model ( inputs )
14 loss = criterion ( outputs , labels )
15
1 model . eval ()
2 with torch . no_grad () :
3 val_loss = 0
4 correct = 0
5 total = 0
6 val_labels = []
7 val_preds = []
8
• Precision = TP
T P+F P measures how many predicted positives are actually
correct.
• Recall (Sensitivity) = TP
T P+F N indicates how well the model identifies actual
positives.
• F1-Score = 2 × P r ecision×Recall
P r ecision+Recall balances precision and recall.
1 import torch
2 import numpy as np
3 import matplotlib . pyplot as plt
4 from sklearn . metrics import confusion_matrix
5 from sklearn . metrics import Co nfu sio nM atr ix Dis pl ay
Run the trained model on the test dataset and collect predictions.
1 model . eval ()
2 all_labels = []
3 all_predictions = []
4
Once the training process of a machine learning model is complete, saving the
trained model is a crucial final step. It preserves the learned parameters, archi-
tecture, and performance optimizations so that the model can be reused without
the need to retrain from scratch. This is especially important in video-based tasks
like surveillance, where training deep models like TimeSformer is computation-
ally intensive and time-consuming.
In this project, after evaluating the model’s performance using metrics such
as accuracy, confusion matrix, and loss curves, the best-performing version of the
trained model was serialized and stored using formats such as .pth (for PyTorch)
or .h5 (for Keras/TensorFlow). Along with the model weights, metadata such
as class labels, input size, and preprocessing requirements were also saved to
ensure seamless deployment and inference later.
Saving the model allows for future use in real-time monitoring systems, fine-
tuning with new data, or further research. It also ensures reproducibility—an
essential principle in scientific and academic work.
Chapter 6
We recorded each video clip for approximately 4 seconds at 40 frames per sec-
ond, resulting in a total of 4×4=16 seconds of footage per individual. This
provided us with well-labelled data suitable for training our model based on the
TimeSformer framework as shown in Figure 6.1
Figure 6.1: Face moving towards left, Nose is the clear indicator of move-
ment, not the eye-ball
This new set of data is a trained using our model. The output layer has minor
modification. The Confusion Matrix of Training Set and Test Set are given in the
fig 6.2 and 6.3 Loss: 0.2278, AUC-ROC: 0.9962, Accuracy: 0.9598 Validation
Loss: 0.2236, Accuracy: 99.02% .
Figure 6.3: Epoch 11: Confusion Matrix Test Set, Loss: 0.2278, AUC-ROC:
0.9962, Accuracy: 0.9598 Validation Loss: 0.2236, Accuracy: 99.02%
Deep-learning Approach Using Mediapipe 48
The development of a smart surveillance system for exam halls is a timely re-
sponse to the increasing need for enhanced fairness, integrity, and security in
academic environments. In this project, we designed a real-time video analytics
framework utilizing state-of-the-art computer vision models like TimeSformer,
YOLO, and Google MediaPipe. These tools work together to provide compre-
hensive visual understanding of the exam hall environment, capturing student
activities, identifying suspicious behavior, and ensuring fairness in real-time.
Throughout the course of implementation, we faced a number of challenges,
including noisy environments, low-resolution video feeds, and the real-time pro-
cessing requirement. Each of these posed unique technical problems that needed
to be resolved through a combination of model fine-tuning, optimization strate-
gies, and efficient dataset curation. The integration of multiple models was not
straightforward, as we had to manage inter-model latency, ensure seamless data
flow between detection and classification stages, and handle a wide range of hu-
man postures and interactions.
The choice of TimeSformer was crucial, as it brought the power of transformer-
based architectures into video understanding. Unlike traditional CNN-based
models, TimeSformer leverages self-attention mechanisms along both spatial
and temporal dimensions. This made it particularly effective in recognizing sub-
tle temporal patterns of student behavior, such as frequent turning of heads, pro-
longed glances in a particular direction, or abnormal postures that may indicate
cheating.
YOLO was primarily employed for object detection. It allowed us to detect
unauthorized materials such as mobile phones, calculators, or slips of paper that
students may attempt to use unfairly. Its real-time processing speed, even on
modest hardware, made it an indispensable component of our system. YOLO’s
bounding box outputs also facilitated region-of-interest tracking, which could
then be further analyzed using TimeSformer for behavioral context.
CHAPTER 7. DISCUSSION AND CONCLUSION 50
Google MediaPipe was used for pose estimation, enabling us to capture body
landmarks of students across frames. This added a valuable layer of inter-
pretability to our system. For example, MediaPipe helped in distinguishing be-
tween normal writing behavior and suspicious posture shifts. Furthermore, com-
bining pose information with TimeSformer predictions allowed us to improve
the reliability of our behavior classifier through multimodal fusion.
One of the strengths of this project lies in the design of the dataset. Real-
world footage, curated and annotated with help from friends and volunteers,
ensured that the training data was representative of actual exam environments.
Diverse camera angles, lighting conditions, and student postures enriched the
dataset, making the model robust to domain variations.
During the evaluation phase, we achieved a high classification accuracy for
distinguishing between normal and suspicious behaviors. The confusion matrix
indicated that our system could accurately detect potential cheating events while
maintaining a low false-positive rate. However, some borderline cases—such
as students stretching or adjusting their seating—occasionally triggered false
alerts. This suggests that a soft probabilistic alerting system, rather than a binary
classifier, might be more appropriate in future iterations.
Another important insight came from latency measurement. While our mod-
els perform well individually, the end-to-end pipeline introduced cumulative de-
lays. To address this, we experimented with lightweight alternatives and prun-
ing strategies. Future work could benefit from deploying models on edge devices
like NVIDIA Jetson to further reduce reliance on cloud-based processing and in-
crease scalability.
Ethical considerations were central to the project. Surveillance in academic
settings raises questions about privacy and student comfort. Therefore, we en-
sured that all participants were informed and consented to video data collection.
Moreover, the system is designed to assist invigilators, not replace them. Final
decisions are left to human supervisors, and AI-generated alerts are only sug-
gestive, not definitive.
An unexpected benefit of the system is its potential for post-exam analysis.
Stored surveillance data, tagged and classified, could help academic authorities
analyze overall student behavior patterns and even improve seating arrange-
ments or room layouts for future exams. This secondary use of data opens up
new avenues for academic administration and behavioral research.
We also found that many current open-source tools, while powerful, lacked
integration-friendly documentation. This prompted us to write several custom
wrappers and pipeline orchestrators to link models efficiently. These contribu-
CHAPTER 7. DISCUSSION AND CONCLUSION 51
tions can be shared with the community in future open-source releases, con-
tributing to the body of work on surveillance AI.
On a broader note, this work touches upon the larger debate around au-
tomation in educational monitoring. While AI can enhance efficiency and re-
duce human error, it must be deployed responsibly. Trust, transparency, and
explainability are essential features of any system that makes high-stakes judg-
ments, especially in academic contexts. In this spirit, all components of our
system provide visual outputs—bounding boxes, pose overlays, and attention
heatmaps—to explain why certain behaviors were flagged [8].
In conclusion, this project demonstrates a successful fusion of modern deep
learning models into a real-world application with societal relevance. The exam
hall surveillance system is not just a technological solution; it is a tool aimed at
reinforcing academic fairness and reducing stress on human invigilators. Through
careful model selection, dataset preparation, and thoughtful system design, we
have built a robust prototype capable of real-time action recognition and object
detection in a high-stakes environment.
Looking forward, there are several exciting directions to explore. One is
the addition of audio-based analysis—whisper detection or speech recognition
could be used to detect collaboration between students. Another is integrating
the system into a multi-camera network where spatiotemporal behavior can be
tracked across rooms. Furthermore, reinforcement learning could be used to
adaptively tune alert thresholds based on invigilator feedback over time.
We also envision deploying this system on embedded edge devices, such
as Jetson Nano or Raspberry Pi clusters, enabling cost-effective deployment
in resource-constrained institutions. Federated learning approaches could help
preserve student privacy while continually improving the model across multiple
deployments.
This work serves as a foundation. It merges multiple domains—computer
vision, machine learning, real-time systems, and ethics—to address a persistent
real-world challenge. While there is still work to be done in terms of robust-
ness, fairness, and deployment, the results are promising and demonstrate the
potential of AI in reimagining educational infrastructure. We hope that this sys-
tem inspires further interdisciplinary research and responsible AI practices in
academic institutions across the globe.
References
11. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object De-
tection. In: Proceedings of the IEEE International Conference on Computer
Vision (ICCV); 2017. p. 2980–2988.