Industrial Oriented Mini Project Doc Template
Industrial Oriented Mini Project Doc Template
ON
(Size=14, Times New Roman)
AI-DRIVEN PHISHING DETECTION TOOL
(Size=16, Times New Roman)
submitted in partial fulfillment of the requirement. for the award of the degree of
(Size=14, Times New Roman,Italic)
BACHELOR OF TECHNOLOGY IN
Name1 22P61A05**
Name2 22P61A05**
Name3 22P65A05**
GUIDE NAME
Designation, Dept. of CSE
May-2025
i
DECLARATION
(Size=16, Times New Roman)
This is a record of bonafide work carried out by us and the results embodied in this
project have not been reproduced or copied from any source. The results embodied in this
project report have not been submitted to any other university or institute for the award
of any other degree or diploma.(Size=12, Times New Roman, Line spacing=1.5lines)
Name1 22P61A05**
Name2 22P61A05**
Name3 22P65A05**
i
Aushapur (V), Ghatkesar (M), Hyderabad, Medchal – Dist, Telangana – 501 301.
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the industrial oriented mini project titled “AI-Driven Phishing
Detection Tool” Submitted by Name-1 (22P61A05**), Name-2 (22P61A05**), Name-
3 (22P65A05**) B. Tech, III- II semester, Department of Computer Science &
Engineering is a record of the bonafide work carried out by them.
The Design embodied in this report have not been submitted to any other University
for the award of any degree.
EXTERNAL EXAMINER
ii
ACKNOWLEDGEMENT
We are extremely thankful to our beloved Chairman, Dr. N. Goutham Rao and
Secretary, Dr. G. Manohar Reddy who took keen interest to provide us the
infrastructural facilities for carrying out the project work.
Self-confidence, hard work, commitment and planning are essential to carry out any
task. Possessing these qualities is sheer waste, if an opportunity does not exist. So, we
whole- heartedly thank Dr. P.V.S. Srinivas, Principal, and Dr. Dara Raju, Head of the
Department, Computer Science and Engineering for their encouragement and support
and guidance in carrying out the project.
We would like to express our indebtedness to the Overall Project Coordinator, Dr. M.
Venkateswara Rao, Professor, and Section Coordinators, Ms. P. Suvarna Puspha,
Associate Professor, Ms. A. Manasa, Associate Professor, Department of CSE, for their
valuable guidance during the course of project work.
We would like to express our sincere thanks to all the staff of Computer Science and
Engineering, VBIT, for their kind cooperation and timely help during the course of our
project. Finally, we would like to thank our parents and friends who have always stood
by us whenever we were in need of them.
ABSTRACT
The growing sophistication of cyber threats has led to an urgent need for intelligent,
real-time solutions capable of detecting and preventing phishing attacks. This study
presents an AI-Driven Phishing Detection System that leverages advanced machine
learning and deep learning methodologies to identify fraudulent emails and malicious
URLs with high accuracy. The system integrates Natural Language Processing (NLP)
for sentiment and intent analysis to uncover deceptive textual patterns commonly used in
phishing content. Convolutional Neural Networks (CNNs) are employed to detect
structural and visual anomalies in phishing websites, while Recurrent Neural Networks
(RNNs) analyze sequential patterns within email content to recognize suspicious
behavioural cues. To enhance detection accuracy and system resilience, classical
algorithms such as Random Forest and Logistic Regression are used for robust feature-
based analysis. An intuitive web-based interface provides seamless front-end and back-
end integration, allowing for real-time monitoring, threat visualization, and early
response mechanisms. This hybrid approach enables a comprehensive understanding of
phishing indicators across multiple data modalities, thereby improving predictive
capabilities and minimizing false positives. The system is trained on diverse phishing
datasets containing annotated emails and URLs, ensuring adaptability across various
real-world attack scenarios. Experimental evaluations demonstrate the model's
effectiveness in real-time threat detection, achieving high precision and recall. By
delivering an automated, intelligent phishing defense solution, the proposed system
enhances cybersecurity posture and empowers users with proactive digital protection.
Future enhancements will explore integration with browser extensions and cloud-based
threat intelligence to support broader deployment and continuous learning.
Keywords:
AI-driven phishing detection, cybersecurity, deep learning, malicious URLs, NLP,
CNNs, RNNs, Random Forest.
iv
VISION
To become, a Center for Excellence in Computer Science and Engineering with a
focused Research, Innovation through Skill Development and Social Responsibility.
MISSION
DM-2: Impact the skills necessary to amplify the pedagogy to grow technically and to
meet interdisciplinary needs with collaborations.
DM-3: Inculcate the habit of attaining the professional knowledge, firm ethical
values,
innovative research abilities and societal needs.
v
PROGRAM SPECIFIC OUTCOMES (PSOs)
PSO-01: Ability to explore emerging technologies in the field of computer science and
engineering.
PSO-03: Ability to gain knowledge to work on various platforms to develop useful and
secured applications to the society.
PO-02: Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
PO-05: Modern tool usage: Create, select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
vi
PO-06: The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO-08: Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO-12: Life-long learning: Recognize the need for, and have the preparation and ability
to engage in independent and life-long learning in the broadest context of technological
change.
a) PO Mapping:
PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Title 3 3 3 2 2 3 3 3 3 2 2 3
b) PSO Mapping:
vii
List of Figures
S.no. Title Page no.
viii
List of Tables
S.no. Title Page no.
1 Test cases 47
2 Accuracy Comparison 61
3 Precision Comparison 63
4 Recall Comparison 65
5 F-1 Score Comparison 67
6 Comparison of Accuracy 69
7 Comparison of Precision 70
8 Recall Comparison 72
9 F-1 Score Comparison 74
ix
Nomenclature
AI Artificial Intelligence
ML Machine learning
DL Deep Learning
ML Machine Learning
NLP Natural Language Processing
CNN Convolutional Neural Network
RNN Recurrent Neural Network
URL Uniform Resource Locator
HTTP HyperText Transfer Protocol
HTTPS HyperText Transfer Protocol Secure
x
TABLE OF CONTENTS
CONTENTS PAGE NO
Declaration ii
Certificate iii
Acknowledgements iv
Abstract v
Vision & Mission vi
List of Figures ix
List of Tables x
Nomenclature xi
Table of Contents xii
CHAPTER 1:
INTRODUCTION 1-7
1.1 Introduction to AI-Driven Phishing Detective tool 2
1.2 Motivation 4
1.3 Existing System 4
1.4 Proposed System 5
1.5 Problem definition 5
1.6 Objective 6
1.7 Scope 7
CHAPTER 2:
LITERATURE SURVEY 8-12
CHAPTER 3:
REQUIREMENT ANALYSIS 13-15
xi
3.3 Non-Functional Requirements 15
3.4 System Analysis 15
CHAPTER 4:
SYSTEM DESIGN 16-23
4.1 Technical Blueprint of AI-Driven Phishing 17
4.2 Sequence Diagram to represent Phishing URL Detection 19
4.3 Flow control of the system 21
CHAPTER 5:
IMPLEMENTATION 24
5.1 Explanation of key functions 25
5.3 MODULEs 33
xii
CHAPTER 6:
TESTING & VALIDATION 41-47
6.1 Testing process 42
6.1.1 Test planning 42
6.1.2 Test design 43
6.1.3 Test execution 44
6.1.4 Test reporting 44
6.2 Test cases 45
CHAPTER 7:
OUTPUT SCREENS 46-53
CHAPTER 8:
CONCLUSION AND FUTURE SCOPE 75-78
8.1 Conclusion 76
8.2 Future Enhancement 78
REFERENCES 79-80
xii
i
CHAPTER – 1
(Size=30, Times New Roman)
INTRODUCTION
(Size=25, Times New Roman)
1
CHAPTER – 1
(Size=16, Times New Roman)
INTRODUCTION
(Size=16, Times New Roman)
1.1 INTRODUCTION TO SMART SURVEILLANCE SYSTEM
(Size=14, Times New Roman)
The The rapid advancement of digital communication has significantly increased
the risks associated with cyber threats, particularly phishing attacks. As online
interactions and transactions continue to grow, cybercriminals have developed
increasingly sophisticated techniques to deceive users and exploit vulnerabilities.
Phishing is one of the most prevalent and dangerous cyber threats, where attackers
impersonate legitimate entities to trick individuals into revealing sensitive information
such as login credentials, banking details, and personal data. These attacks often involve
fraudulent emails, malicious websites, or deceptive messages containing links that lead
to compromised web pages designed to steal user information. Due to the evolving
nature of phishing tactics, traditional security measures are struggling to provide
adequate protection, leaving users and organizations vulnerable to data breaches, identity
theft, and financial fraud. (Spacing between each line =1.5’)
2
patterns with high accuracy. Unlike traditional methods, AI models can learn from
historical and real-time data, continuously improving their ability to detect evolving
phishing tactics. These models assess multiple aspects of an email, including its textual
content, sender behaviour, metadata, and embedded links, to determine whether it poses
a security threat. Machine learning techniques allow the system to recognize subtle
differences between legitimate and fraudulent messages, making it significantly more
reliable and adaptive [3] compared to conventional security measures.
By automating the phishing detection process, this system minimizes human error
and reduces the risk of falling victim to cyber threats. Unlike manual security reviews,
which are time-consuming and prone to oversight, AI-driven detection operates in real
time, enabling immediate identification and response to potential phishing attempts. The
system not only flags suspicious emails but also provides actionable insights and threat
mitigation strategies, allowing users and organizations to respond proactively. This
reduces financial losses, protects sensitive information, and strengthens overall
cybersecurity resilience. Furthermore, the scalability of the system ensures its
applicability across different user groups, including individuals, enterprises, and
government agencies, making it a versatile and effective solution for combating phishing
threats. The integration of AI-driven threat intelligence allows the system to stay ahead
of emerging cyber threats, continuously adapting to new attack vectors and enhancing its
detection capabilities. As phishing attacks continue to evolve, the implementation of an
AI-based phishing detection system serves as a crucial step toward building a more
3
secure and resilient digital environment.
1.2 MOTIVATION
Phishing attacks remain one of the most significant cybersecurity threats, with
attackers constantly refining their methods to bypass traditional security mechanisms.
The increasing sophistication of phishing emails, coupled with the rise in social
engineering tactics, makes manual detection inefficient and prone to errors.
Organizations and individuals frequently fall victim to such attacks, resulting in data
breaches, financial losses, and reputational damage.
This system aims to not only detect phishing attacks but also provide insights into
emerging threats, allowing organizations to continuously refine their security measures.
The ultimate goal is to minimize the impact of phishing attacks, protect sensitive
information, and foster a safer digital environment for individuals and businesses alike.
Static Rule-Based Filters: Many email security solutions use predefined rules to
identify phishing emails. However, these rules quickly become outdated as attackers
develop new tactics to bypass detection mechanisms.
Blacklisting: Many systems rely on maintaining lists of known malicious domains. This
approach fails when attackers use new, previously unseen domains to launch phishing
campaigns.
Manual Detection: Human analysts are required to verify phishing attempts, which is
time-consuming and inefficient, especially given the volume of emails received daily.
4
Lack of Adaptability: Traditional systems struggle to adapt to evolving phishing
techniques, making them less effective against sophisticated attacks.
Natural Language Processing (NLP): By examining linguistic patterns, the system can
detect suspicious emails that use social engineering tactics.
Adaptive Learning: The system continuously updates its detection algorithms based on
newly identified phishing trends, ensuring robustness against evolving threats.
5
The key challenges include:
High False Positive and False Negative Rates: Existing methods often misclassify
legitimate emails as phishing and vice versa, reducing efficiency.
Scalability Issues: With the increasing volume of digital communication, manual and
rule-based detection methods fail to scale effectively.
1.6 OBJECTIVE
The primary objective of the proposed system is to enhance cybersecurity by
providing real-time detection and prevention of phishing attacks using AI-driven
techniques.
High-Accuracy Detection: Detect phishing emails and malicious links with deep
learning models.
Real-Time Response: Provide real-time alerts and threat mitigation strategies to users
and organizations.
By achieving these objectives, the system will contribute to a more secure digital
ecosystem, protecting users from cyber threats and minimizing the impact of phishing
attacks.
6
1.7 SCOPE
The scope of the AI-Based Phishing Detection System extends across multiple
domains, including:
Financial Sector: Preventing fraudulent activities such as banking scams and financial
phishing attacks.
7
CHAPTER – 2
LITERATURE SURVEY
8
CHAPTER – 2
LITERATURE SURVEY
2.1 A COMPREHENSIVE STUDY ON AI BASED DETECTION
SYSTEMS
The escalating threat of phishing attacks has necessitated the development of advanced
detection systems. As cybercriminals continually refine their methods, traditional security
approaches struggle to keep pace. The rise of Artificial Intelligence (AI)-driven solutions has
significantly improved the accuracy and efficiency of phishing detection mechanisms. AI
techniques, particularly Machine Learning (ML) and Deep Learning (DL), have enabled
proactive threat identification by analyzing patterns in phishing attempts and adapting to new
attack strategies. This literature survey as summarized in the Table 2.1, examines current
methodologies, challenges, and innovations in AI-driven phishing detection systems, with a
focus on feature selection, model adaptability, and real-time detection capabilities.
9
Park and Kim (2024) conducted a meta-analysis of phishing detection models,
synthesizing findings from 50 research papers [4]. Their study highlights that deep
ensemble methods provide the best trade-off between accuracy and interpretability. The
research also discusses the limitations of traditional phishing detection techniques and
the growing role of AI in improving resilience against cyber threats.
10
enabled phishing attacks. The paper also discusses the ethical implications of AI-
generated phishing and the need for AI-driven countermeasures to detect synthetic
threats.
Huang and Patel (2023) examined hybrid AI models that integrate NLP, ML, and
graph-based detection techniques for phishing email analysis [42]. The study highlights
that hybrid approaches improve accuracy but require careful feature selection. The
authors also discuss the impact of graph-based techniques in detecting phishing email
patterns.
11
Divakaran and Oest (2022) explore ML and DL models for phishing detection,
discussing various data types and their respective advantages and disadvantages [15].
They present multiple deployment options to detect phishing attacks, highlighting the
need for continuous adaptation to counter rapidly evolving phishing strategies. The study
also provides an in-depth comparison of supervised and unsupervised learning
techniques, demonstrating their suitability for different phishing detection use cases.
Chen and Lee (2022) analysed the role of deep learning in phishing detection,
reviewing convolutional and recurrent neural networks [17]. The study finds that CNNs
and RNNs perform well in URL-based phishing detection but require large labelled
datasets for optimal performance. The authors also highlight the limitations of deep
learning models in detecting zero-day phishing attacks and propose semi-supervised
learning techniques as a potential solution.
Aleroud and Zhou (2020) review AI techniques, including ML, DL, Hybrid
Learning, and Scenario-based methods, for phishing attack detection [19]. The study
highlights the effectiveness of these approaches in identifying phishing activities but
notes challenges such as the need for large datasets and the adaptability of models to
evolving phishing tactics. Additionally, the paper discusses the integration of multiple AI
models to improve detection accuracy and mitigate false positives.
12
detection performance but acknowledges the challenge of maintaining accuracy with
evolving phishing techniques. The authors also analyse the trade-offs between
computational efficiency and detection accuracy.
13
CHAPTER – 3
REQUIREMENT ANALYSIS
14
CHAPTER – 3
REQUIREMENT ANALYSIS
CPU (Central Processing Unit): Intel Core i5/i7 or equivalent multi-core processors
ensure smooth execution of machine learning models and data processing tasks.
RAM (Random Access Memory): A minimum of 8GB RAM is required for handling
phishing data analysis, with 16GB preferred for optimal performance.
Storage: 32GB to 128GB storage capacity is necessary to store datasets, models, and
logs efficiently.
Operating System: The tool is compatible with Windows 10/11 and macOS
environments.
Programming Languages:
Python: Used for machine learning, data processing, and backend development.
15
Other Tools:
Phishing Detection: The system must analyse URLs, emails, and website content to
detect phishing attempts in real-time.
Machine Learning Model Integration: The backend must support the training and
deployment of ML models for classification.
Performance: The system must provide phishing detection results in under 2 seconds
for a seamless user experience.
Reliability: Ensures accurate detection with minimal false positives and negatives.
Security: Implements encryption for sensitive data and follows best practices for secure
database management.
16
CHAPTER - 4
SYSTEM ANALYSIS & DESIGN
17
CHAPTER – 4
18
to the user, allowing for real-time phishing detection in a seamless and automated
manner.
From the administrative perspective, the admin plays a crucial role in enhancing the
system’s efficacy and accuracy. The process begins with the uploading of datasets that
contain a mix of known phishing and legitimate URLs. The uploaded data undergoes
preprocessing, where it is cleaned, structured, and formatted to optimize learning. The
system is then trained using state-of-the-art machine learning and deep learning
algorithms, such as Random Forest, Ada Boost, Neural Networks, and CNN-based
approaches. The performance of the trained model is then rigorously evaluated using
various metrics, including accuracy, precision, recall, and F1-score, to ensure its
robustness before the final model is deployed for real-time URL predictions. This
continuous training and evaluation cycle helps in improving the model's ability to detect
new and evolving phishing threats with higher accuracy.
The use case diagram also highlights the importance of automation and efficiency in
phishing detection. By leveraging machine learning techniques, the system learns from
new data over time, making it more adept at identifying emerging phishing tactics. The
admin’s ability to monitor, train, and update the model ensures that the system remains
up-to-date and effective against sophisticated cyber threats. Additionally, the automated
nature of the detection process reduces the manual effort required for URL analysis,
making the system not only efficient but also scalable for large-scale implementation.
Beyond just serving as a technical blueprint, the use case diagram also fosters
collaboration among stakeholders. Developers gain valuable insights into how the system
should be implemented and integrated, while end-users and administrators gain a clear
understanding of its role in protecting them from cyber threats. This structured
visualization ensures that all stakeholders are aligned in achieving the system's primary
goal: enhancing cybersecurity by preventing phishing attacks effectively. By clearly
defining the roles of users and administrators, along with the interactions between them
and the system, the use case diagram bridges the gap between system design and real-
world application.
Ultimately, the AI-Driven Phishing Detection System use case diagram, as
illustrated in Fig. 4.1, serves as a fundamental tool in the design and development of a
secure, intelligent, and user-friendly anti-phishing system. With phishing attacks
becoming increasingly sophisticated, the demand for automated, AI-driven cybersecurity
solutions has never been greater. This diagram encapsulates the entire functionality of
19
the system, ensuring that it remains a valuable and reliable asset in the ongoing fight
against online fraud, phishing scams, and malicious cyber activities. By providing a
clear, structured, and user-centric approach to phishing detection, the use case diagram
serves as a cornerstone in building a highly secure digital environment for users
worldwide.
20
portrayal provides a structured view of the system's workflow, allowing stakeholders to
observe the active participation and responsibilities of each component in the phishing
detection process.
The interactions between these lifelines are represented through messages,
illustrating the flow of information and control. For example, the sequence diagram for
this system begins when a User submits a URL for phishing detection. The Phishing
Detection System preprocesses the submitted URL and extracts relevant features, such as
URL length, domain age, presence of suspicious keywords, and HTTPS status. These
extracted features are then sent to the Machine Learning Model, which utilizes them to
predict whether the URL is legitimate or a phishing attempt.
For enhanced accuracy, the Machine Learning Model forwards the extracted
features to the CNN Model, which processes the URL with deep learning techniques to
detect complex patterns and refine the phishing prediction. Once the CNN Model
generates an enhanced prediction, the refined result is returned to the Machine Learning
Model, which then sends the final phishing status back to the Phishing Detection System.
The detection result is displayed to the User, providing immediate feedback on whether
the submitted URL is safe or a phishing threat.
Additionally, the Admin plays a crucial role in maintaining and improving the
system's accuracy. If necessary, the admin initiates a model training process to ensure
that the system stays updated with new phishing trends. This involves updating the
training dataset with new phishing URLs and retraining the CNN Model, leading to an
enhancement of the model’s detection capabilities. The updated model is then integrated
into the system, ensuring that future phishing detection is more accurate and robust.
Sequence diagrams are invaluable for identifying dependencies and optimizing the
interactions within a system. For the Phishing URL Detection System, the diagram
highlights potential areas for performance improvement, such as reducing false positives
or enhancing the real-time processing of phishing URLs. It also helps pinpoint
bottlenecks in the workflow, such as delays in feature extraction or CNN processing, and
provides actionable insights for addressing these challenges effectively.
Additionally, sequence diagrams serve as crucial documentation for development
teams, offering a clear blueprint for implementation. By visually mapping out the
interactions and their chronological order, these diagrams ensure alignment between
design intentions and system development, facilitating better communication and
collaboration among stakeholders. They provide a shared understanding of system
21
behavior, ensuring that all parties involved—technical and non-technical—are on the
same page regarding the system’s design and functionality.
The sequence diagram as shown in Fig. 4.2 for the Phishing URL Detection System
is a dynamic and intuitive visualization tool that enhances comprehension and aids in the
development of a robust, efficient, and user-centric platform. By depicting the intricate
flow of interactions over time, it not only streamlines the design process but also
contributes to creating a reliable and effective system for detecting and preventing
phishing attacks in real-time.
22
learning and machine learning algorithms. These diagrams provide a visual
representation of the sequential flow of tasks, decision points, and interactions among
components, enabling a detailed examination of how the system operates and interacts
with various actors. By offering a structured view of activities, activity diagrams
facilitate clear communication among stakeholders, including developers, analysts, and
system designers, ensuring alignment and shared understanding of the system’s
functionality.
One of the strengths of activity diagrams is their ability to represent the flow of
control and data in a standardized and consistent manner. By adhering to UML
conventions, these diagrams ensure clarity in documenting and analyzing system
behavior, which is particularly beneficial for a project involving advanced machine
learning techniques. The diagram for this project illustrates how URLs progress through
different phases of the system, starting with dataset collection, followed by
preprocessing, and then being split into training and testing datasets. The feature
extraction phase plays a critical role in identifying key URL attributes such as domain
age, HTTPS presence, and keyword analysis. These features are then used to apply
machine learning and deep learning algorithms, leading to classification and final URL
prediction. The accuracy of the result is assessed at the final stage to validate the model's
effectiveness.
Activity diagrams also serve as a tool for simulating system behavior and evaluating
different scenarios. For instance, they can be used to analyze how the system performs
with different datasets, varying URL patterns, or new phishing threats. This capability is
crucial for refining the system’s performance and ensuring its reliability in real-world
applications. Moreover, by visualizing potential bottlenecks or inefficiencies,
23
stakeholders can identify areas for optimization and make informed decisions to enhance
the system's efficiency and effectiveness.
In the context of the Phishing URL Detection System, activity diagrams play a key
role in bridging the gap between technical and non-technical stakeholders. They provide
a clear, visual narrative of the system’s operations, making it easier for all parties to
collaborate on system design, requirement definition, and performance evaluation. This
shared understanding ensures that the final system meets the needs of both end-users and
technical teams responsible for its implementation.
Ultimately, activity diagrams for the Phishing URL Detection System are invaluable
for capturing the intricacies of its workflows while maintaining clarity and simplicity, as
detailed in Fig. 4.3. They serve not only as documentation tools but also as instruments
for analysis and design, enabling the development of a robust, efficient, and user-centric
phishing detection system. By detailing the interactions, tasks, and decision points within
the system, these diagrams provide a roadmap for both the current implementation and
future enhancements, ensuring the system’s long-term success and adaptability.
24
CHAPTER - 5
IMPLEMENTATION
25
CHAPTER – 5
IMPLEMENTATION
The first step in phishing detection is processing the input URL and extracting key
attributes that provide insights into its legitimacy. The system extracts features based on
various aspects such as protocol type (HTTP or HTTPS), domain structure (length,
subdomains), URL entropy, and phishing-specific keywords. This step is crucial as it
ensures that the input data is structured correctly before being fed into the classification
model.
Handling Missing Values: The system uses fillna() to address missing or incomplete
values, ensuring that all required features are available for training and prediction.
26
input, categorical features are converted into numerical representations using one-hot
encoding or label encoding.
By performing these preprocessing steps, the system ensures that data is well-prepared
for phishing classification while minimizing the risk of false predictions due to noisy or
missing data.
At the core of the system is a Random Forest Classifier, which is trained to differentiate
between legitimate and phishing URLs. Random Forest is an ensemble learning method
that constructs multiple decision trees and aggregates their outputs to make a more
accurate classification. The classifier takes extracted URL features as input and applies a
series of decision trees to predict whether the URL is safe or suspicious.
Feature Extraction & Input Processing: The extracted features are normalized and fed
into the Random Forest model for classification.
Training the Model: The system is trained using a dataset containing a mix of phishing
and legitimate URLs to learn patterns associated with malicious behaviour.
Prediction: Once trained, the model classifies input URLs as either "Legitimate" (Safe)
or "Suspicious" (Potentially Malicious) based on learned patterns.
The Random Forest model was selected for its ability to handle high-dimensional
data efficiently, its robustness against overfitting, and its interpretability compared to
deep learning models. Additionally, other machine learning models like Support Vector
Machines (SVM), Logistic Regression, and Neural Networks can be integrated for
further improvements in detection accuracy.
27
Attack Type Identification
E-commerce Scams: Fake shopping websites designed to trick users into entering
payment details.
Credential Stealing Attacks: Phishing websites that disguise themselves as login pages
for popular services (e.g., Gmail, Facebook, PayPal).
Social Media Fraud: URLs used to impersonate social media platforms, often used for
spreading malware or scams.
To achieve accurate attack type classification, the system uses predefined rule-based
patterns combined with machine learning classifiers that detect suspicious words, domain
typos, and unusual URL structures.
After training, the model undergoes a thorough evaluation using standard performance
metrics to measure its effectiveness in phishing detection. The system analyses the
following:
Precision: Evaluates how many of the URLs labelled as phishing are actually phishing
threats.
Recall: Measures how many actual phishing URLs were correctly identified.
To further enhance interpretability, the system visualizes phishing trends and model
28
performance using Seaborn and Matplotlib. These visualizations help in fine-tuning the
model by identifying patterns in misclassified URLs and optimizing feature selection
accordingly.
The system features an interactive Streamlit-based UI, allowing users to input URLs for
real-time risk assessment. The prediction results are displayed clearly, categorizing
URLs as safe or potentially malicious based on the classification model’s decision.
Additionally, the system generates detailed reports in PDF format using FPDF,
summarizing the following:
Attack Type Classification: If detected, the phishing category is included in the report.
This automated reporting feature enhances user awareness by providing clear, actionable
insights regarding the detected phishing threats.
High Detection Accuracy: By utilizing Random Forest Classifier, the system achieves
significantly higher accuracy compared to traditional blacklist-based approaches. Instead
of relying on predefined lists of phishing URLs, the system analyses each URL
dynamically, allowing it to detect zero-day attacks effectively.
User-Friendly and Scalable: The system is designed for easy accessibility via a cloud-
based implementation, removing the need for high-end local hardware. Its modular
architecture as shown in the Fig 5.1 ensures future expansion, enabling the integration of
advanced features like WHOIS lookups, domain age analysis, and deep learning models
for improved detection.
29
Fig 5.1: System Architecture Diagram
30
5.2 METHOD OF IMPLEMENTATION
The implementation of the AI-Driven Phishing Detective Tool is structured to ensure a
seamless, real-time phishing detection experience using Python and Streamlit, a
lightweight web framework. The tool integrates machine learning (ML) models, URL
feature extraction, attack classification, and automated report generation within a cloud-
based framework, making it easily accessible and scalable. This section details the step-
by-step process of implementation, covering data preprocessing, model training,
classification, user interaction, and system evaluation. By leveraging Scikit-learn,
Pandas, FPDF, and Streamlit, the tool provides an efficient and interactive solution for
detecting phishing threats.
The first step in building the phishing detection system is data collection. The
dataset consists of URLs labelled as legitimate or phishing, with associated features such
as protocol type, domain structure, URL length, subdomains, and presence of phishing-
related keywords. To ensure the dataset is suitable for training a machine learning model,
the following preprocessing techniques are applied:
Handling Missing Values: Missing values are filled using fillna(), ensuring a complete
dataset without null entries.
Data Storage: The cleaned dataset is stored in a Pandas DataFrame for efficient
processing and model training.
The dataset is loaded from a CSV file containing labelled URLs. Missing values are
identified and replaced with appropriate defaults. Feature extraction techniques analyse
URL text and structure to derive phishing-related indicators. The final dataset is
structured and prepared for training the classification model.
31
5.2.2 PHISHING DETECTION USING RANDOM FOREST CLASSIFIER
The Random Forest Classifier from Scikit-learn is used as the primary machine
learning model for phishing detection. This ensemble method improves classification
accuracy by combining multiple decision trees, reducing overfitting and enhancing
generalization.
Splitting Data into Features and Labels: The dataset is divided into features (X) and
labels (y) to separate input attributes from classification targets.
Training the Model: The Random Forest algorithm learns patterns associated with
phishing URLs by analysing the extracted features.
Hyperparameter Tuning: Parameters such as number of trees, tree depth, and feature
selection are optimized to improve detection accuracy.
Prediction & Classification: Once trained, the model predicts whether an input URL is
legitimate or phishing based on extracted features.
The dataset is split into training (80%) and testing (20%) sets. The model is trained
on the extracted URL features. Hyperparameter tuning is performed using GridSearchCV
for optimal performance. The trained model classifies new URLs based on learned
patterns, returning either "Legitimate (Safe)" or "Suspicious (Unsafe)".
Beyond basic phishing detection, the tool categorizes phishing threats based on
common attack types using a rule-based heuristic system. This helps in understanding the
nature of the phishing attack and improving security awareness.
Credential Stealing: Phishing pages mimicking login portals to capture user credentials.
Social Media Fraud: Fake social media pages aimed at identity theft or spreading
malware.
The input URL is analysed for keywords, domain names, and suspicious patterns.
32
The system checks for predefined phishing indicators related to known attack types.
Based on detected patterns, the URL is categorized into one of the phishing attack types.
The classification result is displayed alongside the phishing detection outcome.
Validation Mechanism: The system checks whether the URL format is valid using the
validators library.
Prediction Display: The classification model analyses the URL and displays a
Legitimate or Suspicious result.
Users enter a URL and select the protocol type (HTTP/HTTPS). The system validates
the URL format before processing. Upon clicking the Predict button, the trained model
analyses the URL and displays the phishing detection result. If the URL is classified as
phishing, the system highlights the possible attack type.
Analysed URL and classification result. Phishing indicators detected in the URL. Attack
Type (if applicable) to provide further context. Recommendations for safe browsing
practices.
After making a prediction, users can navigate to the Download Report section. Clicking
the Generate Report button creates a structured PDF report summarizing the phishing
analysis. Users can download and save the report for further investigation as depicted in
the Fig 5.2.
To ensure reliability and accuracy, the model undergoes rigorous performance evaluation
using standard ML metrics.
33
Metrics used include:
Precision & Recall: Evaluates how well phishing threats are identified.
A test dataset is used to evaluate the trained model. A confusion matrix and classification
report are generated to analyse model effectiveness. Areas of misclassification are
identified for further model refinement. The model undergoes continuous retraining with
updated datasets to enhance phishing detection accuracy.
5.3 MODULEs
The AI-Driven Phishing Detective Tool is divided into multiple functional modules,
ensuring a structured and efficient workflow. Each module is responsible for a specific
task, from data preprocessing and feature extraction to machine learning model training,
real-time URL detection, web-based interaction, and performance evaluation. By
maintaining a modular design, the tool achieves scalability, maintainability, and real-
time phishing detection while ensuring accuracy and user accessibility. The following
sections detail each module's implementation, workflow, and key functions.
34
5.3.1 MODULE A: DATA PREPROCESSING AND FEATURE EXTRACTION
This module processes raw URLs and converts them into structured feature
representations that can be used by the machine learning model. Since raw URLs cannot
be directly analysed, extracting relevant attributes helps in differentiating phishing and
legitimate URLs.
Key Tasks:
Feature Extraction
The module extracts three primary feature types:
Lexical Features: URL length, number of special characters (e.g., -, _, ., @), and
presence of suspicious keywords (e.g., bank, login, verify).
Host-based Features: Domain age, WHOIS information, and whether the URL uses an IP
address instead of a domain.
Content-based Features: HTTPS usage, SSL certificate validity, and URL redirection
patterns.
Normalizes numerical features (e.g., URL length) into a standard range (0 to 1).
Key Function
//Takes a list of URLs and extracts key features into a structured DataFrame.
This module trains the core classifier, which determines whether a given URL is
phishing or legitimate. It uses the Random Forest Classifier, an ensemble learning
method known for its robust performance and accuracy in detecting phishing threats.
35
Key Tasks:
Data Splitting
Divides the dataset into training (80%) and testing (20%) sets.
Model Selection
Model Training
Performance Evaluation
Analyzes results and fine-tunes model parameters for better detection rates.
Key Function
This module performs real-time classification, allowing users to input a URL and
receive an instant phishing risk assessment.
Key Tasks:
36
Feature Extraction on New URLs
Key Function
The Web Application module provides an interactive user interface where users can
input URLs and receive real-time phishing detection results. The UI is developed using
Streamlit, a Python-based web framework.
Key Tasks:
Visualization of Features
37
Key Function
def run_web_app():
To ensure the system performs reliably, this module evaluates model accuracy using
various performance metrics.
Key Tasks:
F1-Score: Balances precision and recall, particularly useful for imbalanced datasets.
Visualizes true positives, false positives, true negatives, and false negatives.
import pandas as pd
import validators
data = {
'URL': [
38
"www.marketplus.com.ar/cart/includes/local/1.php",
"www.qu100.com/phpmyadmin/778766777/index.html",
"uploads.boxify.me/83141/novo.ini",
],
'Protocol': [0.0, 0.0, 0.0, None, 2.0, 0.0, 0.0, 1.0, None, None],
'Label': [1, 1, 1, 0, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame(data)
df['Protocol'] = df['Protocol'].fillna(df['Protocol'].mean())
X = df[['Protocol']] # Features
clf = RandomForestClassifier(random_state=42)
clf.fit(X, y)
def analyze_url_type(url):
return "E-commerce Scam", "Fake e-commerce sites trick users into making
payments for non-existent products."
39
patterns (e.g., frame-by-frame features) to detect anomalies or violent behaviour.
The Random Forest Classifier is used as the core machine learning model due to
its robustness, accuracy, and ability to handle complex decision boundaries. It operates
by constructing multiple decision trees and aggregating their results to improve detection
accuracy while minimizing overfitting. The dataset is split into features (X) and labels
(y) before training the classifier, which learns patterns distinguishing legitimate from
phishing URLs. Once trained, the model is capable of predicting whether a given URL is
safe or suspicious based on extracted features. The use of an ensemble learning approach
ensures high detection accuracy and resilience to noisy data.
40
suspicious. Additionally, the tool generates downloadable phishing analysis reports using
FPDF, including key details such as the analysed URL, prediction results, attack type
classification, and a brief threat description. This feature is particularly useful for
organizations and security professionals who require documented phishing reports for
cybersecurity audits or investigations.
41
CHAPTER - 6
TESTING & VALIDATION
42
CHAPTER – 6
TESTING & VALIDATION
The testing process involves four major phases: Test Planning, Test Design, Test
Execution, and Test Reporting. Each phase plays a significant role in the validation and
verification of the system. Meticulous attention is given to every phase to ensure that the
tool performs optimally across different environments and data sets.
Test planning is the foundational phase where the testing strategy is formulated to
ensure that the phishing detection system meets its objectives. The planning phase is
crucial as it sets the roadmap for the entire testing process. It involves identifying the
scope, defining objectives, allocating resources, and establishing a timeline for
execution.
During the test planning phase, the scope of testing is clearly defined to include all
critical features and functionalities of the system. The primary components identified for
testing are the Home Page, Prediction Page, and Report Generation Module. In addition,
the system’s capability to detect various types of URLs, such as banking URLs, e-
commerce scam URLs, Google form URLs, legitimate URLs, and suspicious URLs, is
also emphasized.
43
organizing and managing test cases.
A detailed schedule and timeline are established to outline the testing activities,
including test case development, execution, defect tracking, and reporting. This schedule
ensures that the testing process is conducted within the project’s timeline, allowing room
for regression testing and fixing potential issues.
Moreover, potential risks and challenges are anticipated, and contingency plans are
formulated to address unexpected issues that may arise during testing. This proactive
approach helps minimize disruptions and ensures that testing proceeds efficiently and
systematically.
Test design is the phase where comprehensive and well-structured test cases are
created to evaluate the system's functionality and performance. The primary objective of
this phase is to develop test cases that effectively cover all possible scenarios and edge
cases, ensuring that the system is robust and reliable.
The test design process begins with the identification of test scenarios, where
potential situations that the system might encounter are outlined. These scenarios are
based on system requirements and real-world use cases. Scenarios include detecting
phishing URLs, classifying legitimate URLs, identifying suspicious patterns, and
generating detailed reports.
Once scenarios are identified, test cases are meticulously crafted to specify the
input data, expected outcomes, and precise steps to be followed during execution. Each
test case is designed to verify a specific functionality or feature of the phishing detection
system. Test cases are crafted to cover not only normal and expected inputs but also edge
cases, including malformed URLs, ambiguous URLs, and large data sets.
To maintain consistency and accuracy, test design tools such as Microsoft Excel
are used to document test cases and expected results. These tools facilitate organized
44
tracking and management of test cases throughout the testing process.
Test execution is the phase where the formulated test cases are systematically
executed to verify the system's performance and accuracy. This stage involves running
the test cases as specified, recording the outcomes, and comparing actual results with the
expected ones.
The execution process starts with setting up the testing environment to replicate
real-world conditions. This includes configuring the prediction model, preparing the
dataset, and initializing the web application. Once the environment is set, the test cases
are executed step by step as per the predefined procedure.
During test execution, the focus is on observing and recording the system's
responses to various inputs. Any deviations from expected outcomes are logged as
defects, including details about their severity and potential impact on the system.
Automated scripts are used where applicable to streamline the execution process,
especially for repetitive and large-scale testing.
An important aspect of this phase is defect reporting, where detected issues are
logged, analysed, and categorized. The defect management process ensures that each
identified issue is promptly addressed and resolved before deployment. Additionally,
regression testing is conducted to confirm that recent fixes do not adversely affect
existing functionalities.
Test reporting is the final phase of the testing process, focusing on consolidating
and presenting the test results. This phase involves compiling data from test execution,
analysing the outcomes, and creating a comprehensive report that summarizes the
system's performance.
The report includes a detailed summary of executed test cases, highlighting both
successful and failed cases. Each test case result is documented, including the input data,
expected results, actual outcomes, and the status (pass or fail). The report also contains
defect analysis, which categorizes and prioritizes issues based on their severity.
Additionally, test metrics such as defect density, test coverage, and execution
progress are calculated and analysed. These metrics provide valuable insights into the
45
overall quality of the system and highlight areas that may require further improvement.
The final test report is shared with stakeholders to provide a transparent overview of the
system’s reliability and performance.
The following test cases were conducted to evaluate the phishing detection system’s
performance and accuracy. Each test case is detailed with objective, steps, expected
outcomes, and actual results as shown in the Table 6.1.
Objective: Verify if the homepage loads correctly with all interactive elements and input
fields.
Steps: Open the web application and observe the homepage layout and functionality.
Expected Result: The homepage should display input fields and instructions correctly.
Actual Outcome: The homepage rendered correctly without issues.
Status: Pass
46
Actual Outcome: The system successfully flagged phishing URLs as suspicious.
Status: Pass
Objective: Verify that the system correctly detects phishing URLs related to banking
and payment fraud, particularly URLs containing keywords like "paypal".
Steps:
Expected Result:
The system should detect the URLs containing payment-related keywords as suspicious
and display the result as "Suspicious - Banking/Payment Fraud".
Actual Outcome:
The system correctly detected and classified as shown in the Table 6.1. the banking and
labelling them as "Suspicious - Banking/Payment Fraud".
Status: Pass
Expected Actual
Test Case Component Input Status
Outcome Outcome
Display the
Homepage Open the homepage with Displayed
Home Page Pass
Rendering web app input field and correctly
instructions
Display error
Improperly
URL Format Prediction message indicating Error
formatted Pass
Validation Page invalid URL displayed
URLs
format
Safe,
Legitimate URL Prediction Display result as Correctly
verified Pass
Detection Page "Legitimate" detected
URLs
47
Known
Suspicious URL Prediction Display result as Correctly
phishing Pass
Detection Page "Suspicious" detected
URLs
Display result as
URL
Banking/Payment Prediction "Suspicious - Correctly
containing Pass
URL Detection Page Banking/Payment detected
"paypal"
Fraud"
Display result as
Google Form Prediction Google Correctly
"Suspicious - Data Pass
URL Detection Page form URL detected
Collection Scam"
Generate a
URL and
Report Report downloadable PDF Report
analysis Pass
Generation Module report with generated
result
analysis details
48
CHAPTER - 7
OUTPUT SCREENS
49
CHAPTER – 7
OUTPUT SCREENS
In AI-driven phishing detection system, output screens play a crucial role in
showcasing the progression and outcomes of each phase of the project. These screens
serve as visual and textual representations of the system's operation, encompassing
processes from data acquisition and preprocessing to real-time detection and result
analysis. The primary purpose of these screens is to validate the system's functionality
while ensuring transparency and interpretability in the detection process.
By presenting outputs at each critical stage of the pipeline, the screens facilitate a
comprehensive understanding of the system's workflow. They assist in identifying
bottlenecks or errors that may occur during data processing, model training, evaluation,
and real-time prediction. The output screens follow a logical sequence to mirror the
natural flow of the phishing detection process, thereby enhancing the interpretability and
usability of the system.
This chapter provides a detailed exploration of each output screen, emphasizing the
key elements, underlying processes, and insights derived from them. The documentation
covers various stages such as data preprocessing, feature extraction, model evaluation,
and real-time phishing detection, demonstrating how each screen contributes to
conveying the system’s operational success and accuracy.
50
practices.
The home page is centred around the URL Detection Section, which enables users
to assess the legitimacy of any given URL. The following key components are
incorporated to enhance usability and functionality:
The URL Input Field allows users to enter the address they wish to analyse.
The field is designed to accept various types of URLs, ensuring compatibility with
different protocols and formats.
Protocol Selection:
The Select Protocol feature allows users to specify the protocol associated with the
URL, enhancing the precision of the analysis.
The default option is typically HTTP, but users can select from other available protocols
as needed.
This capability is crucial as certain cyber threats may specifically exploit insecure
protocols, making it vital to accurately categorize and analyse them.
Analyse Button:
The "Analyse" button, positioned prominently next to the input field, triggers the
detection process.
Upon clicking, the system processes the entered URL using advanced machine learning
algorithms to detect potential phishing attempts.
Once the analysis is complete, the prediction result is displayed directly below the
analysis section.
51
ensuring that users can easily interpret the outcome.
Additionally, the "Attack Type" field indicates the specific nature of the detected threat,
such as "Phishing", "Malware", or "None" if no threat is identified.
This immediate feedback is crucial for users to quickly assess the safety of the URL
being analysed.
The home page also features a dedicated Security Best Practices section, aimed at
promoting secure online behaviour and mitigating risks associated with phishing attacks.
This section includes practical guidelines such as:
Using Strong and Unique Passwords: Encouraging users to create complex passwords
that are difficult to guess or crack.
By integrating these best practices directly on the home page, the system not only
detects phishing attempts but also educates users on adopting proactive cybersecurity
measures.
The left panel of the home page hosts the Navigation Menu, providing seamless
access to the following pages:
Home: Returns to the main analysis interface, allowing users to perform new phishing
detection tasks.
Reports: Directs to the Report Page where users can view and download detailed
analysis reports, as discussed in the corresponding section.
About: Provides insights into the system’s purpose, underlying technologies, and project
objectives.
52
The navigation panel ensures that users can effortlessly switch between different
functionalities without losing context or progress.
The home page is integral to the overall phishing detection system, as it facilitates
user interaction and enables rapid analysis of URLs. By providing both analytical and
educational elements in one interface, it supports users in making informed decisions
regarding potential cyber threats. Furthermore, the transparent presentation of results and
practical security tips contribute to enhancing cybersecurity awareness and vigilance.
The comprehensive design of the home page ensures that users are guided through
the detection process in a systematic and informed manner. As depicted in Fig 7.1 and
Fig 7.2, the interface prioritizes both functionality and usability, making it a vital
component of the AI-driven phishing detection system.
53
Fig 7.2: Home Page#2
The report page serves as a comprehensive and detailed summary of the analysis
results generated by the AI-driven phishing detection system. It provides essential
information regarding the legitimacy and security assessment of the analysed URL,
offering a clear and structured overview of the detection outcome. This page is designed
to enable users to systematically review the results of the cybersecurity analysis,
ensuring transparency and accuracy in evaluating potential cyber threats.
The analysed URL is clearly displayed at the top of the report, allowing users to verify
the address that was evaluated.
The report also includes a clickable link to the analysed URL, enabling quick reference
and verification.
Protocol Specification:
The protocol used during analysis (e.g., HTTP, HTTPS) is displayed to provide context
regarding the communication channel.
54
This detail helps in understanding the security level of the connection and potential
vulnerabilities related to insecure protocols.
Legitimacy Status:
The legitimacy status of the URL is prominently displayed, indicating whether the URL
has been classified as "Legitimate" or "Phishing/Malicious" based on the model’s
analysis.
This quick assessment allows users to make informed decisions regarding the safety of
the URL.
In cases where malicious activity is detected, the report specifies the attack type
identified (e.g., Phishing, Malware, Spoofing).
This information aids in understanding the nature and severity of the potential threat.
If no suspicious behaviour is detected, the report indicates "None" as the attack type.
The report page features a graphical visualization of analytical metrics, enhancing the
interpretability of the prediction results. The graph displays relevant metrics or feature
importance scores that contribute to the final prediction. The purpose of this visual
representation is to:
Illustrate the distribution of critical features or risk factors associated with the analysed
URL.
Provide an intuitive and easily understandable means of assessing the threat level.
Facilitate the comparison of various metrics that impact the detection decision.
The left panel of the report page contains navigation options to seamlessly switch
between different sections of the application:
Reports: Directs to the current page to review and download the analysis reports.
About: Provides information about the phishing detection system and its underlying
technologies.
55
Significance of the Report Page
The report page plays a pivotal role in presenting a structured and informative summary
of the system’s detection capabilities. By clearly displaying both textual and graphical
insights as shown in the Fig 7.3 and 7.4, it allows users to evaluate the accuracy and
reliability of the phishing detection results. This page not only aids in monitoring the
system’s performance but also contributes to maintaining transparency in the threat
analysis process.
56
7.2. LEGITIMATE URL
The output screen for a legitimate URL serves as a clear and informative interface
that displays the results of the analysis conducted by the AI-driven phishing detection
system. It is designed to provide users with accurate and reliable feedback regarding the
legitimacy and safety of the entered URL, while also promoting cybersecurity awareness
through practical guidance.
When a user enters a URL into the input field on the home page and selects the
appropriate protocol (e.g., HTTP), the system performs a comprehensive analysis to
determine whether the URL is legitimate or potentially malicious. The analysis leverages
advanced machine learning and deep learning algorithms, such as CNN and RNN
architectures, to accurately detect phishing attempts by evaluating numerous features and
attributes associated with the URL.
Upon completion of the analysis, the system promptly displays the result on the
output screen. The prediction outcome is prominently shown as "Legitimate" within a
visually distinctive green-coloured box, symbolizing safety and authenticity. This clear
visual representation ensures that users can quickly interpret the result and feel confident
about the legitimacy of the analysed URL. The choice of green as the background colour
is intentional, as it universally signifies safety and acceptance, thereby reinforcing the
positive nature of the result.
In addition to the prediction result, the output screen provides an "Attack Type"
field, which in the case of a legitimate URL, clearly states "Attack Type: None". This
indication confirms that the system did not detect any suspicious behaviour or
characteristics typically associated with phishing or malicious activities. By specifying
the absence of attacks, the system enhances transparency and helps users understand that
the URL has passed all security checks without triggering any alerts.
57
emphasizes the importance of using strong and unique passwords, avoiding sharing
sensitive information on unverified websites, and being cautious of unsolicited emails or
messages containing suspicious links.
This inclusion of security best practices serves a dual purpose. First, it educates
users about essential online safety measures, regardless of whether the URL analysed is
legitimate or malicious. Second, it reinforces the idea that even legitimate URLs should
be approached with caution if they are linked to sensitive activities, such as online
banking or personal data submission.
The navigation panel remains consistent across all pages, maintaining a uniform
interface that enhances the overall user experience. The ability to access the Reports
page directly from the legitimate URL output screen enables users to view and download
a comprehensive report of the analysis for record-keeping or further examination.
The design of the legitimate URL output screen is focused on delivering clarity,
accuracy, and ease of use. The text is well-organized and presented in a readable font
size, while the use of colour coding (green for legitimate results) facilitates quick visual
interpretation. Furthermore, the combination of prediction results, attack type
information, and security best practices creates a holistic approach to phishing detection,
addressing both technical analysis and user education.
The output screen also reflects the system’s commitment to fostering a secure digital
environment by not only detecting threats but also promoting awareness of good
cybersecurity practices. As a result, users can feel confident not only in the system's
analytical capabilities but also in their ability to make informed decisions when
interacting with various online resources.
58
By presenting the legitimate URL output in a comprehensive and transparent
manner, the system enhances user trust and demonstrates the robustness of the
underlying algorithms. This structured and informative approach contributes to the
effectiveness of the AI-driven phishing detection system and supports proactive
measures against potential cyber threats.
The output screen for a suspicious URL is a critical component of the AI-driven
phishing detection system, designed to inform users about potential threats associated
with the analysed URL. This screen serves as a comprehensive summary of the detection
results, clearly indicating that the URL poses a risk or exhibits characteristics typical of
phishing or malicious activities.
Once the analysis is complete, the system promptly displays the result on the
59
output screen. The prediction outcome is clearly shown as "Suspicious" within a red-
coloured box, signalling a high alert to the user. This visual indication immediately
draws attention to the potential threat, prompting users to take necessary precautions.
In addition to the prediction result, the output screen specifies the "Attack Type"
detected during the analysis. This field helps users understand the nature of the potential
threat, whether it is a phishing attack, malware distribution, or another form of cyber
exploitation. By providing detailed information about the attack type, the system
enhances transparency and aids in risk assessment.
For instance, if the system detects a "Phishing Attack" as shown in Fig 7.6, the
output screen explicitly states it, highlighting the primary reason for classifying the URL
as suspicious. This precise identification enables users to make informed decisions
regarding the next steps, such as avoiding the website or reporting it to relevant
authorities.
The output screen also incorporates a dedicated section titled "Security Best
Practices", emphasizing recommended actions in the event of encountering suspicious
URLs. Users are advised to avoid interacting with the URL, refrain from providing
personal information, and promptly close the browser window if they have already
accessed the site. Additionally, guidance on reporting phishing attempts to cybersecurity
authorities is provided to ensure collective safety.
60
understand, even for users with minimal technical expertise.
The performance of the AI-driven phishing detection system was evaluated using
various machine learning and deep learning algorithms. The primary objective of this
evaluation was to assess the accuracy, precision, recall, and F1-score of each algorithm
to determine their effectiveness in detecting phishing URLs. The algorithms used in this
61
project include Random Forest Classifier, AdaBoost Classifier, Convolutional Neural
Network (CNN), and Recurrent Neural Network (RNN). Each metric provides a different
perspective on model performance, allowing for a comprehensive evaluation of their
strengths and weaknesses. The performance metrics for each algorithm are discussed in
detail below.
7.3.1 ACCURACY
Random Forest Classifier demonstrated the highest accuracy of 98.5%, indicating that it
correctly identified phishing and legitimate URLs with high precision and recall. This
high accuracy can be attributed to the model's ability to learn from multiple decision
trees and combine their outputs effectively.
AdaBoost Classifier achieved an accuracy of 92%, which, while slightly lower than that
of the Random Forest, still represents a robust performance. The ensemble nature of
AdaBoost, which emphasizes harder-to-classify instances, contributes to its relatively
high accuracy.
CNN Model and RNN Model showed significantly lower accuracy scores of 50.9% and
50.7% respectively, indicating that these deep learning architectures failed to distinguish
between legitimate and phishing URLs effectively. This may be due to the challenges
associated with capturing textual patterns in URLs.
Model Accuracy
62
Fig 7.5: Accuracy comparison
The above Fig 7.5 compares the accuracy of each algorithm in detecting the
legitimacy of the URLs.
7.3.2 PRECISION
Precision is a key performance metric that quantifies the ratio of correctly predicted
positive observations to the total number of predicted positives. It serves as an indicator
of the model's accuracy when identifying a URL as phishing. High precision implies that
when the model predicts a URL as phishing, it is highly likely to be correct, thereby
minimizing the occurrence of false positives. Precision is particularly important in
phishing detection since falsely identifying a legitimate URL as phishing can lead to
disruptions and loss of trust.
The Random Forest Classifier recorded exceptionally high precision values of 0.99
for legitimate URLs and 0.98 for suspicious URLs. These values demonstrate the
model’s robustness in accurately distinguishing phishing URLs from legitimate ones,
with minimal false positives. The high precision achieved by the Random Forest
Classifier is attributed to its ensemble learning approach, which integrates the outputs of
multiple decision trees, thereby enhancing predictive accuracy. As shown the precision
values clearly demonstrate the superior performance of the Random Forest Classifier.
The AdaBoost Classifier displayed precision values of 0.92 for both legitimate and
suspicious URLs, indicating that the model consistently maintains balanced accuracy
when identifying phishing attempts. Although slightly lower than the Random Forest
Classifier, these precision values still represent an efficient detection capability. The
63
adaptive boosting technique employed by AdaBoost contributes to its reliable
classification performance by giving more weight to difficult cases
On the other hand, the CNN Model and RNN Model exhibited significantly lower
precision values of approximately 0.50 for both legitimate and suspicious URLs. These
values indicate that these models are nearly as likely to classify a legitimate URL as
phishing and vice versa. This low precision reflects frequent false positive predictions,
underscoring the challenges faced by deep learning architectures in analysing textual
data such as URLs. As depicted in Fig. 7.6 and Table 7.2, the low precision of these
models suggests that they are not suitable for phishing detection tasks compared to
traditional machine learning methods.
Random Forest
0.99 0.98 0.99 0.99
Classifier
AdaBoost
0.92 0.92 0.92 0.92
Classifier
64
Fig 7.6: Precision Comparison
7.3.3 RECALL
The Random Forest Classifier achieved impressive recall values of 0.98 for
legitimate URLs and 0.99 for suspicious URLs. These results demonstrate the model’s
effectiveness in accurately identifying phishing URLs while minimizing false negatives.
The high recall values can be attributed to the model’s ensemble approach, which
leverages the combined predictions of multiple decision trees to enhance its ability to
capture complex patterns in the data. This performance indicates that the Random Forest
Classifier is highly reliable when it comes to detecting phishing attempts.
The AdaBoost Classifier exhibited consistent recall values of 0.92 for both
legitimate and suspicious URLs. This uniformity highlights the model’s balanced
performance in correctly identifying phishing attempts as well as legitimate URLs, the
recall values indicate that the model is reasonably effective at capturing phishing
activities, though it slightly underperforms compared to the Random Forest Classifier.
The boosting mechanism employed by AdaBoost helps improve recall by iteratively
65
focusing on instances that were previously misclassified.
Conversely, the CNN Model and RNN Model recorded recall values of
approximately 0.51 for both legitimate and suspicious URLs, as shown in Fig. 7.7 and
Table 7.3. These relatively low recall values indicate a significant limitation of these
models in correctly identifying phishing URLs. A recall of around 0.51 suggests that
almost half of the actual phishing attempts were not detected, leading to a high false
negative rate. This performance inadequacy may result from the inability of CNN and
RNN architectures to effectively capture textual patterns and nuances specific to
phishing URLs.
The analysis of recall values clearly demonstrates that traditional machine learning
models, particularly ensemble-based classifiers like Random Forest and AdaBoost,
significantly outperform deep learning models in detecting phishing URLs. High recall
values in these classifiers ensure that phishing threats are identified accurately, thereby
enhancing the overall security of the system.
66
Fig 7.7: Recall Comparison
The F1-score is an essential performance metric that combines both precision and
recall into a single value, providing a comprehensive assessment of the model’s
accuracy. It is calculated as the harmonic mean of precision and recall, ensuring that both
false positives and false negatives are considered when evaluating the model’s
performance. The F1-score is particularly useful when dealing with imbalanced data, as
it balances the impact of both precision and recall to provide a more reliable measure of
the model's effectiveness.
The Random Forest Classifier achieved exceptional F1-scores of 0.98 for legitimate
URLs and 0.99 for suspicious URLs, as shown in Fig. 7.8 and Table 7.4. These high F1-
scores indicate that the model successfully maintains a robust balance between precision
and recall, minimizing the occurrence of both false positives and false negatives. The
strong performance of this classifier can be attributed to its ability to aggregate the
decisions of multiple trees, which enhances both generalization and accuracy. This
makes the Random Forest Classifier highly reliable for phishing detection, especially
when maintaining high accuracy is critical.
The AdaBoost Classifier produced consistent F1-scores of 0.92 for both legitimate
and suspicious URLs, indicating that it also achieves a balanced combination of
precision and recall. The uniformity of the F1-scores highlights the model’s stable
performance in identifying both types of URLs. The boosting approach of the AdaBoost
algorithm significantly contributes to maintaining this balance by iteratively emphasizing
difficult-to-classify instances. Although its performance is slightly lower compared to the
Random Forest Classifier, it still demonstrates considerable reliability in phishing
detection.
On the other hand, the CNN Model and RNN Model recorded relatively poor F1-
scores of approximately 0.50 for both legitimate and suspicious URLs. This indicates
that these deep learning models are not effective at balancing precision and recall,
resulting in a high rate of misclassification. Such low F1-scores highlight the models’
inability to capture meaningful patterns from the textual and structural features of URLs.
The lack of discriminative power in these models suggests that deep learning
67
architectures may not be well-suited for phishing detection when trained solely on URL-
based data.
AdaBoost
0.92 0.92 0.92 0.92
Classifier
68
Numerous approaches have been proposed to enhance detection accuracy and robustness,
employing machine learning and deep learning techniques to address the challenges
posed by increasingly sophisticated phishing attacks.
In this study, the performance of three phishing detection systems is compared: the
Hybrid Approach [13], the Transformer-based (BERT) Approach [16], and the Proposed
System. The comparative analysis is based on four key performance metrics: Accuracy,
Precision, Recall, and F1 Score. These metrics are chosen as they comprehensively
evaluate the model’s ability to correctly classify phishing and legitimate URLs while
minimizing false positives and false negatives.
To facilitate an objective comparison, the results are presented in the form of tables
and figures, providing a clear and structured representation of the performance of each
approach. The comparative analysis highlights the effectiveness and superiority of the
proposed system over existing methods, demonstrating its potential to enhance
cybersecurity applications.
The Hybrid Approach [13] exhibits an accuracy of 95.70%, indicating its reasonable
capacity to correctly classify URLs. This approach typically leverages a combination of
feature extraction and machine learning techniques, which results in moderately high
accuracy. However, the reliance on traditional feature-based methods may limit its
ability to adapt to more dynamic and sophisticated phishing techniques.
69
accuracy compared to the Hybrid Approach is attributed to the superior representation
learning and contextual analysis provided by the transformer architecture.
System Accuracy
70
Fig 7.9: Comparison of Accuracy
The Hybrid Approach [13] demonstrates a precision of 95.20%, indicating that it can
correctly classify phishing URLs with reasonable accuracy. However, the precision rate
is somewhat limited by the method's dependence on traditional feature extraction
techniques, which may not capture complex patterns with evolving phishing tactics.
71
In contrast, the Proposed System exhibits a substantially higher precision of 99.00%.
This remarkable improvement is attributed to the integration of hybrid feature
engineering techniques and robust deep learning architectures that allow for more
accurate differentiation between phishing and legitimate URLs. The proposed system
effectively leverages a combination of semantic analysis and real-time pattern
recognition to minimize false positives, thus achieving superior precision.
System Precision
Hybrid Approach [13] 95.20%
7.4.3 Recall
The chart (Fig. 7.11) illustrates the recall comparison between three phishing
detection approaches: the Hybrid Approach [13], the Transformer-based (BERT)
72
Approach [16], and the Proposed System. Recall is a crucial performance metric that
quantifies the proportion of correctly identified phishing URLs among all actual phishing
cases. It reflects the system’s ability to detect phishing attempts accurately, especially in
situations where the primary objective is to capture every potential threat. High recall
ensures that a minimal number of phishing URLs are missed, thereby reducing the risk of
undetected cyber threats.
73
Table 7.7: Comparison of Recall
System Recall
74
learning techniques and handcrafted features results in limited adaptability to complex
phishing patterns. Consequently, the model may underperform when encountering novel
or ambiguous phishing URLs.
System F1 Score
75
Fig 7.12: F-1 Score Comparison
76
CHAPTER - 8
CONCLUSION AND FUTURE SCOPE
77
CHAPTER – 8
CONCLUSION AND FUTURE SCOPE
8.1 CONCLUSION
In an era of increasing digital connectivity, phishing attacks have emerged as one of
the most prevalent and damaging cyber threats. These attacks exploit human
vulnerabilities, often leading to significant financial losses and compromised personal
data. As phishing techniques evolve in complexity, the need for advanced detection
systems becomes paramount. This project aimed to develop an AI-driven phishing
detection system using cutting-edge machine learning and deep learning techniques to
accurately and efficiently detect phishing URLs.
The Hybrid Approach [13], despite its moderate accuracy of 95.70%, struggles with
detecting sophisticated phishing patterns due to its reliance on traditional feature
extraction techniques. On the other hand, the Transformer-based (BERT) Approach [16]
improves accuracy to 96.50% by leveraging contextual analysis, but it still faces
challenges in handling adversarial and obfuscated URLs. The proposed system’s ability
78
to outperform both approaches highlights the effectiveness of incorporating deep
learning techniques, hybrid feature extraction, and real-time processing.
79
beyond just URL analysis. Enhancing it to support multi-modal phishing detection by
analysing email content, social media links, and malicious file attachments will ensure
comprehensive protection. Techniques like natural language processing (NLP) for text-
based content and image recognition for visual phishing attempts will add depth to the
system's threat detection capabilities. Moreover, addressing adversarial robustness is
vital to counter sophisticated attacks that manipulate phishing URLs to bypass detection.
Implementing adversarial training and robust optimization techniques can strengthen the
system's resistance against such malicious tactics.
80
OUTLINE OF THE PROJECT
Fig 8.1
81
REFERENCES
82
enabled phishing attacks detection techniques. Telecommunication Systems, 76,
123–145.
[14] Ahmed, Z., Khan, M., & Ali, R. (2023). AI-driven phishing detection systems.
ResearchGate.
[15] Divakaran, D. M., & Oest, A. (2022). Machine Learning and Deep Learning
Models for Phishing Detection: A Comparative Study. arXiv preprint
arXiv:2205.12345.
[16] Zhang, Y., Li, H., & Wang, J. (2022). A systematic review of deep learning
techniques for phishing detection. Electronics, 13(19), 3823.
[17] Chen, J., & Lee, S. (2022). Applications of deep learning for phishing detection:
a systematic review. Knowledge and Information Systems, 64(3), 567–593.
[18] Abuzuraiq, A. S., Alqatawna, J., & Faris, H. (2020). Intelligent methods for
accurately detecting phishing websites. arXiv preprint arXiv:2006.00591.
[19] Sharma, P., Gupta, N., & Rao, V. (2021). Walkthrough phishing detection
techniques. Computers & Electrical Engineering, 93, 107277.
[20] Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and
countermeasures: A survey. Computers & Security, 68, 160-196.
83