Group 17 Blackbook Final Report
Group 17 Blackbook Final Report
BACHELOR OF ENGINEERING IN
INFORMATION TECHNOLOGY
BY
Sr.No. 18, Plot No. 5/3, CTS No.205, VadarVasti Rd, behind Vandevi Temple, Karve Nagar, Pune, Maharashtra 411052
Submitted by
is a bonafide work carried out by them under the supervision of Ms. Shital Kakad
and it is approved by the partial fulfillment of the requirement of Savitribai Phule
Pune University for the award of the Degree of Bachelor of Engineering (Information
Technology)
This project report has not been submitted to any other Institute or University for the
award of any degree or diploma.
It is our proud privilege and duty to acknowledge the kind of help and guidance
received from several people in the preparation of this report. It would not have been
possible to prepare this report, in this report and in this form without their valuable help,
co-operation and guidance.
We express our sincere gratitude to our guide Ms. Shital Kakad for guiding us in the
investigation of this project and in carrying out experimental work. We hold her in
esteem for the guidance, encouragement and inspiration received from her.
Last but not least we wish to thank our parents for financing our studies and helping us
throughout our life for achieving perfection and excellence. Their personal help in
making this report and project presentation is gratefully acknowledged.
For business purposes, email is the preferred form of formal communication. Despite there
being other forms of communication, email usage keeps growing. In the modern
environment, where email volume is increasing daily, automated email management is
crucial. More than 55 percent of the total emails are deemed to be spam. This demonstrates
how these spams suck up user resources and time while producing little helpful results. It is
crucial to comprehend the various spam email classification approaches and their workings
because spammers employ sophisticated and inventive methods to carry out their illicit
actions via spam emails. The machine learning algorithms used for spam classification are
the main topic of this paper. Additionally, this paper offers a thorough analysis and
assessment of previous research on various machine learning methodologies, email
properties, and techniques. Additionally, it outlines potential paths for future study as well
as difficulties encountered in the field of spam classification.
Email spam continues to be a persistent and widespread issue, inundating users with
unwanted and potentially harmful messages. This project aims to develop an effective
email spam detection system using machine learning techniques. The system analyzes the
content of incoming emails and classifies them as spam or non-spam (ham) based on
learned patterns and features. The project involves data collection, preprocessing, feature
extraction, model training, and evaluation stages.
A large dataset of labeled emails is collected and preprocessed to remove noise and
irrelevant information. Relevant features, such as keywords and text patterns, are extracted
from the preprocessed emails. Machine learning models are trained using the extracted
features to learn the distinguishing characteristics of spam and ham emails.
The project aims to deploy the trained model in an email client or spam filter system to
detect and filter spam emails effectively. The email spam detection system will help users
avoid unwanted emails, protect against phishing attempts, and improve overall email
security and user experience.
Keywords:
Email-Spam, Machine Learning, Data Collection, Preprocessing, Feature Extraction,
Dataset, Spam Classification, Model Training, Support Vector Machine, Random Forest.
Name Page No
Certificate 2
Acknowledgement 3
Abstract 5
Index 7
List of Figures 10
SR CHAPTERS PAGE
NO. NO.
1 INTRODUCTION 10
1.1 MOTIVATION 11
2 LITERATURE SURVEY 12
3 SYSTEM REQUIREMENTS 17
4 SYSTEM DESIGN 18
DFD LEVEL 0 19
DFD LEVEL 1 20
5.1 DATASET 24
7 CONCLUSION 37
9 REFERENCES 39
10 PLAGIARISM REPORT 40
7 Activity Diagram 22
8 Class Diagram 23
9 User Interface 31
14 Email Prediction 34
15 Email Prediction Output 35
INTRODUCTION
Real-world situations often have datasets with large dimensions and superfluous, worthless
data, which makes it difficult to analyse this data. Feature selection (FS) is one of the
preprocessing procedures in machine learning that can eliminate pointless and unnecessary
characteristics from a dataset and identify the final subset of crucial features that will
improve the performance of machine learning algorithms. 1– 3 In fact, FS is a crucial and
frequently used technique in data mining and machine learning for the reduction of
dimensions through the removal of unnecessary and redundant data from the dataset in
order to achieve the optimal feature subset that increases the efficiency and effectiveness
of classification algorithms. Finding the ideal feature subset, however, poses a complicated
optimization problem that cannot be resolved using standard techniques. In actuality, the
objective of FS is to identify a subset of features from the total collection of features in
order to enhance the efficiency of learning algorithms in terms of classification accuracy or
learning time.
Two frameworks, including the wrapper-based and lter-based techniques, have so far been
suggested for effectively resolving the FS problem. 6,7 The two essential elements of the
former techniques are the search strategy and evaluation criterion. 8 The search strategy
specifies the procedure for producing a solution for an ideal feature subset. A specific
criterion is used to evaluate each created solution. The subset creation and evaluation
procedure is continued until a stopping criterion is fulfilled in this technique, of course, in
an effort to improve search methods in subsequent rounds. Contrary to the first set of
approaches, the second set uses abundance, relationship, and connection between features
to detect superfluous and unnecessary features. problems can be divided into two groups:
wrapperbased approaches and filter based solution.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 10
1.1 MOTIVATION:
Email spam has become a major problem nowadays, with rapid growth of internet users,
email spam is also increasing. People are using them for illegal and unethical conducts,
phishing and fraud. Sending malicious link through spam emails which can harm the
system and can also seek in into the system.
As of now recent days spam emails are increasing day by day and it is creating problem to
the user so by spam detector, it will identify which mail is spam or not, by this efficiency
of users will be increased.
Creating a fake profile and email account is much easy for the spammers, they pretend like
a genuine person in their spam emails, these spammers target those people who are not
aware about these frauds.
So, it is needed to identify those spam mails which are fraud, this project will identify
those spam by using techniques of machine learning and bio-inspired algorithms
To achieve a more accurate and efficient spam filtering model that can reduce the number
of false positives and negatives and provide a more robust solution to the problem of spam
filtering.
Email spam remains a significant concern, as it inundates users with unsolicited and
potentially harmful messages. Traditional methods for email spam detection often rely on
single machine learning algorithms, which may struggle to achieve optimal performance.
This project aims to address the issue by proposing a hybridization approach using Support
Vector Machine(SVM) and Random Forest algorithm for more accurate and efficient email
spam detection.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 11
CHAPTER 2
LITERATURE SURVEY
Sr. Paper Title Author Year Problem solved Technique What will be
No in this paper : used to solve future work :
Existing problem : Future Scope
Problem Existing
Statement Problem
Solution
1 Binary Majdi Mafarja, 2018 the number of Binary Investigating
Grasshopper Ibrahim Aljarah, features while Grasshopper the use of the
Optimisation Hossam Faris, retaining an Optimization suggested
Algorithm Abdelaziz I. acceptable Algorithm solutions to
Approaches for Hammouri, Ala’ degree of more real-
Feature Selection M. Al-Zoubi, classification world
Problems Seyedali Mirjalili accuracy by difficulties,
reducing such as real-
unnecessary, world
redundant, and commercial
noisy data. issues and
medical
applications
2 A comprehensive Farhad 2019 Determined to Support In order to
survey: Whale Soleimanian minimise or Vector further benefit
Optimization Gharehchopogh, maximise the Machine from using the
Algorithm and its Hojjat Gholizadeh factors involved (SVM) WOA to solve
applications in the issues
difficulties in involving
order to make continuous
anything as optimization,
useful and we advise
effective as combining it
possible. with other
meta-heuristic
methods.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 12
3 A wrapper-based Maryam Samadi 2020 A model of Ant lion It has been
feature selection Bonab, Ali typical algorithm, used to remove
for improving Ghaffari, Farhad behaviour is Support less significant
performance of Soleimanian developed Vector features from
intrusion detection Gharehchopogh, during the Machine IDS datasets
systems Payam Alemi anomaly-based (SVM) and perform
detection wrapper-based
process. feature
Methods such as selection.
clustering and
machine
learning
techniques are
used to develop
a model of
normal
behaviour.
4 Virtual machine Sasan Gharehpasha 2020 Cloud meta- One of the
placement in cloud , computing heuristic important
data centers using Mohammad Masda offers the option optimization things on
a hybrid ri, Ahmad Jafarian of focusing algorithm which you
multi-verse solely on should
optimization organisational concentrate is
algorithm objectives as security.
opposed to
increasing user
hardware
resources.
5 Service Selection Mehdi Hosseinzade 2020 a thorough a Genetic future research
Using h, analysis of the Algorithm directions and
Multi-criteria Hawkar Kamaran cutting-edge aid in creating
Decision Making: Hama , MCDM-based new service
A Comprehensive Marwan Yassin Gh service selection choices
Overview afour, methods put through the use
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 13
Mohammad Masda forward in the of MCDM
ri, literature. techniques.
Omed Hassan Ahm
ed, Hemn Khezri
6 Binary butterfly Sankalap Arora, 2018 A particular meta- To broaden the
optimization Priyanka Anand fitness function heuristic scope of the
approaches for is minimised, algorithm existing
feature selection BOA can method, binary
effectively BOA can be
search the used with
feature space for additional
an optimal or classifiers such
nearly optimal artificial neural
feature subset. networks, k-
nearest
neighbours,
and support
vector
machines in
addition to
other public
datasets and
real-world
issues.
7 A new hybrid ant Md. Monirul Kabir 2012 By choosing hybrid ant The amount of
colony , Md. Shahjahan , important colony parameters in
optimization Kazuyuki Murase features, FS optimization ACOFS may
algorithm for aims to improve (ACO) be decreased in
feature selection a dataset's algorithm the future or
quality while made
making it adaptable.
simpler.
Normally, FS
removes
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 14
erroneous
features from
the source
dataset without
degrading
generalisation
performance
8 An Efficient Binary Hossam Faris, 2018 As a result, two Wrapper examine how
Salp Swarm Majdi M. Mafarja, novel wrapper Feature new S-shaped
Algorithm with Ali Asghar Heidari, FS methods are Selection, and V-shaped
Crossover Scheme Ibrahim Aljarah, suggested, using Salp Swarm TF families
for Feature Ala’ M. Al-Zoubi, SSA as the Algorithm, may affect
Selection Problems Seyedali Mirjalili, search method. Optimization BSSA or other
Hamido Fujita Eight transfer binary
functions are algorithms that
used in the first have been
method to examined.
translate the
continuous
version of SSA
into binary.
9 Binary multi-verse Nailah Al-Madi, 2019 The V-shaped Multi-verse Our research
optimization Hossam Faris, transfer function optimization includes
algorithm Seyedali Mirjalili in the Binary algorithm · testing our
for global Multi-verse Global suggested
optimization Optimizer optimization strategy on
and discrete converts many real-
problems continuous world, high-
variables to dimensional
binary and situations.
updates the
solutions during
the optimization
process.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 15
10 Symbiotic Yongquan Zhou , 2019 A fractional SOS It might be
organisms search Fahui Miao, Qifang fuzzy controller Algorithm taken into
algorithm for Luo can expand the account in
optimal integral and additional
evolutionary differential industries
controller tuning of order of a fuzzy including
fractional fuzzy controller to any bioscience,
controllers real number, in machinery,
contrast to a energy, and
standard integer- electronic
order fuzzy power.
controller.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 16
CHAPTER 3
SOFTWARE REQUIREMENTS
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 17
CHAPTER 4
SYSTEM DESIGN
1. SPAM DETECTION
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 18
In this email spam detection project, a dataset of labeled emails is collected to train a spam
detection model. The collected emails undergo preprocessing, including cleaning and
transforming the data.
The transformed emails are then subjected to feature selection, where relevant characteristics
indicative of spam or ham are identified. Using the selected features, a classification model is
trained on the dataset.
This model learns to distinguish between spam and non-spam emails. In the spam detection
phase, new emails are processed, features are extracted, and the trained model is applied to
classify the emails as spam or ham.
The project aims to develop an effective system that can accurately detect and filter out spam
emails, thereby improving email security and user experience.
4.2 DFD
The diagram represents the entire system's flow of data and how each module
communicates with the other. It starts with the user providing input of email which is to be
checked. The system checks the code and replies according to that which will become the
system's response. System's response is nothing but the output. The same output will be available
for the user.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 19
4.2.2 DFD LEVEL 1
The system receives input data which include email message to be classified as spam or not
spam. Then collection of data will take place on which certain techniques like data
preprocessing feature extraction etc would take place then it would be determined whatever it is
spam or ham and will show the output
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 20
4.3 UML DIAGRAMS
SYSTEM
USER
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 21
4.3.2 ACTIVITY DIAGRAM:
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 22
4.3.4 CLASS DIAGRAM:
A class diagram is a type of structural diagram in the Unified Modeling Language (UML) that
represents the static structure and relationships of the classes within a system. It provides a visual
representation of the classes, their attributes, methods, and associations with other classes.
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 23
CHAPTER 5
IMPLEMENTATION
5.1 DATASET:
Let’s start with our spam detection data. We’ll be using the open-source dataset from Kaggle, a
dataset that contains 5574 emails containing both ham and spam emails.
In this we will be collecting all the spam and ham emails, later just dividing it separately all
spam emails together and all ham email together so that further we will processed it. This is
done because it gets easier to understand.
Email-Collection: In this we will be collecting all the spam and ham emails, later just
dividing it separately all spam emails together and all ham email together so that further
we will processed it. This is done because it gets easier to understand.
Data Preprocessing: Data typically comes from many different sources and is frequently
in a number of formats. This is why it's crucial to modify your raw data. However,
because text data frequently contains redundant and repeating terms, this transformation
is not a straightforward procedure. This indicates that the first stage in our solution is to
process the text data.
Cleaning the Raw Data: During this stage, words or characters that don't contribute to
the text's meaning are removed. Few typical cleaning procedures are removing of stop
words, hyperlinks, numbers. Special characters and whitespaces
Tokenization: Tokenization is the division of text into discrete units known as tokens.
Each token serves both an input and a feature for the machine learning algorithm.
SVM:
Support vector machines (SVMs) are a set of supervised learning methods used for
classification, regression and outliers detection.
Random Forest:
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It is
based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
Hybrid Algorithm:
This algorithm uses two popular machine learning algorithms, Random Forest Classifier
and SVM, to create a hybrid model that can classify a synthetic dataset. The code splits the
dataset into training and testing sets, then trains the hybrid model on the training set. Once
the hybrid model is trained, it makes predictions on the testing set and computes the
accuracy of the predictions. Finally, the code prints the accuracy of the hybrid model. The
accuracy of the hybrid model is a measure of how well it can predict the classification of the
data.
5. Initialize a voting classifier that combines the two base classifiers using the soft voting method:
- voting_clf = VotingClassifier(estimators=[('rf', rf_clf), ('svm', svm_clf)], voting='soft')
mysql = MySQL(app)
filename = 'pickle.pkl'
clf = pickle.load(open(filename, 'rb'))
cv=pickle.load(open('tranform.pkl','rb'))
#App routing is used to map the specific URL with the associated function t is
used to access some particular page
@app.route('/')
def home():
return render_template('home.html')
@app.route('/predict',methods=['POST'])
def predict():
@app.route('/logout')
def logout():
session.pop('loggedin', None)
session.pop('userid', None)
session.pop('email', None)
return redirect(url_for('login'))
#Registration page
@app.route('/register', methods =['GET', 'POST'])
def register(): #internal function that actually registers the
class and return classtype.
mesage = ''
if request.method == 'POST' and 'name' in request.form and 'password' in
request.form and 'email' in request.form :
userName = request.form['name'] #allows you to capture work
requests as they come in.
if __name__ == "__main__":
app.run()
Flask is the framework here, while Flask is a Python class datatype. In other words, Flask
is the prototype used to create instances of web application or web applications if you want
to put it simple. So, once we import Flask, we need to create an instance of the Flask class
for our web app.
Firstly we import MySQLdb.cursors that is used to execute all the statements to
communicate with mysql database.
To save a trained model in Python, you can use the “pickle” or “joblib” module. Both
modules provide functions for serializing and deserializing Python objects, including
trained machine learning models.
App secret.key is used by flask to keep the data safe.
App.configure enables you to manage and quickly deply application configurations .
After this each page of website comes and after home page when we go login enter the
credentials in login page if the person has already registered and do their login correctly
they come to next page others inavalid/incorrect email or password will come .
After this you have to register in order to check whether email is spam or not and to go prediction
page.
Here we have used Xxamp server and hence imported SQL data.
RESULT
1. Home Page:
The above image is the basic home page of our Email Spam Detection System. It has two
options for user, the first is the login button and another is the register button. User needs to
create an account by registering with his/her email id and need to create a password for the same.
The user registration feature allows new users to create an account by providing necessary
information such as a unique username, email address, and password.
After registering the user’s credentials are saved and needs to login again with the email id and
password which the user has registered with. After clicking on the sign in button the user is
promoted to the next window from where he/she can predict which ever message they want to
classify as spam or ham.
Once registered, users can use the user login feature to securely access their accounts. This
typically involves entering their username/email and password combination. The login process
verifies the provided credentials and grants access to the user's personalized dashboard or
account page.
The user registration feature allows new users to create an account by providing necessary
information such as a unique username, email address, and password. Overall, the web page with
user registration and login elements plays a crucial role in facilitating user interaction with our
project, enabling personalized access, and ensuring the security and privacy of user accounts.
The image represents a user interface for the email prediction system. It includes a section where
users can input the email that they want to be predicted.
In the image, there is likely a text box or a designated area where users can enter the content of
the email they want to analyze. This can be in the form of a text input field where users can type
or paste the email's text. Alternatively, it could be a file upload button that allows users to upload
an email file in a specific format.
The purpose of this input area is to gather the text or data of the email that needs to be predicted
for spam detection or any other classification task. Users would typically enter the email content
they want to analyze in this field before initiating the prediction process.
By providing this input section, the system allows users to easily interact with the email
prediction functionality. It provides a straightforward and intuitive way for users to input the
email they want to analyze, ensuring that the system can process and predict the appropriate
classification for the given email data.
v
5. Result:
By showing the result, users can quickly assess the outcome of the prediction process and make
informed decisions or take appropriate actions based on the classification.
The result section is essential for user engagement and understanding of the prediction system's
performance. It provides transparency and allows users to evaluate the accuracy and
effectiveness of the prediction model. Users can rely on the result to determine the credibility
and potential actions to be taken regarding the email they inputted.
Overall, the result display in the image serves as a crucial component of the user interface,
providing clear and concise feedback on the predicted classification of the email, empowering
users to make informed decisions based on the prediction outcome.
v
CHAPTER 7
CONCLUSION
Spam email is one of the most demanding and troublesome internet issues in today’s world of
communication and technology. It is almost impossible to think about e-mail without
considering the issue of spam. Spammers by generating spam mails are misusing this
communication facility and thus affecting organisations and many email users. as a result of
the surge in cybercrime and spammers. In network security and machine learning, spam
filtering in e-mail is a critical problem. The Naive Bayes classifier is critical for detecting e-
mail spam. To make the project a success, models were combined with bio-inspired
algorithms. NB with probability distribution property provides the likely class for email
content, from spam to non-spam, based on keywords included in the email textual data. Spam
messages have developed throughout time to avoid censorship, according to studies. The
fundamental design of an email spam filter, as well as the spam email filtering approaches,
was explored. The research looked at a variety of publicly available data and performance
measures that can be used to assess the effectiveness of spam filters.
v
CHAPTER 8
-It provides sensitivity to the client and adapts well to the future spam techniques.
v
CHAPTER 9
REFERENCES
10.Y.-Q. Zhou, Hybrid symbiotic organisms search algorithm for solving 0-1 knapsack
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 39
CHAPTER 10
PLAGIARISM REPORT
v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 40