0% found this document useful (0 votes)
47 views40 pages

Group 17 Blackbook Final Report

The project report focuses on developing an email spam detection system using a hybrid approach that combines Support Vector Machine (SVM) and Random Forest algorithms. It addresses the growing issue of spam emails, which constitute over 55% of total emails, by utilizing machine learning techniques for effective classification of emails as spam or non-spam. The report includes data collection, preprocessing, feature extraction, model training, and evaluation to enhance email security and user experience.

Uploaded by

akondibahulsure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views40 pages

Group 17 Blackbook Final Report

The project report focuses on developing an email spam detection system using a hybrid approach that combines Support Vector Machine (SVM) and Random Forest algorithms. It addresses the growing issue of spam emails, which constitute over 55% of total emails, by utilizing machine learning techniques for effective classification of emails as spam or non-spam. The report includes data collection, preprocessing, feature extraction, model training, and evaluation to enhance email security and user experience.

Uploaded by

akondibahulsure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

A PROJECT REPORT ON

“Email Spam Detection Using Hybridization Of


SVM and Random Forest”

SUBMITTED TO SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE IN


THEPARTIAL FULFILLMENT FOR THE AWARD OF THE DEGREE
OF

BACHELOR OF ENGINEERING IN
INFORMATION TECHNOLOGY

BY

SNEHA BOBDE B190458512

LOKESH KHADKE B190458539

SHARVARI ROLE B190458568

TEJAS SHIRUDE B190458572

UNDER THE GUIDANCE OF

MS. SHITAL KAKAD

DEPARTMENT OF INFORMATION TECHNOLOGY


Marathwada Mitra Mandal College of Engineering

Sr.No. 18, Plot No. 5/3, CTS No.205, VadarVasti Rd, behind Vandevi Temple, Karve Nagar, Pune, Maharashtra 411052

Academic Year 2022-23


CERTIFICATE

This is to certify that the Project Report entitled

“Email Spam Detection Using Hybridization of SVM and Random Forest”

Submitted by

Sneha Bobde BI190458512

Lokesh Khadke BI190458539

Sharvari Role BI190458568

Tejas Shirude BI190458572

is a bonafide work carried out by them under the supervision of Ms. Shital Kakad
and it is approved by the partial fulfillment of the requirement of Savitribai Phule
Pune University for the award of the Degree of Bachelor of Engineering (Information
Technology)
This project report has not been submitted to any other Institute or University for the
award of any degree or diploma.

Ms. Shital Kakad Dr. Rupali Chopade


Internal Guide HOD
Department of Information Technology Information Technology

External Guide Dr. V.N.Gohokar


Date: 03/06/23 Principal
Place: MMCOE,Pune Marathwada Mitra Mandal's
College of Engineering

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 2


ACKNOWLEDGEMENT

It is our proud privilege and duty to acknowledge the kind of help and guidance
received from several people in the preparation of this report. It would not have been
possible to prepare this report, in this report and in this form without their valuable help,
co-operation and guidance.

Our sincere thanks to Dr. Rupali Chopade, Head Department of Information


Technology, for her valuable suggestions and guidance throughout the preparation of this
report.

We express our sincere gratitude to our guide Ms. Shital Kakad for guiding us in the
investigation of this project and in carrying out experimental work. We hold her in
esteem for the guidance, encouragement and inspiration received from her.

Last but not least we wish to thank our parents for financing our studies and helping us
throughout our life for achieving perfection and excellence. Their personal help in
making this report and project presentation is gratefully acknowledged.

Sneha Bobde B190458512

Lokesh Khadke B190458539

Sharvari Role B190458568

Tejas Shirude B190458572

(B.E. Information Technology)

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 3


PROJECT COMPLETION CERTIFICATE

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 4


ABSTRACT

For business purposes, email is the preferred form of formal communication. Despite there
being other forms of communication, email usage keeps growing. In the modern
environment, where email volume is increasing daily, automated email management is
crucial. More than 55 percent of the total emails are deemed to be spam. This demonstrates
how these spams suck up user resources and time while producing little helpful results. It is
crucial to comprehend the various spam email classification approaches and their workings
because spammers employ sophisticated and inventive methods to carry out their illicit
actions via spam emails. The machine learning algorithms used for spam classification are
the main topic of this paper. Additionally, this paper offers a thorough analysis and
assessment of previous research on various machine learning methodologies, email
properties, and techniques. Additionally, it outlines potential paths for future study as well
as difficulties encountered in the field of spam classification.

Email spam continues to be a persistent and widespread issue, inundating users with
unwanted and potentially harmful messages. This project aims to develop an effective
email spam detection system using machine learning techniques. The system analyzes the
content of incoming emails and classifies them as spam or non-spam (ham) based on
learned patterns and features. The project involves data collection, preprocessing, feature
extraction, model training, and evaluation stages.

A large dataset of labeled emails is collected and preprocessed to remove noise and
irrelevant information. Relevant features, such as keywords and text patterns, are extracted
from the preprocessed emails. Machine learning models are trained using the extracted
features to learn the distinguishing characteristics of spam and ham emails.

The project aims to deploy the trained model in an email client or spam filter system to
detect and filter spam emails effectively. The email spam detection system will help users
avoid unwanted emails, protect against phishing attempts, and improve overall email
security and user experience.

Keywords:
Email-Spam, Machine Learning, Data Collection, Preprocessing, Feature Extraction,
Dataset, Spam Classification, Model Training, Support Vector Machine, Random Forest.

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 5


CONTENTS

Name Page No

Certificate 2

Acknowledgement 3

Abstract 5

Index 7

List of Figures 10

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 6


INDEX

SR CHAPTERS PAGE
NO. NO.

1 INTRODUCTION 10

1.1 MOTIVATION 11

1.2 AIM AND OBJECTIVE 11

1.3 PROBLEM STATEMENT 11

2 LITERATURE SURVEY 12

3 SYSTEM REQUIREMENTS 17

3.1 SOFTWARE REQUIREMENTS 17

3.2 HARDWARE REQUIREMENTS 17

4 SYSTEM DESIGN 18

4.1 SYSTEM ARCHITECTURES 18

4.2 DATA FLOW DIAGRAMS 19

DFD LEVEL 0 19

DFD LEVEL 1 20

4.3 ACTIVITY DIAGRAM 22

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 7


5 IMPLEMENTATION 24

5.1 DATASET 24

5.2 SVMRF Algorithm 25

5.3 Code Implementation 28

6 RESULTS AND EVALUATION 32

7 CONCLUSION 37

8 FUTURE SCOPE AND APPLICATIONS 38

9 REFERENCES 39

10 PLAGIARISM REPORT 40

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 8


LIST OF FIGURES

Fig Figure Page


No. Names No.
1 System Architecture URL-1 18
2 Data Flow Diagram 0 19
3 Data Flow Diagram 1 20
4 Data Flow Diagram 2 20
5 UML Diagram 21
6 Use Case Diagram 21

7 Activity Diagram 22

8 Class Diagram 23
9 User Interface 31

14 Email Prediction 34
15 Email Prediction Output 35

DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 9


CHAPTER 1

INTRODUCTION

Real-world situations often have datasets with large dimensions and superfluous, worthless
data, which makes it difficult to analyse this data. Feature selection (FS) is one of the
preprocessing procedures in machine learning that can eliminate pointless and unnecessary
characteristics from a dataset and identify the final subset of crucial features that will
improve the performance of machine learning algorithms. 1– 3 In fact, FS is a crucial and
frequently used technique in data mining and machine learning for the reduction of
dimensions through the removal of unnecessary and redundant data from the dataset in
order to achieve the optimal feature subset that increases the efficiency and effectiveness
of classification algorithms. Finding the ideal feature subset, however, poses a complicated
optimization problem that cannot be resolved using standard techniques. In actuality, the
objective of FS is to identify a subset of features from the total collection of features in
order to enhance the efficiency of learning algorithms in terms of classification accuracy or
learning time.

Two frameworks, including the wrapper-based and lter-based techniques, have so far been
suggested for effectively resolving the FS problem. 6,7 The two essential elements of the
former techniques are the search strategy and evaluation criterion. 8 The search strategy
specifies the procedure for producing a solution for an ideal feature subset. A specific
criterion is used to evaluate each created solution. The subset creation and evaluation
procedure is continued until a stopping criterion is fulfilled in this technique, of course, in
an effort to improve search methods in subsequent rounds. Contrary to the first set of
approaches, the second set uses abundance, relationship, and connection between features
to detect superfluous and unnecessary features. problems can be divided into two groups:
wrapperbased approaches and filter based solution.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 10
1.1 MOTIVATION:

Email spam has become a major problem nowadays, with rapid growth of internet users,
email spam is also increasing. People are using them for illegal and unethical conducts,
phishing and fraud. Sending malicious link through spam emails which can harm the
system and can also seek in into the system.

As of now recent days spam emails are increasing day by day and it is creating problem to
the user so by spam detector, it will identify which mail is spam or not, by this efficiency
of users will be increased.

Creating a fake profile and email account is much easy for the spammers, they pretend like
a genuine person in their spam emails, these spammers target those people who are not
aware about these frauds.

So, it is needed to identify those spam mails which are fraud, this project will identify
those spam by using techniques of machine learning and bio-inspired algorithms

1.2 AIM AND OBJECTIVE:

 To improve the performance of existing spam filtering techniques

 To achieve a more accurate and efficient spam filtering model that can reduce the number
of false positives and negatives and provide a more robust solution to the problem of spam
filtering.

1.3 PROBLEM STATEMENT:

Email spam remains a significant concern, as it inundates users with unsolicited and
potentially harmful messages. Traditional methods for email spam detection often rely on
single machine learning algorithms, which may struggle to achieve optimal performance.
This project aims to address the issue by proposing a hybridization approach using Support
Vector Machine(SVM) and Random Forest algorithm for more accurate and efficient email
spam detection.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 11
CHAPTER 2

LITERATURE SURVEY

Sr. Paper Title Author Year Problem solved Technique What will be
No in this paper : used to solve future work :
Existing problem : Future Scope
Problem Existing
Statement Problem
Solution
1 Binary Majdi Mafarja, 2018 the number of Binary Investigating
Grasshopper Ibrahim Aljarah, features while Grasshopper the use of the
Optimisation Hossam Faris, retaining an Optimization suggested
Algorithm Abdelaziz I. acceptable Algorithm solutions to
Approaches for Hammouri, Ala’ degree of more real-
Feature Selection M. Al-Zoubi, classification world
Problems Seyedali Mirjalili accuracy by difficulties,
reducing such as real-
unnecessary, world
redundant, and commercial
noisy data. issues and
medical
applications
2 A comprehensive Farhad 2019 Determined to Support In order to
survey: Whale Soleimanian minimise or Vector further benefit
Optimization Gharehchopogh, maximise the Machine from using the
Algorithm and its Hojjat Gholizadeh factors involved (SVM) WOA to solve
applications in the issues
difficulties in involving
order to make continuous
anything as optimization,
useful and we advise
effective as combining it
possible. with other
meta-heuristic
methods.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 12
3 A wrapper-based Maryam Samadi 2020 A model of Ant lion It has been
feature selection Bonab, Ali typical algorithm, used to remove
for improving Ghaffari, Farhad behaviour is Support less significant
performance of Soleimanian developed Vector features from
intrusion detection Gharehchopogh, during the Machine IDS datasets
systems Payam Alemi anomaly-based (SVM) and perform
detection wrapper-based
process. feature
Methods such as selection.
clustering and
machine
learning
techniques are
used to develop
a model of
normal
behaviour.
4 Virtual machine Sasan Gharehpasha 2020 Cloud meta- One of the
placement in cloud , computing heuristic important
data centers using Mohammad Masda offers the option optimization things on
a hybrid ri, Ahmad Jafarian of focusing algorithm which you
multi-verse solely on should
optimization organisational concentrate is
algorithm objectives as security.
opposed to
increasing user
hardware
resources.
5 Service Selection Mehdi Hosseinzade 2020 a thorough a Genetic future research
Using h, analysis of the Algorithm directions and
Multi-criteria Hawkar Kamaran cutting-edge aid in creating
Decision Making: Hama , MCDM-based new service
A Comprehensive Marwan Yassin Gh service selection choices
Overview afour, methods put through the use

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 13
Mohammad Masda forward in the of MCDM
ri, literature. techniques.
Omed Hassan Ahm
ed, Hemn Khezri
6 Binary butterfly Sankalap Arora, 2018 A particular meta- To broaden the
optimization Priyanka Anand fitness function heuristic scope of the
approaches for is minimised, algorithm existing
feature selection BOA can method, binary
effectively BOA can be
search the used with
feature space for additional
an optimal or classifiers such
nearly optimal artificial neural
feature subset. networks, k-
nearest
neighbours,
and support
vector
machines in
addition to
other public
datasets and
real-world
issues.
7 A new hybrid ant Md. Monirul Kabir 2012 By choosing hybrid ant The amount of
colony , Md. Shahjahan , important colony parameters in
optimization Kazuyuki Murase features, FS optimization ACOFS may
algorithm for aims to improve (ACO) be decreased in
feature selection a dataset's algorithm the future or
quality while made
making it adaptable.
simpler.
Normally, FS
removes

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 14
erroneous
features from
the source
dataset without
degrading
generalisation
performance
8 An Efficient Binary Hossam Faris, 2018 As a result, two Wrapper examine how
Salp Swarm Majdi M. Mafarja, novel wrapper Feature new S-shaped
Algorithm with Ali Asghar Heidari, FS methods are Selection, and V-shaped
Crossover Scheme Ibrahim Aljarah, suggested, using Salp Swarm TF families
for Feature Ala’ M. Al-Zoubi, SSA as the Algorithm, may affect
Selection Problems Seyedali Mirjalili, search method. Optimization BSSA or other
Hamido Fujita Eight transfer binary
functions are algorithms that
used in the first have been
method to examined.
translate the
continuous
version of SSA
into binary.
9 Binary multi-verse Nailah Al-Madi, 2019 The V-shaped Multi-verse Our research
optimization Hossam Faris, transfer function optimization includes
algorithm Seyedali Mirjalili in the Binary algorithm · testing our
for global Multi-verse Global suggested
optimization Optimizer optimization strategy on
and discrete converts many real-
problems continuous world, high-
variables to dimensional
binary and situations.
updates the
solutions during
the optimization
process.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 15
10 Symbiotic Yongquan Zhou , 2019 A fractional SOS It might be
organisms search Fahui Miao, Qifang fuzzy controller Algorithm taken into
algorithm for Luo can expand the account in
optimal integral and additional
evolutionary differential industries
controller tuning of order of a fuzzy including
fractional fuzzy controller to any bioscience,
controllers real number, in machinery,
contrast to a energy, and
standard integer- electronic
order fuzzy power.
controller.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 16
CHAPTER 3

SOFTWARE REQUIREMENTS

3.1 Software Requirements:


1. Operating System - Windows 7/8/10
2. Front End - HTML, CSS, Bootstrap
3. Scripts - JavaScript.
4. Languages - Python
5. IDE - Visual Studio Code

3.2 Hardware Requirements:

1. Speed - 1.1 GHz


2. Processor - Intel i3/i5/i7
3. RAM - 8 GB(min)
4. Hard Disk - 40 GB
5. Key Board - Standard Windows Keyboard
6. Mouse - Two or Three Button Mouse
7. Monitor - SVGA

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 17
CHAPTER 4

SYSTEM DESIGN

4.1 System Architecture

1. SPAM DETECTION

Fig1 : System Architecture of Email Spam Detection System

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 18
In this email spam detection project, a dataset of labeled emails is collected to train a spam
detection model. The collected emails undergo preprocessing, including cleaning and
transforming the data.

The transformed emails are then subjected to feature selection, where relevant characteristics
indicative of spam or ham are identified. Using the selected features, a classification model is
trained on the dataset.

This model learns to distinguish between spam and non-spam emails. In the spam detection
phase, new emails are processed, features are extracted, and the trained model is applied to
classify the emails as spam or ham.

The project aims to develop an effective system that can accurately detect and filter out spam
emails, thereby improving email security and user experience.

4.2 DFD

The diagram represents the entire system's flow of data and how each module
communicates with the other. It starts with the user providing input of email which is to be
checked. The system checks the code and replies according to that which will become the
system's response. System's response is nothing but the output. The same output will be available
for the user.

4.2.1 DFD LEVEL 0

Fig 2: A Basic DFD of Email Spam Detection System

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 19
4.2.2 DFD LEVEL 1

Fig3: DFD Level 1 of the System

The system receives input data which include email message to be classified as spam or not
spam. Then collection of data will take place on which certain techniques like data
preprocessing feature extraction etc would take place then it would be determined whatever it is
spam or ham and will show the output

4.2.2 DFD LEVEL 2

Fig4: DFD Level 2

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 20
4.3 UML DIAGRAMS

A UML (Unified Modeling Language) diagram is a graphical representation used in software


engineering to visualize, design, and communicate the structure, behavior, and interactions of a
system. UML provides a standardized notation and set of diagrams that allow software developers,
architects, and stakeholders to express system concepts and relationships in a clear and consistent
manner.

4.3.1 USE CASE DIAGRAM:

SYSTEM

USER

Fig5: Use Case Diagram

UML diagrams provide a standardized and visual representation of software systems,


facilitating communication, design, and analysis among stakeholders. They serve as a common
language for software professionals, allowing them to understand, document, and refine system
requirements, architecture, and behavior.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 21
4.3.2 ACTIVITY DIAGRAM:

An activity diagram is a behavioral diagram i.e. it depicts the behavior of a


system. An activity diagram portrays the control flow showing the various
decision paths.

Fig6: Activity Diagram

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 22
4.3.4 CLASS DIAGRAM:

Fig7: Class Diagram

A class diagram is a type of structural diagram in the Unified Modeling Language (UML) that
represents the static structure and relationships of the classes within a system. It provides a visual
representation of the classes, their attributes, methods, and associations with other classes.

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 23
CHAPTER 5

IMPLEMENTATION

5.1 DATASET:
Let’s start with our spam detection data. We’ll be using the open-source dataset from Kaggle, a
dataset that contains 5574 emails containing both ham and spam emails.

Image below is the Email dataset.

In this we will be collecting all the spam and ham emails, later just dividing it separately all
spam emails together and all ham email together so that further we will processed it. This is
done because it gets easier to understand.

 Email-Collection: In this we will be collecting all the spam and ham emails, later just
dividing it separately all spam emails together and all ham email together so that further
we will processed it. This is done because it gets easier to understand.
 Data Preprocessing: Data typically comes from many different sources and is frequently
in a number of formats. This is why it's crucial to modify your raw data. However,
because text data frequently contains redundant and repeating terms, this transformation
is not a straightforward procedure. This indicates that the first stage in our solution is to
process the text data.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 24


Some basic steps in data preprocessing are:

Cleaning the Raw Data: During this stage, words or characters that don't contribute to
the text's meaning are removed. Few typical cleaning procedures are removing of stop
words, hyperlinks, numbers. Special characters and whitespaces

Tokenization: Tokenization is the division of text into discrete units known as tokens.
Each token serves both an input and a feature for the machine learning algorithm.

Fig8: Visual Representation of Tokenization

 Transformation: Humans are capable of effortlessly interpreting text data. However,


reading and analysing is a very difficult work for machines. We need to translate our text
into a machine-readable format in order to complete this operation. The process of
embedding involves turning structured written data into numerical values or vectors that
a computer can understand.
 Feature Selection: Relevant features are extracted from the pre-processed emails. These
features can include word frequencies, presence of specific keywords, email headers, or
other characteristics that can differentiate between spam and ham emails. A feature
matrix is constructed, where each row represents an email and each column represents a
specific feature.
 Classification: The classification step involves training a machine learning model using
the selected features and then using this model to predict the class labels for new, unseen
emails. It is the process of using machine learning techniques to train a model on
selected features and then utilize it to classify incoming emails as spam or non-spam,
thus helping users understand the underlying technology and its effectiveness in
identifying unwanted or malicious emails.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 25


5.2 SVMRF ALGORITHM:

SVM:
Support vector machines (SVMs) are a set of supervised learning methods used for
classification, regression and outliers detection.

The advantages of support vector machines are:


 Effective in high dimensional spaces.
 Still effective in cases where number of dimensions is greater than the number of samples.
 Uses a subset of training points in the decision function (called support vectors), so it is
also memory efficient.
 Versatile: different Kernel functions can be specified for the decision function. Common
kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:


 If the number of features is much greater than the number of samples, avoid over-fitting
in choosing Kernel functions and regularization term is crucial.
 SVMs do not directly provide probability estimates, these are calculated using an expensive five-
fold cross-validation (see Scores and probabilities, below).

Random Forest:

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It is
based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority votes of predictions, and it
predicts the final output. The greater number of trees in the forest leads to higher accuracy
and prevents the problem of overfitting.

Hybrid Algorithm:

This algorithm uses two popular machine learning algorithms, Random Forest Classifier
and SVM, to create a hybrid model that can classify a synthetic dataset. The code splits the
dataset into training and testing sets, then trains the hybrid model on the training set. Once
the hybrid model is trained, it makes predictions on the testing set and computes the
accuracy of the predictions. Finally, the code prints the accuracy of the hybrid model. The
accuracy of the hybrid model is a measure of how well it can predict the classification of the
data.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 26


- Algorithm:

1. Import the necessary libraries and modules: `RandomForestClassifier`, `VotingClassifier`, `SVC`,


`make_classification`, `train_test_split`, and `accuracy_score`.

2. Generate a synthetic dataset for classification:


- X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

3. Split the dataset into training and testing sets:


- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Initialize the two base classifiers:


Random Forest Classifier:
- rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
SVM:
- svm_clf = SVC(kernel='linear', probability=True, random_state=42)

5. Initialize a voting classifier that combines the two base classifiers using the soft voting method:
- voting_clf = VotingClassifier(estimators=[('rf', rf_clf), ('svm', svm_clf)], voting='soft')

6. Train the voting classifier on the training set:


- voting_clf.fit(X_train, y_train)

7. Make predictions on the


testing set:
- y_pred
=voting_clf.predict(X_test)

8. Compute the accuracy of the hybrid


model:
- accuracy
=accuracy_score(y_test,y_pred)

9. Print the accuracy of the hybrid model:


- print("Accuracy:", accuracy)

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 27


4.4 CODE IMPLEMENTATION:

from flask import Flask, render_template, request, redirect, url_for, session


from flask_mysqldb import MySQL
import MySQLdb.cursors #used to execute statements to communicate with mysql
database.
import re

import pickle #python object into a byte stream to store in a database.


import joblib #provide set of functions to perform operations in parallel
on large datasets.

app = Flask(__name__) #an instance of flask

app.secret_key = 'xyzsdfg' #used by flask and extensions to keep


data safe.

app.config['MYSQL_HOST'] = 'localhost' #enables you to manage and quickly


deply application configurations
app.config['MYSQL_USER'] = 'root'
app.config['MYSQL_PASSWORD'] = ''
app.config['MYSQL_DB'] = 'user-system'

mysql = MySQL(app)

filename = 'pickle.pkl'
clf = pickle.load(open(filename, 'rb'))
cv=pickle.load(open('tranform.pkl','rb'))

#App routing is used to map the specific URL with the associated function t is
used to access some particular page
@app.route('/')
def home():
return render_template('home.html')

@app.route('/predict',methods=['POST'])
def predict():

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 28


return render_template('result.html')

@app.route('/login', methods =['GET', 'POST'])


def login():
mesage = ''
if request.method == 'POST' and 'email' in request.form and 'password' in
request.form:
email = request.form['email']
password = request.form['password']
cursor = mysql.connection.cursor(MySQLdb.cursors.DictCursor)
cursor.execute('SELECT * FROM user WHERE email = % s AND password = % s',
(email, password, ))
user = cursor.fetchone()
if user:
session['loggedin'] = True
session['userid'] = user['userid']
session['name'] = user['name']
session['email'] = user['email']
mesage = 'Logged in successfully !' #if person has already
registered
return render_template('predict.html', mesage = mesage)
else:
mesage = 'Please enter correct email / password !' #entered
credentials are incorrect
return render_template('login.html', mesage = mesage)

@app.route('/logout')
def logout():
session.pop('loggedin', None)
session.pop('userid', None)
session.pop('email', None)
return redirect(url_for('login'))

#Registration page
@app.route('/register', methods =['GET', 'POST'])
def register(): #internal function that actually registers the
class and return classtype.
mesage = ''
if request.method == 'POST' and 'name' in request.form and 'password' in
request.form and 'email' in request.form :
userName = request.form['name'] #allows you to capture work
requests as they come in.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 29


password = request.form['password'] #enable u to establish a process for
submitting,implenting those req.
email = request.form['email']
cursor = mysql.connection.cursor(MySQLdb.cursors.DictCursor)
#This is a Cursor class that returns rows as dictionaries and stores the
result set in the client.
cursor.execute('SELECT * FROM user WHERE email = % s', (email, ))
account = cursor.fetchone() # fetchone() method returns one row or
a single record at a time.
if account:
mesage = 'Account already exists !'
elif not re.match(r'[^@]+@[^@]+\.[^@]+', email):
mesage = 'Invalid email address !' #email is wrong
doesnot exist
elif not userName or not password or not email:
mesage = 'Please fill out the form !' #no login found so
fill the form
else:
cursor.execute('INSERT INTO user VALUES (NULL, % s, % s, % s)',
(userName, email, password, ))
mysql.connection.commit()
mesage = 'You have successfully registered !' #at end u can
register by entering all the info.
elif request.method == 'POST':
mesage = 'Please fill out the form !'
return render_template('register.html', mesage = mesage)

if __name__ == "__main__":
app.run()

 Flask is the framework here, while Flask is a Python class datatype. In other words, Flask
is the prototype used to create instances of web application or web applications if you want
to put it simple. So, once we import Flask, we need to create an instance of the Flask class
for our web app.
 Firstly we import MySQLdb.cursors that is used to execute all the statements to
communicate with mysql database.
 To save a trained model in Python, you can use the “pickle” or “joblib” module. Both
modules provide functions for serializing and deserializing Python objects, including
trained machine learning models.
 App secret.key is used by flask to keep the data safe.
 App.configure enables you to manage and quickly deply application configurations .

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 30


 App routing is used to map the specific URL with the associated fumction that is used to
access some particular page.

 After this each page of website comes and after home page when we go login enter the
credentials in login page if the person has already registered and do their login correctly
they come to next page others inavalid/incorrect email or password will come .
 After this you have to register in order to check whether email is spam or not and to go prediction
page.

Here we have used Xxamp server and hence imported SQL data.

Fig9: Xampp Control Panel

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 31


CHAPTER 6

RESULT

1. Home Page:

Fig10: Home Page

The above image is the basic home page of our Email Spam Detection System. It has two
options for user, the first is the login button and another is the register button. User needs to
create an account by registering with his/her email id and need to create a password for the same.
The user registration feature allows new users to create an account by providing necessary
information such as a unique username, email address, and password.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 32


2. Login Page

Fig11: Login Page

After registering the user’s credentials are saved and needs to login again with the email id and
password which the user has registered with. After clicking on the sign in button the user is
promoted to the next window from where he/she can predict which ever message they want to
classify as spam or ham.

Once registered, users can use the user login feature to securely access their accounts. This
typically involves entering their username/email and password combination. The login process
verifies the provided credentials and grants access to the user's personalized dashboard or
account page.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 33


3. Registration Page:

Fig12: Registration Page

The user registration feature allows new users to create an account by providing necessary
information such as a unique username, email address, and password. Overall, the web page with
user registration and login elements plays a crucial role in facilitating user interaction with our
project, enabling personalized access, and ensuring the security and privacy of user accounts.

vDEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 34


4. Prediction Page:

Fig13: Prediction Page

The image represents a user interface for the email prediction system. It includes a section where
users can input the email that they want to be predicted.

In the image, there is likely a text box or a designated area where users can enter the content of
the email they want to analyze. This can be in the form of a text input field where users can type
or paste the email's text. Alternatively, it could be a file upload button that allows users to upload
an email file in a specific format.

The purpose of this input area is to gather the text or data of the email that needs to be predicted
for spam detection or any other classification task. Users would typically enter the email content
they want to analyze in this field before initiating the prediction process.

By providing this input section, the system allows users to easily interact with the email
prediction functionality. It provides a straightforward and intuitive way for users to input the
email they want to analyze, ensuring that the system can process and predict the appropriate
classification for the given email data.

v
5. Result:

Fig14: Result Page

By showing the result, users can quickly assess the outcome of the prediction process and make
informed decisions or take appropriate actions based on the classification.

The result section is essential for user engagement and understanding of the prediction system's
performance. It provides transparency and allows users to evaluate the accuracy and
effectiveness of the prediction model. Users can rely on the result to determine the credibility
and potential actions to be taken regarding the email they inputted.

Overall, the result display in the image serves as a crucial component of the user interface,
providing clear and concise feedback on the predicted classification of the email, empowering
users to make informed decisions based on the prediction outcome.

v
CHAPTER 7

CONCLUSION
Spam email is one of the most demanding and troublesome internet issues in today’s world of
communication and technology. It is almost impossible to think about e-mail without
considering the issue of spam. Spammers by generating spam mails are misusing this
communication facility and thus affecting organisations and many email users. as a result of
the surge in cybercrime and spammers. In network security and machine learning, spam
filtering in e-mail is a critical problem. The Naive Bayes classifier is critical for detecting e-
mail spam. To make the project a success, models were combined with bio-inspired
algorithms. NB with probability distribution property provides the likely class for email
content, from spam to non-spam, based on keywords included in the email textual data. Spam
messages have developed throughout time to avoid censorship, according to studies. The
fundamental design of an email spam filter, as well as the spam email filtering approaches,
was explored. The research looked at a variety of publicly available data and performance
measures that can be used to assess the effectiveness of spam filters.

v
CHAPTER 8

FUTURE SCOPE AND APPLICATIONS

-It provides sensitivity to the client and adapts well to the future spam techniques.

-It considers a complete message instead of single words with


respect to its organization.

-It increases Security and Control.

-It reduces IT Administration Costs.

-It also reduce Network Resource Costs.

v
CHAPTER 9

REFERENCES

1. M. Mafarja, I. Aljarah, H. Faris, A. I. Hammouri, A.-Z. Ala'M and S. Mirjalili,


Binary grasshopper optimisation algorithm approaches for feature selection problems,
Expert Systems with Applications117(2019) 267–286

2. F. S. Gharehchopogh and H. Gholizadeh, A comprehensive survey: Whale


OptimizationAlgorithm and its applications,Swarm and Evolutionary
Computation48(2019) 1–24.

3. M. Samadi Bonab, A. Gha®ari, F. Soleimanian Gharehchopogh and P. Alemi, A


wrapper-based feature selection for improving performance of intrusion detection systems,
Inter-national Journal ofCommunication Systems,33(3) (2020) e4434.

4. S. Gharehpasha, M. Masdari and A. Jafarian, Virtual machine placement in cloud


datacenters using a hybrid multi-verse optimization algorithm, Artifı̄cial Intelligence
Review(2020) 1–37.512H. Mohammadzadeh & F. S. Gharehchopogh

5. M. Hosseinzadeh, H. K. Hama, M. Y. Ghafour, M. Masdari, O. H. Ahmed and H.


Khezri, Service selection using multi-criteria decision making: A comprehensive
overview, Journal of Network and Systems Management.28(2020) 1639–1693

6. S. Arora and P. Anand, Binary buttery optimization approaches for feature


selection, Expert Systems with Applications116(2019) 147–160.

7. M. M. Kabir, M. Shahjahan and K. Murase, A new hybrid ant colony optimization


algorithm for feature selection, Expert Systems with Applications39(2012) 3747–3763.

8. H. Faris, M. M. Mafarja, A. A. Heidari, I. Aljarah, A.-Z. Ala'M, S. Mirjaliliet


al.,An e±cient binary salp swarm algorithm with crossover scheme for feature
selectionproblems

9. N. Al-Madi, H. Faris and S. Mirjalili, Binary multi-verse optimization algorithm


forglobaloptimization and discrete problems,

10.Y.-Q. Zhou, Hybrid symbiotic organisms search algorithm for solving 0-1 knapsack

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 39
CHAPTER 10

PLAGIARISM REPORT

v
DEPARTMENT OF INFORMATION TECHNOLOGY MMCOE, PUNE 40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy