0% found this document useful (0 votes)

24 views10 pages

46 - Ijme... Mech Engg..Research Paper-1

The document discusses the issue of email spam and the use of machine learning techniques for spam detection. It highlights various methodologies, including Naive Bayes and Neural Networks, achieving high accuracy rates in classifying emails as spam or ham. The research emphasizes the importance of developing effective spam detection systems due to the increasing volume of spam emails and the need for efficient filtering methods.

Uploaded by

divu271004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views10 pages

46 - Ijme... Mech Engg..Research Paper-1

Uploaded by

divu271004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

DOI : https://doi.org/10.

56452/7-12-46

ISSN: 0974-5823 Vol. 7 No. 12 December, 2022

International Journal of Mechanical Engineering

Email Spam Detection using Machine Learning

Dr. Nilesh Jain
Associate Professor, Mandsaur University nileshjainmca@gmail.com

Dr. B. K. Sharma
1Professor, Mandsaur University, Mandsaur, e-mail: bksharma7426@gmail.com

people’s attention away from genuine and

ABSTRACT- Spam emails are known as important emails and direct them towards
unrequested commercialized emails or deceptive detrimental situations. Spam emails are capable of
emails sent to a specific person or a company [5]. filling up inboxes or storage
Spams can be detected through natural language
processing and machine learning methodologies. capacities, deteriorating the speed of the internet to
Machine learning methods are commonly used in a great extent. These emails have the capability of
spam filtering. These methods are used to render corrupting one’s system by smuggling viruses into
spam classifying emails to either ham (valid it, or steal useful information and scam gullible
messages) or spam (unwanted messages) with the people. The identification of spam emails is a very
use of Machine Learning classifiers. The proposed tedious task and can get frustrating sometimes.
work showcases differentiating features of the
content of documents [4]. There has been a lot of While spam detection can be done manually,
work that has been performed in the area of spam filtering out a large number of spam emails can
filtering which is limited to some domains. take very long and waste a lot of time. Hence, the
Research on spam email detection either focuses need for spam detection softwares has become the
on natural language processing methodologies need of the hour. To solve this problem, various
[25] on single machine learning algorithms or one spam detection techniques are used now. The most
natural language processing technique [22] on common technique for spam detection is the
multiple machine learning algorithms [2]. In this utilization of Naive Bayesian [5] method and
Project, a modeling pipeline is developed to feature sets that assess the presence of spam
review the machine learning methodologies. keywords. The main purpose is to demonstrate an
alternative scheme, with the use of Neural Network
Keywords: Email Spam Detection, Spam (NN) [4] classification system that utilises a
Detection, Machine Learning, Neural Networks, collection of emails sent by several users, is one of
Naive Bayes, Support Vector Classifier, Logistic the objectives of this research. One other purpose
Regression, Spam, Social Media, Email. is the development of spam detection with the help
of Artificial Neural Networks, resulting in almost
98.8% accuracy.
1. INTRODUCTION
2. LITERATURE SURVEY
Technology has become a vital part of life in
today’s time. With each passing day, the use of the Email :
internet increases exponentially, and with it, the Electronic mail (email) is a messaging system that
use of email for the purpose of exchanging electronically transmits messages across computer
information and communicating has also networks. Anyone is free to use email services
increased, it has become second nature to most through Gmail, Yahoo or people can even register
people. While e-mails are necessary for everyone, with an Internet Service Provider (ISPs) and be
they also come with unnecessary, undesirable bulk provided with an email account. Only an internet
mails, which are also called Spam Mails [29]. connection is required, otherwise being a free
Anyone with access to the internet can receive service.
spam on their devices. Most spam emails divert
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
490
DOI : https://doi.org/10.56452/7-12-46

Spam :
Bulk mails that are unnecessary and undesirable An approach using random forest algorithm
can be classified as Spam Mails. These spam approach is proposed by Akinyelu and Adewumi
emails hold the power to corrupt one's system by [1] in order to identify the phishing or spam emails.
filling up inboxes, degrading the speed of their It used 200 emails. The main motto of research was
internet connection. to reduce features and increase efficiency/accuracy.
Accuracy of up to 99.7% with a minimal amount of
Spam Detection : 0.06% false positives is achieved by the proposed
algorithm.
Many spam detection techniques are being used The research only covered the classification aspect
now-a-days. The methods use filters which can without considering vital information which can
prevent emails from causing any harm to the user. affect the results, especially, in case of limited text
The contributions and their weakness have been in the email.
identified.
Yüksel et al. [3] aimed to resolve the problem of
There are several methods that are accessible to spam by inhibiting the spam emails from being
spam, for example location of sender, it’s spread within the
contents, checking IP address or space names. email systems. To achieve this, they propose a
[26]. Spammers use refined variations to avoid cloud base system, which involves the
spam identification. Few measures connected identification of spam emails using analytics and
with spam identification are; Blacklist and white- machine learning algorithms like support vector
list, Machine learning approaches, Naïve machines and decision trees. The results of the
Bayes, Support Vector Machine, Neural Network tests show that the SVM leads to a higher accuracy
Classification. [27] of up to 97.6% and a false-positive rate of 2.33%.
The decision tree attains a lower accuracy of
A mobile system was proposed by Mahmoud et al. 82.6% and a false-positive rate of 17.3%. Results
[28] with the motive of blocking and identifying reveal that the increase in spam emails is affected
spam SMS. In their work, they attempted to by the no. of received emails. Lee et al. [28]
protect smartphones by filtering SMS spam that proposed an optimal technique for spam detection.
contains abbreviations and idioms. The system
was based on the Artificial Immune System (AIS) 2.1. EXISTING SYSTEMS
and Naïve Bayesian (NB) algorithm. By the use of
the Naive Bayes algorithm, the messages are Due to the increase in the number of email users,
classified based on their features. It used an SMS the amount of spam emails have also risen in
dataset with 1324 messages. Results from this number in the past years. It has now become even
system gave detection rate 82%, 6% positive rate more challenging to handle a wide range of emails
and 91% accuracy. for data mining and machine learning. Therefore,
many researchers have executed comparative
Table 1 : Spam Categories studies to see various classification algorithms
Categories Descriptions performances and their results in classifying emails
accurately with the help of a number of
Health The spam of fake performance metrics. Hence, it is important to find
medications an algorithm that gives the best possible outcome
Promotional The spam of fake fashion for any particular metric for correct classification
products items like clothes bags and of emails and spam or ham.
watches The present systems of spam detection are reliant
Adult content The spam of adult content on three major methods:-
of pornography and
prostitution A. Linguistic Based Methods
Finance & The spam of stock kiting, Unlike humans, who can grasp linguistic constructs
marketing tax solutions, and loan along with their exposition, machines cannot and
packages hence it is necessary to teach machines some
Phishing The spam of phishing or languages to help them understand these
fraud constructs. This is the technique that is used in
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
491
DOI : https://doi.org/10.56452/7-12-46

places like search engines in order to ascertain the 3. Heuristic or Rule-Based Spam Filtering
next terms for suggestions to the user while they Technique
are typing their search. Sentences are divided into
two Unigrams (words taken are one by one) and Algorithms use pre-defined rules in the form of a
two Bigrams (words that are taken two at a time). regular expression to give a score to the messages
Since this technique requires that every present in the e-mails. Based on the scores
expression be remembered, this method is not generated, they segregate emails into spam non-
feasible and also time-intensive. [29] spam categories.

B. Behavior-Based Methods 4. The Previous Likeness Based Spam Filtering

This technique is Metadata-based. This approach Technique
requires that users generate a set of rules, and the
users must have a thorough understanding of Algorithms extract the incoming mails' features and
these rules. Since the attributes of spam change create a multi-dimensional space vector and draw
over time so the rules also need to be reformed points for every new instance. Based on the KNN
from time to time. As a result, it still requires a algorithm, these new points get assigned to the
human to scrutinise the details and is majorly closest class of spam and non-spam.
user-dependent. [29]
5. Adaptive Spam Filtering Technique
C. Graph-Based Methods
Algorithms classify the incoming mails in various
This technique uses a single graphical
groups and, based on the comparison scores of
representation by incorporating numerous,
every group with the defined set of groups, spam
heterogeneous particulars. Graph-based anomaly
and non-spam emails got segregated.
recognition algorithms are executed which detect
abnormal forms in the data showing behaviours
This article will give an idea for implementing
of spammers. This method is not dependable, so
content-based filtering using one of the most
it is taxing to recognise faulty opinions. [29]
famous algorithms for spam detection, which is K-
Feature
Nearest Neighbour (KNN).
Engineering mostly depends on the commercial
appeal of terms and is absolutely content-oriented, k-NN based algorithms are widely used for
and does not depend on statistics. All these clustering tasks. Let’s quickly know the entire
attributes lead to a noteworthy decline of this architecture of this implementation first and then
structure. explore every step. Executing these 5 steps, one
after the other, will help us implement our spam
3. PROPOSED METHOD classifier smoothly.

Many several techniques are present in the market Training Testing Phase
to detect spam e-mails. If we want to classify
broadly, there are 5 different techniques based on
which algorithms decide whether any mail is spam
or not.

1. Content-Based Filtering Technique

Algorithms analyze words, the occurrence of New Email Classification

words, and the distribution of words and phrases
inside the content of e-mails and segregate them
into spam non-spam categories.

2. Case Base Spam Filtering Method

Algorithms trained on well-annotated spam/non-

spam marked emails try to classify the incoming Step 1: E-mail Data Collection
mails into two categories.

Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)

International Journal of Mechanical Engineering
492
DOI : https://doi.org/10.56452/7-12-46

The dataset contained in a corpus plays a crucial

role in assessing the performance of any spam
filter. Many open-source datasets are freely
available in the public domain.

Train/Test Split: Split the dataset into train and

test datasets but make sure that both sets must
balance numbers of ham and spam emails ( ham is
a fancy name for non-spam emails).
Visualization for spam email
Below are a few of the famous repositories where
you can easily get thousand kind of data set for
free :
UC Irvine Machine Learning Repository
Kaggle datasets
AWS datasets

For this email spamming data set, it is distributed

by Spam Assassin, you can click this link to go to
the data set. There are a few categories of the data, Visualization for non spam email
you can read the readme.html to get more
background information on the data. From this visualization, you can notice something
interesting about the spam email. A lot of them are
In short, there is two types of data present in this having high number of “spammy” words such as:
repository, which is ham (non-spam) and spam free, money, product etc. Having this awareness
data. Furthermore, in the ham data, there are easy might help us to make better decision when it
and hard, which mean there is some non-spam comes to designing the spam detection system.
data that has a very high similarity with spam
data. This might pose a difficulty for our system to One important thing to note is that word cloud only
make a decision. displays the frequency of the words, not necessarily
the importance of the words. Hence it is necessary
Exploratory Data Analysis (EDA) to do some data cleaning such as removing
stopwords, punctuation and so on from the data
Exploratory Data Analysis is a very important before visualizing it.
process of data science. It helps the data scientist
to understand the data at hand and relates it with N-grams model visualization
the business context.
Another technique of visualization is by utilizing
The open source tools that I will be using in bar chart and display the frequency of the words
visualizing and analyzing my data is Word Cloud. that appear the most. N-gram means that how many
words you are considering as a single unit when you
Word Cloud is a data visualization tool used for are calculating the frequency of words.
representing text data. The size of the texts in the
image represent the frequency or importance of Followings are the example of 1-gram, and 2-gram.
the words in the training data.

Visualization

Wordcloud

Wordcloud is a useful visualization tool for you to

have a rough estimate of the words that has the
highest frequency in the data that you have.

Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)

International Journal of Mechanical Engineering
493
DOI : https://doi.org/10.56452/7-12-46

Train Data Distribution

Bar chart visualization of 1-gram model

Count For Test Data

Bar chart visualization of 2-gram model

Train Test Split

It is important to split your data set to training set

Test Data Distribution
and test set, so that you can evaluate the
performance of your model using the test set The distribution between train data and test data are
before deploying it in a production environment. quite similar which is around 20–21%, so we are
good to go and start to process our data !
One important thing to note when doing the train
test split is to make sure the distribution of the Data Preprocessing
data between the training set and testing set are
similar. Text Cleaning
What it means in this context is that the Text Cleaning is a very important step in machine
percentage of spam email in the training set and learning because your data may contains a lot of
test set should be similar. noise and unwanted character such as punctuation,
white space, numbers, hyperlink and etc.

Some standard procedures generally used are:

1. convert all letters to lower/upper case
2. removing numbers
3. removing punctuation
4. removing white spaces
5. removing hyperlink

Target Count For Train Data removing stop words such as a, about, above, down,
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
494
DOI : https://doi.org/10.56452/7-12-46

doing and the list goes on… chopped off

clean_text = word_stemmer(dirty_text.split(" "))
Word Stemming and Word lemmatization these clean_text
are the two techniques are trying to reduce the #Output
words to its most basic form, but doing this with 'He studi in the hous yesterday, unluckily, the fan
different approaches. break down'

Word stemming — Stemming algorithms work The lemmatization has converted studies -> study,
by removing the end or the beginning of the breaks -> break
words, using a list of common prefixes and clean_text = word_lemmatizer(dirty_text.split(" "))
suffixes that can be found in that language. clean_text
Examples of Word Stemming for English words
are as below: #Output

'I study in the house yesterday, unluckily, the fan

break down'

Feature Extraction

Our algorithm always expect the input to be

Word Lemmatization — Lemmatization is integers/floats, so we need to have some feature
utilizing the dictionary of a particular language extraction layer in the middle to convert the words
and tried to convert the words back to its base to integers/floats.
form. It will try to take into account of the
meaning of the verbs and convert it back to the There are a couples ways of doing this as following
most suitable base form.
1. CountVectorizer

2. TfidfVectorizer

3. Word Embedding
Implementing these two algorithms to deal with
different edge cases. CountVectorizer

First we need to input all the training data into

Import the library and start designing some
CountVectorizer and the CountVectorizer will keep
functions to help us understand the basic working
of these two algorithms. a dictionary of every word and its respective id and
this id will relate to the word count of this word
# Just import them and use it inside this whole training set.
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer For example, a sentence like ‘I like to eat apple and
stemmer = PorterStemmer() drink apple juice’
lemmatizer = WordNetLemmatizer() from sklearn.feature_extraction.text import
dirty_text = "He studies in the house yesterday, CountVectorizer
unluckily, the fans breaks down" # list of text documents
text = ["I like to eat apple and drink apple juice"]
def word_stemmer(words):
stem_words = [stemmer.stem(o) for o in words] # create the transform
return " ".join(stem_words) vectorizer = CountVectorizer()
def word_lemmatizer(words): # tokenize and build vocab
lemma_words = [lemmatizer.lemmatize(o) for o vectorizer.fit(text)
in words] # summarize
print(vectorizer.vocabulary_)
return " ".join(lemma_words)
# encode document
The output of word stemmer is very obvious, vector = vectorizer.transform(text)
some of the endings of the words have been # summarize encoded vector

Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)

International Journal of Mechanical Engineering
495
DOI : https://doi.org/10.56452/7-12-46

print(vector.shape) Simply put, word embedding is a very powerful

print(type(vector)) representation of the words and one of the well
print(vector.toarray()) known techniques in generating this embedding is
# Output Word2Vec.
# The number follow by the word are the index of
the word Algorithm Implementation
{'like': 5, 'to': 6, 'eat': 3, 'apple': 1, 'and': 0, 'drink':
2, 'juice': 4} TfidfVectorizer + Naive Bayes Algorithm
# The index relates to the position of the word
count array below The first approach to use the TfidfVectorizer as a
# "I like to eat apple and drink apple juice" -> [1 2 feature extraction tools and Naive Bayes algorithm
1 1 1 1 1] to do the prediction. Naive Bayes is a simple and a
# apple which has the index 1 correspond to the probabilistic traditional machine learning algorithm.
word count of 2 in the array
It is very popular even in the past in solving
TfidfVectorizer problems like spam detection. Using Naive Bayes
library provided by sklearn library save us a lot of
Word counts are good but can we do better? One hassle in implementing this algorithm. This can be
issue with simple word count is that some words easily get in a few lines of codes
like ‘the’, ‘and’ will appear many times and they from sklearn.naive_bayes import GaussianNB
don’t really add too much meaningful information. clf.fit(x_train_features.toarray(),y_train)
# Output of the score is the accuracy of the
Another popular alternative is TfidfVectorizer. prediction
Besides of taking the word count of every words, # Accuracy: 0.995
words that often appears across multiple clf.score(x_train_features.toarray(),y_train)
documents or sentences, the vectorizer will try to # Accuracy: 0.932
downscale them.
clf.score(x_test_features.toarray(),y_test)
For more info about CountVectorizer and
TfidfVectorizer, please read from this great piece We achieve an accuracy of 93.2%. But accuracy is
of article, which is also where I gain most of my not solely the metrics to evaluate the performance
understanding. of an algorithm. So other scoring metrics and that
may help us to understand thoroughly how well this
Word Embedding model is doing.

Word embedding is trying to convert a word to a Scoring & Metrics

vectorized format and this vector represents the
position of this word in a higher dimensional When it comes to evaluation of a data science
space. model’s performance, sometimes accuracy may not
be the best indicator.
For words that have similar meaning, the cosine
distance of those two word vectors will be shorter Some problems that we are solving in real life
and they will be closer to each other. might have a very imbalanced class and using
accuracy might not give us enough confidence to
And in fact, these words are vectors, so you can understand the algorithm’s performance.
even perform math operations on them ! The end
results of these operation will be another vector In the email spamming problem the spam data is
that maps to a word. Unexpectedly, those approximately 20% of our data. If our algorithm
operations produce some amazing result ! predicts all the email as non-spam, it will achieve an
accuracy of 80%.
Example 1 : King- Man + Woman = Queen
And for some problem that has only 1% of positive
Example 2: Madrid-Spain+France = Paris data, predicting all the sample as negative will give
them an accuracy of 99% but we all know this kind
Example 3: Walking-Swimming+Swam= Walked of model is useless in a real life scenario.
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
496
DOI : https://doi.org/10.56452/7-12-46

Precision & Recall The recall of this model is rather low, it might not
be doing a good enough job in discovering the spam
Precision & Recall is the common evaluation email.
metrics that people use when they are evaluating
class-imbalanced classification model. Summary

Precision is evaluating, when a model predict I have showed you all the necessary steps needed in
something as positive, how accurate the model is. designing a spam detection algorithm. Just a brief
On the other hand, recall is evaluating how well a recap:
model in finding all the positive samples.
Explore and understand your data
The mathematical equation for precision & recall
are as respective Visualize the data at hand to gain a better intuition
— Wordcloud, N-gram Bar Chart

Text Cleaning — Word Stemmer and Word

Lemmatization
TP: True Positive Feature Extraction — Count Vectorizer, Tfidf
Vectorizer, Word Embedding
FP : False Positive
Algorithm — Naive Bayes
TN: True Negative
Scoring & Metrics — Accuracy, Precision, Recall
FN: False Negative
Here concludes the first part of demonstration in
Confusion Matrix designing spam detection algorithm.
Confusion Matrix is a very good way to
understand results like true positive, false positive, 4. CONCLUSION
true negative and so on.
As shown in Figure 4, all the models based on the
Sklearn documentation has provided a sample feature set 2 most-frequent-word-count have higher
code of how to plot nice looking confusion matrix accuracy and F1 score than those based on the
to visualize your result. feature set 1 stop words
+ n-gram + tf-IDF.

If the use case is to introduce a beta version of an

email spam detector like no-spam in the inbox. In
this case, the model: Neural Network with tanh
activation function and the feature set 1 stop words
+ n-gram + tf-IDF serves this purpose.

According to the graphs in Figure 4, if the use case

is to introduce an email spam detector to reduce bad
user experience in searching for important emails
from junk mailboxes and filtering spam from the
inbox. In this case, Neural Network with a feature
set 2 - ‘most frequent word count’ gives a better
user experience in general.
Confusion Matrix of the result
The future work includes testing the model with
Precision: 87.82% various standard datasets. This research proposes
that the outcome that is obtained should be
Recall: 81.01% compared with additional spam datasets from
various sources. Also, more classification and
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
497
DOI : https://doi.org/10.56452/7-12-46

feature algorithms should be analyzed with email

spam datasets. [11] Jason Brownlee, “Naive Bayes for Machine
Learning” The Machine Learning Mastery,
5. REFERENCES April 11, 2015.
https://machinelearningmastery.com/naive-
[1] AKINYELU, A. A., & ADEWUMI, A. O. bayes-for-machine- learning/
(2014). “Classification of phishing email
using random forest machine learning [12] Wikipedia, “History of email spam,”
technique”. Journal of Applied Mathematics. Internet Free Encyclopedia, 2001.
https://en.wikipedia.org/wiki/History_of_
[2] Vinodhini. M, Prithvi. D, Balaji. S “Spam email_spam
Detection Framework using ML
Algorithm” in IJRTE ISSN: 2277-3878, [13] Rohith Gandhi, “Support Vector Machine”
Vol.8 Issue.6, March 2020. The Machine Learning Mastery, June 7,
2018.
https://towardsdatascience.com/support-
[3] YUsKSEL, A. S., CANKAYA, S. F., &
vector-machine-introduction-to-machine-
UsNCUs, It. S. (2017). “Design of a
learning-algorithms-934a444fca47
Machine Learning Based Predictive
Analytics System for Spam Problem.”
[14] Jason Brownlee, “Logistic Regression for
Acta Physica Polonica, A.,132(3).[26]
Machine Learning” The Machine Learning
GOODMAN, J. (2004, July). “IP Addresses
Mastery, April 1, 2016.
in
https://machinelearningmastery.com/logisti
[4] Email Clients.” In CEAS. c-regression-for-machine-learning/

[5] Deepika Mallampati, Nagaratna P. Hegde “A [15] Jason Brownlee, “How to Encode Text Data
Machine Learning Based Email Spam for Machine Learning with scikit- learn”
Classification Framework Model” in The Machine Learning Mastery, September
IJITEE, ISSN: 2278-3075, Vol.9 Issue.4, 29, 2017.
February 2020. https://machinelearningmastery.com/prepare-
text-data-machine-learning-scikit-learn/
[6] Javatpoint, “Machine Learning
Tutorial” 2017 [16] I. Androutsopoulos, J. Koutsias, K. Chandrinos
https://www.javatpoint.com/machi and C.
ne- learning D. Spyropoulos, "An experimental comparison of
naive Bayesian and keyword-based anti-
[7] SpamAssassin, “Spam and Ham Dataset'', spam filtering with personal email
Kaggle, 2018. messages," Computation and Language, pp.
https://www.kaggle.com/veleon/ham-and- 160-167, 2000.
spam-dataset
[17] G. V. Cormack, "Email Spam Filtering: A
Systematic Review," Foundations and
[8] Apache, “open-source Apache SpamAssassin
Trends® in Information Retrieval, vol. 1,
Dataset”, 2019
no. 4, pp. 335-455, 2006.
https://spamassassin.apache.org/old/publicc
orpus/
[18] M. Siponen and C. Stucke, "Effective
Anti-Spam Strategies in Companies: An
[9] SpamAssassin, “Spam Classification
International Study," Proceedings of the
Kernel”, 2018
39th Annual Hawaii International
https://www.kaggle.com/veleon/spam-
Conference on System Sciences
classification
(HICSS'06), 2006.
[10] SpamAssassin, “REVISION HISTORY OF
THIS CORPUS”, 2016
[19] Guzella, T. S. and Caminhas, W. M.”A
https://spamassassin.apache.org/old/publicco
review of machine learning approaches to
rpus/read me.html
Spam filtering.” Expert Syst. Appl., 2009.
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
498
DOI : https://doi.org/10.56452/7-12-46

[20] Jianying Zhou, Wee-Yung Chin, Rodrigo M. S. (2018, January). “A framework for real-time
Roman, and Javier Lopez, (2007) "An spam detection in Twitter.” In
Effective MultiLayered Defense Framework Communication Systems & Networks
against Spam", Information Security (COMSNETS), 2018 10th International
Technical Report 01/2007. Conference on (pp. 380-383).
[29] MAHMOUD, T. M., & MAHFOUZ, A. M.
[21] Xiao Mang Li, Ung Mo Kim, (2012) "A (2012). “SMS spam filtering technique
hierarchical framework for content-based based on artificial immune system.”
image spam filtering", 8th International International Journal of Computer Science
Conference on Information Science and Issues (IJCSI), 9(2), 589.
Digital Content Technology (ICIDT), Jeju,
June, pp. 149-155. [30] AN ANTI-SPAM DETECTION MODEL
FOR EMAILS OF MULTI-NATURAL
[22] Linda Huang, Julia Jia, Emma Ingram, LANGUAGE Mazin Abed Mohammed a,*,
Wuxu Peng, “Enhancing the Naive Bayes Salama A. Mostafa b,*, Omar Ibrahim Obaid
Spam Filter through Intelligent Text
Modification Detection”, 2018 17th IEEE
International Conference on Trust, Security
and Privacy in Computing and
Communications.

[23] W.A. Awad, S.M. Elseuofi, Machine

learning methods for spam E-mail
classification, Int. J. Comput. Sci. Inf.
Technol. 3 (1) (2011) 173–184.

[24] K.R. Dhanaraj, V. Palaniswami, Firefly

and Bayes classifier for email spam
classification in a distributed
environment, Aust. J. Basic Appl. Sci. 8
(17) (2014) 118–130.

[25] M. Zavvar, M. Rezaei, S. Garavand, Email

spam detection using combination of
particle swarm optimization and artificial
neural network and support vector machine
Int. J Mod Educ. Comput.Sci. (2016) 68-74.

[26] Deepika Mallampati, “An Efficient

Spam Filtering using Supervised
Machine Learning Techniques” in
IJSRCSE, Vol.6, Issue.2, pp.33-37,
April (2018).

[27] [Deepika Mallampati, K.Chandra Shekar

and K.Ravikanth “Supervised Machine
Learning Classifier for Email Spam
Filtering”, © Springer Nature Singapore
Pte Ltd. 2019 and Engineering,
https://doi.org/10.1007/978-981-13-7082-
341.

[28] GUPTA, H., JAMAL, M. S., MADISETTY,

S., & DESARKAR,
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
499

Jebin 2
No ratings yet
Jebin 2
22 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
No ratings yet
Madhavan 2021 IOP Conf. Ser. Mater. Sci. Eng. 1022 012113
12 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Spam 2023
No ratings yet
Spam 2023
11 pages
VBK23 Cse 041
No ratings yet
VBK23 Cse 041
6 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Email (Research) 3
No ratings yet
Email (Research) 3
7 pages
Spam Email Using Machine Learning
No ratings yet
Spam Email Using Machine Learning
13 pages
Ijirt156181 Paper
No ratings yet
Ijirt156181 Paper
5 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Decision Tree Model For Email Classification: Ivana Čavor
No ratings yet
Decision Tree Model For Email Classification: Ivana Čavor
4 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
Spam Classification Based On Supervised Learning U
No ratings yet
Spam Classification Based On Supervised Learning U
6 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
$RB0DCAN
No ratings yet
$RB0DCAN
10 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem
No ratings yet
E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem
5 pages
NLP Report
No ratings yet
NLP Report
19 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages
Email Based Spam Detection
No ratings yet
Email Based Spam Detection
5 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Machine Learning Based Spam E-Mail Detection
No ratings yet
Machine Learning Based Spam E-Mail Detection
10 pages
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
No ratings yet
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
7 pages
Spam Detection Using BERT
No ratings yet
Spam Detection Using BERT
6 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
EmailSpamFilteringTechniques AReview
No ratings yet
EmailSpamFilteringTechniques AReview
13 pages
(IJCST-V12I1P3) :ipsita Panda, Sidharth Dash
No ratings yet
(IJCST-V12I1P3) :ipsita Panda, Sidharth Dash
6 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
Print 22may2023
No ratings yet
Print 22may2023
54 pages
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
No ratings yet
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
7 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
Spam SMS (Or) Email Detection and Classification Using Machine Learning
No ratings yet
Spam SMS (Or) Email Detection and Classification Using Machine Learning
5 pages
Security and Communication Networks - 2022 - Ahmed - Machine Learning Techniques For Spam Detection in Email and IoT
No ratings yet
Security and Communication Networks - 2022 - Ahmed - Machine Learning Techniques For Spam Detection in Email and IoT
19 pages
Moutafis EWS 098
No ratings yet
Moutafis EWS 098
8 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
E-Mail Spam Detection Using Machine Learning KNN
No ratings yet
E-Mail Spam Detection Using Machine Learning KNN
5 pages
Evaluating The Effectiveness of Machine Learning Methods For
No ratings yet
Evaluating The Effectiveness of Machine Learning Methods For
8 pages
PPT
0% (1)
PPT
15 pages
Bhardwaj Sharma 2022 Email Spam Detection Using Bagging and Boosting of Machine Learning Classifiers
No ratings yet
Bhardwaj Sharma 2022 Email Spam Detection Using Bagging and Boosting of Machine Learning Classifiers
25 pages
A Novel Approach For Spam Detection Using Natural Language Processing With AMALS Models
No ratings yet
A Novel Approach For Spam Detection Using Natural Language Processing With AMALS Models
16 pages
2023 V14i805
No ratings yet
2023 V14i805
7 pages
Optimizing Spam Filtering With Machine Learning
No ratings yet
Optimizing Spam Filtering With Machine Learning
35 pages
Naive Bayes Spam Filte....
No ratings yet
Naive Bayes Spam Filte....
10 pages
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
No ratings yet
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
5 pages
Review (2) - Machine Learning For SPAM Detection 2023
No ratings yet
Review (2) - Machine Learning For SPAM Detection 2023
13 pages
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
100% (2)
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
58 pages
A Hybrid Machine Learning Approach For Spam and Malware
No ratings yet
A Hybrid Machine Learning Approach For Spam and Malware
14 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Id - 3747 - Literature Review
No ratings yet
Id - 3747 - Literature Review
3 pages
Amrit Science Campus: Submitted by
No ratings yet
Amrit Science Campus: Submitted by
35 pages
Guide to PC Security
From Everand
Guide to PC Security
Max Editorial
No ratings yet
Open Data and Reuse: Issues and Challenges For Cultural Institutions
No ratings yet
Open Data and Reuse: Issues and Challenges For Cultural Institutions
14 pages
Adversarial Machine Learning
No ratings yet
Adversarial Machine Learning
107 pages
Business Logic: An Overview
No ratings yet
Business Logic: An Overview
8 pages
Chat Bot
No ratings yet
Chat Bot
5 pages
The Idea by Woobensky
No ratings yet
The Idea by Woobensky
4 pages
A. B. C. D.: Database Application and The Database
No ratings yet
A. B. C. D.: Database Application and The Database
45 pages
Text Mining: Concepts, Process and Applications: January 2013
No ratings yet
Text Mining: Concepts, Process and Applications: January 2013
5 pages
Kalakoti Hemaswi DS
No ratings yet
Kalakoti Hemaswi DS
1 page
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
No ratings yet
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
16 pages
MPH-DE Program
No ratings yet
MPH-DE Program
12 pages
Sem - 6 Bca
No ratings yet
Sem - 6 Bca
23 pages
Project Report
No ratings yet
Project Report
5 pages
Sample Ai Generated Story
No ratings yet
Sample Ai Generated Story
2 pages
BIS613D Module 5 Textbook
No ratings yet
BIS613D Module 5 Textbook
9 pages
Mygrades 2
No ratings yet
Mygrades 2
1 page
Experiment No.5 Aim: Theory:: Develop An Application That Makes Use of Database
No ratings yet
Experiment No.5 Aim: Theory:: Develop An Application That Makes Use of Database
7 pages
Automation in Construction: Aritra Pal, Shang-Hsien Hsieh
No ratings yet
Automation in Construction: Aritra Pal, Shang-Hsien Hsieh
22 pages
Presentation RW
No ratings yet
Presentation RW
12 pages
ASSIGNMEnt 3
No ratings yet
ASSIGNMEnt 3
26 pages
Microsoft Ai SDK For Sap Abap
No ratings yet
Microsoft Ai SDK For Sap Abap
16 pages
A Novel Knowledge Graph-Based Optimization Approach For Resource Allocation in Discrete Manufacturing Workshops
No ratings yet
A Novel Knowledge Graph-Based Optimization Approach For Resource Allocation in Discrete Manufacturing Workshops
14 pages
AI Engineer at Neuralk-AI
No ratings yet
AI Engineer at Neuralk-AI
2 pages
Rohith DE
No ratings yet
Rohith DE
7 pages
Grimoli Antonio: Eipass 7 Moduli User
No ratings yet
Grimoli Antonio: Eipass 7 Moduli User
1 page
Internship (5B4)
No ratings yet
Internship (5B4)
38 pages
Downloadfile
No ratings yet
Downloadfile
23 pages
Ccs369-Unit 1
No ratings yet
Ccs369-Unit 1
27 pages
INTERNSHIP REPORT Raviteja
No ratings yet
INTERNSHIP REPORT Raviteja
12 pages
Chapter 14-NLP
No ratings yet
Chapter 14-NLP
24 pages
Tej Anand CV
No ratings yet
Tej Anand CV
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

46 - Ijme... Mech Engg..Research Paper-1

Uploaded by

46 - Ijme... Mech Engg..Research Paper-1

Uploaded by

DOI : https://doi.org/10.

ISSN: 0974-5823 Vol. 7 No. 12 December, 2022

Email Spam Detection using Machine Learning

people’s attention away from genuine and

B. Behavior-Based Methods 4. The Previous Likeness Based Spam Filtering

1. Content-Based Filtering Technique

Algorithms analyze words, the occurrence of New Email Classification

2. Case Base Spam Filtering Method

Algorithms trained on well-annotated spam/non-

Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)

The dataset contained in a corpus plays a crucial

Train/Test Split: Split the dataset into train and

For this email spamming data set, it is distributed

Wordcloud is a useful visualization tool for you to

Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)

Train Data Distribution

Bar chart visualization of 1-gram model

Count For Test Data

Bar chart visualization of 2-gram model

Train Test Split

It is important to split your data set to training set

Some standard procedures generally used are:

doing and the list goes on… chopped off

'I study in the house yesterday, unluckily, the fan

Our algorithm always expect the input to be

First we need to input all the training data into

Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)

print(vector.shape) Simply put, word embedding is a very powerful

Word embedding is trying to convert a word to a Scoring & Metrics

Text Cleaning — Word Stemmer and Word

If the use case is to introduce a beta version of an

According to the graphs in Figure 4, if the use case

feature algorithms should be analyzed with email

[23] W.A. Awad, S.M. Elseuofi, Machine

[24] K.R. Dhanaraj, V. Palaniswami, Firefly

[25] M. Zavvar, M. Rezaei, S. Garavand, Email

[26] Deepika Mallampati, “An Efficient

[27] [Deepika Mallampati, K.Chandra Shekar

[28] GUPTA, H., JAMAL, M. S., MADISETTY,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.