EmailSpamFilteringTechniques AReview
EmailSpamFilteringTechniques AReview
net/publication/357175093
CITATIONS READS
0 872
3 authors, including:
Pravin Kshirsagar
Raisoni Group of Institutions
12 PUBLICATIONS 10 CITATIONS
SEE PROFILE
All content following this page was uploaded by Pravin Kshirsagar on 20 December 2021.
1. Introduction
Spam, also known as unwanted bulk email, has permeated daily life all these years.Ballooned
spam has profoundly affected the efficiency of email usage as email is used to help
discussionsas well as atask manager and document sharing system and archiving these days.
Some case studieshave also highlighted a terrifying fact that all types of spam emails can be
as high as 88%~92% of total emails sent every day [1].The content of spam email may
include illegal products, services, intimidation and fraud, plus spam emails usually induce
alleged threats such as information theft with the help of extremely fast-spreading malware.
Therefore, several solutions have been proposed to avoid the situation from worsening and
spam detection techniques have been profoundlyemerged and marketed over the
years.However, every email user still receives plenty of spam emails every year which
indicates the need and urgency of improving spam detection.Since spam email is likely to be
uncovered at each step of the email sending operation, multiple methods are often used in
spam filtering that work constantly cooperatively, for example whitelist/blacklist, challenge-
response, rule -based filtering, keyword-based filtering, content-based filtering, etc. Spam
filtering can be considered a specialized binary classification function of text to classify an
email into spam orham [2].
[8327]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
applied to tackle spam issue. Figure 1 represents the generalized Email spam prediction
model [4].
Data Pre-processing
Email Spam
Dataset
Tokenization
Classification
Semantic Feature model
Selection
Ori
gin
Feature
al
Reduction
Dat
a
Data Classification
Training Data
[8328]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
[8329]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
classification. This model has low variance and more competence. It is easy to update a
logistic model using new data by stochastic gradient descent.
ii. Naïve Bayes: The breed of naive Bayes (NB) classifiers rely upon Bayes' theorem that limits
absolute and conditional probabilities. In the case of machine learning and spam detection
[11], probabilities can be linked to the related frequencies of word presence in messages (i.e.,
the relative frequency count of words). The next idea is the alleged naive assumption based
on the independence of all features with respect to the output (i.e., their original class).
Though this assumption of independence is rarely true, naive Bayes classifiers can make a
highly fruitful classification even if the training data has not multiple examples. In addition,
classifiers belonging to the NB family are considered to be fast and easy-going.
iii. Support vector machines (SVM): SVMs are one of the most used classification algorithms
although their utility is widespread (for example, outlier detection). If a labelled dataset is
given, SVM discovers a classification (separation) hyperplane by finding the maximal
distance between data points (vectors representing samples) that belong to dissimilar classes.
Two types of SVM models exist: hard-margin (need to classify each point accurately) and
soft-margin (misclassification is also acceptable) [12]. Unlike k-NN classifiers, it is
advantageous for SVMs to work in higher dimensions. The data points are separated more
competently through the increase in the number of features. The points nearest to the
classification hyperplane classification are known as support vectors. A hyperplane is also
known as a decision boundary and divides elements that belong to dissimilar groups. The gap
between hyperplane duo used by the support vectors is regarded as the margin. The larger the
margin, the better.
[8330]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
interpret natural language text. At the heart of the OST are stores of world and linguistic
knowledge learned by authorities, primarily an ontology and a lexicon, also regarded as static
knowledge resources.The repository contains concepts independent of languagealong
relations between them, Proper Name Dictionary (PND) and is employed to illustrate and
represent the various meanings of words and sentences. Nowadays, OST has been
broadlyused in human-robot interaction.
c. Baseline semantic spam filtering:Hempelman and Mehra constructed baseline semantic
spam filtering with OST and presented OST to simplify spam filtering at the semantic level.
This approach adds store of information guarantee and security in OST, specifically to take
the steganographic scramble to a new scale, away from statistical pattern corresponding to
text comprehension [15].To deal with spam content, most of the existing spam filtering uses
statistics to detect rare words ("Viagra," "re-entrant," "replication"), usually Bayesian,
combined with some hard-coded heuristics.OSTs, in contrast, can increase the adversarial
threshold, in fact side-stepping noise-based, hash-busting steganography, currently in effect,
while it is still possible thatsomebody with more resources and more time can develop an
improved noise model and still separate signal and the alternate. Baseline semantic spam
filteringdoes this at an initial low level, again considering the statistical features, but now
about the meaning of possiblydisturbing texts, not the superficial text that is the
epiphenomenon of language. That is, instead of considering the meaning of the text, it only
considers the level of the meaning of the text. The other applications of the OST for
information guarantee and safety include the content of the meaning itself.
2. Literature Review
NadjateSaidani, et.al (2020) suggested a technique on the basis of performing two-semantic
level analysis [16]. Initially, particular domains were utilized for categorizing the emails so
that an individual conceptual view was ensured for spam in every domain. Subsequently, the
spam was detected by integrating a set of manually-specified attributes with semantic features
which were extracted in automatic manner. These attributes were assisted in summarizing the
email content into compact topics for which the spam emails were differentiated from non-
spam emails effectively. The suggested technique was capable of detecting the spam in
comparison with the traditional techniques on the basis of BoW (bag-of-words) and generated
optimal outcomes. A new algorithm was deployed by Wuxu Peng, et.al (2018) in order to
improve the accuracy of NB (Naive Bayes) spam filter with the objective of detecting the text
modifications and classifying the email in two classes: spam or ham [17]. The outcomes
demonstrated that the presented approach was applicable for consistently mitigating the
amount of spam emails whose misclassification was done as ham email. A FFNN (feed
forward neural network) was introduced by E Elakkiya, et.al (2019) along with BP (back
propagation) for detecting the spam [18]. The primary weights of FFNN were tuned with the
enhancement of quality of the learning process. For this purpose, FA (firefly algorithm) was
implemented for alleviating the time to discover the optimal weights under the learning
procedure. A twitter dataset was executed for the experimentation. The experimental
outcomes depicted that the introduced approach was effective with regard to accuracy and
detection rate and offered a least FPR (false positive rate). An adaptive scheme of classifying
the data was presented by ThayakornDangkesee, et.al (2017) in order to detect the spam for
which spam word lists and a commercial URL-based security tool were implemented [19].
[8331]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
The NB (Naïve Bayes) algorithm was deployed to analyze the data. Consequently, the
efficacy of the presented scheme was enhanced in more optimal manner in comparison with
traditional techniques. The experimental outcomes revealed that the presented scheme was
applicable to detect the spam. Xiaoxu Liu, et.al (2021) emphasized on discovering the
possibility of the Transformer model in order to detect the spam SMS (Short Message
Service) messages. For this, an enhanced Transformer model was recommended to detect
spam messages [20][35][36]37]. Two datasets were employed to quantify the recommended
model. The outcomes attained in experimentation proved that the results generated through
the recommended model were promising and this model provided the accuracy up to 98.92%,
recall of 94.51% and F1-Score around 96.13%. An effectual framework of detecting the spam
in email was projected by Maria Habib, et.al (2018 on the basis of a hybrid of GP (Genetic
Programmin) and SMOTE (Synthetic Minority Over-sampling Technique) so that the spam
emails were detected [21]. Two email corpora were utilized to test the projected framework
with respect to accuracy, recall, precision and G-mean. The experimental outcomes
confirmed that effectiveness of the projected framework for classifying the spam emails in
contrast to traditional techniques. An approach was developed by Wuxain Zhang, et.al (2017)
in which feature-based technique and supervised learning method were deployed for
detecting the spam posts from Instagram [22][32][33][34]. The collection of user profiles and
media posts was done from Instagram. The media posts were marked instantly using Minhash
and K-medoids clustering for grouping the near-duplicate posts into similar clusters. The
developed approach was appropriate for classifying these posts as spam or ham and yielded
the accuracy around 96.27%. An innovative algorithm of detecting spam was intended by
Zhijie Zhang, et.al (2020) on the basis of regularized ELM (extreme learning machine)
known as I2FELM (Improved Incremental Fuzzy-kernel-regularized Extreme Learning
Machine) with the objective of detecting the spam in Twitter in accurate manner [23]. The
results of experiments revealed that the intended algorithm was applicable in detecting the
spam. A DL (deep learning) technique was designed by AsoKhaleel Ameen, et.al (2018) to
detect the spam in Twitter [24]. This technique focused on training the Word2Vec based on
representation initially. Subsequently, the tweets were classified as spam and normal using
binary classifiers. Finally, the MLP (Multilayer Perceptron) was adopted to classify the spam
from tweets. The outcomes exhibited the supremacy of the designed technique over the
existing ones. And performed well concerning precision, recall and F-measure. A DL (deep
learning) based mechanism to detect spam was presented by GirijaChetty, et.al (2019) [25].
In this mechanism, the Word Embedding method was integrated with NN (Neural Network)
algorithm. Word Embedding was assisted in displaying the meaning and analogy of word.
The attributes of text documents available in the embedding space were learned using DNN.
Thereafter, these attributes were considered for classify text documents. The presented
mechanism had potential for detecting the spam in different text documents. An approach
was suggested by NattananWatcharenwong, et.al (2017) in order to detect the spam in closed
groups for which the text attributes were integrated with social attributes [26][27][28]. The
RF (Random Forest) algorithm was put forward to classify the spam form 1,200 labelled
posts. The outcome depicted that the suggested approach yielded the efficacy up to 98% for
detecting the spam in effective manner[29][30][31].
[8332]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
[8333]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
Xiaoxu Liu, et al. 202 Enhanced v.1 dataset The results This
1 Transformer and generated approach
model UtkMl‟s through the offered least
Twitter recommended accuracy in
dataset model were case of
promising and enormous
this model dataset
provided the which
accuracy up to contained a
98.92%, recall number of
of 94.51% and messages or
F1-Score even other
around types of
96.13%. content.
Maria Habib, et al. 201 Genetic CSDMC201 The projected The relative
8 Programmin 0 dataset framework significance
g (GP) was effective of attributes
combined for classifying was not
with the spam analyzed
Synthetic emails in using this
Minority contrast to framework.
Over- traditional
sampling techniques.
Technique
(SMOTE)
Wuxain Zhang, et al. 201 Feature- Instagram The developed The
7 based dataset approach was developed
method and appropriate for approach had
supervised classifying not
learning these posts as considered
technique spam or ham favor of
and yielded the users while
accuracy developing a
around technique to
96.27%. customize
the spam
classification
algorithms.
Zhijie Zhang, et al. 202 Improved Matlab2012 The intended This
0 Incremental b algorithm was technique
Fuzzy- applicable in was
kernel- detecting the incapable of
regularized spam. detecting the
Extreme spam in
Learning Twitter due
Machine to inadequate
(I2FELM) labeled data
[8334]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
in the social
network.
AsoKhaleel Ameen, et 201 Deep Twitter‟s The supremacy This
al. 8 Learning Streaming of the designed technique
method API technique was was
proved over ineffective in
the existing case of
ones for enormous
detecting the amount of
spam and it data.
performed well
concerning
precision,
recall and F-
measure.
GirijaChetty, et al. 201 Word UCI The presented This
9 Embedding machine mechanism technique
technique- learning had potential was not
Neural repository for detecting robust to
Network the spam in understand
algorithm different text the
documents. modelling
power of DL
while
detecting the
spam
NattananWatcharenwo 201 Random Facebook The suggested The
ng, et al. 7 Forest Graph APIs approach suggested
yielded the approach
efficacy up to was not
98% for performed
detecting the well for
spam in classifying
effective the spam
manner. posts having
only images
not letters.
Conclusion
A surge in the number of spammers and spam emails has been noticed in the recent years, as
the investment required for the spamming business is minimum. This has led to a system that
finds each email suspicious, causing substantial investments in defence mechanisms. The
most commonly used mail filtering schemes are Knowledge Engineering (KE) and Machine
Learning (ML). The approaches based on KE generate a set of rules so as to classify
messages as spam or genuine mail. The email spam detection has various phases like feature
extraction and classification. The various scheme are analyzed in this paper for the email
[8335]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
spam detection. It is analyzed that machine learning algorithms are best performing
algorithms as compared content filtering techniques.
References
[1] Venkatraman, S., Surendiran, B. & Arun Raj Kumar, P. Spam e-mail classification for the
Internet of Things environment using semantic similarity approach”, 2020, The Journal
ofSupercomputing, vol. 76, pp. 756–776
[2] M. Qi and R. Mousoli, "Semantic analysis for spam filtering," 2010 Seventh International
Conference on Fuzzy Systems and Knowledge Discovery, 2010, pp. 2914-2917
[3] Q. Zhang, H. Yang, Z. Yuan and J. Sun, "Studies on the Semantic Body-Based Spam
Filtering," 2010, International Conference of Information Science and Management
Engineering, pp. 233-236
[4] A. Han, H. Kim, I. Ha and G. Jo, "Semantic Analysis of User Behaviors for Detecting Spam
Mail," 2008 IEEE International Workshop on Semantic Computing and Applications, 2008,
pp. 91-95
[5] G. Vijayasekaran, S.Ros, “Spam and Email Detection in Big data Platform using Naives
Bayesian classifier”, 2018, International Journal of Computer Science and Mobile
Computing, Vol.7 Issue. 4, pg. 53-58
[6] Priti Sharma, Uma Bhardwaj, “Machine Learning based Spam E-Mail Detection”, 2018,
International Journal of Intelligent Engineering and Systems, Vol.11, No.3
[7] M. Deepika, Shilpa Rani, “Performance of Machine Learning Techniques for Email Spam
Filtering”, 2017, IJRTER
[8] Esha Bansal, Pradeep Kumar Bhatia, “A SURVEY OF VARIOUS MACHINE LEARNING
ALGORITHMS ON EMAIL SPAMMING”, 2017, International Journal of Advances in
Electronics and Computer Science
[9] Dr. SwapnaBorde, Utkarsh M. Agrawal, Viraj S. Bilay, Nilesh M. Dogra, “Supervised
Machine Learning techniques for Spam Email Detection”, 2017, IJSART, Volume 3 Issue 3
[10] DeepikaMallampati, Nagaratna P. Hegde, “A Machine Learning Based Email Spam
Classification Framework Model: Related Challenges and Issues”, 2020, International
Journal of Innovative Technology and Exploring Engineering (IJITEE), Volume-9 Issue-4
[11] A. Lakshmanarao, K. Chandra Sekhar, Y. Swath, “An Efficient Spam Classification System
Using Ensemble Machine Learning Algorithm”, 2018, Journal of Applied Science and
Computations, Volume 5, Issue 9
[12] ApurvaTaunk, Srishty Bharti, Sipra Sahoo, “An Ensemble Method for Spam Classification”,
2020, International Journal of Scientific & Technology Research Volume 9, Issue 02
[13] MeghaRathi, VikasPareek, “Spam Mail Detection through Data Mining – A Comparative
Performance Analysis”, 2013, International Journal of Modern Education and Computer
Science, Volume 12, PP. 31-39
[14] HanifBhuiyan, AkmAshiquzzaman, Tamanna Islam Juthi, Suzit Biswas &JinatAra, “A
Survey of Existing E-Mail Spam Filtering Methods Considering Machine Learning
Techniques”, 2018, Global Journal of Computer Science and Technology, Volume 18, Issue
2
[15] Harjot Kaur, Er. Prince Verma, “Survey on E-mail Spam Detection using Supervised
approach with Feature selection”, 2017, International Journal of Engineering sciences &
Research technology
[8336]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
[8337]
ISSN: 0011-9342 | Year 2021
Design Engineering Issue: 9 | Pages: 8327 - 8338
[32]Jude, A.B., Singh, D., Islam, S. et al. An Artificial Intelligence Based Predictive Approach for
Smart Waste Management. Wireless PersCommun (2021). https://doi.org/10.1007/s11277-
021-08803-7.
[33]Padmaja, M., Shitharth, S., Prasuna, K. et al. Grow of Artificial Intelligence to Challenge
Security in IoT Application. Wireless PersCommun (2021). https://doi.org/10.1007/s11277-
021-08725-4.
[34]S. Shitharth, PratikshaMeshram, Pravin R. Kshirsagar, HariprasathManoharan, VineetTirth,
VenkatesaPrabhuSundramurthy, "Impact of Big Data Analysis on Nanosensors for Applied
Sciences Using Neural
Networks", JournalNanomaterials, vol. 2021, ArticleID 4927607, 9 pages, 2021. https://doi.or
g/10.1155/2021/4927607
[35]Kshirsgar P., More V., Hendre V., Chippalkatti P., Paliwal K. (2020) IOT Based Baby
Incubator for Clinic. In: Kumar A., Mozar S. (eds) ICCCE 2019. Lecture Notes in Electrical
Engineering, vol 570. Springer, Singapore.
[36]Oza S. et al. (2020) IoT: The Future for Quality of Services. In: Kumar A., Mozar S. (eds)
ICCCE 2019. Lecture Notes in Electrical Engineering, vol 570. Springer, Singapore
[37]Kshirsgar P., Pote A., Paliwal K.K., Hendre V., Chippalkatti P., Dhabekar N. (2020) A
Review on IOT Based Health Care Monitoring System. In: Kumar A., Mozar S. (eds) ICCCE
2019. Lecture Notes in Electrical Engineering, vol 570. Springer, Singapore
[8338]