46 - Ijme... Mech Engg..Research Paper-1
46 - Ijme... Mech Engg..Research Paper-1
56452/7-12-46
Dr. B. K. Sharma
1Professor, Mandsaur University, Mandsaur, e-mail: bksharma7426@gmail.com
Spam :
Bulk mails that are unnecessary and undesirable An approach using random forest algorithm
can be classified as Spam Mails. These spam approach is proposed by Akinyelu and Adewumi
emails hold the power to corrupt one's system by [1] in order to identify the phishing or spam emails.
filling up inboxes, degrading the speed of their It used 200 emails. The main motto of research was
internet connection. to reduce features and increase efficiency/accuracy.
Accuracy of up to 99.7% with a minimal amount of
Spam Detection : 0.06% false positives is achieved by the proposed
algorithm.
Many spam detection techniques are being used The research only covered the classification aspect
now-a-days. The methods use filters which can without considering vital information which can
prevent emails from causing any harm to the user. affect the results, especially, in case of limited text
The contributions and their weakness have been in the email.
identified.
Yüksel et al. [3] aimed to resolve the problem of
There are several methods that are accessible to spam by inhibiting the spam emails from being
spam, for example location of sender, it’s spread within the
contents, checking IP address or space names. email systems. To achieve this, they propose a
[26]. Spammers use refined variations to avoid cloud base system, which involves the
spam identification. Few measures connected identification of spam emails using analytics and
with spam identification are; Blacklist and white- machine learning algorithms like support vector
list, Machine learning approaches, Naïve machines and decision trees. The results of the
Bayes, Support Vector Machine, Neural Network tests show that the SVM leads to a higher accuracy
Classification. [27] of up to 97.6% and a false-positive rate of 2.33%.
The decision tree attains a lower accuracy of
A mobile system was proposed by Mahmoud et al. 82.6% and a false-positive rate of 17.3%. Results
[28] with the motive of blocking and identifying reveal that the increase in spam emails is affected
spam SMS. In their work, they attempted to by the no. of received emails. Lee et al. [28]
protect smartphones by filtering SMS spam that proposed an optimal technique for spam detection.
contains abbreviations and idioms. The system
was based on the Artificial Immune System (AIS) 2.1. EXISTING SYSTEMS
and Naïve Bayesian (NB) algorithm. By the use of
the Naive Bayes algorithm, the messages are Due to the increase in the number of email users,
classified based on their features. It used an SMS the amount of spam emails have also risen in
dataset with 1324 messages. Results from this number in the past years. It has now become even
system gave detection rate 82%, 6% positive rate more challenging to handle a wide range of emails
and 91% accuracy. for data mining and machine learning. Therefore,
many researchers have executed comparative
Table 1 : Spam Categories studies to see various classification algorithms
Categories Descriptions performances and their results in classifying emails
accurately with the help of a number of
Health The spam of fake performance metrics. Hence, it is important to find
medications an algorithm that gives the best possible outcome
Promotional The spam of fake fashion for any particular metric for correct classification
products items like clothes bags and of emails and spam or ham.
watches The present systems of spam detection are reliant
Adult content The spam of adult content on three major methods:-
of pornography and
prostitution A. Linguistic Based Methods
Finance & The spam of stock kiting, Unlike humans, who can grasp linguistic constructs
marketing tax solutions, and loan along with their exposition, machines cannot and
packages hence it is necessary to teach machines some
Phishing The spam of phishing or languages to help them understand these
fraud constructs. This is the technique that is used in
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
491
DOI : https://doi.org/10.56452/7-12-46
places like search engines in order to ascertain the 3. Heuristic or Rule-Based Spam Filtering
next terms for suggestions to the user while they Technique
are typing their search. Sentences are divided into
two Unigrams (words taken are one by one) and Algorithms use pre-defined rules in the form of a
two Bigrams (words that are taken two at a time). regular expression to give a score to the messages
Since this technique requires that every present in the e-mails. Based on the scores
expression be remembered, this method is not generated, they segregate emails into spam non-
feasible and also time-intensive. [29] spam categories.
Many several techniques are present in the market Training Testing Phase
to detect spam e-mails. If we want to classify
broadly, there are 5 different techniques based on
which algorithms decide whether any mail is spam
or not.
Visualization
Wordcloud
Target Count For Train Data removing stop words such as a, about, above, down,
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
494
DOI : https://doi.org/10.56452/7-12-46
Word stemming — Stemming algorithms work The lemmatization has converted studies -> study,
by removing the end or the beginning of the breaks -> break
words, using a list of common prefixes and clean_text = word_lemmatizer(dirty_text.split(" "))
suffixes that can be found in that language. clean_text
Examples of Word Stemming for English words
are as below: #Output
Feature Extraction
2. TfidfVectorizer
3. Word Embedding
Implementing these two algorithms to deal with
different edge cases. CountVectorizer
Precision & Recall The recall of this model is rather low, it might not
be doing a good enough job in discovering the spam
Precision & Recall is the common evaluation email.
metrics that people use when they are evaluating
class-imbalanced classification model. Summary
Precision is evaluating, when a model predict I have showed you all the necessary steps needed in
something as positive, how accurate the model is. designing a spam detection algorithm. Just a brief
On the other hand, recall is evaluating how well a recap:
model in finding all the positive samples.
Explore and understand your data
The mathematical equation for precision & recall
are as respective Visualize the data at hand to gain a better intuition
— Wordcloud, N-gram Bar Chart
[5] Deepika Mallampati, Nagaratna P. Hegde “A [15] Jason Brownlee, “How to Encode Text Data
Machine Learning Based Email Spam for Machine Learning with scikit- learn”
Classification Framework Model” in The Machine Learning Mastery, September
IJITEE, ISSN: 2278-3075, Vol.9 Issue.4, 29, 2017.
February 2020. https://machinelearningmastery.com/prepare-
text-data-machine-learning-scikit-learn/
[6] Javatpoint, “Machine Learning
Tutorial” 2017 [16] I. Androutsopoulos, J. Koutsias, K. Chandrinos
https://www.javatpoint.com/machi and C.
ne- learning D. Spyropoulos, "An experimental comparison of
naive Bayesian and keyword-based anti-
[7] SpamAssassin, “Spam and Ham Dataset'', spam filtering with personal email
Kaggle, 2018. messages," Computation and Language, pp.
https://www.kaggle.com/veleon/ham-and- 160-167, 2000.
spam-dataset
[17] G. V. Cormack, "Email Spam Filtering: A
Systematic Review," Foundations and
[8] Apache, “open-source Apache SpamAssassin
Trends® in Information Retrieval, vol. 1,
Dataset”, 2019
no. 4, pp. 335-455, 2006.
https://spamassassin.apache.org/old/publicc
orpus/
[18] M. Siponen and C. Stucke, "Effective
Anti-Spam Strategies in Companies: An
[9] SpamAssassin, “Spam Classification
International Study," Proceedings of the
Kernel”, 2018
39th Annual Hawaii International
https://www.kaggle.com/veleon/spam-
Conference on System Sciences
classification
(HICSS'06), 2006.
[10] SpamAssassin, “REVISION HISTORY OF
THIS CORPUS”, 2016
[19] Guzella, T. S. and Caminhas, W. M.”A
https://spamassassin.apache.org/old/publicco
review of machine learning approaches to
rpus/read me.html
Spam filtering.” Expert Syst. Appl., 2009.
Copyrights @Kalahari Journals Vol.7 No.12 (December, 2022)
International Journal of Mechanical Engineering
498
DOI : https://doi.org/10.56452/7-12-46
[20] Jianying Zhou, Wee-Yung Chin, Rodrigo M. S. (2018, January). “A framework for real-time
Roman, and Javier Lopez, (2007) "An spam detection in Twitter.” In
Effective MultiLayered Defense Framework Communication Systems & Networks
against Spam", Information Security (COMSNETS), 2018 10th International
Technical Report 01/2007. Conference on (pp. 380-383).
[29] MAHMOUD, T. M., & MAHFOUZ, A. M.
[21] Xiao Mang Li, Ung Mo Kim, (2012) "A (2012). “SMS spam filtering technique
hierarchical framework for content-based based on artificial immune system.”
image spam filtering", 8th International International Journal of Computer Science
Conference on Information Science and Issues (IJCSI), 9(2), 589.
Digital Content Technology (ICIDT), Jeju,
June, pp. 149-155. [30] AN ANTI-SPAM DETECTION MODEL
FOR EMAILS OF MULTI-NATURAL
[22] Linda Huang, Julia Jia, Emma Ingram, LANGUAGE Mazin Abed Mohammed a,*,
Wuxu Peng, “Enhancing the Naive Bayes Salama A. Mostafa b,*, Omar Ibrahim Obaid
Spam Filter through Intelligent Text
Modification Detection”, 2018 17th IEEE
International Conference on Trust, Security
and Privacy in Computing and
Communications.