0% found this document useful (0 votes)
113 views5 pages

Phishing Detection Using Neural Network: Ningxia Zhang, Yongqing Yuan

1. The document describes a study that uses neural networks to detect phishing emails. It discusses extracting features from emails and using those features as inputs to train a neural network classifier. 2. The study designs a feature set capturing characteristics of phishing emails, processes a dataset of over 8,000 emails, and implements neural networks with different architectures to classify emails as phishing or legitimate. 3. The performance of neural networks is evaluated using cross-validation and compared to other machine learning algorithms like decision trees and support vector machines. The study finds neural networks can accurately detect phishing emails even with limited training data when using an appropriate architecture and feature selection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views5 pages

Phishing Detection Using Neural Network: Ningxia Zhang, Yongqing Yuan

1. The document describes a study that uses neural networks to detect phishing emails. It discusses extracting features from emails and using those features as inputs to train a neural network classifier. 2. The study designs a feature set capturing characteristics of phishing emails, processes a dataset of over 8,000 emails, and implements neural networks with different architectures to classify emails as phishing or legitimate. 3. The performance of neural networks is evaluated using cross-validation and compared to other machine learning algorithms like decision trees and support vector machines. The study finds neural networks can accurately detect phishing emails even with limited training data when using an appropriate architecture and feature selection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Phishing Detection Using Neural Network

Ningxia Zhang, Yongqing Yuan


Department of Computer Science, Department of Statistics, Stanford University

Abstract

The goal of this project is to apply multilayer feedforward neural networks to phishing email detection and evaluate the effectiveness of
this approach. We design the feature set, process the phishing dataset, and implement the neural network (NN) systems. We then use
cross validation to evaluate the performance of NNs with different numbers of hidden units and activation functions. We also compare
the performance of NNs with other major machine learning algorithms. From the statistical analysis, we conclude that NNs with an
appropriate number of hidden units can achieve satisfactory accuracy even when the training examples are scarce. Moreover, our feature
selection is effective in capturing the characteristics of phishing emails, as most machine learning algorithms can yield reasonable results
with it.

1 Introduction incorporating some basic features pertaining to the email


structure and external links.
Recently, a phishing email has been circulating in the
Stanford community, aiming to collect SUnetIDs and pass-
words. As the majority of phishing emails are formatted
to appear from a legitimate source, a large percentage 2 Methods
of email users are unable to recognize phishing attacks.
Moreover, traditional spam email filters are inclined to fail 2.1 Features
to identify phishing emails since most phishing attacks use
more sophisticated techniques and tend to be directed to a
After referring to available literature, we have selected and
more targeted audience. With the increasing severity of this
defined a set of features that capture the characteristics of
issue, many efforts have been devoted to apply machine
phishing emails [1, 3, 6].
learning methods to phishing detection.

One of the most common machine learning techniques


for phishing classification is to use a list of key features 2.1.1 Structural Features
to represent an email and apply a learning algorithm to
classify an email to phishing or ham based on the selected 1. Total number of body parts
features. Chandrasekaran et al. [4] proposed a novel According to MIME standard, "Content-Type" attribute of
technique to classify phishing emails based on distinct one email could be multipart, meaning that this email has
structural characteristics, such as the structure of the email multiple body parts. Phishers are likely to utilize this fact to
subject line and some functional words. They used SVM to construct phishing emails with sophisticated structures. By
test their features on 400 emails and obtained a 95% accu- counting the number of boundary variables, we obtain the
racy rate. However, they did not perform different splits number of body parts in a multipart email. If the "Content-
between training and test data due to the small sample size. Type" of the email is not multipart, this feature is set to
Fette et al. [6] used ten different features specific to the 0, for the purpose of differentiating from multipart emails
deceptive methods for phishing classification and obtained with only one body part. If one part can be further divided
an F1 -measure of more than 90% using a support vector ma- into multiple parts, the number of sub-parts is added to
chine classifier. However they used significantly more ham the number of parts of the entire email. For example, if an
emails (7000) than phishing emails (860) in their simulation. email has 2 body parts, one of which has 2 sub-parts, the
number of body parts is set to 4. However, only 3 parts of
In this project, we use approximately 8762 emails out the content are scanned in the feature extraction process.
of which 4560 are phishing emails and the rest are ham. 2. Total number of alternative parts
We notice that few studies have been done on applica- The multipart/alternative subtype indicates that each part
tions of neural networks (NNs) to phishing email filtering. is an "alternative" version of the same or similar content,
Although NNs normally require considerable time for pa- each in a different format denoted by its "Content-Type"
rameter training, they usually yield more accurate results header [7]. As it is not strictly enforced that each part
than other classifiers [5]. In our project, we try to detect of the message is the same or similar, phishers often take
phishing attacks through a feedforward neural network by advantage of this fact to create fraudulent emails.

1
2.1.2 Link Features appear as phishers fabricate stories luring readers to enter
their personal information.
1. Total number of links
Phishing emails usually contain multiple links to fake web-
sites for readers to sign in. 2.2 Neural Networks
2. Number of IP-based links An artificial neural network, or neural network, is a mathe-
A legitimate website usually has a domain name for identi- matical model inspired by biological neural networks. In
fication while phishers typically use multiple zombie sys- most cases it is an adaptive system that changes its struc-
tems to host phishing sites. Besides, the use of IP address ture during learning [10]. There are many different types
makes it difficult for readers to know exactly which site of NNs. For the purpose of phishing detection, which is
they are being directed to when they click on the link. basically a classification problem, we choose multilayer
Therefore, the presence of IP-based links can be a good feedforward NN. In a feedforward NN, the connections
indicator of phishing emails. between neurons do not form a directed cycle. Contrasted
3. Number of deceptive links with recurrent NNs, which are often used for pattern recog-
Deceptive links are the ones with visible URLs different nition, feedforward NNs are better at modeling relation-
from the URLs to which they are pointing. Some phishers ships between inputs and outputs. In our experiments, we
use this technique to fool email readers into clicking on the use the most common structure of multilayer feedforward
links. NN, which consists of one input layer, one hidden layer
4. Number of links behind an image and one output layer. The number of computational units
In order to make the emails look authentic, phishers often in the input and output layers corresponds to the number
place in the emails images or banners linking to a legitimate of inputs and outputs. Different numbers of units in the
website. Thus, if URL-based images appear in an email, it hidden layer are attempted in the following experiments.
is likely to be an phishing email. To fit our dataset, hyperbolic tangent and sigmoid are used
5. Maximum number of dots in a link as activation functions. A comparison of the two is also
Using sub-domains is another technique phishers often conducted. With regard to the training method, we choose
exploit to make links appear legitimate, resulting in a inor- resilient propagation training (RPROP), as it is usually the
dinately large number of dots in the URL [3]. most efficient training algorithm for supervised feedfor-
6. A Boolean indicator of whether there is a link that contains ward NNs [9].
one of the following words: click, here, login, update
To realize the goal of acquiring usernames, passwords, or
credit card information from the readers, phishing emails 2.3 Other Machine Learning Techniques
often invite readers to login to the fake websites for reasons To further evaluate the performance of NNs in phishing
such as updating personal information. Therefore, those detection, we compare its performance against that of other
words appearing in the link text would be a good indicator. major machine learning classifiers – decision tree (DT),
K-nearest neighbors, naive Bayes (NB), support vector ma-
2.1.3 Element Features chine (SVM) and unsupervised K-means clustering. The
same dataset and feature set are used in the comparison.
1. A Boolean indicator of whether it is in HTML format
Phishing emails are mostly in HTML format as plain text
does not provide the opportunity to play the tricks of phish-
2.4 Cross Validation
ing. Given a training dataset and a proposed classifer, we as-
2. A Boolean indicator of whether it contains JavaScript sess the performance of the classifier by using hold-out
JavaScript enables phishers to perform many actions behind cross validation, also known as simple cross validation [8].
the scene, such as creating popup windows and changing The dataset is randomly divided into Strain and Scv . The
the status bar of a web browser [6]. If the email contains proposed classifier is trained on Strain to get parameter esti-
strings, "javascript" or "onclick", this feature is set to one. mates and tested on Scv . We then obtain the output which
3. A Boolean indicator of whether it contains <Form> tag indicates whether each email in Scv is ham or phishing.
HTML forms are one of the techniques used to gather This procedure is repeated 20 times for different sizes of
information from readers [3]. Strain and Scv . The proportions of the dataset used as Strain
are as follows: 0.1%, 1%, 10%, 20%, 30%, 40%, 50%, 60%,
2.1.4 Word List Features 70%, 80% and 90%.

1. Boolean indicators of whether the words or stems listed below


2.5 Evaluation Metrics
appear in the email body: account, update, confirm, verify, secur,
notif, log, click, inconvenien By comparing the classification predictions with the actual
In typical phishing email examples, these words frequently categories of the emails, we are able to compute the num-

2
bers of true negatives (TN, correctly classified ham email), 4.1.2 Normalization
false negatives (FN, phishing email mistakenly classified as
ham), true positives (TP, correctly classified phishing email) In order to ensure that each feature has an equal impact in
and false positives (FP, ham email mistakenly classified the classification process, the vectors should be normalized
as phishing). To evaluate the classifier performance, we before applying machine learning algorithms. For each
compute the accuracy(Accu) and the weighted accuracy feature, we find the maximum and minimum values, and
(Wacc ) by the following formula: for each value of this feature, we compute:

TN + TP (current_value−minimum)
Accu = TN + FP+ TP+ FN (1) normalized_value = (maximum−minimum)
λ· TN + TP
Wacc (λ) = λ·( TN + FP)+ FN + TP
(2)
After normalization, the values of all features fall into the
In phishing email filtering, errors are not of equal impor- range of 0 to 1 and each feature contributes the same in de-
tance. A false positive is much more costly than a false termining the classification output. The normalized vectors
negative in the real world [1]. It is thus desirable to have of the whole dataset are stored in another text file.
a classfier with a low false positive rate. The "weighted
accuracy" measure is proposed by Androutsopoulos et al.
[2] to address this issue. Different values of λ can be ap- 4.1.3 Training and Test Sets Preparation
plied to the formula (1). Notice that when λ is one, the FP
To conduct the cross validations described above, we di-
and FN are weighed equally. In our simulations, we pick
vide the dataset into training and test sets with different
λ = 9 so that FP are penalized nine times more than FN.
proportions. For each proportion, we generate 20 different
In addition, we compute the precision, recall and F1 -score
training and test sets. This is done by Matlab.
of each classifer as follows:
TP TP
Precision = TP+ FP Recall = TP+ FN (3)
4.2 Machine Learning Implementation
2· precision·recall
F1 = precision+recall (4)
The multilayer feedforward NN is implemented in Java
with the Encog Java Core package, which provides a power-
ful framework to conveniently construct NNs and perform
3 Dataset training and testing. When implementing other machine
learning algorithms, we exploit the corresponding off-the-
The dataset comprises of a large number of real world ex-
shelf Matlab packages.
amples of ham and phishing emails, all in standard MIME
format. There are a total number of 4202 ham emails and
4560 phishing emails, separated in 7 folders, 3 of which
hold ham emails and 4 hold phishing emails. Each text file 4.3 Data Analysis
contains a single MIME email.
Once we obtain the classification predictions, we compute
TN, FN, TP, FP, Accu, Waccu , Precision, Recall and F1 -score
as described in the method section. We compare different
4 Implementation and Experiments
neural networks by varying the units in the hidden layer
as well as the activation function. We also compare the
4.1 Preprocessing
performance of neural networks with that of other machine
4.1.1 Feature Extraction learning techniques.

We write a Perl script to extract features from one email


example. It reads in the email file, does structural analysis
with the help of MIME::Entity and MIME::Parser modules. 5 Results
It summarises link features using HTML::SimpleLinkExtor
and HTML::LinkExtractor modules. Other features are As mentioned in the previous section, to evaluate each
obtained by taking advantage of the powerful regular ex- neural network classifier, we calculate the average Accu
pression manipulation of Perl. Ultimately, it outputs a and Waccu (λ = 9) in 20 cross validation procedures for
feature vector together with the ideal value (1 for phishing each training size. As shown in Figure 1 and Figure 2,
email and 0 for ham). To process the entire dataset, another when the training size is small, more hidden units tend to
Perl script is written to call the feature extracting script and overfit the data while fewer hidden units tend to underfit.
write the obtained feature vectors line by line into one text However, when the training set is large enough, the num-
file. ber of hidden units does not greatly affect performance.

3
Figure 4: Waccu for NN with 0.1% training size
Figure 1: Each curve shows the average Accu for an NN classifier
We compare the NN performance using two activa-
with a specific hidden layer size.
tion functions: hyperbolic tangent (HT) function and sig-
moid function. The results are shown in Figure 5 and
Figure 6. It is noticeable that the sigmoid function per-
forms slightly better than the hyperbolic tangent function.

Figure 2: Each curve shows the average Waccu for an NN classi- Figure 5: Accu of two NN (8 hidden units) activation functions
fier with a specific hidden layer size.

To further demonstrate the overfitting of the dataset


with a small training size, we examine the Accu and
Waccu for the 0.1% training set in Figure 3 and Fig-
ure 4. We observe that the two curves both peak
at 8 hidden units and start to decline as more hid-
den units are used. It is also worth noting that the
Waccu generally drops after penalizing FP more than FN.

Figure 6: Waccu of two NN (8 hidden units) activation functions


We also compare the NN performance with other ma-
chine learning techniques. The results are shown in Figure
7 and Figure 8. Decision tree has the best overall perfor-
mance, while it falls short on small training sets compared
to NN and K-nearest. Generally, most algorithms can reach
an accuracy of 95%, which suggests that the selected feature
set has captured the essential characteristics of phishing
emails. When we perform unsupervised 2-means clustering
on the entire dataset, we are able to achieve 87% accuracy,
Figure 3: Accu for NN with 0.1% training size which further supports the validity of our feature set.

4
the highest recall while still mainitaining a >95% precision,
suggesting that NNs are excellent at detecting phishing
emails while misclassifying only a small portion of ham
emails.

References
[1] Saeed Abu-Nimeh, Dario Nappa, Xinlei Wang, and
Suku Nair. A comparison of machine learning tech-
niques for phishing detection. In Proceedings of the
Anti-Phishing Working Group eCrime Researchers Sum-
mit, pages 649–656, 2007.
Figure 7: Accu of NN (8 hidden units) and other machine learn-
ing techniques [2] Ion Androutsopoulos, John Koutsias, Konstantinos V.
Chandrinos, George Paliouras, and Constantine D.
Spyropoulos. An evaluation of naive bayesian anti-
spam filtering. In Proceedings of the Workshop on Ma-
chine Learning in the New Information Age, 11th Eu-
ropean Conference on Machine Learning, Barcelona,
Spain, 2002.

[3] Ram Basnet, Srinivas Mukkamala, and Andrew H.


Sung. Detection of phishing attacks: A machine learn-
ing approach. Studies in Fuzziness and Soft Computing,
226:373–383, 2008.

[4] Madhusudhanan Chandrasekaran, Krishnan


Narayanan, and Shambhu Upadhyaya. Phish-
Figure 8: Waccu of of NN (8 hidden units) and other machine
ing e-mail detection based on structural properties. In
learning techniques
Proceedings of the NYS Cyber Security Conference, 2006.
Table 1: Evaluations of NNs with two activation functions
[5] James Clark, Irena Koprinsk, and Josiah Poon. A neu-
Activation Accu Waccu Precision Recall F1 ral network based approach to automated e-mail clas-
HT 0.9551 0.9494 0.9525 0.9618 0.9571 sification. In Proc. IEEE/WIC International Conference on
sigmoid 0.9516 0.9417 0.9450 0.9630 0.9539 Web Intelligence (WI), pages 702–705, 2003.

Table 2: Evaluations of NNs and other machine learning tech- [6] Ian Fette, Norman Sadeh, and Anthony Tomasic.
niques Learning to detect phishing emails. In Proceedings
Method Accu Waccu Precision Recall F1 of the International World Wide Web Conference(WWW),
DT 0.9658 0.9742 0.9778 0.9561 0.9668 2007.
SVM1 0.9218 0.8929 0.9022 0.9555 0.9275
[7] Network Working Group. Multipurpose internet
SVM2 0.9579 0.9654 0.9693 0.9491 0.9591
mail extensions (MIME) part two:media types. http:
NB 0.9278 0.9370 0.9460 0.9173 0.9367
//tools.ietf.org/html/rfc2046#section-5.1.4,
K-nearest 0.9558 0.9536 0.9585 0.9583 0.9579
1996.
NN 0.9551 0.9494 0.9525 0.9618 0.9571
Table 1 summarizes performance measures for NNs [8] Andrew Ng. CS229 lecture notes. http://cs229.
with two activation functions in detail. As seen in the table, stanford.edu/notes/cs229-notes5.pdf, 2012.
HT function performs slightly better in terms of all mea- [9] Martin Riedmiller and Heinrich Braun. A direct adap-
sures except recall. Notice that the largest difference out tive method for fast backpropagation learning: The
of all the measures comes from Waccu , which suggests that rprop algorithm. In Proceedings of the IEEE Interna-
the HT function is better at avoiding misclassifying ham tional Conference on Neural Networks, volume 5, pages
emails to phishing emails. 586–591, 1993.
Table 2 summarizes the performace measures for NNs
and other machine learning techniques. As shown in the [10] Wikipedia. Artificial neural network — wikipedia, the
table, DT gives the best overall performance. NNs give free encyclopedia.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy