Phishing Detection Using Neural Network: Ningxia Zhang, Yongqing Yuan
Phishing Detection Using Neural Network: Ningxia Zhang, Yongqing Yuan
Abstract
The goal of this project is to apply multilayer feedforward neural networks to phishing email detection and evaluate the effectiveness of
this approach. We design the feature set, process the phishing dataset, and implement the neural network (NN) systems. We then use
cross validation to evaluate the performance of NNs with different numbers of hidden units and activation functions. We also compare
the performance of NNs with other major machine learning algorithms. From the statistical analysis, we conclude that NNs with an
appropriate number of hidden units can achieve satisfactory accuracy even when the training examples are scarce. Moreover, our feature
selection is effective in capturing the characteristics of phishing emails, as most machine learning algorithms can yield reasonable results
with it.
1
2.1.2 Link Features appear as phishers fabricate stories luring readers to enter
their personal information.
1. Total number of links
Phishing emails usually contain multiple links to fake web-
sites for readers to sign in. 2.2 Neural Networks
2. Number of IP-based links An artificial neural network, or neural network, is a mathe-
A legitimate website usually has a domain name for identi- matical model inspired by biological neural networks. In
fication while phishers typically use multiple zombie sys- most cases it is an adaptive system that changes its struc-
tems to host phishing sites. Besides, the use of IP address ture during learning [10]. There are many different types
makes it difficult for readers to know exactly which site of NNs. For the purpose of phishing detection, which is
they are being directed to when they click on the link. basically a classification problem, we choose multilayer
Therefore, the presence of IP-based links can be a good feedforward NN. In a feedforward NN, the connections
indicator of phishing emails. between neurons do not form a directed cycle. Contrasted
3. Number of deceptive links with recurrent NNs, which are often used for pattern recog-
Deceptive links are the ones with visible URLs different nition, feedforward NNs are better at modeling relation-
from the URLs to which they are pointing. Some phishers ships between inputs and outputs. In our experiments, we
use this technique to fool email readers into clicking on the use the most common structure of multilayer feedforward
links. NN, which consists of one input layer, one hidden layer
4. Number of links behind an image and one output layer. The number of computational units
In order to make the emails look authentic, phishers often in the input and output layers corresponds to the number
place in the emails images or banners linking to a legitimate of inputs and outputs. Different numbers of units in the
website. Thus, if URL-based images appear in an email, it hidden layer are attempted in the following experiments.
is likely to be an phishing email. To fit our dataset, hyperbolic tangent and sigmoid are used
5. Maximum number of dots in a link as activation functions. A comparison of the two is also
Using sub-domains is another technique phishers often conducted. With regard to the training method, we choose
exploit to make links appear legitimate, resulting in a inor- resilient propagation training (RPROP), as it is usually the
dinately large number of dots in the URL [3]. most efficient training algorithm for supervised feedfor-
6. A Boolean indicator of whether there is a link that contains ward NNs [9].
one of the following words: click, here, login, update
To realize the goal of acquiring usernames, passwords, or
credit card information from the readers, phishing emails 2.3 Other Machine Learning Techniques
often invite readers to login to the fake websites for reasons To further evaluate the performance of NNs in phishing
such as updating personal information. Therefore, those detection, we compare its performance against that of other
words appearing in the link text would be a good indicator. major machine learning classifiers – decision tree (DT),
K-nearest neighbors, naive Bayes (NB), support vector ma-
2.1.3 Element Features chine (SVM) and unsupervised K-means clustering. The
same dataset and feature set are used in the comparison.
1. A Boolean indicator of whether it is in HTML format
Phishing emails are mostly in HTML format as plain text
does not provide the opportunity to play the tricks of phish-
2.4 Cross Validation
ing. Given a training dataset and a proposed classifer, we as-
2. A Boolean indicator of whether it contains JavaScript sess the performance of the classifier by using hold-out
JavaScript enables phishers to perform many actions behind cross validation, also known as simple cross validation [8].
the scene, such as creating popup windows and changing The dataset is randomly divided into Strain and Scv . The
the status bar of a web browser [6]. If the email contains proposed classifier is trained on Strain to get parameter esti-
strings, "javascript" or "onclick", this feature is set to one. mates and tested on Scv . We then obtain the output which
3. A Boolean indicator of whether it contains <Form> tag indicates whether each email in Scv is ham or phishing.
HTML forms are one of the techniques used to gather This procedure is repeated 20 times for different sizes of
information from readers [3]. Strain and Scv . The proportions of the dataset used as Strain
are as follows: 0.1%, 1%, 10%, 20%, 30%, 40%, 50%, 60%,
2.1.4 Word List Features 70%, 80% and 90%.
2
bers of true negatives (TN, correctly classified ham email), 4.1.2 Normalization
false negatives (FN, phishing email mistakenly classified as
ham), true positives (TP, correctly classified phishing email) In order to ensure that each feature has an equal impact in
and false positives (FP, ham email mistakenly classified the classification process, the vectors should be normalized
as phishing). To evaluate the classifier performance, we before applying machine learning algorithms. For each
compute the accuracy(Accu) and the weighted accuracy feature, we find the maximum and minimum values, and
(Wacc ) by the following formula: for each value of this feature, we compute:
TN + TP (current_value−minimum)
Accu = TN + FP+ TP+ FN (1) normalized_value = (maximum−minimum)
λ· TN + TP
Wacc (λ) = λ·( TN + FP)+ FN + TP
(2)
After normalization, the values of all features fall into the
In phishing email filtering, errors are not of equal impor- range of 0 to 1 and each feature contributes the same in de-
tance. A false positive is much more costly than a false termining the classification output. The normalized vectors
negative in the real world [1]. It is thus desirable to have of the whole dataset are stored in another text file.
a classfier with a low false positive rate. The "weighted
accuracy" measure is proposed by Androutsopoulos et al.
[2] to address this issue. Different values of λ can be ap- 4.1.3 Training and Test Sets Preparation
plied to the formula (1). Notice that when λ is one, the FP
To conduct the cross validations described above, we di-
and FN are weighed equally. In our simulations, we pick
vide the dataset into training and test sets with different
λ = 9 so that FP are penalized nine times more than FN.
proportions. For each proportion, we generate 20 different
In addition, we compute the precision, recall and F1 -score
training and test sets. This is done by Matlab.
of each classifer as follows:
TP TP
Precision = TP+ FP Recall = TP+ FN (3)
4.2 Machine Learning Implementation
2· precision·recall
F1 = precision+recall (4)
The multilayer feedforward NN is implemented in Java
with the Encog Java Core package, which provides a power-
ful framework to conveniently construct NNs and perform
3 Dataset training and testing. When implementing other machine
learning algorithms, we exploit the corresponding off-the-
The dataset comprises of a large number of real world ex-
shelf Matlab packages.
amples of ham and phishing emails, all in standard MIME
format. There are a total number of 4202 ham emails and
4560 phishing emails, separated in 7 folders, 3 of which
hold ham emails and 4 hold phishing emails. Each text file 4.3 Data Analysis
contains a single MIME email.
Once we obtain the classification predictions, we compute
TN, FN, TP, FP, Accu, Waccu , Precision, Recall and F1 -score
as described in the method section. We compare different
4 Implementation and Experiments
neural networks by varying the units in the hidden layer
as well as the activation function. We also compare the
4.1 Preprocessing
performance of neural networks with that of other machine
4.1.1 Feature Extraction learning techniques.
3
Figure 4: Waccu for NN with 0.1% training size
Figure 1: Each curve shows the average Accu for an NN classifier
We compare the NN performance using two activa-
with a specific hidden layer size.
tion functions: hyperbolic tangent (HT) function and sig-
moid function. The results are shown in Figure 5 and
Figure 6. It is noticeable that the sigmoid function per-
forms slightly better than the hyperbolic tangent function.
Figure 2: Each curve shows the average Waccu for an NN classi- Figure 5: Accu of two NN (8 hidden units) activation functions
fier with a specific hidden layer size.
4
the highest recall while still mainitaining a >95% precision,
suggesting that NNs are excellent at detecting phishing
emails while misclassifying only a small portion of ham
emails.
References
[1] Saeed Abu-Nimeh, Dario Nappa, Xinlei Wang, and
Suku Nair. A comparison of machine learning tech-
niques for phishing detection. In Proceedings of the
Anti-Phishing Working Group eCrime Researchers Sum-
mit, pages 649–656, 2007.
Figure 7: Accu of NN (8 hidden units) and other machine learn-
ing techniques [2] Ion Androutsopoulos, John Koutsias, Konstantinos V.
Chandrinos, George Paliouras, and Constantine D.
Spyropoulos. An evaluation of naive bayesian anti-
spam filtering. In Proceedings of the Workshop on Ma-
chine Learning in the New Information Age, 11th Eu-
ropean Conference on Machine Learning, Barcelona,
Spain, 2002.
Table 2: Evaluations of NNs and other machine learning tech- [6] Ian Fette, Norman Sadeh, and Anthony Tomasic.
niques Learning to detect phishing emails. In Proceedings
Method Accu Waccu Precision Recall F1 of the International World Wide Web Conference(WWW),
DT 0.9658 0.9742 0.9778 0.9561 0.9668 2007.
SVM1 0.9218 0.8929 0.9022 0.9555 0.9275
[7] Network Working Group. Multipurpose internet
SVM2 0.9579 0.9654 0.9693 0.9491 0.9591
mail extensions (MIME) part two:media types. http:
NB 0.9278 0.9370 0.9460 0.9173 0.9367
//tools.ietf.org/html/rfc2046#section-5.1.4,
K-nearest 0.9558 0.9536 0.9585 0.9583 0.9579
1996.
NN 0.9551 0.9494 0.9525 0.9618 0.9571
Table 1 summarizes performance measures for NNs [8] Andrew Ng. CS229 lecture notes. http://cs229.
with two activation functions in detail. As seen in the table, stanford.edu/notes/cs229-notes5.pdf, 2012.
HT function performs slightly better in terms of all mea- [9] Martin Riedmiller and Heinrich Braun. A direct adap-
sures except recall. Notice that the largest difference out tive method for fast backpropagation learning: The
of all the measures comes from Waccu , which suggests that rprop algorithm. In Proceedings of the IEEE Interna-
the HT function is better at avoiding misclassifying ham tional Conference on Neural Networks, volume 5, pages
emails to phishing emails. 586–591, 1993.
Table 2 summarizes the performace measures for NNs
and other machine learning techniques. As shown in the [10] Wikipedia. Artificial neural network — wikipedia, the
table, DT gives the best overall performance. NNs give free encyclopedia.