Machine Learning For HTTP Botnet Detection Using Classifier Algorithms
Machine Learning For HTTP Botnet Detection Using Classifier Algorithms
Abstract— Recently, HTTP based Botnet threat has become a 2017. The threat on common HTTP protocol (Port 80) which
serious problem for computer security experts as bots can infect is used by the normal user to access web page motivates us to
victim’s computer quick and stealthily. By using HTTP study the detection of HTTP Botnets and minimize its threat
protocol, Bots are able to hide their communication flow within in the future.
normal HTTP communications. In addition, since HTTP
Intrusion detection system (IDS) is a system device or
protocol is widely used by internet application, it is not easy to
block this service as a precautionary approach. Thus, it is software that is used to monitor computer network or system
needed for expert finding ways to detect the HTTP Botnet in from malicious behavior and violation in security policy [3].
network traffic effectively. In this paper, we propose to There are two main categories of IDS which are network IDS
implement machine learning classifiers, to detect HTTP Botnets. (NIDS) and host IDS (HIDS) [4]. NIDS are located at a
Network traffic dataset used in this research is extracted based certain point in computer networking system to monitor
on TCP packet feature. We also able to find the best machine network traffic to and from all network devices connected to
learning classifier in our experiment. The proposed method is the network. Meanwhile, HIDS is setup in an individual node
able to classify HTTP Botnet in network traffic using the best in network traffic usually in mission critical devices, for
classifier in the experiment with an average accuracy of 92.93%.
example on the servers. IDS has two main detection methods
Index Terms— Botnet Detection; Classification; Classifier; namely signature-based and anomaly-based. A signature-
HTTP Botnet; Machine Learning; Malware. based IDS is an IDS that detect the attack based on specific
known attack signatures. The drawback of signature-based
I. INTRODUCTION IDS is that the system not able to detect a new attack as no
attack signature available in IDS knowledge database. For
Nowadays, the cybercriminal uses bot malware tirelessly to anomaly-based IDS, the detection system main purpose is to
infect victim’s computer and make them as part of their bot detect any malicious activities based on malicious behavior
armies (zombie PC) which known as Botnet. The infected as set in the IDS rule sets. Anomaly-based IDS basically
machine is controlled by botmaster to commit their crimes implement machine learning approach to create a detection
and achieve their evil intentions. A botnet can be defined as model (normal and malicious detection model) for detecting
a collection of computers or devices that have been infected new unknown malicious behavior. The anomaly-based IDS
by malware, allowing the attacker to perform malicious may produce a false alarm if there are unknown legitimate
activities by sending instructions through command and behavior in the system. Hence, in this research, we use
control (C&C) server. There are various types of Botnet anomaly-based IDS which implement machine learning
communication channels and the earliest Botnet uses classification to classify normal and malicious behavior in
centralized network architecture of Internet Relay Chat (IRC) network traffic.
for C&C server to communicate with bot zombies. To date, Machine learning is a term that a computer has been
Botnet has adapted to several attack pattern and using various programmed, giving the ability to learn by studying the data
type of network protocols to commit malicious activity. One pattern and make a prediction on new data in artificial
of example is peer-to-peer (P2P) Botnet that use the P2P intelligence (AI). Machine learning has two main
application to carried out C&C server command. However, categorizations namely clustering and classification. In
P2P Botnet has the drawback of complexity in managing bots clustering, the data input is group into their similarities to
for decentralized network architecture, so the Hypertext each other without learning model. This type of learning
Transfer Protocol (HTTP) Botnet is introduced to overcome known as unsupervised learning. The examples of clustering
the issue. HTTP Botnet operating in centralized network techniques are k-means [5] and power spectral density [6].
architecture, similar to IRC Botnet with some detection Meanwhile, in classification, has two phases which are
evasion features like DNS fast-flux and using HTTP protocol training phase and testing phase. The data is labeled by
resulting difficulties in detection. HTTP Botnet responsible assigning the class to the data input. Then the machine will
for committing several attacks famously known for learn data pattern using classifier algorithm in training phase
distributed denial-of-service (DDoS) attack, stealing and produce learning model. In the testing phase, the new
information, spamming, fraud and malware spreading in the data is used and the machine will classify using classifier
digital world. According to Ref. [1], it is found that HTTP- algorithm together with learning model. This type of learning
DDoS was a common attack by Botnet. MyCert Incident known as supervised learning. The examples of classification
Statistics Report 2017 [2] stated that the number of malware techniques are decision tree and Naïve Bayes. Thus, in this
infection caused by Botnet increased from Jan 2017 to April experiment, we use classification as our data is labeled with
malicious or normal classes for each network packet. supported by Pouliakis et al. [15] that highlighted the same
In this paper, the purpose of the study is to implement issues. Hence, in this paper, the classifier algorithms are used
machine learning classifiers to detect HTTP Botnet in to detect HTTP Botnet in network traffic based on TCP
network traffic. The rest of the paper is organized as follows. packet features.
Section II discusses related work that has been done by the
previous researchers which related to this paper topic. Section III. METHODOLOGY
III discusses on the methodology of the experiment. Section
IV describes obtained result. Finally, the conclusion is stated This paper implement classifier algorithm of machine
in section V. learning to classify the normal and bot-infected
communication in network traffic. In this section, we discuss
II. RELATED WORK the methodology of the research using machine learning to
detect HTTP Botnet. The methodology that has been carried
This section discusses detection of HTTP Botnet that has out is depicted as in Figure 1.
been done before. Various techniques have been used to
detect Botnet. C. Livadas et al. in 2006 [7] are among the
earliest work to study about detection of Botnet using
machine learning.
Reference [8], use various type of classification algorithm.
The algorithms are Sequential Minimal Optimization (SMO),
Bayes Theorem-based algorithms, J48 – Decision Tree,
Random Forest, Voted Perceptron, K-Nearest Neighbour and
Multilayer Perceptron. The author highlighted the ratio
number of the packet corresponding to benign traffic with
malicious traffic is ranged from 4:1 to 80:1. The highest
accuracy achieved is 82.48 % by using random forest
classifier. Interestingly, the ratio number of the packet also
discussed by C.Chen and H. Lin [9] that use 5:5 ratio for
individual malicious traffic to normal traffic. Thus, our
experiment use, 1:1 ratio traffic as the previous study show a
good result for individual malicious traffic to normal traffic. Figure 1: Research methodology for HTTP Botnet detection
Another researcher that achieved high detection rate by
using C4.5 Decision Tree classifier is [10], employs C4.5 A. Data Collection
Decision Tree and Naïve Bayes learning algorithm to detect First, a test bed environment is setup to generate data set for
HTTP based Botnet. The study uses flow-based network HTTP Botnet detection analysis. This test bed environment
traffic (NetFlow) and using HTTP filters. The highest aims to obtain real malicious traffic. The design of network
detection rate achieved is 97% with a very low false positive test bed depicted as Figure 2. The network design of test bed
rate of 3% with C4.5 Decision Tree as the classification consists of five desktop PCs installed with Windows 7
algorithm. operating system that becomes Botnet zombies by executing
Meanwhile, Venkatesh, G.K. and Nadarajan, R.A. [11] bot binaries in the PCs. There are five types of HTTP Bots
identifies anomalies in network flow by extracting TCP used in this study namely Dorkbot, Zeus, Citadel, SpyEye,
packet features. Extracted TCP packet is based from and Cutwail. A sniffer server also connected to the same
communication web-based botnet in specific time intervals. network to capture network traffic log that incoming and
The researchers did a comparison between multilayer Feed- outgoing at the default gateway.
Forward neural network model with C4.5 Decision Tree,
Random Forest and Radial Basis Function (RBF). The study
found that neural network classifier has better average
detection accuracy of 99.025% on SpyEye and Zeus Botnet.
The accuracy of the experiment is then compared with [12]
and [13].
Although detection accuracy shows the promising result,
our experiment does not implement both neural network
algorithm namely multilayer Feed-Forward neural network
and RBF due to several reasons. Firstly, the neural network
requires a lot of computational processing resources and
Graphic Processing Units (GPU) is used to decrease the
training duration. However, our experiment PCs have a low
specification in term of GPU and processing resources which
limit the capabilities of the neural network. Secondly, the
neural network also requires a large set of features during
Figure 2: Network design for HTTP Botnet test bed environment
training phase compared to decision trees. Our detection
features have been reduced during data preprocessing phase
Any network communication between bot and C&C server
and not suitable to implement neural network classifier as it
is collected using tcpdump tool. The tcpdump data for the
may give out false detection accuracy. The justifications are
different type of HTTP Botnet are collected to analyze the
discussed based on work by Tabarez-Paz et al. [14] and
network activity of the HTTP Botnet. The five type of HTTP IV. RESULT AND DISCUSSION
bots is released for seven days. After seven days, the tcpdump
data that are collected will continue to the next phase. In order to evaluate the proposed approach, several Botnets
Malicious traffic is combined with non-malicious (normal) datasets were used as shown in Table 2. Botnet datasets are
traffic. Normal traffic is obtained by carrying out with the consist of seven large datasets with one dataset signify one
same test bed design without executing bot binary in the PCs day for each HTTP Bot, executed in the test bed.
and perform web browsing activity to simulate normal user
activities. Table 2
The result of HTTP Botnet detection using our approach
B. Data Pre-Processing HTTP
In data pre-processing, the network traffic of both malicious Accuracy Precision Recall/TPR FPR
Bot Classifier
(%) (%) (%) (%)
and normal is extracted into TCP traffic log parser (.csv file) Family
using TCPTRACE tool. Malicious and normal log parser is Decision
87.75 86.86 99.99 66.14
Tree
combined and labeled with “0” for normal traffic and “1” for KNN 90.07 93.69 94.07 26.88
bot traffic. The aggregation traffic is then undergoing data Dorkbot Naïve
70.10 91.54 69.46 27.22
cleaning process which is carried out manually to reduce Bayes
error, meaningless noise in the obtained result and avoid Random
81.47 81.37 99.99 97.12
Forest
miscalculation of detection accuracy in classification. Data Decision
cleaning also includes ignoring the source and destination IP 83.61 82.16 99.82 65.07
Tree
and port number due to inefficiency in general Botnet KNN 86.96 91.21 91.42 26.44
detection and less effective on Botnet’s IP-flux attack [16]. Zeus Naïve
51.84 84.85 43.58 23.35
Bayes
Random
C. Machine Learning Classification 78.08 77.42 99.93 87.51
Forest
Then, the labeled data is run through classifier algorithm Decision
90.41 96.33 90.86 11.03
using data modeling tool, RapidMiner Studio [17]. Classifier Tree
KNN 95.26 96.84 96.94 10.09
algorithms used in this research are summarized in Table 1. SpyEye Naïve
65.51 98.31 55.66 3.06
Bayes
Table 1 Random
Classifier description 76.84 76.68 99.98 96.95
Forest
Classification Decision
Description 93.73 94.91 97.94 31.52
Algorithm Tree
A decision algorithm is a machine learning model that KNN 97.88 98.66 98.87 8.06
consist internal and leaf nodes. The internal node Cutwail Naïve
contains the attribute or feature of data. Meanwhile, the 87.76 95.42 90.04 25.95
Decision Tree Bayes
leaf nodes show the class label. The branches of Random
internal nodes connected to leaf nodes to create a model 86.38 86.33 99.93 94.93
Forest
of a learning tree. Decision
KNN classifies unknown input data based on the class 91.70 90.39 98.41 23.10
k-Nearest Tree
of the attributes that closest to training dataset. The KNN 94.46 95.92 96.04 9.01
Neighbour
KNN algorithm measures the distance between training Citadel Naïve
(KNN) 73.23 85.51 73.55 27.47
data and unknown data in order to classify the attribute. Bayes
Naïve Bayes classifier is a derivation from Bayesian Random
Naïve Bayes Theorem by using all attribute contained from the data 71.88 71.02 88.89 89.85
Forest
(NB) and conditionally analyzed the attribute independently
as to assume that all attribute are equally important.
RF classifier algorithm is an ensemble machine learner Table 2 shows the result of the performance of using four
method provides works by constructing many type of classifier algorithms. The random forest classifier
Random
Forest (RF)
individual decision trees on various sub-sample of data shows promising TPR with Dorkbot, Zeus, SpyEye and
and decide the best parameter by selecting the output Cutwail detection achieved an average above 90%. However,
class that appears most often or by mean prediction of
classes in decision trees nodes. the FPR for random forest classifier also high which shows
that the detection using random forest classifier may produce
The k-fold cross-validation(x-validation) is used in this false alarm during HTTP Botnet detection.
experiment to validate the performance of the learned model. Surprisingly, the accuracy produced by KNN classifier are
The number of fold used is set to 10. 10 fold x-validation is highest for each type of bot family with good performance of
the method where the input data is divided into 10 sets of data. FPR. In another word, KNN is able to classify the bot and
When 9 sets of data is used for learning in training phase, the normal traffic due to high detection accuracy and produce low
other 1 set of data are used as a test set in the testing phase. false alarm during detection. Hence, we conclude that the best
The validation method is repeated 10 times according to classifier to detect HTTP bot for this experiment is KNN
number of the folds and the classifier performance is classifier algorithm. KNN has good performance in term of
evaluated by using performance metrics. high accuracy, good bot detection rate (TPR) and low false
In the performance metrics, the labeled data that had been alarm compared to other classifiers.
classified using classification algorithm will give a result on Interestingly, although the result of the experiment shows
True Positive Rate (TPR), False Positive Rate (FPR), that our approach is able to detect HTTP Botnets activities in
Accuracy and Precision [18]. network traffic, in some circumstances the approach may
falsely detect normal behaviors as malicious activities in real
traffic. For example, sometimes when the user does keep on
reloading the same web pages, it sends repeated HTTP
request packet to the web server. This activity resembles the
behavior pattern of HTTP Botnets attack [19]. Thus, to ensure in Q1 2015,” 2015. [Online]. Available: https://securelist.com/statistics-
on-botnet-assisted-ddos-attacks-in-q1-2015/70071/. [Accessed: 12-Jul-
that our approach is able to detect Botnet effectively, we will
2015].
look for selecting proper network features in the future to [2] MyCERT, “MyCERT Incident Statistics 2017,” 2017. [Online].
increase detection effectiveness. Available:https://www.mycert.org.my/statistics/2017.php. [Accessed:
20-May-2017].
[3] J. Jabez and B. Muthukumar, “Intrusion Detection System (IDS):
V. CONCLUSION
Anomaly Detection Using Outlier Detection Approach,” Procedia
Comput. Sci., vol. 48, pp. 338–346, 2015.
The number of HTTP Botnet threat has increased year by [4] M. A. Khan, “A survey of security issues for cloud computing,” J. Netw.
year. Hence, there is a need of finding solutions to overcome Comput. Appl., vol. 71, pp. 11–29, 2016.
[5] C. Fachkha, E. Bou-Harb, and M. Debbabi, “Inferring distributed
this threat. This paper aims to implement machine learning
reflection denial of service attacks from darknet,” Comput. Commun.,
classifier to detect HTTP Botnet. The detection is carried out vol. 62, pp. 59–71, 2015.
by detecting HTTP Botnet in network traffic based on TCP [6] J. Kwon, J. Lee, H. Lee, and A. Perrig, “PsyBoG: A scalable botnet
traffic features. The proposed methodology is evaluated detection method for large-scale DNS traffic,” Comput. Networks, vol.
97, pp. 48–73, 2016.
based on true positive rate, false positive rate and detection
[7] C. Livadas, R. Walsh, D. Lapsley, and W. T. Strayer, “Using machine
accuracy on five different HTTP Botnets. The classifiers used learning techniques to identify botnet traffic,” in Proceedings -
in the experiment are four classifiers namely Decision Tree, Conference on Local Computer Networks, LCN, 2006, pp. 967–974.
Naïve Bayes, K-Nearest Neighbour and Random Forest. We [8] F. Brezo, D. Puerta, X. Ugarte-pedrero, I. Santos, P. G. Bringas, and D.
Barroso, “A Supervised Classification Approach for Detecting Packets
achieve our objective to detect HTTP Botnet using machine
Originated in a HTTP-based Botnet,” vol. 16, no. 3, pp. 1–13, 2013.
learning classifier algorithm. Moreover, the result showed [9] C.-M. Chen, Y.-H. Ou, and Y.-C. Tsai, “Web botnet detection based on
significant readings on classification detection of malicious flow information,” 2010 Int. Comput. Symp., pp. 381–384, 2010.
activities of HTTP Botnet in their network traffic. The best [10] F. Haddadi, J. Morgan, E. G. Filho, and a. N. Zincir-Heywood, “Botnet
behaviour analysis using IP flows: With http filters using classifiers,”
classifier for this experiment, K-Nearest Neighbour classifier
Proc. - 2014 IEEE 28th Int. Conf. Adv. Inf. Netw. Appl. Work. IEEE
achieving average detection accuracy of 92.93% with TPR of WAINA 2014, pp. 7–12, 2014.
95.47%. The result shows that the KNN is able to detect [11] G. Kirubavathi Venkatesh and R. Anitha Nadarajan, “HTTP Botnet
HTTP Botnet in network traffic and with low false alarm Detection Using Adaptive Learning Rate Multilayer Feed-Forward
Neural Network,” Inf. Secur. Theory Pract. Secur. Priv. Trust Comput.
compared to other machine learning classifier. The result
Syst. Ambient Intell. Ecosyst. SE - 5, vol. 7322, pp. 38–48, 2012.
achieved in the experiment may contribute to the body of [12] Nogueira, P. Salvador, and F. Blessa, “A Botnet Detection System
knowledge in computer network security field that machine Based on Neural Networks,” Digit. Telecommun. (ICDT), 2010 Fifth
learning classifier is capable and convincing to detect HTTP Int. Conf., pp. 57–62, 2010.
[13] G. Gu, R. Perdisci, J. Zhang, and W. Lee, “BotMiner : Clustering
Botnet. In the future, we will perform a selection of network
Analysis of Network Traffic for Protocol- and Structure-Independent
attribute. The attribute selection purpose is to reduce the Botnet Detection,” Proc. 17th Conf. Secur. Symp., pp. 139–154, 2008.
number of the feature while getting similar or better result as [14] Tabarez-paz, N. Hernández-Gress, and M. G. Mendoza, “Improving of
without attribute selection. Artificial Neural Networks Performance by Using GPU ’S : A Survey,”
in Third International Conference on Advances in Computing &
Information Technology, 2013, no. 1943, pp. 39–48.
ACKNOWLEDGMENT [15] Pouliakis, E. Karakitsou, N. Margari, P. Bountris, M. Haritou, J.
Panayiotides, D. Koutsouris, and P. Karakitsos, “Artificial Neural
This work has been supported under Universiti Teknikal Networks as Decision Support Tools in Cytopathology: Past, Present,
and Future,” Biomed. Eng. Comput. Biol., no. 7, pp. 7–1, 2016.
Malaysia Melaka (UTeM) research grant
[16] E. B. Beigi, H. H. Jazi, N. Stakhanova, and A. A. Ghorbani, “Towards
GLUAR/CSM/2016/FTMK-CACT/I00013 and KPT Effective Feature Selection in Machine Learning-Based Botnet
MyBrain15 postgraduate scholarship. The authors would like Detection Approaches,” pp. 247–255, 2014.
to thank you to reviewers and members of INSFORNET [17] RapidMiner inc, “RapidMiner: Data Science Platform,” 2016. [Online].
Available: https://rapidminer.com/. [Accessed: 23-Sep-2016].
research group for their incredible supports and guides in the
[18] M. Z. Mas’Ud, S. Sahib, M. F. Abdollah, S. R. Selamat, and R. Yusof,
making of these paper. “Analysis of features selection and machine learning classifier in
android malware detection,” ICISA 2014 - 2014 5th Int. Conf. Inf. Sci.
REFERENCES Appl., 2014.
[19] M. Eslahi, H. Hashim, and N. M. Tahir, “An efficient false alarm
reduction approach in HTTP-based botnet detection,” IEEE Symp.
[1] Kaspersky Lab, “Statistics on Botnet-Assisted DDoS Attacks in Attacks
Comput. Informatics, Isc. 2013, pp. 201–205, 2013.