Cloud Computing IaaS Implementation
Cloud Computing IaaS Implementation
ScienceDirect
Article history: The results of botnet detection methods are usually presented without any comparison.
Received 21 October 2013 Although it is generally accepted that more comparisons with third-party methods may
Received in revised form help to improve the area, few papers could do it. Among the factors that prevent a com-
29 April 2014 parison are the difficulties to share a dataset, the lack of a good dataset, the absence of a
Accepted 27 May 2014 proper description of the methods and the lack of a comparison methodology. This paper
Available online 5 June 2014 compares the output of three different botnet detection methods by executing them over a
new, real, labeled and large botnet dataset. This dataset includes botnet, normal and
Keywords: background traffic. The results of our two methods (BClus and CAMNEP) and BotHunter
Botnet detection were compared using a methodology and a novel error metric designed for botnet de-
Malware detection tections methods. We conclude that comparing methods indeed helps to better estimate
Methods comparison how good the methods are, to improve the algorithms, to build better datasets and to build
Botnet dataset a comparison methodology.
Anomaly detection © 2014 Elsevier Ltd. All rights reserved.
Network traffic
* Corresponding author. ISISTAN Research Institute–CONICET, Faculty of Sciences, UNICEN University, Argentina.
E-mail addresses: sebastian.garcia@isistan.unicen.edu.ar, eldraco@gmail.com (S. García), grill@agents.fel.cvut.cz (M. Grill), jan.sti-
borek@agents.felk.cvut.cz (J. Stiborek), alejandro.zunino@isistan.unicen.edu.ar (A. Zunino).
http://dx.doi.org/10.1016/j.cose.2014.05.011
0167-4048/© 2014 Elsevier Ltd. All rights reserved.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 101
method is usually focused on different properties of the data- concludes that it is difficult to compare the results with
set. The problem is to find a good, common and public dataset another proposal because the datasets tend to be private and
that can be read by all methods and satisfy all the constrains. the descriptions of the methods tend to be incomplete.
The difficultly to compare detection methods goes beyond Another analysis of the difficulty of reproducing a method
the dataset. The lack of good descriptions of the methods and was described by Tavallaee et al. (2010), where they state that
error metrics contribute to the problem. As stated by Rossow there is an absence of proper documentation of the methods
et al. (2012), the error metrics used on most papers are usu- and experiments in most detection proposals.
ally non-homogeneous. They tend to use different error One of the detection proposals that actually made a com-
metrics and different definitions of error. Moreover, the most parison with a third-party method was presented by
common error metrics, e.g. FPR, seems to be not enough to Wurzinger et al. (2010). The purpose of the paper is to identify
compare botnet detection methods. The classic error metrics single infected machines using previously generated detec-
were defined from a statistical point of view and they fail to tion models. It first extracts the characters strings from the
address the detection needs of a network administrator. network to find the commands sent by the C&C and then it
The goal of this paper is to compare three botnet detection finds the bot responses to those commands. The authors
methods using a simple and reproducible methodology, a downloaded and executed the BotHunter program of Gu et al.
good dataset and a new error metric. The contributions of our (2007) on their dataset and made a comparison. However, the
paper are: paper only compares the results of both proposals using the
TPR error metric and the FP values.
A deep comparison of three detection methods. Our own The other paper that made a comparison with a third-party
algorithms, CAMNEP and BClus, and the third-party algo- method was presented by Zhao et al. (2013). This proposal
rithm BotHunter (Gu et al., 2007). selects a set of attributes from the network flows and then
A simple methodology for comparing botnet detection applies a Bayes Network algorithm and a Decision Tree algo-
methods along with the corresponding public tool for rithm to classify malicious and non-malicious traffic. The
reproducing the methodology. third-party method used for comparison was again BotHunter.
A new error metric designed for comparing botnet detec- There is a description of how BotHunter was executed, but
tion methods. unfortunately the only error metric reported was a zero False
A new, large, labeled and real botnet dataset that includes Positive. No other numerical values were presented.
botnet, normal and background data. The last proposal that also compared its results with a
third-party method was made by Li et al. (2010). This paper
We conclude that the comparison of different botnet analyzes the probable bias that the selection of ground-truth
detection methods with other proposals is highly beneficial for labels might have on the accuracy reported for malware
the botnet research community because it helps to objectively clustering techniques. It states that common methods for
assess the methods and improve the techniques. Also, that the determining the ground truth of labels may bias the dataset
use of a good botnet dataset is paramount for the comparison. toward easy-to-cluster instances. This work is important
The rest of the paper is organized as follows. Section 2 because it successfully compared its results with the work of
shows previous work in the area. Section 3 describes the Bayer et al. (2009). The comparison was done with the help of
CAMNEP detection method. Section 4 shows the BClus botnet Bayer et al., who run the algorithms described in Li et al. (2010)
detection method. Section 5 describes the BotHunter method. on their private dataset.
Section 6 describes the dataset and its features. Section 7 de- Regarding the creation of datasets for malware-related
scribes the comparison methodology, the public tool and the research, Rossow et al. (2012) presented a good paper about
new error metric. Section 8 shows the results and compares the prudent practices for designing malware experiments.
the methods and Section 9 presents our conclusions. They defined a prudent experiment as one being correct,
realistic, transparent and that do not harm others. After
analyzing 36 papers they conclude that most of them had
2. Previous work shortcomings in one or more of these areas. Most importantly,
they conclude that only a minority of papers included real-
The comparison of detection methods is usually considered a world traffic in their evaluations.
difficult task. In the case of botnets it is also related to the
creation of a new dataset. The next Subsections describe the 2.2. Datasets available
previous work in the area of comparison of methods and the
area of creation of datasets. Regarding botnet datasets that are available for download, a
deep study was presented in Shiravi et al. (2012) about the
2.1. Comparison of methods generation of datasets. It describes the properties that a
dataset should have in order to be used for comparison pur-
The comparison of a new detection method with a third-party poses. The dataset used in the paper includes an IRC-based
method is difficult. In the survey presented by García et al. Botnet attack,1 but the bot used for the attack was developed
(2013), where there is a deep analysis of fourteen network- by the authors and therefore it may not represent a real botnet
based botnet detection methods, the authors found only one behavior. This dataset may be downloaded with authorization.
paper that made such a comparison. The survey compared the
1
motivations, datasets and results of the fourteen proposals. It http://www.iscx.ca/datasets.
102 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3
The Protected Repository for the Defense of Infrastructure A considerable amount of malware traffic in pcap format
Against Cyber Threats (PREDICT) indexed three Botnet data- was published in the Contagio blog.9 It contains thirty one ATP
sets2 until May 16th, 2013. The first one is the Kraken Botnet pcap captures and sixty one crimenware pcaps. Each file
Sinkhole Connection Data dataset, the second one is the Flash- contains the traffic of one malware without background
back Botnet Sinkhole Connection Data dataset and the third one is traffic. Unfortunately, the captures are really short (mostly
the Conficker Botnet Sinkhole Connection Data dataset. They were between 1 min and 1 h) and the traffic is not labeled. But since
published as CSV text files, where each line is a one minute each scenario includes only one infected computer, it should
aggregation of the number of attempted connections of one IP be possible to label them.
address. Unfortunately, the aggregation method may not be Another dataset with malware logs and benign logs was
suitable for comparisons with other proposals. collected in NexGinRC (2013). The malware logs are both real
The CAIDA organization published a paper about the Sality and simulated. The benign logs consist of 12 months of traffic.
botnet in Dainotti et al. (2012) along with its corresponding Unfortunately, the dataset is in CSV format, which may not be
dataset.3 Unfortunately, the CSV text format of the dataset suitable for some detection algorithms because it does not
may not be suitable for every detection algorithm because the have the same information as a NetFlow file or pcap file. Ac-
content of the dataset only includes enough information to cess to this dataset may be granted upon request10.
reproduce the techniques in the paper. CAIDA also published The last dataset analyzed is currently created by the MAWI
a dataset about the Witty Botnet in pcap format4 and several project described in Cho et al. (2000). It includes an ongoing
datasets with responses to spoofed DoS traffic5 and anoma- effort to publish one of the most recent and updated back-
lous packets.6 None of them are labeled. ground datasets to date. Its goal is to promote the research on
A custom botnet dataset was created to verify five P2P traffic analysis and the creation of free analysis tools. How-
botnet detection algorithms in Saad et al. (2011). Fortunately, ever, the pcap files are not labeled, and therefore it is more
this dataset was made public and can be downloaded.7 The difficult to use them for training or verification. There was an
dataset is a mixture of two existing and publicly available effort to label this dataset using anomaly detectors in
malicious datasets and one non-malicious pcap dataset. They Fontugne et al. (2010). The labels are not ground-truth, but
were merged to generate a new file. This was, at that time, the may be useful to compare other methods.
best dataset that can be downloaded for comparison pur- A summary of the described datasets is presented in Table 1.
poses. Unfortunately, there is only one infected machine for This table shows that, so far, no dataset includes Background,
each type of botnet, therefore no synchronization analysis can Botnet and Normal labeled data.
be done.
The Traffic Laboratory at Ericsson Research created a
normal dataset that was used in Saad et al. (2011) and 3. The CAMNEP detection method
described in Szabo et al. (2008). This normal dataset is publicly
available.8 It is composed of pcap traffic files that were labeled The Cooperative Adaptive Mechanism for NEtwork Protection
by means of one IP header option field. This is the only normal (CAMNEP) (Rehak et al., 2009) is a Network Behavior Analysis
dataset that is labeled inside the pcap file. system (Scarfone and Mell, 2007) that consists of various
state-of-the-art anomaly detection methods. The system
2 models the normal behavior of the network and/or individual
https://www.predict.org/Default.aspx?tabid¼104.
3 users behaviors and labels deviations from normal behaviors
http://imdc.datcat.org/collection/1-06Y5-B¼UCSDþNetworkþ
TelescopeþDatasetþonþtheþSipscan. as anomalous.
4
http://www.caida.org/data/passive/witty_worm_dataset.xml.
5
http://www.caida.org/data/passive/backscatter_2008_dataset. 3.1. System architectures
xml.
6
http://www.caida.org/data/passive/telescope-2days-2008_
CAMNEP processes NetFlow data provided by routers or other
dataset.xml.
7
http://www.isot.ece.uvic.ca/dataset/ISOT_Botnet_DataSet_ network equipment to identify anomalous traffic by means of
2010.tar.gz. several collaborative anomaly detection algorithms. It uses
8
http://www.crysys.hu/~szabog/measurement.tar.
9 10
http://contagiodump.blogspot.co.uk/2013/04/collection-of- ytextsc{http://nexginrc.org/Datasets/DatasetDetail.aspx?
pcap-files-from-malware.html. pageID¼24}.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 103
a multi-algorithm and multi-stage approach to reduce the algorithms used in the system are described in detail in Sec-
amount of false positives generated by the individual anomaly tion 3.2.
detectors without compromising the performance of the sys- The trust models layer maps the NetFlows into traffic
tem. The self-monitoring and self-adaptation techniques, clusters. These clusters group together the NetFlows that have
described in Section 3.4, are very important to this purpose. a similar behavioral pattern. They also contain the anomaly
They help improve the error rate of the system with a minimal value of the type of the event that they represent. These
and controllable impact on its efficiency. clusters persist over time and the anomaly value is updated by
CAMNEP consists of three principal layers that evaluate the the trust model. The updated anomaly value of a cluster is
traffic: anomaly detectors, trust models and anomaly used to determine the anomaly of new NetFlows. Therefore,
aggregators. the trust models act as a persistent memory and reduce the
The anomaly detectors layer (identified as Anomaly De- amount of false positives by means of the spatial aggregation
tectors A and B in Fig. 1) analyze the NetFlows using various of the anomalies.
anomaly detection algorithms. We are currently using 8 The aggregators layer creates one composite output that
different anomaly detection approaches. Each of them uses a integrates the individual opinion of several anomaly detectors
different set of features, thus looking for an anomaly from as they were provided by the trust models. The result of the
slightly different perspectives. The output of these algorithms aggregation is presented to the user of the system as the final
is aggregated into events using several statistical functions anomaly score of the NetFlows. Each aggregator can use two
and the results are sent to the trust models. All the detection different averaging operators: an order-weighted averaging
(Yager, 1988) or simple weighted averaging. CAMNEP is using a introduction of new rules that define a horizontal port scan as
simulation process to determine the best aggregation operator anomalous.
for the current type and state of network. This process is
described in Section 3.4. 3.2.3. Lakhina volume
The volume prediction algorithm presented in Lakhina et al.
3.2. Anomaly detectors (2004) uses the Principal Components Analysis (PCA) algo-
rithm to build a model of traffic volumes from individual
The anomaly detectors in the CAMNEP system are based on sources. The observed traffic for each source IP address with
already published anomaly detection methods. They work in non-negligible volumes of traffic is defined as a three dimen-
two stages: (i) they extract meaningful features associated sional vector: the number of NetFlows, number of bytes and
with each NetFlow (or group of NetFlows), and (ii) they use the number of packets from the source IP address. The traffic
values of these features to assign an anomaly score to each model is defined as a dynamic and data-defined trans-
NetFlow. This anomaly score is a value in the [0,1] range. A formation matrix that is applied to the current traffic vector.
value of 1 represents an anomaly and a value of 0 represents The transformation splits the traffic into normal (i.e. modeled)
normal behavior. The anomaly detector's model is a fuzzy and residual (i.e. anomalous). The transformation returns the
classifier that provides the anomaly value for each NetFlow. residual amount of NetFlows, packets and bytes for each
This value depends on the NetFlow itself, on other NetFlows in source IP address. These values define the context (identical
the current context, and on the internal traffic model, which is for all the flows from the given source). An anomaly is deter-
based on the past traffic observed on the network. mined by transforming the 3D context into a single value in
The following subsections describe each of the anomaly the [0,1] interval.
detectors used in the CAMNEP system. Notice that the original work was designed to handle a
different problem, that is, the detection of anomalies on a
3.2.1. MINDS backbone. Also the original work modeled networks instead of
The MINDS algorithm (Ertoz et al., 2004) builds a context in- source IP addresses. However, we modify it to obtain a clas-
formation for each evaluated NetFlow using the following fea- sifier that can successfully contribute to the joint opinion
tures: the number of NetFlows from the same source IP address when combined with others.
as the evaluated NetFlow, the number of NetFlows toward the
same destination host, the number of NetFlows towards the 3.2.4. Lakhina Entropy
same destination host from the same source port, and the The entropy prediction algorithm presented by Lakhina et al.
number of NetFlows from the same source host towards the (2005) is based on the similar PCA-based traffic model than
same destination port. This is a simplified version of the orig- Section 3.2.3, but it uses different features. It aggregates the
inal MINDS system, which also uses a secondary window traffic from the individual source IP addresses, but instead of
defined by the number of connections in order to address slow traffic volumes, it predicts the entropies of destination IP ad-
attacks. The anomaly value for a NetFlow is based on its dis- dresses, destination ports and source ports over the set of
tance to the normal sample. The metric defined in this four- context NetFlows for each source. The context space is
dimensional context space uses a logarithmic scale on each therefore three dimensional. An anomaly is determined as the
context dimension, and these marginal distances are combined normalized sum of residual entropy over all three dimensions.
into the global distance as the sum of their squares. In the The metric is simple: a function measures the difference of
CAMNEP implementation of this algorithm, the variance- residual entropies between the NetFlows and aggregates their
adjusted difference between the floating average of past squares. Also, the original anomaly detection method was
values and the evaluated NetFlow on each of the four context significantly modified along the same lines as the volume
dimensions is used to know if the evaluated NetFlow is prediction algorithm.
anomalous. The original work is based on the combination of
computationally-intensive clustering and human intervention. 3.2.5. TAPS
The TAPS method (Sridharan et al., 2006) is different from the
3.2.2. Xu previous approaches because it targets horizontal and vertical
In the algorithm proposed by Xu et al. (2005), the context of port scans. The algorithm only considers the traffic sources
each NetFlow to be evaluated is created with all the NetFlows that created at least one single-packet NetFlow during a
coming from the same source IP address. In the CAMNEP particular observation period. These preselected sources are
implementation, for each context group of NetFlows, a 3 then classified using the following three features: number of
dimensional model is built with the normalized entropy of the destination IP addresses, number of destination ports and the
source ports, the normalized entropy of the destination ports, entropy of the NetFlow size measured in number of packets.
and the normalized entropy of the destination IP addresses. The anomaly value of the source IP address is based on the
The anomalies are determined by some classification rules ratio between the number of unique destination IP addresses
that divide the traffic into normal and anomalous. The dis- and destination ports. When this ratio exceeds a pre-
tance between the contexts of two NetFlows is computed as determined threshold the source IP address is considered as a
the difference between the three normalized entropies, com- scan origin. Using the original method, we have encountered
bined as the sum of their squares. Our implementation of the an unusually high number of false positives. Therefore, we
algorithm is close to the original publication, which was extended the method with the NetFlow size entropy to ach-
further expanded by Xu and Zhang (2005), except for the ieve better results.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 105
3.2.6. KGB Each trust model determines the trustfulness of each Net-
The KGB anomaly detector presented by Pevný et al. (2012) is also Flow by finding all the centroids in the NetFlows vicinity. It
based on Lakhina's work. It uses the same features as Lakhina sets the trustfulness using the distance-based weighted
Entropy detector described above. Similar to Lakhina's work, it average of the values preserved by the centroids. All the
performs a PCA analysis of the feature vectors for each source IP models provide their trustfulness assessment (conceptually a
address in the dataset. The final anomaly is determined from the reputation opinion) to the anomaly aggregators.
deviations of averaging the principal components.
There are two versions of KGB detector:
3.4. Adaptation
Fig. 2 e Distribution of challenges. The anomalies distribution of the malicious challenges (from one class) is on the left side
of the graph, while the legitimate events are on the right.
robustness achieved by the use of multiple algorithms, makes uses them to detect similar traffic on the network. It is not an
the evasion attempt a much more difficult task than simply anomaly detection method.
avoiding a single intrusion detection method (Rubinstein The purpose of the method is to cluster the traffic sent by
et al., 2009). each IP address and to recognize which clusters have a
Furthermore, the system is able to find the optimal behavior similar to the botnet traffic. A basic schema of the
thresholds for the anomaly score when using the results from BClus method is:
the adaptation process. The system is continuously modeling
the normal distribution of malicious and legitimate chal- 1. Separate the NetFlows in time windows.
lenges. The threshold is set to minimize the Bayes risk (pos- 2. Aggregate the NetFlows by source IP address.
teriori expected loss) computed from the modeled legitimate 3. Cluster the aggregated NetFlows.
and malicious behavior distributions. Thus, the final result of 4. Only Training: Assign ground-truth labels to the botnet
the system is a list of NetFlows with labels anomalous or clusters.
normal. 5. Only Training: Train a classification model on the botnet
clusters.
6. Only Testing: Use the classification model to recognize the
3.5. Training of the CAMNEP method
botnet clusters.
Since the system needs the inner models of the anomaly de-
At the end the BClus method outputs a predicted label for
tectors and trust models to have the optimal detection results,
each NetFlow analyzed.
it is necessary to train them. Typically, the system needs
The following Subsections describe each of these steps.
25 min of traffic to create its inner models and to adapt itself to
the current type of network and its state. Therefore, the
4.1. Separate the NetFlows in time windows
training data for the CAMNEP algorithm was created by
trimming off some minutes at the start of each of the sce-
The main reason to separate the NetFlows in time windows is
narios in the dataset described in Section 6.
the huge amount of data that had to be processed. Some of our
botnet scenarios produced up to 395,000 packets per minute. A
short time window allow us to better process this information.
4. The BClus detection method The second reason to use time windows is that botnets
tend to have a temporal locality behavior (Hegna, 2010),
The BClus method is a behavioral-based botnet detection meaning that most actions remain unchanged for several
approach. It creates models of known botnet behavior and minutes. In our dataset the locality behavior ranges between 1
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 107
and 30 min. This temporal locality helps to capture all the list of instances for each aggregation window. Next Subsec-
important behaviors in the time windows. tion describes the clustering process of these instances.
The third reason for using time windows is the need to
deliver a result to the network administrator in a timely
4.3. Cluster the aggregated NetFlows
manner. After each time window, the BClus method outputs
some results and the administrator can have an input about
The continuous evolution of botnets suggests that a good
the traffic.
detection method should be as independent of the network
An important decision on the time window separation
characteristics as possible. The BClus method, then, uses an
criteria is the window width. A short time window does not
unsupervised approach to cluster the instances described in
contain enough NetFlows and therefore would not allow a
the previous section. These natural groups of behaviors
correct analysis of the botnet behavior. In the other hand, a
depend on the time window being analyzed and on the
large time window would have a high computational cost. The
characteristics of the network where the algorithm is running.
time window used by the BClus method is of two minutes,
The technique used for this task is WEKA's implementation
since it is enough to capture all the botnet behaviors and it
of the Expectation-Maximization (EM) algorithm (Moon, 1996).
does not contain too much NetFlows.
EM is an iterative procedure that attempts to find the pa-
The next step in the BClus method is to aggregates the
rameters of the model that maximize the probability of the
NetFlows.
observed data. Our dataset has many different network be-
haviors generated by normal, botnet and attack actions. We
hypothesize that these behaviors are generated from different
4.2. Aggregate the NetFlows by source IP address
probabilistic models and that the parameters of these models
can be found using the EM algorithm. The instances are
The purpose of aggregating th NetFlows is to analyze the
assigned to the probability distribution that they most likely
problem from a new high-level perspective. The aggregated
belong to, therefore building clusters.
data may show new patterns. We hypothesize that these new
After generating the clusters, the task of the BClus method
patterns could help recognize the behaviors of botnets. From
is to find which of them belong to botnets. The features of a
the botnet detection perspective, the main motivations for
cluster are the average and standard deviation of the seven
aggregating NetFlows are the following:
instances features described in Section 4.2. The following 15
cluster features are obtained for each cluster:
Each bot communicates with the C&C server periodically
(AsSadhan et al., 2009).
1. Total amount of instances in the cluster.
Several bots may communicate at the same time with the
2. Total amount of NetFlows in the cluster.
same C&C servers (Gu et al., 2008).
3. Amount of source IP addresses.
Several bots attack at the same time the same target (Lee
4. Average amount of unique source ports.
et al., 2008).
5. Standard Deviation of the amount of unique source
ports.
Inside each time window, the NetFlows are aggregated
6. Average amount of unique destination IP addresses.
during one aggregation window. The width of the aggregation
7. Standard Deviation of the amount of unique destination
window should be less than the width of the time window,
IP addresses.
which was of two minutes. After some experimentation, a one
8. Average amount of unique destination ports.
minute aggregation window width was selected, which is
9. Standard Deviation of the amount of unique destination
enough to capture the botnet synchronization patterns and
ports.
short enough not to capture too much traffic (García et al.,
10. Average amount of NetFlows.
2012). Therefore, on each time window, two aggregation
11. Standard Deviation of the amount of NetFlows.
windows are used.
12. Average amount of bytes transferred.
The NetFlows are aggregated by unique source IP address.
13. Standard Deviation of the amount of bytes transferred.
The resulting features on each aggregation window are:
14. Average amount of packets transferred.
15. Standard Deviation of the amount of packets
1. Source IP address
transferred.
2. Amount of unique source ports used by this source IP
address.
Once the features are extracted, they are used in the next
3. Amount of unique destination IP addresses contacted by
Subsection to assign the ground-truth labels to the clusters.
this source IP address.
4. Amount of unique destination ports contacted by this
source IP address. 4.4. Train a classification model on the botnet clusters
5. Amount of NetFlows used by this source IP address.
6. Amount of bytes transferred by this source IP address. The classification algorithm used to find the botnet clusters is
7. Amount of packets transferred by this source IP address. JRIP. It is the WEKA's implementation of a “( … ) a proposi-
tional rule learner, Repeated Incremental Pruning to Produce
We call this group of seven aggregated features an instance Error Reduction (RIPPER), which was proposed by William W.
to simplify the references. The aggregation step ends with a Cohen as an optimized version of IREP” (Hall et al., 2009).
108 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3
The JRIP algorithm receives a labeled group of clusters and As this new feature is a percentage, a correct threshold had
output a group of rules to detect them. That means that the to be found. This threshold decision is very important,
JRIP algorithm needs to be trained on how to recognize a because different percentages correspond to different botnet
botnet cluster. This training is done with the following leave- behaviors. If it is above 0%, it means that every cluster with at
one-out algorithm: least one botnet NetFlow is considered a representative of a
botnet behavior. If it is above 1%, it means that only clusters
1. Training phase. with more that 1% of botnet NetFlows are considered a
(a) Use a leave-one-out algorithm with the training and representative of a botnet behavior. A manual analysis of the
cross-validation datasets. For each round do: dataset determined that most of the real botnet clusters had
i. Separate the NetFlows in time windows (Section between 0% and 1% of botnet NetFlows. To find out which
4.1). threshold between 0% and 1% was the best, we implemented
ii. Aggregate the NetFlows by source IP address (Sec- the leave-one-out algorithm described in Section 4.4 to try the
tion 4.2). following ten candidates thresholds: 0.1%, 0.2%, 0.3%, 0.4%,
iii. Cluster the aggregated NetFlows (Section 4.3). 0.5%, 0.6%, 0.7%, 0.8%, 0.9% and 1%.
iv. Assign ground-truth labels to the clusters based on After running the leave-one-out technique we found that
the ground-truth labels of the NetFlows (Section the group of clusters that has the best error metrics for the
4.4.1). BClus algorithm was generated with a threshold of 0.4%. The
v. Train a JRIP classification model to recognize the set of JRIP rules generated by the 0.4% percentage became the
botnet clusters. best detection model applied in next Subsection.
vi. Apply the JRIP model in the cross-validation dataset
of this round. 4.5. Testing phase. Use the classification model to
vii. Store the error metrics of this round. recognize the botnet clusters
2. Select the bests JRIP model based on the results of the
leave-one-out. Once the best detection model was found in previous Sub-
3. Testing phase (Section 4.5). section, we applied it in the testing dataset to knew the real
(a) Read the testing dataset. performance of the BClus algorithm.
(b) Use the best JRIP classification model to recognize the The testing dataset was processed in the same way that the
botnet clusters. training dataset. That is, it was separated in two minutes time
(c) Assign the labels to the NetFlows based on the labels of windows, the NetFlows in each time windows were aggre-
the clusters. gated by its source IP address every one minute and those
aggregated instances were clustered. Then, the best JRIP
The rest of the Subsections describe each of the steps. model was applied to detect the botnet clusters.
If a cluster was classified as botnet, then all of its instances
4.4.1. Assign ground-truth labels to the botnet clusters were labeled as botnet, and in turn all of the NetFlows in those
Ground-truth labels should be assigned to the clusters instances were labeled as botnet. Finally, the BClus method
because we need to train the JRIP classification algorithm with outputted a list of NetFlows with the predicted label assigned.
them. Once the JRIP algorithm knows which are the botnet This list of labeled NetFlows for each testing scenario is the
clusters, it can create a model to recognize them. output that will be compared to the CAMNEP and BotHunter
To assign a ground-truth label to a cluster, we should first methods in Section 8.
assign a ground-truth label to all of its instances (aggregated
NetFlows). However, to assign a ground-truth label to an
instance, we should first assign a ground-truth label to all of 5. The BotHunter Method
its NetFlows.
The ground-truth label of each NetFlow is known from the The BotHunter method was proposed by Gu et al. (2007) to
original NetFlow files that are part of the dataset. Therefore, detect the infection and coordination dialog of botnets by
ground-truth label of each instance is known, since an matching a state-based infection sequence model. It consists
instance is composed of all the NetFlows from the same IP of a correlation engine that aims at detecting specific stages of
address. However, a cluster is composed of different instances the malware infection process, such as inbound scanning,
coming from different IP addresses, and then it is not exploit usage, egg downloading, outbound bot coordination
straightforward to know which ground-truth label should be dialog and outbound attack propagation.
assigned to a cluster. This Subsection describes how we It uses an adapted version of the Snort IDS12 with two
assigned a ground-truth label to each cluster. proprietary plugin-ins, called Statistical Scan Anomaly Detection
To help us decide which label should be assigned to each Engine (SCADE) and Statistical Payload Anomaly Detection Engine
cluster, a new feature was computed for each cluster: The (SLADE). SLADE implements a lossy n-gram payload analysis
percentage of botnet NetFlows on the cluster. This value is of incoming traffic flows to detect divergences in some pro-
expected to be bigger on botnet clusters and smaller on tocols. SCADE performs port scan analyzes.
background clusters and it was used to select the ground- An infection is reported when one of two conditions is
truth label for the cluster. Notice that this feature is only satisfied: first, when an evidence of local host infection is
used to assign the ground-truth label in the training phase and
12
is not stored nor used in the testing phase. http://www.snort.org.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 109
found and evidence of outward bot coordination or attack The topology used to create the dataset consisted in a set
propagation is found, and second, when at least two distinct of virtualized computers running the Microsoft Windows XP
signs of outward bot coordination or attack propagation are SP2 operating system on top of a Linux Debian host. At the
found. The BotHunter warnings are tracked over a temporal time of designing the topology, the Windows XP SP2 was the
window and contribute to the infection score of each host. most used operating system by the malware. Each virtual
The BotHunter proposal is compared to the BClus and machine was being bridged into the University network.
CAMNEP methods to have the reference of an accepted Fig. 3 shows a diagram of the testbed. The traffic was
detection method in the community. The version of Bot- captured both on the Linux host and on one of the University
Hunter used in the comparison is 1.7.2. routers. The traffic from the Linux host was exclusively
Section 8.1 describes how the results of the BotHunter composed of botnet traffic and was used for labeling pur-
proposal were adapted and incorporated into the comparison. poses. The traffic from the University router was used to
create the final dataset. The tool used to capture the traffic
was tcpdump (Jacobson et al., 1997).
6. Creation of the dataset The next Subsections describe each of the captures, its
design principles, the preprocessing of the dataset, the
In order to compare the methods, a good dataset is needed. assignment of labels, the separation in training and testing
According to (Sperotto et al., 2009; Shiravi et al., 2012), a good and the publication of the dataset.
dataset should be representative of the network were the al-
gorithms are going to be used. This means that it should have 6.1. Design of the botnet scenarios
botnet, normal and background labeled data, that the balance
of the dataset should be like in a real network (usually the A botnet scenario, in the context of this paper, is a particular
percentage of botnet data is small), and that it should be infection of the virtual machines using a specific malware.
representative of the type of behaviors seen on the network. Thirteen of these scenarios were created, and each of them
The difficulties of obtaining such a dataset are discussed in was designed to be representative of some malware behavior.
Shiravi et al. (2012) and the importance of these characteris- The main characteristics of the scenarios and their be-
tics are discussed in Rossow et al. (2012). haviors are shown in Table 2. It describes if they used IRC, P2P
Due to the absence of a public botnet dataset with the or HTTP protocols, if they sent SPAM, did Click-Fraud, port
characteristics needed, we created a new public dataset that scanned, did DDoS attacks, used Fast-Flux techniques or if
complies with the following design goals: they were custom compiled.
The features related with the network traffic of each sce-
Must have real botnets attacks and not simulations. nario are shown in Table 3. It presents the size, duration,
Must have unknown traffic from a large network. number of packets, number of flows, number of bots and bot
Must have ground-truth labels for training and evaluating family.
the methods. The network topology used to make the captures had a
Must include different types of botnets. bandwidth control mechanism. However, the traffic going out
Must have several bots infected at the same time to capture to the Internet was not filtered. This decision may seem
synchronization patterns. controversial (Rossow et al., 2012), but it was taken with the
Must have NetFlow files to protect the privacy of the users. explicit determination of capturing real attacks. We believe
that the best way to study and model an attack is to capture files was done in two steps using the Argus software suite
real attacks. (Argus, 2013). First, the argus tool was used to convert each
The next Subsection describes how these scenarios were pcap file into a bidirectional Argus binary storage file. The
preprocessed to obtain a more usable dataset. exact configuration of argus is published with each scenario.
Second, the ra Argus client tool was used to convert each
6.2. Dataset preprocessing Argus binary storage file into a NetFlow file. This can be done
by specifying in the ra configuration the output fields. The ra
After capturing the packets, the dataset was preprocessed and configuration is also published with each scenario. These final
converted to a common format for the detection methods. The NetFlow files were composed of the following fields: Start Time,
format selected was the NetFlow file standard (Clais, 2008), End Time, Duration, Source IP address, Source Port, Direction,
which is considered the ad-hoc standard for network data Destination IP address, Destination Port, State, SToS, Total Packets
representation. The conversion from pcap files to NetFlow and Total Bytes.
Table 2 e Characteristics of the botnet scenarios. (CF: ClickFraud, PS: Port Scan, FF: FastFlux, US: Compiled and controlled by
us.)
Id IRC SPAM CF PS DDoS FF P2P US HTTP Note
1 √ √ √
2 √ √ √
3 √ √ √
4 √ √ √ UDP and ICMP DDoS.
5 √ √ √ Scan web proxies.
6 √ Proprietary C&C. RDP.
7 √ Chinese hosts.
8 √ Proprietary C&C. Net-BIOS, STUN.
9 √ √ √ √
10 √ √ √ UDP DDoS.
11 √ √ √ ICMP DDoS.
12 √ Synchronization.
13 √ √ √ Captcha. Web mail.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 111
6.2.1. Ground-truth labels assignment 6.2.2. Dataset separation into training, testing and cross-
The assignment of ground-truth labels is a very important validation
part of the dataset creation process (Fontugne et al., 2010). To correctly create the classification models used in the BClus
However, it can be complex and difficult to do (Davis and and CAMNEP methods, we need to first separate the dataset.
Clark, 2011). For example, a wrongly assigned label might For the CAMNEP method the training consisted in the first
produce unreliable results (Maloof, 2006). 25 min of each scenario, so it was not necessary to further
Our labeling strategy assigns three different labels: back- separate them.
ground, botnet and normal. The priority to assign the labels is For the BClus method, it was necessary to separate the
the following: dataset into training and cross-validation, and testing. The
separation criteria was carefully evaluated, because the
1. Assign the Background label to the whole traffic. following constrains must be met:
2. Assign the Normal label to the traffic that matches certain
filters. The training and cross-validation datasets should be
3. Assign the Botnet label to all the traffic that comes from or approximately 80% of the dataset.
to any of the known infected IP addresses. The testing dataset should be approximately 20% of the
dataset.
The filters used to assign normal labels were created from None of the botnet families used in the training and cross-
the known and controlled computers in the network, such as validation dataset should be used in the testing dataset.
routers, proxies, switches, our own computers in the labora- This ensures that the methods can generalize and detect
tory, etc. new behaviors.
The distribution of labels on each experiment is shown in
Table 4. It can be seen that most of the traffic was labeled as However, it is not clear which feature should be used for
Background. This majority class may add a natural bias to the the 80%e20% separation criteria. It is not the same to have an
dataset, however one of the ways to avoid this is to capture a 80% of the amount of packets than of the amount of bytes. We
large dataset, as it was sated in by Kotsiantis et al. (2006). made the separation by carefully selecting the scenarios so
that the 80% of the following features are considered: the
Duration in minutes, the Number of clusters, the Number of Net-
Flows and the Number of aggregated NetFlows of the scenarios.
The final separation of the scenarios for the datasets is
Table 4 e Distribution of labels for each scenario in the
shown in Table 5. The problem of the imbalanced amount of
dataset.
labels on each dataset was reduced, as stated in Kotsiantis
Id Background Botnet Normal
et al. (2006), by carefully selecting the training and testing
1 10,124,854 (95.40%) 94,972 (0.89%) 392,433 (3.69%) datasets. Also, as the majority label is Background, the bias
2 6,071,419 (95.59%) 54,433 (0.85%) 225,336 (3.54%) toward the majority class reported in Li et al. (2010) is avoided.
3 14,381,899 (94.60%) 75,891 (0.49%) 744,270 (4.89%)
All the three methods compared used only the testing
4 3,895,469 (91.91%) 6466 (0.15%) 336,103 (7.93%)
5 416,267 (91.37%) 2129 (0.46%) 37,144 (8.15%)
scenarios for obtaining results.
6 2,031,967 (94.12%) 4927 (0.22%) 121,854 (5.64%)
7 425,611 (93.71%) 293 (0.06%) 28,270 (6.22%)
8 11,451,205 (95.47%) 12,063 (0.10%) 530,666 (4.42%) Table 5 e Dataset separation into training, testing and
9 6,881,228 (90.22%) 383,215 (5.02%) 362,594 (4.75%) cross-validation.
10 4,535,493 (87.54%) 323,441 (6.24%) 321,917 (6.21%)
Scenario Dataset
11 119,933 (29.33%) 277,892 (67.97%) 11,010 (2.69%)
12 119,933 (29.33%) 277,892 (67.97%) 11,010 (2.69%) 1,2,6,8,9 Testing
13 1,218,140 (93.76%) 21,760 (1.67%) 59,190 (4.55%) 3,4,5,7,10,11,12,13 Training and cross-validation
112 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3
6.3. Dataset publication this approach is that the details of the methods remain
private.
The thirteen scenarios of our dataset were published in the To implement this methodology we created and published
web site https://mcfp.felk.cvut.cz/(García, 2013). Each scenario a new tool called Botnet Detectors Comparer (García, 2014) that it
includes the botnet pcap file, the labeled NetFlow file, a is publicly available for download.13
README file with the capture time line and the original mal- This tool reads the dataset NetFlow file and implements
ware executable binary. It was not possible to publish the the following steps:
complete pcap file with the background and normal packets
because they contain private information. However, both of Separates the NetFlow file in comparison time windows.
our methods use only the NetFlows files. The correspondence Compares the ground-truth NetFlow labels with the pre-
between the number of scenario and the name of the capture dicted labels of each method and computes the TP, TN, FP
in the web page is: and FN values.
After the comparison time window ended, it computes the
Scenario Id 1 is CTU-Malware-Capture-Botnet-42. error metrics: FPR, TPR, TNR, FNR, Precision, Accuracy,
Scenario Id 2 is CTU-Malware-Capture-Botnet-43. ErrorRate and FMeasure1 for that time window.
Scenario Id 3 is CTU-Malware-Capture-Botnet-44. When the dataset ends, it computes the final error metrics.
Scenario Id 4 is CTU-Malware-Capture-Botnet-45. The error metrics are stored in a text file and plotted in a
Scenario Id 5 is CTU-Malware-Capture-Botnet-46. eps image.
Scenario Id 6 is CTU-Malware-Capture-Botnet-47.
Scenario Id 7 is CTU-Malware-Capture-Botnet-48. Also, this tool computes the new error metric that we
Scenario Id 8 is CTU-Malware-Capture-Botnet-49. propose in next Section 7.2.
Scenario Id 9 is CTU-Malware-Capture-Botnet-50. The comparison time window is the time window used for
Scenario Id 10 is CTU-Malware-Capture-Botnet-51. computing the error metrics and it is not related with the
Scenario Id 11 is CTU-Malware-Capture-Botnet-52. methods. It is the time that the network administrator may
Scenario Id 12 is CTU-Malware-Capture-Botnet-53. wait to have a decision about the traffic. In our methodol-
Scenario Id 13 is CTU-Malware-Capture-Botnet-54. ogy the width of the comparison time windows is five
minutes.
Using this methodology, researchers can now add its own
predictions to the NetFlows files of our dataset and use this
7. Comparison methodology and new error
tool to compute the error metrics. To tell the tool which labels
metric
the new method uses for its predictions, they should be added
to the header of the NetFlow file as a new column with the
To compare several detection methods it is necessary to
format “NameOfNewMethod( NormalLabelUsed: BotnetLabe-
have a methodology, so the comparisons can be repeated
lUsed: BackgroundLabelUsed)”.
and extended. For this purpose we created a simple meth-
The next Subsection describes the new error metric pro-
odology and a new error metric. The methodology may be
posed to compare botnet detection methods.
used by other researchers to add the results of their
methods and obtain a new comparisons. Section 7.1 pre-
sents the methodology and Section 7.2 presents the error 7.2. New error metric
metric.
The error metrics usually used by researchers to analyze
their results (e.g. FPR, FMeasure) were historically designed
7.1. Comparison methodology from a statistical point of view, and they are really good to
measure differences and to compare most methods. But the
When a new botnet detection method using a new dataset needs of a network administrator that is going to use a
needs to be compared with a third-party method, the most detection method are slightly different. These error metrics
usual approach is to try to run the third-party method on the should have a meaning that can be translated to the
new dataset. However, obtaining the original implementation network. This has been called the semantic gap by Rossow
of a third-party method may be difficult or even impossible et al. (2012). It is possible that the common error metrics
due to copyright issues. are not enough for a network administrator (García et al.,
The comparison methodology used in this paper is simpler. 2013).
Instead of trying to implement a third-party method on our For example, according to the classic definition, a False
dataset, we propose that the researchers first download a Positive should be detected every time that a normal NetFlow
common dataset with labels, execute their methods on this is detected as botnet. However, a network administrator
common dataset, add its results to the common dataset and might want to detect a small amount of infected IP addresses
then publish the common dataset back. instead of hundreds of NetFlows. Furthermore, she may need
A dataset made of NetFlows lines with ground truth to detect them as soon as possible. However, these needs are
labels can be easily modified to add a new column with the not satisfied in the classic error metrics.
method's predictions for each NetFlow. In this way, more
and more methods will publish their results and more 13
http://downloads.sourceforge.net/project/botnetdetectorsco
comparisons can be made. The main advantage of mparer/BotnetDetectorsComparer-0.9.tgz.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 113
The type of error metric that may be useful for a network tFN:
administrator may be also useful for comparing the methods
that she is going to use. c FN * correcting function
Therefore, we have created a new set of error metrics in an N+ of unique botnet IP addresses in the comparison time frame
attempt to solve this issue that adhere to the following (3)
principles:
tFP:
Errors should account for IP addresses instead of NetFlows.
To detect a botnet IP address (TP) early is better than latter. c FP
To miss a botnet IP address (FN) early is worst than latter. N+ of unique normal IP addresses in the comparison time frame
The value of detecting a normal IP address (TN) is not (4)
affected by time.
tTN:
The value of missing a normal IP address (FP) is not affected
by time.
c TN
N+ of unique normal IP addresses in the comparison time frame
The first step is to incorporate the time to the metrics by
(5)
computing the errors in comparison time frames. These time
frames are only used to compute the errors and are inde- These time-based error metrics allow for a more realistic
pendent of the detection methods. comparison between detection algorithms. An algorithm
The second step was to migrate from a NetFlow-based weights better if it can detect sooner all the infected IP ad-
detection to an IP-based detection. The classical error values dresses without error. To miss an infected IP address at the
(TP, FP, TN, FN) were redefined as follows:
Fig. 5 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 1.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 115
8.1. Adaptation of the BotHunter results destination port. Then, we searched which was the NetFlow
corresponding to that alert and we assigned it the label Botnet.
The comparison of the results is done by reading the ground- The rest of the NetFlows were labeled as Normal.
truth label of each NetFlow and comparing it to the predicted With this labels assignment procedure it was possible to
label of each NetFlow. However, BotHunter does not read add the BotHunter method to the comparison.
NetFlows files and does not output a prediction label for each Next Subsections compare the results on each scenario.
NetFlow, making the comparison more difficult.
To solve this issue we run BotHunter on the original pcap 8.2. Comparison of results in Scenario 1
files and we obtained, for each pcap, a list of alerts. These
alerts include the date, the name of the alert, the protocol, the This scenario corresponds to an IRC-based botnet that sent
source IP address, source port, the destination IP address and spam for almost six and a half hours.
Fig. 7 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 2.
The error metrics for this scenario are shown in Table 7, FMeasure of more than 60%. The BClus algorithm had between
which is ordered by FMeasure1. This Table shows that the 40% and 60% for the TPR, FPR, TNR and FNR metrics and nearly
AllPositive algorithm had the best FMeasure, although it had a 50% for the FMeasure. The CAMNEP algorithm had a value near
100% FPR. The BClus algorithm had a FMeasure of 0.48 and an 0% for the TPR, near 1% for the TNR and near 0% for the FMeasure.
FPR of 40%. The BotHunter algorithm had a FMeasure of 0.02 The apparently good results of the AllPositive algorithm
and an FNR of 98%. The CAMNEP (CA1) algorithm had a low may have an explanation. This algorithm predicts always
FMeasure and low FPR. The bold text in Tables 7,8,9,10 and 11 Botnet, which gives a TPR of 100%, a precision of 50% and a
identifies the main algorithms compared in this work, i.e. FMeasure of 66%. However, these metrics were computed
BClus, CAMNEP and BotHunter. The rest are the internal using only the Botnet and Normal labels and omitting the
CAMNEP algorithms and the All Positive algorithm. Background labels. The Background labels were not used for
A simplified comparison between the BClus, CAMNEP and computing the error metrics because they were neither
AllPossitive algorithms is shown in Fig. 5. Although the All- Normal nor Botnet. Therefore, the only traffic that this algo-
Positive algorithm had a 100% TPR and FPR, it can be seen that it rithm can mis-classify is the Normal traffic. However, the
had a Precision, Accuracy and ErrorRate around 50% and a amount of Normal traffic in the dataset is considerably
smaller than the rest. This imbalance made the AllPositive BotHunter algorithm had an FMeasure of 0.04 and an FNR of
have better results than it should. This algorithm is useful as a 97%. The CAMNEP algorithm had a FMeasure of 0.01 and a very
baseline for evaluating detection methods and datasets, but it small FPR.
is useless in a real network. The simplified comparison for this scenario is shown in
To better appreciate the inner workings of the detection Fig. 7. Although it was the same bot that scenario 1 and it
methods during the analysis of this scenario, we plotted the performed almost the same actions, all the algorithms gave
accumulated and running error metrics for each comparison different results. The CAMNEP method still have a large
time frame in Fig. 6. This Figure shows that the FPR of the amount of tFN, but despite the low 1% FMeasure, it was 55
BClus method was high on the first time frames, but after that times better than itself on scenario 1. Its Precision was high
it kept going down until the final 40%. On the sixth time frame because there were almost no tFP, independently of the
the BClus method started to detect botnets with a 100% TPR amount of tTP. Regarding the BClus method, it had a lower
until near the twelfth time frame. While still having a huge TPR than on scenario 1, but also a lower FPR, which lead to a
amount of FP, the BClus method managed to have a final comparatively better FMeasure value. This scenario is a good
FMeasure of 48%. The CAMNEP and BotHunter algorithms had example of the variability in the network due to the presence
low values during all the scenario. of Background traffic. The same bot, generating the same type
and amount of traffic obtained different error metrics.
8.3. Comparison of results in Scenario 2 The inner workings of the algorithms can be seen in the
running metrics shown in Fig. 8. The BClus method started
In this scenario, the same IRC-based botnet as scenario 1 sent with a large FPR, but after the fifth time frame it started to
SPAM for 4.21 h. detect botnets correctly and its FMeasure value improved. The
The error metrics for this scenario are shown in Table 8. TPR, FPR and FMeasure values for the BClus method decreased
The AllPositive algorithm had the best FMeasure. The BClus until the end of the scenario, suggesting that the final values
algorithm had a FMeasure of 0.41 and an FPR of 20%. The could be even lower. The BotHunter algorithm had an FM1
Fig. 9 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 6.
118 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3
close to 20% on the first time frames but then it quickly several RDP (Remote Desktop Protocol) services. However, it
dropped to 0.04. The CAMNEP error metrics remained low did not send any SPAM and did not attack. The C&C server
during the whole scenario. used a proprietary protocol that connected every 33 s and sent
an average of 5500 bytes on each connection.
8.4. Comparison of results in Scenario 6 The error metrics for this scenario can be seen in Table 9.
The AllPositive algorithm had a Fmeasure of 0.73, far better
The botnet in this scenario scanned SMTP (Simple Mail than the FMeasures of BClus and CAMNEP, which were both
Transfer Protocol) servers for two hours and connected to 0.04. The BotHunter algorithm had a better FMeasure than the
Fig. 11 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 8.
BClus and CAMNEP algorithms because some of the IP ad- BClus and CAMNEP methods detected tFP values until half of
dresses used in the SMTP connections where blacklisted in its the scenario, and then they had some tTP. However, the tTPs
static detection rules as part of the RBN (Russian Business were not enough to improve the FMeasure1 significantly.
Network). It should be noted that the scenarios were captured
on August 2011 and the BotHunter rules are from January 8.5. Comparison of results in Scenario 8
2013, so it is possible that these IP addresses were blacklisted
after the capture. In this scenario, the botnet contacted a lot of different Chinese
The simplified comparison for this scenario is shown in C&C hosts and received large amounts of encrypted data. It
Fig. 9. The CAMNEP and the BClus methods behaved almost also scanned and cracked the passwords of machines using
identically. The only difference is that the BClus method had the DCERPC protocol both on Internet and on the local
ten times more FPR and therefore the CAMNEP method had a network for 19 h. There were more attacks during more time.
better Precision and a slightly better FMeasure. The error metrics for this scenario can be seen in Table 10.
The inner workings of the algorithms can be seen in the The AllPositive algorithm had the best FMeasure. Six algo-
running metrics shown in Fig. 10. This Figure shows that both rithms were better than BClus, who had a FMeasure of 0.14
and an FPR of 30%. The CAMNEP algorithm had a FMeasure of was very difficult for the methods. Until the fiftieth time frame
0.08 and a very low FPR. The BotHunter algorithm could not both BClus and CAMNEP TPR values grew at almost the same
detect a single TP, so it was not possible to compute its rate. However, after that, the BClus method grew a little faster.
FMeasure. The FPR of the BClus method was very high almost from the
The simplified comparison for this scenario is shown in start of the scenario. The BotHunter algorithm had very low
Fig. 11. The BClus method had a low TPR of about 10%, how- measurements during the whole scenario.
ever it was more than twice of CAMNEPs TPR value. The TNR
value was near 60% for BClus and near 90% for CAMNEP. The
FPR of BClus was high, near 30%, however it had a FMeasure 8.6. Comparison of results in Scenario 9
value that is twice the value for CAMNEP.
The inner workings of the algorithms can be seen in the In this scenario, ten host were infected using the same Neris
running metrics shown in Fig. 12. This Figure shows that none botnet as in scenario 1 and 2. For five hours, more than 600
of the error metrics exceeded 40%. It means that this scenario SPAM mails were successfully sent.
Fig. 13 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 9.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 121
The error metrics for this scenario can be seen in Table 11. platform of detection methods could greatly enhance
The AllPositive algorithm had the best FMeasure. The BClus the results achieved in the area. We believe that such a
algorithm had a FMeasure of 0.25, an FPR of 20% and a TPR of platform could take advantages of our comparison
10%. The CAMNEP algorithm had a FMeasure of 0.17, a very methodology.
low FPR and a very low TPR. The BotHunter algorithm had a The usage of a large and real dataset, despite not having a
low FMeasure of 0.03 and high FNR of 98%. great amount of different botnets, show us which phases of
The simplified comparison for this scenario is shown in the botnet behavior were more easy to detect by the methods,
Fig. 13. It can be seen that the TPR value for the BClus method and the difficulties of working with unknown background
was almost twice the value for CAMNEP. Also, the FPR value of data.
BClus was 40 times larger than the CAMNEP value. However, Regarding our detection methods, BClus showed large FPR
the FMeasure value of CAMNEP was almost 70% the value of values on most scenarios but also large TPR, we are already
BClus. working on improving it. The CAMNEP method had a low FPR
The inner workings of the algorithms can be seen in the during most of the scenarios but at the expense of a low TPR.
running metrics shown in Fig. 14. Almost from the start of the Each of them seems best for a different botnet behavior. The
scenario, the TPR and FMeasure of the BClus and CAMNEP comparison against the BotHunter method showed that in
methods grew fast. However, after the twentieth time frame, real environments it could still be useful to have blacklists of
both FMeasures values started to decrease. The FPR value of known malicious IP addresses.
BClus was relatively low compared to the previous scenarios. Despite being biased by the really small amount of labeled
The BotHunter algorithm presented very low values during normal traffic, the AllPositive baseline algorithm was useful to
the whole scenario despite that there were ten bots being visualize how the error metrics should be always carefully
executed. considered.
We also conclude that, although useful and enough for our
purposes, the comparison methodology can be improved to
9. Conclusions show how many of the infected IP addresses were detected by
the algorithms. The new error metric proposed, that takes into
We conclude that our comparison of detection methods using consideration the IP addresses and time, allowed us to easily
a real dataset greatly helped to improve our research. It compare the algorithms from the perspective of a network
showed us how and why the methods were not optimal, administrator.
which botnet behaviors were not being detected and how the The dataset created, despite being paramount for the
dataset should be improved. Also, it show us the need for a comparison, should be improved. We are already working
comparison methodology and a proper error metric. on adding more botnets, more diverse attacks and more
We also conclude, as it was recommended by Aviv and normal labels. A better and larger dataset is already being
Haeberlen (2011), that a join effort to create a comparison built.
122 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3
Sridharan A, Ye T, Bhattacharyya S. Connectionless port scan and a teacher in the UFASTA University. His research interests
detection on the backbone. In: IPCCC 2006: Performance, include network-based botnet behavior detection, anomaly
Computing, and Communications Conference; 2006. p. 576. detection, penetration testing, honeypots, malware detection,
Szabo G, Orincsay D, Malomsoky S, Szabo I. On the validation of keystroke dynamics and machine learning. His recent projects
traffic classification algorithms. In: PAM 2008: 9th focus on using unsupervised and semi-supervised machine
International Conference, Passive and Active Network learning techniques to detect botnets on large networks based on
Measurement; 2008. pp. 72e81. their behavioral models.
Tavallaee M, Stakhanova N, Ghorbani A. Toward credible
evaluation of anomaly-based intrusion-detection methods. Martin Grill holds master degree in Software development at the
IEEE Transactions Syst Man Cybern Part C Appl Rev Faculty of Nuclear Sciences and Physical Engineering of the Czech
2010;40:516e24. Technical University in Prague. At the present time he is a
Wurzinger P, Bilge L, Holz T, Goebel J, Kruegel C, Kirda E. member of the Agent Technology Center, a researcher at CESNet,
Automatically generating models for botnet detection; 2010. and a PhD student at the Department of Cybernetics of Czech
pp. 232e49. Technical University in Prague.
Xu K, Zhang ZL. Profiling internet backbone traffic: behavior
models and applications. In: Proceedings of the 2005 Jan Stiborek holds master degree in Software development at
conference on Applications, technologies, architectures, and Faculty of Nuclear Sciences and Physical Engineering of the Czech
protocols for computer communications; 2005. pp. 169e80. Technical University in Prague. At the present time, he is pursuing
Xu K, Zhang Z, Bhattacharyya S. Reducing unwanted traffic in a PhD degree in Artificial Intelligence and Biocybernetics at
backbone network. In: SRUTI 05: steps to reducing unwanted Department of Cybernetics, FEE CTU. His current profesional in-
traffic on the internet workshop; 2005. pp. 9e15. terests focus on network security, network simulation and
Yager RR. On ordered weighted averaging aggregation operators autonomous adaptation of intrusion detection systems.
in multicriteria decisionmaking. Syst Man Cybern IEEE
Transactions 1988:183e90.
Alejandro Zunino (http://www.exa.unicen.edu.ar/~azunino) recei-
Zhao D, Traore I, Sayed B, Lu W, Saad S, Ghorbani A, et al. Botnet
ved a Ph.D. degree in Computer Science from the National University
detection based on traffic behavior analysis and flow intervals.
of the Center of Buenos Aires (UNICEN), in 2003, and his M.Sc. in
J Comput Secur 2013;39:2e16.
Systems Engineering in 2000. He is a full Adjunct Professor at UNI-
CEN, member of the ISISTAN Research Institute and Independent
Sebastia n García is a PhD student in UNICEN University Researcher of the National Scientific and Technical Research Council
(Argentina) and a researcher in the ATG group at the Czech (CONICET). His research areas are Distributed Computing and Soft-
Technical University. He is also a research fellow at the National ware Engineering. Contact him at azunino@conicet.gov.ar.
Scientific and Technical Research Council of Argentina (CONICET)