0% found this document useful (0 votes)
93 views

Cloud Computing IaaS Implementation

Research paper to implement the IaaS in the organizations and existing infrastructure.

Uploaded by

Salman Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

Cloud Computing IaaS Implementation

Research paper to implement the IaaS in the organizations and existing infrastructure.

Uploaded by

Salman Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Available online at www.sciencedirect.com

ScienceDirect

journal homepage: www.elsevier.com/locate/cose

An empirical comparison of botnet detection


methods

S. García a,b,*, M. Grill b, J. Stiborek b, A. Zunino a


a
ISISTAN Research Institute e CONICET, Faculty of Sciences, UNICEN University, Argentina
b
Agents Technology Group, Department of Computer Science and Engineering, Czech Technical University in Prague,
Czech Republic

article info abstract

Article history: The results of botnet detection methods are usually presented without any comparison.
Received 21 October 2013 Although it is generally accepted that more comparisons with third-party methods may
Received in revised form help to improve the area, few papers could do it. Among the factors that prevent a com-
29 April 2014 parison are the difficulties to share a dataset, the lack of a good dataset, the absence of a
Accepted 27 May 2014 proper description of the methods and the lack of a comparison methodology. This paper
Available online 5 June 2014 compares the output of three different botnet detection methods by executing them over a
new, real, labeled and large botnet dataset. This dataset includes botnet, normal and
Keywords: background traffic. The results of our two methods (BClus and CAMNEP) and BotHunter
Botnet detection were compared using a methodology and a novel error metric designed for botnet de-
Malware detection tections methods. We conclude that comparing methods indeed helps to better estimate
Methods comparison how good the methods are, to improve the algorithms, to build better datasets and to build
Botnet dataset a comparison methodology.
Anomaly detection © 2014 Elsevier Ltd. All rights reserved.
Network traffic

Although the comparison of methods can greatly help to


1. Introduction improve the botnet detection area, few proposals made such a
comparison (García et al., 2013). As far as we know, only three
It is difficult to estimate how much a new botnet detection papers (Wurzinger et al., 2010; Zhao et al., 2013; Li et al., 2010)
method improves the current results in the area. It may be made the effort so far.
done by comparing the new results with other methods, but Obtaining a good dataset for comparisons is difficult.
this has already been proven hard to accomplish (Aviv and Currently, most detection proposals tend to create their own
Haeberlen, 2011). Among the factors that prevent these com- botnet datasets to evaluate their methods. However, these
parisons are: the absence of proper documentation of the datasets are difficult to create (Lu et al., 2009) and usually end
methods (Tavallaee et al., 2010), the lack of a common, labeled up being suboptimal (Shiravi et al., 2012), i.e. they lack some
and good botnet dataset (Rossow et al., 2012), the lack of a important features, such as ground-truth labels, heterogeneity
comparison methodology (Aviv and Haeberlen, 2011) and the or real-world traffic. These custom datasets are often difficult
lack of a suitable error metric (Salgarelli et al., 2007). to use for comparison with other methods. This is because each

* Corresponding author. ISISTAN Research Institute–CONICET, Faculty of Sciences, UNICEN University, Argentina.
E-mail addresses: sebastian.garcia@isistan.unicen.edu.ar, eldraco@gmail.com (S. García), grill@agents.fel.cvut.cz (M. Grill), jan.sti-
borek@agents.felk.cvut.cz (J. Stiborek), alejandro.zunino@isistan.unicen.edu.ar (A. Zunino).
http://dx.doi.org/10.1016/j.cose.2014.05.011
0167-4048/© 2014 Elsevier Ltd. All rights reserved.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 101

method is usually focused on different properties of the data- concludes that it is difficult to compare the results with
set. The problem is to find a good, common and public dataset another proposal because the datasets tend to be private and
that can be read by all methods and satisfy all the constrains. the descriptions of the methods tend to be incomplete.
The difficultly to compare detection methods goes beyond Another analysis of the difficulty of reproducing a method
the dataset. The lack of good descriptions of the methods and was described by Tavallaee et al. (2010), where they state that
error metrics contribute to the problem. As stated by Rossow there is an absence of proper documentation of the methods
et al. (2012), the error metrics used on most papers are usu- and experiments in most detection proposals.
ally non-homogeneous. They tend to use different error One of the detection proposals that actually made a com-
metrics and different definitions of error. Moreover, the most parison with a third-party method was presented by
common error metrics, e.g. FPR, seems to be not enough to Wurzinger et al. (2010). The purpose of the paper is to identify
compare botnet detection methods. The classic error metrics single infected machines using previously generated detec-
were defined from a statistical point of view and they fail to tion models. It first extracts the characters strings from the
address the detection needs of a network administrator. network to find the commands sent by the C&C and then it
The goal of this paper is to compare three botnet detection finds the bot responses to those commands. The authors
methods using a simple and reproducible methodology, a downloaded and executed the BotHunter program of Gu et al.
good dataset and a new error metric. The contributions of our (2007) on their dataset and made a comparison. However, the
paper are: paper only compares the results of both proposals using the
TPR error metric and the FP values.
 A deep comparison of three detection methods. Our own The other paper that made a comparison with a third-party
algorithms, CAMNEP and BClus, and the third-party algo- method was presented by Zhao et al. (2013). This proposal
rithm BotHunter (Gu et al., 2007). selects a set of attributes from the network flows and then
 A simple methodology for comparing botnet detection applies a Bayes Network algorithm and a Decision Tree algo-
methods along with the corresponding public tool for rithm to classify malicious and non-malicious traffic. The
reproducing the methodology. third-party method used for comparison was again BotHunter.
 A new error metric designed for comparing botnet detec- There is a description of how BotHunter was executed, but
tion methods. unfortunately the only error metric reported was a zero False
 A new, large, labeled and real botnet dataset that includes Positive. No other numerical values were presented.
botnet, normal and background data. The last proposal that also compared its results with a
third-party method was made by Li et al. (2010). This paper
We conclude that the comparison of different botnet analyzes the probable bias that the selection of ground-truth
detection methods with other proposals is highly beneficial for labels might have on the accuracy reported for malware
the botnet research community because it helps to objectively clustering techniques. It states that common methods for
assess the methods and improve the techniques. Also, that the determining the ground truth of labels may bias the dataset
use of a good botnet dataset is paramount for the comparison. toward easy-to-cluster instances. This work is important
The rest of the paper is organized as follows. Section 2 because it successfully compared its results with the work of
shows previous work in the area. Section 3 describes the Bayer et al. (2009). The comparison was done with the help of
CAMNEP detection method. Section 4 shows the BClus botnet Bayer et al., who run the algorithms described in Li et al. (2010)
detection method. Section 5 describes the BotHunter method. on their private dataset.
Section 6 describes the dataset and its features. Section 7 de- Regarding the creation of datasets for malware-related
scribes the comparison methodology, the public tool and the research, Rossow et al. (2012) presented a good paper about
new error metric. Section 8 shows the results and compares the prudent practices for designing malware experiments.
the methods and Section 9 presents our conclusions. They defined a prudent experiment as one being correct,
realistic, transparent and that do not harm others. After
analyzing 36 papers they conclude that most of them had
2. Previous work shortcomings in one or more of these areas. Most importantly,
they conclude that only a minority of papers included real-
The comparison of detection methods is usually considered a world traffic in their evaluations.
difficult task. In the case of botnets it is also related to the
creation of a new dataset. The next Subsections describe the 2.2. Datasets available
previous work in the area of comparison of methods and the
area of creation of datasets. Regarding botnet datasets that are available for download, a
deep study was presented in Shiravi et al. (2012) about the
2.1. Comparison of methods generation of datasets. It describes the properties that a
dataset should have in order to be used for comparison pur-
The comparison of a new detection method with a third-party poses. The dataset used in the paper includes an IRC-based
method is difficult. In the survey presented by García et al. Botnet attack,1 but the bot used for the attack was developed
(2013), where there is a deep analysis of fourteen network- by the authors and therefore it may not represent a real botnet
based botnet detection methods, the authors found only one behavior. This dataset may be downloaded with authorization.
paper that made such a comparison. The survey compared the
1
motivations, datasets and results of the fourteen proposals. It http://www.iscx.ca/datasets.
102 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Table 1 e Summary of available datasets.


Name Available Format Background Botnet Normal Labels
Shiravi et al. (2012) ? ? √ √ e ?
PREDICT √ CSV e √ e e
CAIDA √ CSV, pcap e √ e e
Saad et al. (2011) √ pcap √ √ e √
Szabo et al. (2008) √ pcap e e √ √
Contagio √ pcap e √ e √
NexGinRC (2013) ? CSV e √ √ ?
Cho et al. (2000) √ pcap √ e e e

The Protected Repository for the Defense of Infrastructure A considerable amount of malware traffic in pcap format
Against Cyber Threats (PREDICT) indexed three Botnet data- was published in the Contagio blog.9 It contains thirty one ATP
sets2 until May 16th, 2013. The first one is the Kraken Botnet pcap captures and sixty one crimenware pcaps. Each file
Sinkhole Connection Data dataset, the second one is the Flash- contains the traffic of one malware without background
back Botnet Sinkhole Connection Data dataset and the third one is traffic. Unfortunately, the captures are really short (mostly
the Conficker Botnet Sinkhole Connection Data dataset. They were between 1 min and 1 h) and the traffic is not labeled. But since
published as CSV text files, where each line is a one minute each scenario includes only one infected computer, it should
aggregation of the number of attempted connections of one IP be possible to label them.
address. Unfortunately, the aggregation method may not be Another dataset with malware logs and benign logs was
suitable for comparisons with other proposals. collected in NexGinRC (2013). The malware logs are both real
The CAIDA organization published a paper about the Sality and simulated. The benign logs consist of 12 months of traffic.
botnet in Dainotti et al. (2012) along with its corresponding Unfortunately, the dataset is in CSV format, which may not be
dataset.3 Unfortunately, the CSV text format of the dataset suitable for some detection algorithms because it does not
may not be suitable for every detection algorithm because the have the same information as a NetFlow file or pcap file. Ac-
content of the dataset only includes enough information to cess to this dataset may be granted upon request10.
reproduce the techniques in the paper. CAIDA also published The last dataset analyzed is currently created by the MAWI
a dataset about the Witty Botnet in pcap format4 and several project described in Cho et al. (2000). It includes an ongoing
datasets with responses to spoofed DoS traffic5 and anoma- effort to publish one of the most recent and updated back-
lous packets.6 None of them are labeled. ground datasets to date. Its goal is to promote the research on
A custom botnet dataset was created to verify five P2P traffic analysis and the creation of free analysis tools. How-
botnet detection algorithms in Saad et al. (2011). Fortunately, ever, the pcap files are not labeled, and therefore it is more
this dataset was made public and can be downloaded.7 The difficult to use them for training or verification. There was an
dataset is a mixture of two existing and publicly available effort to label this dataset using anomaly detectors in
malicious datasets and one non-malicious pcap dataset. They Fontugne et al. (2010). The labels are not ground-truth, but
were merged to generate a new file. This was, at that time, the may be useful to compare other methods.
best dataset that can be downloaded for comparison pur- A summary of the described datasets is presented in Table 1.
poses. Unfortunately, there is only one infected machine for This table shows that, so far, no dataset includes Background,
each type of botnet, therefore no synchronization analysis can Botnet and Normal labeled data.
be done.
The Traffic Laboratory at Ericsson Research created a
normal dataset that was used in Saad et al. (2011) and 3. The CAMNEP detection method
described in Szabo  et al. (2008). This normal dataset is publicly
available.8 It is composed of pcap traffic files that were labeled The Cooperative Adaptive Mechanism for NEtwork Protection
by means of one IP header option field. This is the only normal (CAMNEP) (Rehak et al., 2009) is a Network Behavior Analysis
dataset that is labeled inside the pcap file. system (Scarfone and Mell, 2007) that consists of various
state-of-the-art anomaly detection methods. The system
2 models the normal behavior of the network and/or individual
https://www.predict.org/Default.aspx?tabid¼104.
3 users behaviors and labels deviations from normal behaviors
http://imdc.datcat.org/collection/1-06Y5-B¼UCSDþNetworkþ
TelescopeþDatasetþonþtheþSipscan. as anomalous.
4
http://www.caida.org/data/passive/witty_worm_dataset.xml.
5
http://www.caida.org/data/passive/backscatter_2008_dataset. 3.1. System architectures
xml.
6
http://www.caida.org/data/passive/telescope-2days-2008_
CAMNEP processes NetFlow data provided by routers or other
dataset.xml.
7
http://www.isot.ece.uvic.ca/dataset/ISOT_Botnet_DataSet_ network equipment to identify anomalous traffic by means of
2010.tar.gz. several collaborative anomaly detection algorithms. It uses
8
http://www.crysys.hu/~szabog/measurement.tar.
9 10
http://contagiodump.blogspot.co.uk/2013/04/collection-of- ytextsc{http://nexginrc.org/Datasets/DatasetDetail.aspx?
pcap-files-from-malware.html. pageID¼24}.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 103

a multi-algorithm and multi-stage approach to reduce the algorithms used in the system are described in detail in Sec-
amount of false positives generated by the individual anomaly tion 3.2.
detectors without compromising the performance of the sys- The trust models layer maps the NetFlows into traffic
tem. The self-monitoring and self-adaptation techniques, clusters. These clusters group together the NetFlows that have
described in Section 3.4, are very important to this purpose. a similar behavioral pattern. They also contain the anomaly
They help improve the error rate of the system with a minimal value of the type of the event that they represent. These
and controllable impact on its efficiency. clusters persist over time and the anomaly value is updated by
CAMNEP consists of three principal layers that evaluate the the trust model. The updated anomaly value of a cluster is
traffic: anomaly detectors, trust models and anomaly used to determine the anomaly of new NetFlows. Therefore,
aggregators. the trust models act as a persistent memory and reduce the
The anomaly detectors layer (identified as Anomaly De- amount of false positives by means of the spatial aggregation
tectors A and B in Fig. 1) analyze the NetFlows using various of the anomalies.
anomaly detection algorithms. We are currently using 8 The aggregators layer creates one composite output that
different anomaly detection approaches. Each of them uses a integrates the individual opinion of several anomaly detectors
different set of features, thus looking for an anomaly from as they were provided by the trust models. The result of the
slightly different perspectives. The output of these algorithms aggregation is presented to the user of the system as the final
is aggregated into events using several statistical functions anomaly score of the NetFlows. Each aggregator can use two
and the results are sent to the trust models. All the detection different averaging operators: an order-weighted averaging

Fig. 1 e Adaptation process in the CAMNEP system.


104 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

(Yager, 1988) or simple weighted averaging. CAMNEP is using a introduction of new rules that define a horizontal port scan as
simulation process to determine the best aggregation operator anomalous.
for the current type and state of network. This process is
described in Section 3.4. 3.2.3. Lakhina volume
The volume prediction algorithm presented in Lakhina et al.
3.2. Anomaly detectors (2004) uses the Principal Components Analysis (PCA) algo-
rithm to build a model of traffic volumes from individual
The anomaly detectors in the CAMNEP system are based on sources. The observed traffic for each source IP address with
already published anomaly detection methods. They work in non-negligible volumes of traffic is defined as a three dimen-
two stages: (i) they extract meaningful features associated sional vector: the number of NetFlows, number of bytes and
with each NetFlow (or group of NetFlows), and (ii) they use the number of packets from the source IP address. The traffic
values of these features to assign an anomaly score to each model is defined as a dynamic and data-defined trans-
NetFlow. This anomaly score is a value in the [0,1] range. A formation matrix that is applied to the current traffic vector.
value of 1 represents an anomaly and a value of 0 represents The transformation splits the traffic into normal (i.e. modeled)
normal behavior. The anomaly detector's model is a fuzzy and residual (i.e. anomalous). The transformation returns the
classifier that provides the anomaly value for each NetFlow. residual amount of NetFlows, packets and bytes for each
This value depends on the NetFlow itself, on other NetFlows in source IP address. These values define the context (identical
the current context, and on the internal traffic model, which is for all the flows from the given source). An anomaly is deter-
based on the past traffic observed on the network. mined by transforming the 3D context into a single value in
The following subsections describe each of the anomaly the [0,1] interval.
detectors used in the CAMNEP system. Notice that the original work was designed to handle a
different problem, that is, the detection of anomalies on a
3.2.1. MINDS backbone. Also the original work modeled networks instead of
The MINDS algorithm (Ertoz et al., 2004) builds a context in- source IP addresses. However, we modify it to obtain a clas-
formation for each evaluated NetFlow using the following fea- sifier that can successfully contribute to the joint opinion
tures: the number of NetFlows from the same source IP address when combined with others.
as the evaluated NetFlow, the number of NetFlows toward the
same destination host, the number of NetFlows towards the 3.2.4. Lakhina Entropy
same destination host from the same source port, and the The entropy prediction algorithm presented by Lakhina et al.
number of NetFlows from the same source host towards the (2005) is based on the similar PCA-based traffic model than
same destination port. This is a simplified version of the orig- Section 3.2.3, but it uses different features. It aggregates the
inal MINDS system, which also uses a secondary window traffic from the individual source IP addresses, but instead of
defined by the number of connections in order to address slow traffic volumes, it predicts the entropies of destination IP ad-
attacks. The anomaly value for a NetFlow is based on its dis- dresses, destination ports and source ports over the set of
tance to the normal sample. The metric defined in this four- context NetFlows for each source. The context space is
dimensional context space uses a logarithmic scale on each therefore three dimensional. An anomaly is determined as the
context dimension, and these marginal distances are combined normalized sum of residual entropy over all three dimensions.
into the global distance as the sum of their squares. In the The metric is simple: a function measures the difference of
CAMNEP implementation of this algorithm, the variance- residual entropies between the NetFlows and aggregates their
adjusted difference between the floating average of past squares. Also, the original anomaly detection method was
values and the evaluated NetFlow on each of the four context significantly modified along the same lines as the volume
dimensions is used to know if the evaluated NetFlow is prediction algorithm.
anomalous. The original work is based on the combination of
computationally-intensive clustering and human intervention. 3.2.5. TAPS
The TAPS method (Sridharan et al., 2006) is different from the
3.2.2. Xu previous approaches because it targets horizontal and vertical
In the algorithm proposed by Xu et al. (2005), the context of port scans. The algorithm only considers the traffic sources
each NetFlow to be evaluated is created with all the NetFlows that created at least one single-packet NetFlow during a
coming from the same source IP address. In the CAMNEP particular observation period. These preselected sources are
implementation, for each context group of NetFlows, a 3 then classified using the following three features: number of
dimensional model is built with the normalized entropy of the destination IP addresses, number of destination ports and the
source ports, the normalized entropy of the destination ports, entropy of the NetFlow size measured in number of packets.
and the normalized entropy of the destination IP addresses. The anomaly value of the source IP address is based on the
The anomalies are determined by some classification rules ratio between the number of unique destination IP addresses
that divide the traffic into normal and anomalous. The dis- and destination ports. When this ratio exceeds a pre-
tance between the contexts of two NetFlows is computed as determined threshold the source IP address is considered as a
the difference between the three normalized entropies, com- scan origin. Using the original method, we have encountered
bined as the sum of their squares. Our implementation of the an unusually high number of false positives. Therefore, we
algorithm is close to the original publication, which was extended the method with the NetFlow size entropy to ach-
further expanded by Xu and Zhang (2005), except for the ieve better results.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 105

3.2.6. KGB Each trust model determines the trustfulness of each Net-
The KGB anomaly detector presented by Pevný et al. (2012) is also Flow by finding all the centroids in the NetFlows vicinity. It
based on Lakhina's work. It uses the same features as Lakhina sets the trustfulness using the distance-based weighted
Entropy detector described above. Similar to Lakhina's work, it average of the values preserved by the centroids. All the
performs a PCA analysis of the feature vectors for each source IP models provide their trustfulness assessment (conceptually a
address in the dataset. The final anomaly is determined from the reputation opinion) to the anomaly aggregators.
deviations of averaging the principal components.
There are two versions of KGB detector:
3.4. Adaptation

 KGBf e examines principal components with high


The adaptation layer of CAMNEP identifies the optimal trust-
variances
fulness aggregation function that achieves the best separation
 KGBfog e examines principal components with low
between the legitimate and malicious NetFlows. This layer is
variances.
based on the insertion of challenges into the NetFlow data
observed by the system. The challenges are NetFlows of past
classified incidents. They are generated by short lived, chal-
3.2.7. Flags
lenge specific challenge agents and are mixed with the input
The Flags detector uses the same detection method as the KGB
traffic. They cannot be distinguished from the rest of the input
detector (Pevný et al., 2012). The only difference is in the input
traffic by the detectors/aggregators. They are processed and
feature vector. The feature vector of the Flags detector is
evaluated with the rest of the traffic. Also, they are used to
determined by the histogram of the TCP Flags of all the Net-
update anomaly detection mechanisms and the trust models.
Flows with the same IP address. This detector is looking for a
Once the process is completed, the challenges are re-
sequence or a combination of anomalous TCP flags.
identified by their respective challenge agents and removed
from the output. The anomaly value given to these NetFlows
3.3. Trust modeling
by the individual anomaly aggregators is used to evaluate
those aggregations and to select the optimal output for the
The trust models are specialized knowledge structures stud-
current network conditions.
ied in multi-agent research (Ramchurn et al., 2004; Sabater
There are two broad types of challenges: The malicious
and Sierra, 2005). The features of trust models include fast
challenges correspond to known attacks, whereas the legitimate
learning, robustness in response to false reputation informa-
challenges represent known instances of legitimate events that
tion and robustness with respect to environmental noise.
tend to be misclassified as anomalous. Malicious challenges
Recent trust models, inspired by machine learning
are further divided into broad attack classes, such as finger-
methods (Rettinger et al., 2007) and pattern recognition ap-
printing/vertical scan, horizontal scan, password brute forc-
proaches (Rehak et al., 2007) make the trust reasoning more
ing, etc. For each attack class, each aggregator has a
relevant for network security, as they are able to:
probability distribution that is empirically estimated from the
continuous anomaly values attributed to the challenges in
 include the context of the trusting situation into the
that class. This characterization can be seen in Fig. 2. All the
reasoning, making the trust model situational;
legitimate challenges are also defined by a distribution.
 use the similarities between trustees to reason about
We assume that the anomaly values of both the legitimate
short-lived or one shot trustees, e.g. NetFlows.
and malicious challenges are defined by normal distribu-
tions.11 The distance between the estimated mean normalized
A feature vector includes the identity of a NetFlow and the
values of both distributions, represents the quality of the
context of the NetFlow by each trust model in the feature
aggregator with respect to a given attack class. The effective-
space. We use the term centroid to denote the permanent
ness of the aggregator, defined as an ability to distinguish be-
feature vectors that are positioned in the feature spaces of
tween the legitimate events and the attacks is defined as a
trust models. The centroids act as trustees of the model, and
weighted average of the effectiveness with respect to indi-
the trustfulness value of each centroid is updated. Each
vidual classes.
centroid is used to deduce the trustfulness of the feature
As the network traffic is highly dynamic, it is very difficult
vectors in its vicinity.
to predict which aggregation function will be chosen, espe-
The anomaly detectors integrate the anomaly values of
cially given the fact that the challenges are selected from a
individual NetFlows into their trust models. The reasoning
challenge database using a stochastic process with a pseudo-
about the trustfulness of each individual NetFlow is both
random generator unknown to a potential attacker. The
computationally unfeasible and unpractical (the NetFlows
attacker therefore faces a dynamic detection system that
are single shot events by definition), and thus the centroids of
unpredictably switches its detection profiles. Each profile has
the clusters holds the trustfulness of significant NetFlow
a utility value (i.e. detection performance) close to the opti-
samples. The anomaly value of each NetFlow is used to up-
mum. This unpredictability, together with the additional
date the trustfulness of centroids in its vicinity. The weight
used for the update of the trustfulness of the centroids de- 11
Normality of both distributions is not difficult to achieve
creases with the distance. Therefore, as each model uses a provided that the attack classes are properly defined and that the
distinct distance function, they all have a different insight challenge samples in these classes are well selected, i.e. compa-
into the problem. rable in terms of size and other parameters.
106 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Fig. 2 e Distribution of challenges. The anomalies distribution of the malicious challenges (from one class) is on the left side
of the graph, while the legitimate events are on the right.

robustness achieved by the use of multiple algorithms, makes uses them to detect similar traffic on the network. It is not an
the evasion attempt a much more difficult task than simply anomaly detection method.
avoiding a single intrusion detection method (Rubinstein The purpose of the method is to cluster the traffic sent by
et al., 2009). each IP address and to recognize which clusters have a
Furthermore, the system is able to find the optimal behavior similar to the botnet traffic. A basic schema of the
thresholds for the anomaly score when using the results from BClus method is:
the adaptation process. The system is continuously modeling
the normal distribution of malicious and legitimate chal- 1. Separate the NetFlows in time windows.
lenges. The threshold is set to minimize the Bayes risk (pos- 2. Aggregate the NetFlows by source IP address.
teriori expected loss) computed from the modeled legitimate 3. Cluster the aggregated NetFlows.
and malicious behavior distributions. Thus, the final result of 4. Only Training: Assign ground-truth labels to the botnet
the system is a list of NetFlows with labels anomalous or clusters.
normal. 5. Only Training: Train a classification model on the botnet
clusters.
6. Only Testing: Use the classification model to recognize the
3.5. Training of the CAMNEP method
botnet clusters.

Since the system needs the inner models of the anomaly de-
At the end the BClus method outputs a predicted label for
tectors and trust models to have the optimal detection results,
each NetFlow analyzed.
it is necessary to train them. Typically, the system needs
The following Subsections describe each of these steps.
25 min of traffic to create its inner models and to adapt itself to
the current type of network and its state. Therefore, the
4.1. Separate the NetFlows in time windows
training data for the CAMNEP algorithm was created by
trimming off some minutes at the start of each of the sce-
The main reason to separate the NetFlows in time windows is
narios in the dataset described in Section 6.
the huge amount of data that had to be processed. Some of our
botnet scenarios produced up to 395,000 packets per minute. A
short time window allow us to better process this information.
4. The BClus detection method The second reason to use time windows is that botnets
tend to have a temporal locality behavior (Hegna, 2010),
The BClus method is a behavioral-based botnet detection meaning that most actions remain unchanged for several
approach. It creates models of known botnet behavior and minutes. In our dataset the locality behavior ranges between 1
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 107

and 30 min. This temporal locality helps to capture all the list of instances for each aggregation window. Next Subsec-
important behaviors in the time windows. tion describes the clustering process of these instances.
The third reason for using time windows is the need to
deliver a result to the network administrator in a timely
4.3. Cluster the aggregated NetFlows
manner. After each time window, the BClus method outputs
some results and the administrator can have an input about
The continuous evolution of botnets suggests that a good
the traffic.
detection method should be as independent of the network
An important decision on the time window separation
characteristics as possible. The BClus method, then, uses an
criteria is the window width. A short time window does not
unsupervised approach to cluster the instances described in
contain enough NetFlows and therefore would not allow a
the previous section. These natural groups of behaviors
correct analysis of the botnet behavior. In the other hand, a
depend on the time window being analyzed and on the
large time window would have a high computational cost. The
characteristics of the network where the algorithm is running.
time window used by the BClus method is of two minutes,
The technique used for this task is WEKA's implementation
since it is enough to capture all the botnet behaviors and it
of the Expectation-Maximization (EM) algorithm (Moon, 1996).
does not contain too much NetFlows.
EM is an iterative procedure that attempts to find the pa-
The next step in the BClus method is to aggregates the
rameters of the model that maximize the probability of the
NetFlows.
observed data. Our dataset has many different network be-
haviors generated by normal, botnet and attack actions. We
hypothesize that these behaviors are generated from different
4.2. Aggregate the NetFlows by source IP address
probabilistic models and that the parameters of these models
can be found using the EM algorithm. The instances are
The purpose of aggregating th NetFlows is to analyze the
assigned to the probability distribution that they most likely
problem from a new high-level perspective. The aggregated
belong to, therefore building clusters.
data may show new patterns. We hypothesize that these new
After generating the clusters, the task of the BClus method
patterns could help recognize the behaviors of botnets. From
is to find which of them belong to botnets. The features of a
the botnet detection perspective, the main motivations for
cluster are the average and standard deviation of the seven
aggregating NetFlows are the following:
instances features described in Section 4.2. The following 15
cluster features are obtained for each cluster:
 Each bot communicates with the C&C server periodically
(AsSadhan et al., 2009).
1. Total amount of instances in the cluster.
 Several bots may communicate at the same time with the
2. Total amount of NetFlows in the cluster.
same C&C servers (Gu et al., 2008).
3. Amount of source IP addresses.
 Several bots attack at the same time the same target (Lee
4. Average amount of unique source ports.
et al., 2008).
5. Standard Deviation of the amount of unique source
ports.
Inside each time window, the NetFlows are aggregated
6. Average amount of unique destination IP addresses.
during one aggregation window. The width of the aggregation
7. Standard Deviation of the amount of unique destination
window should be less than the width of the time window,
IP addresses.
which was of two minutes. After some experimentation, a one
8. Average amount of unique destination ports.
minute aggregation window width was selected, which is
9. Standard Deviation of the amount of unique destination
enough to capture the botnet synchronization patterns and
ports.
short enough not to capture too much traffic (García et al.,
10. Average amount of NetFlows.
2012). Therefore, on each time window, two aggregation
11. Standard Deviation of the amount of NetFlows.
windows are used.
12. Average amount of bytes transferred.
The NetFlows are aggregated by unique source IP address.
13. Standard Deviation of the amount of bytes transferred.
The resulting features on each aggregation window are:
14. Average amount of packets transferred.
15. Standard Deviation of the amount of packets
1. Source IP address
transferred.
2. Amount of unique source ports used by this source IP
address.
Once the features are extracted, they are used in the next
3. Amount of unique destination IP addresses contacted by
Subsection to assign the ground-truth labels to the clusters.
this source IP address.
4. Amount of unique destination ports contacted by this
source IP address. 4.4. Train a classification model on the botnet clusters
5. Amount of NetFlows used by this source IP address.
6. Amount of bytes transferred by this source IP address. The classification algorithm used to find the botnet clusters is
7. Amount of packets transferred by this source IP address. JRIP. It is the WEKA's implementation of a “( … ) a proposi-
tional rule learner, Repeated Incremental Pruning to Produce
We call this group of seven aggregated features an instance Error Reduction (RIPPER), which was proposed by William W.
to simplify the references. The aggregation step ends with a Cohen as an optimized version of IREP” (Hall et al., 2009).
108 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

The JRIP algorithm receives a labeled group of clusters and As this new feature is a percentage, a correct threshold had
output a group of rules to detect them. That means that the to be found. This threshold decision is very important,
JRIP algorithm needs to be trained on how to recognize a because different percentages correspond to different botnet
botnet cluster. This training is done with the following leave- behaviors. If it is above 0%, it means that every cluster with at
one-out algorithm: least one botnet NetFlow is considered a representative of a
botnet behavior. If it is above 1%, it means that only clusters
1. Training phase. with more that 1% of botnet NetFlows are considered a
(a) Use a leave-one-out algorithm with the training and representative of a botnet behavior. A manual analysis of the
cross-validation datasets. For each round do: dataset determined that most of the real botnet clusters had
i. Separate the NetFlows in time windows (Section between 0% and 1% of botnet NetFlows. To find out which
4.1). threshold between 0% and 1% was the best, we implemented
ii. Aggregate the NetFlows by source IP address (Sec- the leave-one-out algorithm described in Section 4.4 to try the
tion 4.2). following ten candidates thresholds: 0.1%, 0.2%, 0.3%, 0.4%,
iii. Cluster the aggregated NetFlows (Section 4.3). 0.5%, 0.6%, 0.7%, 0.8%, 0.9% and 1%.
iv. Assign ground-truth labels to the clusters based on After running the leave-one-out technique we found that
the ground-truth labels of the NetFlows (Section the group of clusters that has the best error metrics for the
4.4.1). BClus algorithm was generated with a threshold of 0.4%. The
v. Train a JRIP classification model to recognize the set of JRIP rules generated by the 0.4% percentage became the
botnet clusters. best detection model applied in next Subsection.
vi. Apply the JRIP model in the cross-validation dataset
of this round. 4.5. Testing phase. Use the classification model to
vii. Store the error metrics of this round. recognize the botnet clusters
2. Select the bests JRIP model based on the results of the
leave-one-out. Once the best detection model was found in previous Sub-
3. Testing phase (Section 4.5). section, we applied it in the testing dataset to knew the real
(a) Read the testing dataset. performance of the BClus algorithm.
(b) Use the best JRIP classification model to recognize the The testing dataset was processed in the same way that the
botnet clusters. training dataset. That is, it was separated in two minutes time
(c) Assign the labels to the NetFlows based on the labels of windows, the NetFlows in each time windows were aggre-
the clusters. gated by its source IP address every one minute and those
aggregated instances were clustered. Then, the best JRIP
The rest of the Subsections describe each of the steps. model was applied to detect the botnet clusters.
If a cluster was classified as botnet, then all of its instances
4.4.1. Assign ground-truth labels to the botnet clusters were labeled as botnet, and in turn all of the NetFlows in those
Ground-truth labels should be assigned to the clusters instances were labeled as botnet. Finally, the BClus method
because we need to train the JRIP classification algorithm with outputted a list of NetFlows with the predicted label assigned.
them. Once the JRIP algorithm knows which are the botnet This list of labeled NetFlows for each testing scenario is the
clusters, it can create a model to recognize them. output that will be compared to the CAMNEP and BotHunter
To assign a ground-truth label to a cluster, we should first methods in Section 8.
assign a ground-truth label to all of its instances (aggregated
NetFlows). However, to assign a ground-truth label to an
instance, we should first assign a ground-truth label to all of 5. The BotHunter Method
its NetFlows.
The ground-truth label of each NetFlow is known from the The BotHunter method was proposed by Gu et al. (2007) to
original NetFlow files that are part of the dataset. Therefore, detect the infection and coordination dialog of botnets by
ground-truth label of each instance is known, since an matching a state-based infection sequence model. It consists
instance is composed of all the NetFlows from the same IP of a correlation engine that aims at detecting specific stages of
address. However, a cluster is composed of different instances the malware infection process, such as inbound scanning,
coming from different IP addresses, and then it is not exploit usage, egg downloading, outbound bot coordination
straightforward to know which ground-truth label should be dialog and outbound attack propagation.
assigned to a cluster. This Subsection describes how we It uses an adapted version of the Snort IDS12 with two
assigned a ground-truth label to each cluster. proprietary plugin-ins, called Statistical Scan Anomaly Detection
To help us decide which label should be assigned to each Engine (SCADE) and Statistical Payload Anomaly Detection Engine
cluster, a new feature was computed for each cluster: The (SLADE). SLADE implements a lossy n-gram payload analysis
percentage of botnet NetFlows on the cluster. This value is of incoming traffic flows to detect divergences in some pro-
expected to be bigger on botnet clusters and smaller on tocols. SCADE performs port scan analyzes.
background clusters and it was used to select the ground- An infection is reported when one of two conditions is
truth label for the cluster. Notice that this feature is only satisfied: first, when an evidence of local host infection is
used to assign the ground-truth label in the training phase and
12
is not stored nor used in the testing phase. http://www.snort.org.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 109

found and evidence of outward bot coordination or attack The topology used to create the dataset consisted in a set
propagation is found, and second, when at least two distinct of virtualized computers running the Microsoft Windows XP
signs of outward bot coordination or attack propagation are SP2 operating system on top of a Linux Debian host. At the
found. The BotHunter warnings are tracked over a temporal time of designing the topology, the Windows XP SP2 was the
window and contribute to the infection score of each host. most used operating system by the malware. Each virtual
The BotHunter proposal is compared to the BClus and machine was being bridged into the University network.
CAMNEP methods to have the reference of an accepted Fig. 3 shows a diagram of the testbed. The traffic was
detection method in the community. The version of Bot- captured both on the Linux host and on one of the University
Hunter used in the comparison is 1.7.2. routers. The traffic from the Linux host was exclusively
Section 8.1 describes how the results of the BotHunter composed of botnet traffic and was used for labeling pur-
proposal were adapted and incorporated into the comparison. poses. The traffic from the University router was used to
create the final dataset. The tool used to capture the traffic
was tcpdump (Jacobson et al., 1997).
6. Creation of the dataset The next Subsections describe each of the captures, its
design principles, the preprocessing of the dataset, the
In order to compare the methods, a good dataset is needed. assignment of labels, the separation in training and testing
According to (Sperotto et al., 2009; Shiravi et al., 2012), a good and the publication of the dataset.
dataset should be representative of the network were the al-
gorithms are going to be used. This means that it should have 6.1. Design of the botnet scenarios
botnet, normal and background labeled data, that the balance
of the dataset should be like in a real network (usually the A botnet scenario, in the context of this paper, is a particular
percentage of botnet data is small), and that it should be infection of the virtual machines using a specific malware.
representative of the type of behaviors seen on the network. Thirteen of these scenarios were created, and each of them
The difficulties of obtaining such a dataset are discussed in was designed to be representative of some malware behavior.
Shiravi et al. (2012) and the importance of these characteris- The main characteristics of the scenarios and their be-
tics are discussed in Rossow et al. (2012). haviors are shown in Table 2. It describes if they used IRC, P2P
Due to the absence of a public botnet dataset with the or HTTP protocols, if they sent SPAM, did Click-Fraud, port
characteristics needed, we created a new public dataset that scanned, did DDoS attacks, used Fast-Flux techniques or if
complies with the following design goals: they were custom compiled.
The features related with the network traffic of each sce-
 Must have real botnets attacks and not simulations. nario are shown in Table 3. It presents the size, duration,
 Must have unknown traffic from a large network. number of packets, number of flows, number of bots and bot
 Must have ground-truth labels for training and evaluating family.
the methods. The network topology used to make the captures had a
 Must include different types of botnets. bandwidth control mechanism. However, the traffic going out
 Must have several bots infected at the same time to capture to the Internet was not filtered. This decision may seem
synchronization patterns. controversial (Rossow et al., 2012), but it was taken with the
 Must have NetFlow files to protect the privacy of the users. explicit determination of capturing real attacks. We believe

Fig. 3 e Testbed network topology.


110 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Fig. 4 e Correcting Function applied to 60 time windows.

that the best way to study and model an attack is to capture files was done in two steps using the Argus software suite
real attacks. (Argus, 2013). First, the argus tool was used to convert each
The next Subsection describes how these scenarios were pcap file into a bidirectional Argus binary storage file. The
preprocessed to obtain a more usable dataset. exact configuration of argus is published with each scenario.
Second, the ra Argus client tool was used to convert each
6.2. Dataset preprocessing Argus binary storage file into a NetFlow file. This can be done
by specifying in the ra configuration the output fields. The ra
After capturing the packets, the dataset was preprocessed and configuration is also published with each scenario. These final
converted to a common format for the detection methods. The NetFlow files were composed of the following fields: Start Time,
format selected was the NetFlow file standard (Clais, 2008), End Time, Duration, Source IP address, Source Port, Direction,
which is considered the ad-hoc standard for network data Destination IP address, Destination Port, State, SToS, Total Packets
representation. The conversion from pcap files to NetFlow and Total Bytes.

Table 2 e Characteristics of the botnet scenarios. (CF: ClickFraud, PS: Port Scan, FF: FastFlux, US: Compiled and controlled by
us.)
Id IRC SPAM CF PS DDoS FF P2P US HTTP Note
1 √ √ √
2 √ √ √
3 √ √ √
4 √ √ √ UDP and ICMP DDoS.
5 √ √ √ Scan web proxies.
6 √ Proprietary C&C. RDP.
7 √ Chinese hosts.
8 √ Proprietary C&C. Net-BIOS, STUN.
9 √ √ √ √
10 √ √ √ UDP DDoS.
11 √ √ √ ICMP DDoS.
12 √ Synchronization.
13 √ √ √ Captcha. Web mail.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 111

Table 3 e Amount of data on each botnet scenario.


Id Duration(hrs) # Packets #NetFlows Size Bot #Bots
1 6.15 71,971,482 11,231,035 52 GB Neris 1
2 4.21 71,851,300 7,037,972 60 GB Neris 1
3 66.85 167,730,395 15,202,061 121 GB Rbot 1
4 4.21 62,089,135 4,238,045 53 GB Rbot 1
5 11.63 4,481,167 7,710,910 37.6 GB Virut 1
6 2.18 38,764,357 2,579,105 30 GB Menti 1
7 0.38 7,467,139 454,175 5.8 GB Sogou 1
8 19.5 155,207,799 11,993,935 123 GB Murlo 1
9 5.18 115,415,321 8,087,513 94 GB Neris 10
10 4.75 90,389,782 5,180,852 73 GB Rbot 10
11 0.26 6,337,202 40,836 5.2 GB Rbot 3
12 1.21 13,212,268 1,262,790 8.3 GB NSIS.ay 3
13 16.36 50,888,256 6,425,345 34 GB Virut 1

6.2.1. Ground-truth labels assignment 6.2.2. Dataset separation into training, testing and cross-
The assignment of ground-truth labels is a very important validation
part of the dataset creation process (Fontugne et al., 2010). To correctly create the classification models used in the BClus
However, it can be complex and difficult to do (Davis and and CAMNEP methods, we need to first separate the dataset.
Clark, 2011). For example, a wrongly assigned label might For the CAMNEP method the training consisted in the first
produce unreliable results (Maloof, 2006). 25 min of each scenario, so it was not necessary to further
Our labeling strategy assigns three different labels: back- separate them.
ground, botnet and normal. The priority to assign the labels is For the BClus method, it was necessary to separate the
the following: dataset into training and cross-validation, and testing. The
separation criteria was carefully evaluated, because the
1. Assign the Background label to the whole traffic. following constrains must be met:
2. Assign the Normal label to the traffic that matches certain
filters.  The training and cross-validation datasets should be
3. Assign the Botnet label to all the traffic that comes from or approximately 80% of the dataset.
to any of the known infected IP addresses.  The testing dataset should be approximately 20% of the
dataset.
The filters used to assign normal labels were created from  None of the botnet families used in the training and cross-
the known and controlled computers in the network, such as validation dataset should be used in the testing dataset.
routers, proxies, switches, our own computers in the labora- This ensures that the methods can generalize and detect
tory, etc. new behaviors.
The distribution of labels on each experiment is shown in
Table 4. It can be seen that most of the traffic was labeled as However, it is not clear which feature should be used for
Background. This majority class may add a natural bias to the the 80%e20% separation criteria. It is not the same to have an
dataset, however one of the ways to avoid this is to capture a 80% of the amount of packets than of the amount of bytes. We
large dataset, as it was sated in by Kotsiantis et al. (2006). made the separation by carefully selecting the scenarios so
that the 80% of the following features are considered: the
Duration in minutes, the Number of clusters, the Number of Net-
Flows and the Number of aggregated NetFlows of the scenarios.
The final separation of the scenarios for the datasets is
Table 4 e Distribution of labels for each scenario in the
shown in Table 5. The problem of the imbalanced amount of
dataset.
labels on each dataset was reduced, as stated in Kotsiantis
Id Background Botnet Normal
et al. (2006), by carefully selecting the training and testing
1 10,124,854 (95.40%) 94,972 (0.89%) 392,433 (3.69%) datasets. Also, as the majority label is Background, the bias
2 6,071,419 (95.59%) 54,433 (0.85%) 225,336 (3.54%) toward the majority class reported in Li et al. (2010) is avoided.
3 14,381,899 (94.60%) 75,891 (0.49%) 744,270 (4.89%)
All the three methods compared used only the testing
4 3,895,469 (91.91%) 6466 (0.15%) 336,103 (7.93%)
5 416,267 (91.37%) 2129 (0.46%) 37,144 (8.15%)
scenarios for obtaining results.
6 2,031,967 (94.12%) 4927 (0.22%) 121,854 (5.64%)
7 425,611 (93.71%) 293 (0.06%) 28,270 (6.22%)
8 11,451,205 (95.47%) 12,063 (0.10%) 530,666 (4.42%) Table 5 e Dataset separation into training, testing and
9 6,881,228 (90.22%) 383,215 (5.02%) 362,594 (4.75%) cross-validation.
10 4,535,493 (87.54%) 323,441 (6.24%) 321,917 (6.21%)
Scenario Dataset
11 119,933 (29.33%) 277,892 (67.97%) 11,010 (2.69%)
12 119,933 (29.33%) 277,892 (67.97%) 11,010 (2.69%) 1,2,6,8,9 Testing
13 1,218,140 (93.76%) 21,760 (1.67%) 59,190 (4.55%) 3,4,5,7,10,11,12,13 Training and cross-validation
112 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

6.3. Dataset publication this approach is that the details of the methods remain
private.
The thirteen scenarios of our dataset were published in the To implement this methodology we created and published
web site https://mcfp.felk.cvut.cz/(García, 2013). Each scenario a new tool called Botnet Detectors Comparer (García, 2014) that it
includes the botnet pcap file, the labeled NetFlow file, a is publicly available for download.13
README file with the capture time line and the original mal- This tool reads the dataset NetFlow file and implements
ware executable binary. It was not possible to publish the the following steps:
complete pcap file with the background and normal packets
because they contain private information. However, both of  Separates the NetFlow file in comparison time windows.
our methods use only the NetFlows files. The correspondence  Compares the ground-truth NetFlow labels with the pre-
between the number of scenario and the name of the capture dicted labels of each method and computes the TP, TN, FP
in the web page is: and FN values.
 After the comparison time window ended, it computes the
 Scenario Id 1 is CTU-Malware-Capture-Botnet-42. error metrics: FPR, TPR, TNR, FNR, Precision, Accuracy,
 Scenario Id 2 is CTU-Malware-Capture-Botnet-43. ErrorRate and FMeasure1 for that time window.
 Scenario Id 3 is CTU-Malware-Capture-Botnet-44.  When the dataset ends, it computes the final error metrics.
 Scenario Id 4 is CTU-Malware-Capture-Botnet-45.  The error metrics are stored in a text file and plotted in a
 Scenario Id 5 is CTU-Malware-Capture-Botnet-46. eps image.
 Scenario Id 6 is CTU-Malware-Capture-Botnet-47.
 Scenario Id 7 is CTU-Malware-Capture-Botnet-48. Also, this tool computes the new error metric that we
 Scenario Id 8 is CTU-Malware-Capture-Botnet-49. propose in next Section 7.2.
 Scenario Id 9 is CTU-Malware-Capture-Botnet-50. The comparison time window is the time window used for
 Scenario Id 10 is CTU-Malware-Capture-Botnet-51. computing the error metrics and it is not related with the
 Scenario Id 11 is CTU-Malware-Capture-Botnet-52. methods. It is the time that the network administrator may
 Scenario Id 12 is CTU-Malware-Capture-Botnet-53. wait to have a decision about the traffic. In our methodol-
 Scenario Id 13 is CTU-Malware-Capture-Botnet-54. ogy the width of the comparison time windows is five
minutes.
Using this methodology, researchers can now add its own
predictions to the NetFlows files of our dataset and use this
7. Comparison methodology and new error
tool to compute the error metrics. To tell the tool which labels
metric
the new method uses for its predictions, they should be added
to the header of the NetFlow file as a new column with the
To compare several detection methods it is necessary to
format “NameOfNewMethod( NormalLabelUsed: BotnetLabe-
have a methodology, so the comparisons can be repeated
lUsed: BackgroundLabelUsed)”.
and extended. For this purpose we created a simple meth-
The next Subsection describes the new error metric pro-
odology and a new error metric. The methodology may be
posed to compare botnet detection methods.
used by other researchers to add the results of their
methods and obtain a new comparisons. Section 7.1 pre-
sents the methodology and Section 7.2 presents the error 7.2. New error metric
metric.
The error metrics usually used by researchers to analyze
their results (e.g. FPR, FMeasure) were historically designed
7.1. Comparison methodology from a statistical point of view, and they are really good to
measure differences and to compare most methods. But the
When a new botnet detection method using a new dataset needs of a network administrator that is going to use a
needs to be compared with a third-party method, the most detection method are slightly different. These error metrics
usual approach is to try to run the third-party method on the should have a meaning that can be translated to the
new dataset. However, obtaining the original implementation network. This has been called the semantic gap by Rossow
of a third-party method may be difficult or even impossible et al. (2012). It is possible that the common error metrics
due to copyright issues. are not enough for a network administrator (García et al.,
The comparison methodology used in this paper is simpler. 2013).
Instead of trying to implement a third-party method on our For example, according to the classic definition, a False
dataset, we propose that the researchers first download a Positive should be detected every time that a normal NetFlow
common dataset with labels, execute their methods on this is detected as botnet. However, a network administrator
common dataset, add its results to the common dataset and might want to detect a small amount of infected IP addresses
then publish the common dataset back. instead of hundreds of NetFlows. Furthermore, she may need
A dataset made of NetFlows lines with ground truth to detect them as soon as possible. However, these needs are
labels can be easily modified to add a new column with the not satisfied in the classic error metrics.
method's predictions for each NetFlow. In this way, more
and more methods will publish their results and more 13
http://downloads.sourceforge.net/project/botnetdetectorsco
comparisons can be made. The main advantage of mparer/BotnetDetectorsComparer-0.9.tgz.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 113

The type of error metric that may be useful for a network  tFN:
administrator may be also useful for comparing the methods
that she is going to use. c FN * correcting function
Therefore, we have created a new set of error metrics in an N+ of unique botnet IP addresses in the comparison time frame
attempt to solve this issue that adhere to the following (3)
principles:
 tFP:
 Errors should account for IP addresses instead of NetFlows.
 To detect a botnet IP address (TP) early is better than latter. c FP
 To miss a botnet IP address (FN) early is worst than latter. N+ of unique normal IP addresses in the comparison time frame
 The value of detecting a normal IP address (TN) is not (4)
affected by time.
 tTN:
 The value of missing a normal IP address (FP) is not affected
by time.
c TN
N+ of unique normal IP addresses in the comparison time frame
The first step is to incorporate the time to the metrics by
(5)
computing the errors in comparison time frames. These time
frames are only used to compute the errors and are inde- These time-based error metrics allow for a more realistic
pendent of the detection methods. comparison between detection algorithms. An algorithm
The second step was to migrate from a NetFlow-based weights better if it can detect sooner all the infected IP ad-
detection to an IP-based detection. The classical error values dresses without error. To miss an infected IP address at the
(TP, FP, TN, FN) were redefined as follows:

 c_TP: A True Positive is accounted when a Botnet IP address


is detected as Botnet at least once during the comparison Table 6 e References for algorithms names.
time frame. Algorithm name Reference
 c_TN: A True Negative is accounted when a Normal IP
Flags-FOG-srcIP.src.fog-1.00 Fs1
address is detected as Non-Botnet during the whole com- Flags-FOG-srcIP.src.fog-1.50 Fs1.5
parison time frame. Flags-FOG-dstIP.dst.fog-1.00 Fd1
 c_FP: A False Positive is accounted when a Normal IP Flags-FOG-srcIP.src.fog-2.00 Fs2
address is detected as Botnet at least once during the Flags-FOG-dstIP.dst.fog-1.50 Fd1.5
comparison time frame. Flags-FOG-dstIP.dst.fog-2.00 Fd2
Minds-1.00 Mi1
 c_FN: A False Negative is accounted when a Botnet IP
Xu-1.00 X1
address is detected as Non-Botnet during the whole com-
Xu-1.50 X1.5
parison time frame. Minds-1.50 Mi1.5
Minds-2.00 Mi2
The third step was to modify the values of the error metrics LakhinaEntropyGS-1.00 Le1
by adding a time-based correcting_function that is defined as KGBFog-sIP.src.fog-1.00 Ko1
follows: KGBFog-sIP.src.fog-1.50 Ko1.5
MasterAggregator-1.00 CA1
* N+ comparison time frameÞ
correcting function ¼ eða þ1 (1) TAPS3D-1.50 T1.5
TAPS3D-1.00 T1
This function depends on the ordinal number of the time KGBF-sIP.src.f-1.00 K1
frame were it is computed and on the a value. Fig. 4 shows this KGBF-sIP.src.f-2.00 K2
monotonically decreasing function that weights the values AllNegative AllNeg
AllPositive AllPo
according to the time frame number. The main idea is to
BClus BClus
weight the values more on the firsts time frames and to weight
XuDstIP-1.50 Xd1.5
them less on the lasts ones. The a value is used to manually fit XuDstIP-2.00 Xd2
the function according to the capture length. The a value used TAPS3D-2.00 T2
for our comparison was 0.01. LakhinaVolumeGS-1.50 Lv1.5
Using this correcting_function, four time-dependent error MasterAggregator-1.50 CA1.5
metrics were created, called tTP, tTN, tFP and tFN. Note that LakhinaVolumeGS-2.00 Lv2
MasterAggregator-2.00 CA2
they use the previously defined IP-based error metrics. They
Xu-2.00 X2
are computed as follows: KGBF-sIP.src.f-1.50 K1.5
XuDstIP-1.00 Xd1
 tTP: LakhinaEntropyGS-1.50 Le1.5
LakhinaEntropyGS-2.00 Le2
LakhinaVolumeGS-1.00 Lv1
c TP * correcting function KGBFog-sIP.src.fog-2.00 Ko2
N+ of unique botnet IP addresses in the comparison time frame AllBackground AllBac
(2) BotHunter BH
114 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Table 7 e Comparison of error metrics for the methods in Scenario 1.


Name tTP tTN tFP tFN TPR TNR FPR FNR Prec Acc ErrR FM1
AllPo 65.5 0 69 0 1 0 1 0 0.4 0.4 0.5 0.65
BClus 30.2 41.3 27.6 35.3 0.4 0.5 0.4 0.5 0.5 0.5 0.4 0.48
Fs1 7.8 66.4 2.5 57.5 0.1 0.9 <0.0 0.8 0.7 0.5 0.4 0.20
Fs1.5 6.3 67.2 1.7 59.1 <0.0 0.9 <0.0 0.9 0.7 0.5 0.4 0.17
Fd1 6.8 54.2 14.6 58.6 0.1 0.7 0.2 0.8 0.3 0.4 0.5 0.15
Fs2 4 67.6 1.3 61.4 <0.0 0.9 <0.0 0.9 0.7 0.5 0.4 0.11
Fd1.5 4.6 57.5 11.4 60.8 <0.0 0.8 0.1 0.9 0.2 0.4 0.5 0.11
Fd2 2.2 59.8 9.1 63.2 <0.0 0.8 0.1 0.9 0.1 0.4 0.5 0.05
Mi1 2.3 52.3 16.6 63.1 <0.0 0.7 0.2 0.9 0. 0.4 0.5 0.05
X1 1.7 68.6 0.3 63.6 <0.0 0.9 <0.0 0.9 0.8 0.5 0.4 0.05
X1.5 1.5 68.6 0.3 63.9 <0.0 0.9 <0.0 0.9 0.8 0.5 0.4 0.04
BH 1.59 73.8 0.18 109 0.01 0.9 <0.0 0.9 0.8 0.4 0.5 0.02
Mi1.5 1 56.9 12 64.4 <0.0 0.8 0.1 0.9 <0.0 0.4 0.5 0.02
Mi2 0.6 63.1 5.8 64.8 <0.0 0.9 <0.0 0.9 <0.0 0.4 0.5 0.01
Le1 0.2 68.1 0.8 65.2 <0.0 0.9 0.01 0.9 0.2 0.5 0.4 0.007
Ko1 0.1 68.7 0.1 65.3 <0.0 0.9 <0.0 0.9 0.4 0.5 0.4 0.004
Ko1.5 0.08 68.9 0.02 65.3 <0.0 1 0 0.9 0.7 0.5 0.4 0.002
CA1 0.005 68.7 0.2 65.4 0 0.9 <0.0 1 <0.0 0.5 0.4 <0.00
T1.5 0.005 68.9 0 65.4 0 1 0 1 1 0.5 0.4 <0.00
T1 0.005 68.9 0 65.4 0 1 0 1 1 0.5 0.4 <0.00

beginning is more costly than to falsely detect an infected IP


address at the beginning. After some time frames, all the error 8. Comparison of the results of the detection
values are weighted the same. methods
With these time-based error metrics we can now compute
new corresponding rates. They are like the classic ones, but The three detection methods were executed on each of the
are redefined to use the time-based values: five testing datasets described in Section 6.2.2. Each method
added its flow predictions to each dataset file, so there are
 FPR ¼ tTNþtFP
tFP five files to compare using the methodology described in
 TPR ¼ tTPþtFN
tTP Section 7.
 TNR ¼ tTNþtFP
tTN To better understand the implications of comparing these
 FNR ¼ tTPþtFN
tFN results, the following baseline algorithms were added: the
 Precision ¼ tTPþtFP
tTP AllPositive algorithm, that always predicts Botnet, the All-
 Accuracy ¼ tTPþtTNþtFPþtFN
tTPþtTN Negative algorithm that always predicts Normal and the All-
 ErrorRate ¼ tTPþtTNþtFPþtFN
tFNþtFP Background algorithm that always predicts Background. They
 F Measure1 ¼ 2  PrecisionþTPR
PrecisionTPR
ðFMeasure with beta ¼ 1Þ are analyzed alongside with the BClus, CAMNEP and Bot-
Hunter algorithm. The names of all the algorithms compared
This error metric is implemented in the Botnet Detectors are encoded in Table 6.
Comparer tool that it is publicly available for download and The next Subsections compare the results on each of the
described in previous Subsection. five testing scenarios of the dataset.

Fig. 5 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 1.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 115

Fig. 6 e Comparison of the running error metrics for Scenario 1.

8.1. Adaptation of the BotHunter results destination port. Then, we searched which was the NetFlow
corresponding to that alert and we assigned it the label Botnet.
The comparison of the results is done by reading the ground- The rest of the NetFlows were labeled as Normal.
truth label of each NetFlow and comparing it to the predicted With this labels assignment procedure it was possible to
label of each NetFlow. However, BotHunter does not read add the BotHunter method to the comparison.
NetFlows files and does not output a prediction label for each Next Subsections compare the results on each scenario.
NetFlow, making the comparison more difficult.
To solve this issue we run BotHunter on the original pcap 8.2. Comparison of results in Scenario 1
files and we obtained, for each pcap, a list of alerts. These
alerts include the date, the name of the alert, the protocol, the This scenario corresponds to an IRC-based botnet that sent
source IP address, source port, the destination IP address and spam for almost six and a half hours.

Table 8 e Comparison of error metrics for the methods in Scenario 2.


Name tTP tTN tFP tFN TPR TNR FPR FNR Prec Acc ErrR FM1
AllPo 49.9 0 47 0 1 0 1 0 0.5 0.5 0.4 0.68
BClus 15.6 37.1 9.8 34.2 0.3 0.7 0.2 0.6 0.6 0.5 0.4 0.41
Fd1 14.4 36.5 10.4 35.5 0.2 0.7 0. 0.7 0.5 0.5 0.4 0.38
Fd1.5 9.3 39.1 7.8 40.5 0.1 0.8 0.1 0.8 0.5 0.5 0.5 0.27
Fd2 7.9 40.7 6.2 42 0.1 0.8 0.1 0.8 0.5 0.5 0.4 0.24
Fs1 6.8 45.9 1 43 0.1 0.9 <0.0 0.8 0.8 0.5 0.4 0.23
Fs1.5 6 46.3 0.6 43.8 0.1 0.9 <0.0 0.8 0.8 0.5 0.4 0.21
X1 5.3 46.7 0.2 44.5 0.1 0.9 <0.0 0.8 0.9 0.5 0.4 0.19
X1.5 4.3 46.8 0.1 45.6 <0.0 0.9 <0.0 0.9 0.9 0.5 0.4 0.15
Fs2 4.2 46.5 0.4 45.7 <0.0 0.9 <0.0 0.9 0.9 0.5 0.4 0.15
BH 1.65 46.9 0.05 75 0.02 0.99 <0.0 0.9 0.9 0.3 0.6 0.04
Mi1 1.1 35.8 11.1 48.8 <0.0 0.7 0.2 0.9 <0.0 0.3 0.6 0.03
Mi1.5 0.6 39.1 7.8 49.2 <0.0 0.8 0.1 0.9 <0.0 0.4 0.5 0.02
X2 0.5 46.9 0.02 49.4 <0.0 1 0 0.9 0.9 0.4 0.5 0.02
CA1 0.2 46.9 0.04 49.6 <0.0 0.9 <0.0 0.9 0.8 0.4 0.5 0.01
Ko1 0.1 46.9 0.08 49.8 <0.0 0.9 <0.0 0.9 0.6 0.4 0.5 0.005
Mi2 0.1 46.3 0.6 49.8 <0.0 0.9 <0.0 0.9 0.1 0.4 0.5 0.005
Le1 0.1 46.1 0.8 49.8 <0.0 0.9 <0.0 0.9 0.1 0.4 0.5 0.005
Lv1 0.1 46.6 0.3 49.8 <0.0 0.9 <0.0 0.9 0.2 0.4 0.5 0.004
Xd1 0.07 44 2. 49.8 <0.0 0.9 <0.0 0.9 <0.0 0.4 0.5 0.003
T1 0.03 47 0 49.9 <0.0 1 0 0.9 1 0.4 0.5 0.001
116 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Fig. 7 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 2.

The error metrics for this scenario are shown in Table 7, FMeasure of more than 60%. The BClus algorithm had between
which is ordered by FMeasure1. This Table shows that the 40% and 60% for the TPR, FPR, TNR and FNR metrics and nearly
AllPositive algorithm had the best FMeasure, although it had a 50% for the FMeasure. The CAMNEP algorithm had a value near
100% FPR. The BClus algorithm had a FMeasure of 0.48 and an 0% for the TPR, near 1% for the TNR and near 0% for the FMeasure.
FPR of 40%. The BotHunter algorithm had a FMeasure of 0.02 The apparently good results of the AllPositive algorithm
and an FNR of 98%. The CAMNEP (CA1) algorithm had a low may have an explanation. This algorithm predicts always
FMeasure and low FPR. The bold text in Tables 7,8,9,10 and 11 Botnet, which gives a TPR of 100%, a precision of 50% and a
identifies the main algorithms compared in this work, i.e. FMeasure of 66%. However, these metrics were computed
BClus, CAMNEP and BotHunter. The rest are the internal using only the Botnet and Normal labels and omitting the
CAMNEP algorithms and the All Positive algorithm. Background labels. The Background labels were not used for
A simplified comparison between the BClus, CAMNEP and computing the error metrics because they were neither
AllPossitive algorithms is shown in Fig. 5. Although the All- Normal nor Botnet. Therefore, the only traffic that this algo-
Positive algorithm had a 100% TPR and FPR, it can be seen that it rithm can mis-classify is the Normal traffic. However, the
had a Precision, Accuracy and ErrorRate around 50% and a amount of Normal traffic in the dataset is considerably

Fig. 8 e Comparison of the running error metrics for Scenario 2.


c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 117

Table 9 e Comparison of error metrics for the methods in Scenario 6.


Name tTP tTN tFP tFN TPR TNR FPR FNR Prec Acc ErrR FM1
Fs1 22.5 20.7 0.2 6.8 0.7 0.9 <0.0 0.2 0.9 0.8 0.1 0.86
Fd1 23.2 17.6 3.3 6 0.7 0.8 0.1 0.2 0.8 0.8 0.1 0.83
Fs1.5 20.8 20.8 0.1 8.5 0.7 0.9 <0.0 0.2 0.9 0.8 0.1 0.82
Fd1.5 19.9 18.6 2.3 9.3 0.6 0.8 0.1 0.3 0.8 0.7 0.2 0.77
Fd2 19.1 19.3 1.6 10.1 0.6 0.9 <0.0 0.3 0.9 0.7 0.2 0.76
Fs2 17.7 20.8 0.1 11.5 0.6 0.9 <0.0 0.3 0.9 0.7 0.2 0.75
AllPo 29.3 0 21 0 1 0 1 0 0.5 0.5 0.4 0.73
X1 5.7 20.9 0.04 23.6 0.1 0.9 <0.0 0.8 0.9 0.5 0.4 0.32
Xd1.5 3.6 17.3 3.6 25.7 0.1 0.8 0.1 0.8 0.5 0.4 0.5 0.19
BH 2.53 20.9 0.02 37.3 0.06 0.99 <0.0 0.93 0.98 0.38 0.61 0.11
Xd2 3.6 17.3 3.6 25.7 0.1 0.8 0.1 0.8 0.5 0.4 0.5 0.19
Xd1 3.6 17.3 3.6 25.7 0.1 0.8 0.1 0.8 0.4 0.4 0.5 0.19
X1.5 1.4 21 0 27.8 <0.0 1 0 0.9 1 0.4 0.5 0.09
CA1 0.6 20.9 0.07 28.6 <0.0 0.9 <0.0 0.9 0.9 0.4 0.5 0.04
BClus 0.6 20.2 0.7 28.6 <0.0 0.9 <0.0 0.9 0.4 0.4 0.5 0.04

smaller than the rest. This imbalance made the AllPositive BotHunter algorithm had an FMeasure of 0.04 and an FNR of
have better results than it should. This algorithm is useful as a 97%. The CAMNEP algorithm had a FMeasure of 0.01 and a very
baseline for evaluating detection methods and datasets, but it small FPR.
is useless in a real network. The simplified comparison for this scenario is shown in
To better appreciate the inner workings of the detection Fig. 7. Although it was the same bot that scenario 1 and it
methods during the analysis of this scenario, we plotted the performed almost the same actions, all the algorithms gave
accumulated and running error metrics for each comparison different results. The CAMNEP method still have a large
time frame in Fig. 6. This Figure shows that the FPR of the amount of tFN, but despite the low 1% FMeasure, it was 55
BClus method was high on the first time frames, but after that times better than itself on scenario 1. Its Precision was high
it kept going down until the final 40%. On the sixth time frame because there were almost no tFP, independently of the
the BClus method started to detect botnets with a 100% TPR amount of tTP. Regarding the BClus method, it had a lower
until near the twelfth time frame. While still having a huge TPR than on scenario 1, but also a lower FPR, which lead to a
amount of FP, the BClus method managed to have a final comparatively better FMeasure value. This scenario is a good
FMeasure of 48%. The CAMNEP and BotHunter algorithms had example of the variability in the network due to the presence
low values during all the scenario. of Background traffic. The same bot, generating the same type
and amount of traffic obtained different error metrics.
8.3. Comparison of results in Scenario 2 The inner workings of the algorithms can be seen in the
running metrics shown in Fig. 8. The BClus method started
In this scenario, the same IRC-based botnet as scenario 1 sent with a large FPR, but after the fifth time frame it started to
SPAM for 4.21 h. detect botnets correctly and its FMeasure value improved. The
The error metrics for this scenario are shown in Table 8. TPR, FPR and FMeasure values for the BClus method decreased
The AllPositive algorithm had the best FMeasure. The BClus until the end of the scenario, suggesting that the final values
algorithm had a FMeasure of 0.41 and an FPR of 20%. The could be even lower. The BotHunter algorithm had an FM1

Fig. 9 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 6.
118 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Fig. 10 e Comparison of the running error metrics for Scenario 6.

close to 20% on the first time frames but then it quickly several RDP (Remote Desktop Protocol) services. However, it
dropped to 0.04. The CAMNEP error metrics remained low did not send any SPAM and did not attack. The C&C server
during the whole scenario. used a proprietary protocol that connected every 33 s and sent
an average of 5500 bytes on each connection.
8.4. Comparison of results in Scenario 6 The error metrics for this scenario can be seen in Table 9.
The AllPositive algorithm had a Fmeasure of 0.73, far better
The botnet in this scenario scanned SMTP (Simple Mail than the FMeasures of BClus and CAMNEP, which were both
Transfer Protocol) servers for two hours and connected to 0.04. The BotHunter algorithm had a better FMeasure than the

Table 10 e Comparison of error metrics for the methods in Scenario 8.


Name tTP tTN tFP tFN TPR TNR FPR FNR Prec Acc ErrR FM1
AllPo 233.4 0 230 0 1 0 1 0 0.5 0.5 0.4 0.67
Fs1 74.4 220.7 9.2 159 0.3 0.9 <0.0 0.6 0.8 0.6 0.3 0.46
Ko1 70.7 228.9 1.08 162.7 0.3 0.9 <0.0 0.6 0.9 0.6 0.3 0.46
Ko1.5 68.4 229.3 0.6 164.9 0.2 0.9 <0.0 0.7 0.9 0.6 0.3 0.45
Ko2 53.4 229.5 0.4 180 0.2 0.9 <0.0 0.7 0.9 0.6 0.3 0.37
Fs1.5 54.6 222.7 7.2 178.8 0.2 0.9 <0.0 0.7 0.8 0.5 0.4 0.36
Fs2 28 224.9 5 205.3 0.1 0.9 0.02 0.8 0.8 0.5 0.4 0.21
BClus 23.5 152.8 77.1 209.9 0.1 0.6 0.3 0.8 0.2 0.3 0.6 0.14
Fd1 15.6 196.1 33.8 217.7 <0.0 0.8 0.14 0.9 0.3 0.4 0.5 0.1
Fd1.5 14.1 203.9 26 219.2 <0.0 0.8 0.1 0.9 0.3 0.4 0.5 0.10
CA1 10.8 229.8 0.1 222.6 <0.0 0.9 <0.0 0.9 0.9 0.5 0.4 0.08
Fd2 11.6 209.8 20.1 221.8 <0.0 0.9 <0.0 0.9 0.3 0.4 0.5 0.08
X1 8.5 229.7 0.2 224.9 <0.0 0.9 <0.0 0.9 0.9 0.5 0.4 0.07
Mi1 4.6 213.1 16.8 228.7 <0.0 0.9 <0.0 0.9 0.2 0.4 0.5 0.03
Xd1.5 3.9 194.1 35.8 229.5 <0.0 0.8 0.1 0.9 <0.0 0.4 0.5 0.02
Xd2 3.9 194.4 35.5 229.5 <0.0 0.8 0.1 0.9 <0.0 0.4 0.5 0.02
Xd1 3.9 193.5 36.4 229.5 <0.0 0.8 0.1 0.9 <0.0 0.4 0.5 0.02
Mi1.5 1 219.2 10.7 232.4 <0.0 0.9 <0.0 0.9 <0.0 0.4 0.5 0.008
X1.5 0.5 230 0 232.8 <0.0 1 0 0.9 1 0.4 0.5 0.005
BH 0 229 0.11 309 0 0.99 <0.0 1 0 0.42 0.57 e
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 119

Fig. 11 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 8.

BClus and CAMNEP algorithms because some of the IP ad- BClus and CAMNEP methods detected tFP values until half of
dresses used in the SMTP connections where blacklisted in its the scenario, and then they had some tTP. However, the tTPs
static detection rules as part of the RBN (Russian Business were not enough to improve the FMeasure1 significantly.
Network). It should be noted that the scenarios were captured
on August 2011 and the BotHunter rules are from January 8.5. Comparison of results in Scenario 8
2013, so it is possible that these IP addresses were blacklisted
after the capture. In this scenario, the botnet contacted a lot of different Chinese
The simplified comparison for this scenario is shown in C&C hosts and received large amounts of encrypted data. It
Fig. 9. The CAMNEP and the BClus methods behaved almost also scanned and cracked the passwords of machines using
identically. The only difference is that the BClus method had the DCERPC protocol both on Internet and on the local
ten times more FPR and therefore the CAMNEP method had a network for 19 h. There were more attacks during more time.
better Precision and a slightly better FMeasure. The error metrics for this scenario can be seen in Table 10.
The inner workings of the algorithms can be seen in the The AllPositive algorithm had the best FMeasure. Six algo-
running metrics shown in Fig. 10. This Figure shows that both rithms were better than BClus, who had a FMeasure of 0.14

Fig. 12 e Comparison of the running error metrics for Scenario 8.


120 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Table 11 e Comparison of error metrics for the methods in Scenario 9.


Name tTP tTN tFP tFN TPR TNR FPR FNR Prec Acc ErrR FM1
AllPo 58.3 0 58 0 1 0 1 0 0.5 0.5 0.4 0.66
Fd1 21.7 47.1 10.8 36.6 0.3 0.8 0.1 0.6 0.6 0.5 0.4 0.47
Fd1.5 17.7 49.2 8.7 40.6 0.3 0.8 0.1 0.6 0.6 0.5 0.4 0.41
Fd2 13.9 50.9 7.01 44.4 0.2 0.8 0.1 0.7 0.6 0.5 0.4 0.35
Xd1 10.3 48.5 9.4 48.0 0.1 0.8 0.1 0.8 0.5 0.5 0.4 0.26
BClus 10.1 46.4 11.5 48.2 0.1 0.8 0.2 0.8 0.4 0.4 0.5 0.25
Fs1 8.3 55.2 2.7 50 0.1 0.95 <0.0 0.8 0.7 0.5 0.4 0.23
Fs1.5 7.8 55.7 2.2 50.4 0.1 0.9 <0.0 0.8 0.7 0.5 0.4 0.23
Xd1.5 7.04 53.3 4.6 51.3 0.1 0.9 <0.0 0.8 0.6 0.5 0.4 0.2
CA1 5.5 57.7 0.2 52.8 <0.0 0.9 <0.0 0.9 0.9 0.5 0.4 0.17
Fs2 4.3 56.6 1.3 54 <0.0 0.9 <0.0 0.9 0.7 0.5 0.4 0.13
X1 2.8 56.9 1.09 55.5 <0.0 0.9 <0.0 0.9 0.7 0.5 0.4 0.09
Mi1 2.9 47.3 10.6 55.4 <0.0 0.8 0.1 0.9 0.2 0.4 0.5 0.08
CA1.5 2.3 57.8 0.1 55.9 <0.0 0.9 <0.0 0.9 0.9 0.5 0.4 0.07
Mi1.5 2.2 51.3 6.6 56.1 <0.0 0.8 0.1 0.9 0.2 0.4 0.5 0.06
BH 1.76 57.9 0.06 86.9 0.02 0.9 <0.0 0.9 0.9 0.4 0.5 0.03
X1.5 0.9 57.3 0.6 57.4 <0.0 0.9 <0.0 0.9 0.6 0.5 0.4 0.03
Mi2 0.4 57.2 0.7 57.9 <0.0 0.9 <0.0 0.9 0.3 0.4 0.5 0.01
Ko1 0.2 57.8 0.1 58.1 <0.0 0.9 <0.0 0.9 0.6 0.4 0.5 0.008
Le1 0.1 57.2 0.7 58.1 <0.0 0.9 <0.0 0.9 0.1 0.4 0.5 0.006
X2 0.1 57.9 0.05 58.2 <0.0 0.9 <0.0 0.9 0.6 0.4 0.5 0.004
Ko1.5 0.06 57.9 0.03 58.3 <0.0 0.9 <0.0 0.9 0.6 0.4 0.5 0.002
Lv1 0.04 57.6 0.3 58.3 <0.0 0.9 <0.0 0.9 0.1 0.4 0.5 0.001
T1 0.04 58 0 58.3 <0.0 1 0 0.9 1 0.4 0.5 <0.00
CA2 0.04 58 0 58.3 <0.0 1 0 0.9 1 0.4 0.5 <0.00
Le1.5 0.03 57.7 0.2 58.3 <0.0 0.9 <0.0 0.9 0.1 0.4 0.5 <0.00
T1.5 0.02 58 0 58.3 0 1 0 1 1 0.4 0.5 <0.00
Ko2 0.02 57.9 0.01 58.3 0 1 0 1 0.5 0.4 0.5 <0.00

and an FPR of 30%. The CAMNEP algorithm had a FMeasure of was very difficult for the methods. Until the fiftieth time frame
0.08 and a very low FPR. The BotHunter algorithm could not both BClus and CAMNEP TPR values grew at almost the same
detect a single TP, so it was not possible to compute its rate. However, after that, the BClus method grew a little faster.
FMeasure. The FPR of the BClus method was very high almost from the
The simplified comparison for this scenario is shown in start of the scenario. The BotHunter algorithm had very low
Fig. 11. The BClus method had a low TPR of about 10%, how- measurements during the whole scenario.
ever it was more than twice of CAMNEPs TPR value. The TNR
value was near 60% for BClus and near 90% for CAMNEP. The
FPR of BClus was high, near 30%, however it had a FMeasure 8.6. Comparison of results in Scenario 9
value that is twice the value for CAMNEP.
The inner workings of the algorithms can be seen in the In this scenario, ten host were infected using the same Neris
running metrics shown in Fig. 12. This Figure shows that none botnet as in scenario 1 and 2. For five hours, more than 600
of the error metrics exceeded 40%. It means that this scenario SPAM mails were successfully sent.

Fig. 13 e Simplified comparison of error metrics for the BClus, CAMNEP and AllPositive algorithms on Scenario 9.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 121

Fig. 14 e Comparison of the running error metrics for Scenario 9.

The error metrics for this scenario can be seen in Table 11. platform of detection methods could greatly enhance
The AllPositive algorithm had the best FMeasure. The BClus the results achieved in the area. We believe that such a
algorithm had a FMeasure of 0.25, an FPR of 20% and a TPR of platform could take advantages of our comparison
10%. The CAMNEP algorithm had a FMeasure of 0.17, a very methodology.
low FPR and a very low TPR. The BotHunter algorithm had a The usage of a large and real dataset, despite not having a
low FMeasure of 0.03 and high FNR of 98%. great amount of different botnets, show us which phases of
The simplified comparison for this scenario is shown in the botnet behavior were more easy to detect by the methods,
Fig. 13. It can be seen that the TPR value for the BClus method and the difficulties of working with unknown background
was almost twice the value for CAMNEP. Also, the FPR value of data.
BClus was 40 times larger than the CAMNEP value. However, Regarding our detection methods, BClus showed large FPR
the FMeasure value of CAMNEP was almost 70% the value of values on most scenarios but also large TPR, we are already
BClus. working on improving it. The CAMNEP method had a low FPR
The inner workings of the algorithms can be seen in the during most of the scenarios but at the expense of a low TPR.
running metrics shown in Fig. 14. Almost from the start of the Each of them seems best for a different botnet behavior. The
scenario, the TPR and FMeasure of the BClus and CAMNEP comparison against the BotHunter method showed that in
methods grew fast. However, after the twentieth time frame, real environments it could still be useful to have blacklists of
both FMeasures values started to decrease. The FPR value of known malicious IP addresses.
BClus was relatively low compared to the previous scenarios. Despite being biased by the really small amount of labeled
The BotHunter algorithm presented very low values during normal traffic, the AllPositive baseline algorithm was useful to
the whole scenario despite that there were ten bots being visualize how the error metrics should be always carefully
executed. considered.
We also conclude that, although useful and enough for our
purposes, the comparison methodology can be improved to
9. Conclusions show how many of the infected IP addresses were detected by
the algorithms. The new error metric proposed, that takes into
We conclude that our comparison of detection methods using consideration the IP addresses and time, allowed us to easily
a real dataset greatly helped to improve our research. It compare the algorithms from the perspective of a network
showed us how and why the methods were not optimal, administrator.
which botnet behaviors were not being detected and how the The dataset created, despite being paramount for the
dataset should be improved. Also, it show us the need for a comparison, should be improved. We are already working
comparison methodology and a proper error metric. on adding more botnets, more diverse attacks and more
We also conclude, as it was recommended by Aviv and normal labels. A better and larger dataset is already being
Haeberlen (2011), that a join effort to create a comparison built.
122 c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3

Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic


Acknowledgment anomalies. ACM SIGCOMM Comput Commun Rev
2004;34:357e74.
This work was supported by the project of the Czech Ministry Lakhina A, Crovella M, Diot C. Mining anomalies using traffic
of Interior No. VG20122014079. feature distributions. ACM SIGCOMM Comput Commun Rev
2005;35:217e28.
Lee K, Kim J, Kwon K, Ha Yn, Kim S. DDoS attack detection
method using cluster analysis. Expert Syst Appl
references 2008;34:1659e65.
Li P, Liu L, Gao D, Reiter MK. On challenges in evaluating malware
clustering. In: Proceedings of the 13th international
Argus. Auditing Network Activity http://qosient.com/argus/; 2013. conference on recent advances in intrusion detection; 2010.
AsSadhan B, Moura JMF, Lapsley D. Periodic behavior in botnet pp. 238e55.
command and control channels traffic. In: GLOBECOM: Global Lu W, Tavallaee M, Rammidi G, Ghorbani A. BotCop: an Online
Telecommunications Conference IEEE; 2009. pp. 1e6. botnet traffic classifier. In: 2009 Seventh Annual
Aviv A, Haeberlen A. Challenges in experimenting with botnet Communication Networks and Services Research Conference;
detection systems. In: USENIX 4th CSET Workshop, San 2009. pp. 70e7.
Francisco, CA; 2011 [p. 6e6]. Maloof MA. Some basic concepts of machine learning and data
Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Scalable Kirda E. mining machine learning and data mining for computer
Behavior-based Malware clustering. In: NDSS: Network and security. London: Springer; 2006. pp. 23e43.
Distributed System Security Symposium; 2009. Moon TK. The expectation-maximization algorithm. Signal
Cho K, Mitsuya K, Kato A. Traffic data repository at the WIDE Process Mag IEEE 1996:47e60.
project. In: ATEC '00: Annual conference on USENIX Annual NexGinRC. Endpoint worm scan dataset [accessed 06.03.14],
Technical Conference; 2000. pp. 51e2. http://nexginrc.org/Datasets/DatasetDetail.aspx?pageID¼24;
Claise B. Specification of the IP Flow Information Export (IPFIX) 2013.
protocol for the exchange of IP traffic flow information http:// Pevný T, Reha  k M, Grill M. Identifying suspicious users in
tools.ietf.org/html/rfc5101; 2008. corporate networks. In: Proceedings of workshop on
Dainotti A, King A, Papale F, Pescape  A. Analysis of a/0 stealth information forensics and security; 2012. pp. 1e6.
scan from a botnet. In: IMC '12: ACM conference on Internet Ramchurn SD, Jennings NR, Sierra C, Godo L. Devising a trust
measurement conference; 2012. pp. 1e4. model for multi-agent interactions using confidence and
Davis JJ, Clark JA. Data preprocessing for anomaly based network reputation. Int J Appl Artif Intell 2004;18:833e52.
intrusion detection: a review. J Comput Secur 2011;30:353e75. Rehak M, Pechoucek M, Gregor M, Reh M. Trust modeling with
Ertoz L, Eilertson E, Lazarevic A, Tan PN, Kumar V, Srivastava J, context representation and generalized identities. In:
et al. Minds-minnesota intrusion detection system. In: Next Proceedings of the 11th international workshop on
generation data mining. MIT Press; 2004. pp. 199e218. Cooperative Information Agents XI; 2007. pp. 298e312.
Fontugne R, Borgnat P, Abry P, Fukuda K. Mawilab: combining Rehak M, Pechoucek M, Grill M, Stiborek J, Bartos K, Celeda P.
diverse anomaly detectors for automated anomaly labeling Adaptive multiagent system for network traffic monitoring.
and performance benchmarking. In: CoNEXT 2010: ACM Intell Syst IEEE 2009;24:16e25.
Conference on Emerging Networking Experiments and Rettinger V, Nickles M, Tresp V. Learning Initial trust Among
Technology; 2010 [p. 8e8]. Interacting agents. In: Proceedings of the 11th international
García S. Malware capture facility project https://mcfp.felk.cvut. workshop on Cooperative Information Agents XI; 2007.
cz/; 2013 [accessed 06.03.14]. pp. 313e27.
García S. Botnet detectors comparer http://sourceforge.net/ Rossow C, Dietrich CJ, Grier C, Kreibich C, Paxson V, Pohlmann N,
projects/botnetdetectorscomparer/; 2014 [accessed 06.03.14]. et al. Prudent practices for designing malware experiments:
García S, Zunino A, Campo M. Botnet behavior detection using status quo and outlook. In: IEEE Symposium on Security and
network synchronism. In: Privacy, Intrusion Detection and Privacy; 2012. pp. 65e79.
Response: Technologies for Protecting Networks. IGI Global; Rubinstein BIP, Nelson B, Huang L, Joseph AD, Lau S, Rao S, et al.
2012. pp. 122e44. Stealthy poisoning attacks on PCA-based anomaly detectors.
García S, Zunino A, Campo M. Survey on network-based botnet SIGMETRICS Perform Eval Rev 2009;37:73e4.
detection methods. J Secur Commun Networks Saad S, Traore I, Ghorbani A, Sayed B, Zhao D, Lu W, et al.
2013;7:878e903. John Wiley & Sons. Detecting P2P botnets through network behavior analysis and
Gu G, Porras P, Yegneswaran V, Fong M, Lee W. Bothunter: machine learning. In: Privacy, security and trust (PST), Ninth
detecting malware infection through ids-driven dialog Annual International Conference on; 2011. pp. 174e80.
correlation. In: Proceedings of 16th USENIX Security Sabater J, Sierra C. Review on computational trust and reputation
Symposium; 2007. pp. 1e16. models. J Artif Intell Rev 2005;24:33e60.
Gu G, Zhang J, Lee W. BotSniffer: detecting botnet command and Salgarelli L, Gringoli F, Karagiannis T. Comparing traffic
control channels in network traffic. In: NDSS: Proc. 15th classifiers. In: ACM SIGCOMM Computer Communication
Network and Distributed System Security Symposium; 2008. Review; 2007. pp. 65e8.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. Scarfone K, Mell P. Guide to Intrusion Detection and Prevention
The WEKA data mining software: an update. ACM SIGKDD Systems (IDPS). Technical report. NIST; 2007.
Explor Newsl 2009;11. Shiravi A, Shiravi H, Tavallaee M, Ghorbani A. Toward
Hegna A. Visualizing spatial and temporal dynamics of a class of developing a systematic approach to generate benchmark
IRC-based botnets [PhD thesis]; 2010. datasets for intrusion detection. J Comput Secur
Jacobson V., Leres C., McCanne S. (1997). tcpdump/libpcap www. 2012;31:357e74.
tcpdump.org Lawrence Berkeley Laboratory. Sperotto A, Sadre R, Vliet FV, Pras A. A labeled data set for flow-
Kotsiantis S, Kanellopoulos D, Pintelas P. Handling imbalanced based intrusion detection. In: IPOM '09 Proceedings of the 9th
datasets : a review. J GESTS Int Transactions Comput Sci Eng IEEE international Workshop on IP Operations and
2006;30:25e36. Management; 2009. pp. 39e50.
c o m p u t e r s & s e c u r i t y 4 5 ( 2 0 1 4 ) 1 0 0 e1 2 3 123

Sridharan A, Ye T, Bhattacharyya S. Connectionless port scan and a teacher in the UFASTA University. His research interests
detection on the backbone. In: IPCCC 2006: Performance, include network-based botnet behavior detection, anomaly
Computing, and Communications Conference; 2006. p. 576. detection, penetration testing, honeypots, malware detection,
Szabo G, Orincsay D, Malomsoky S, Szabo  I. On the validation of keystroke dynamics and machine learning. His recent projects
traffic classification algorithms. In: PAM 2008: 9th focus on using unsupervised and semi-supervised machine
International Conference, Passive and Active Network learning techniques to detect botnets on large networks based on
Measurement; 2008. pp. 72e81. their behavioral models.
Tavallaee M, Stakhanova N, Ghorbani A. Toward credible
evaluation of anomaly-based intrusion-detection methods. Martin Grill holds master degree in Software development at the
IEEE Transactions Syst Man Cybern Part C Appl Rev Faculty of Nuclear Sciences and Physical Engineering of the Czech
2010;40:516e24. Technical University in Prague. At the present time he is a
Wurzinger P, Bilge L, Holz T, Goebel J, Kruegel C, Kirda E. member of the Agent Technology Center, a researcher at CESNet,
Automatically generating models for botnet detection; 2010. and a PhD student at the Department of Cybernetics of Czech
pp. 232e49. Technical University in Prague.
Xu K, Zhang ZL. Profiling internet backbone traffic: behavior
models and applications. In: Proceedings of the 2005 Jan Stiborek holds master degree in Software development at
conference on Applications, technologies, architectures, and Faculty of Nuclear Sciences and Physical Engineering of the Czech
protocols for computer communications; 2005. pp. 169e80. Technical University in Prague. At the present time, he is pursuing
Xu K, Zhang Z, Bhattacharyya S. Reducing unwanted traffic in a PhD degree in Artificial Intelligence and Biocybernetics at
backbone network. In: SRUTI 05: steps to reducing unwanted Department of Cybernetics, FEE CTU. His current profesional in-
traffic on the internet workshop; 2005. pp. 9e15. terests focus on network security, network simulation and
Yager RR. On ordered weighted averaging aggregation operators autonomous adaptation of intrusion detection systems.
in multicriteria decisionmaking. Syst Man Cybern IEEE
Transactions 1988:183e90.
Alejandro Zunino (http://www.exa.unicen.edu.ar/~azunino) recei-
Zhao D, Traore I, Sayed B, Lu W, Saad S, Ghorbani A, et al. Botnet
ved a Ph.D. degree in Computer Science from the National University
detection based on traffic behavior analysis and flow intervals.
of the Center of Buenos Aires (UNICEN), in 2003, and his M.Sc. in
J Comput Secur 2013;39:2e16.
Systems Engineering in 2000. He is a full Adjunct Professor at UNI-
CEN, member of the ISISTAN Research Institute and Independent
Sebastia n García is a PhD student in UNICEN University Researcher of the National Scientific and Technical Research Council
(Argentina) and a researcher in the ATG group at the Czech (CONICET). His research areas are Distributed Computing and Soft-
Technical University. He is also a research fellow at the National ware Engineering. Contact him at azunino@conicet.gov.ar.
Scientific and Technical Research Council of Argentina (CONICET)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy