Nour JNCA
Nour JNCA
net/publication/329533742
CITATIONS READS
314 5,716
3 authors:
Jill Slay
University of South Australia
184 PUBLICATIONS 8,845 CITATIONS
SEE PROFILE
All content following this page was uploaded by Nour Moustafa on 02 June 2021.
Abstract
Network Anomaly Detection Systems (NADSs) are gaining a more important
role in most network defense systems for detecting and preventing potential
threats. The paper discusses various aspects of anomaly-based Network Intru-
sion Detection Systems (NIDSs). The paper explains cyber kill chain models and
cyber-attacks that compromise network systems. Moreover, the paper describes
various Decision Engine (DE) approaches, including new ensemble learning and
deep learning approaches. The paper also provides more details about bench-
mark datasets for training and validating DE approaches. Most of NADSs'
applications, such as Data Centers, Internet of Things (IoT), as well as Fog and
Cloud Computing, are also discussed. Finally, we present several experimental
explanations which we follow by revealing various promising research directions.
Keywords: Intrusion Detection system (IDS), Network Anomaly Detection
Systems (NADS), data pre-processing, Decision Engine (DE)
1. Introduction
∗ Corresponding author
Email addresses: nour.moustafa@unsw.edu.au (Nour Moustafa ), J.Hu@adfa.edu.au
(Jiankun Hu ), J.Slay@latrobe.edu.au (Jill Slay )
Action
Database
• The packet decoder acquires portions of raw network trac using audit
data collection tools, such as Tcpdump and Libpcap, which transfer each
portion into the pre-processor for handling.
• The pre-processor captures a set of features from the raw audit data which
is used later in the DE sensor. A typical pre-processor is the TCP handler
which analyses TCP protocols in session ows; for example, Netow, Bro-
IDS and Argus tools which examine dierent protocols, such as HTTP,
DNS, SMTP and UDP.
• The DE sensor receives the extracted features from the pre-processor and
builds a model that distinguishes attack observations from normal ones. If
an attack is detected, it requests the defence response for raising an alert.
Over the last decade, there are many surveys that have been conducted for
reviewing the IDS technology. Chandola et al. [13] discussed the foundations
of anomaly detection approaches and their applicability in dierent domains.
Garcia-Teodoro et al. [11] reviewed anomaly detection methods of statistical,
knowledge, machine learning, as well as their issues. Ahmed et al. [14] described
3
the methods of anomaly detection systems and some challenges of IDS datasets.
In [15], hybrid IDSs were discussed by integrating feature selection and detec-
tion methods for improving the detection accuracy, but they have a drawback of
demanding highly computational resources. Peng et al. [16] discussed intrusion
detection and prevention techniques by designing user proles and discovering
variations as anomalies. Recently, researchers surveyed the deployment of IDSs
in dierent applications such as Internet of Things (IoT)-based IDS [17] and
Cloud-based IDS [18]. For example, Zarpelao et al. [10] presented a review
of IDSs in IoT networks. The authors described detection approaches, IDS
deployments, and security threats. Sharma et al. [19] explained the methodolo-
gies of deploying IDSs in VANET and VANET Cloud. Recently, Resende and
Drummond [20] presented a comprehensive discussion of using Random Forest
methods for developing a reliable IDS. Although the existing surveys discussed
various aspects of IDSs, our survey provides a holistic review that gives a better
understanding of designing anomaly detection in dierent domains.
The main contributions of this survey include the following.
Table 1: Attacks against computer and network systems could be identied by NIDSs
Attack types Properties Examples
Information - scan computer and network systems to nd IPsweep,
Gathering and vulnerabilities portsweep,
Probing - provide lists of vulnerabilities, such as SMBv1 and SYS scan, FIN
open ports, to an attacker for exploiting victims scan
User to Root - can breach vulnerabilities to gain privileges of a Rootkit,
(U2R) system's superuser while starting as a legitimate user loadmodule
Remote to - can transmit packets to a remote system over a Warezclient,
Local (R2L) network without having an account on that system warezmaste,
and gain access to harm the system's operations. spy
Malware - includes any executable malicious scripts like worms SQL Slammer
and viruses worms, Tuareg
viruses
Flooding - contain malicious events that massively transmit Buer
attacks superuous requests for disrupting computer overow, TCP
resources such as DoS and Distrusted DoS (DDoS) SYN, teardrop,
smurf
chain's life cycle assists in designing an eective and reliable NADS that can
eciently discover existing and future malicious activities [22].
An attacker's philosophy almost invariably comprises two phases [22]. The
rst, the so-called exploitation phase, is a method for controlling the execution
ow in the targeted program. At its abstract level, this can be a stack/heap-
based buer overow in which an intrusively long text overwrites the instruc-
tion pointers of the targeted program but also includes a full suite of methods
which can be used by more sophisticated adversaries to gain control of a system
while its code is running. The second phase is known as the payload phase.
After successfully exploiting the execution ow to the payload, this phase per-
forms the aim of the attacker, such as to steal information and/or disrupt com-
puter resources. The payload process is executed through a shellcode terminal
which establishes a command prompt on the hacker's computer to execute post-
exploitation events. Existing IDSs can identify attack types listed in Table 1 if
their DE approaches are well-designed [23, 24]. Based on the Australian Cyber
Security Centre (ACSC) [25], McAfee threat reports [26], Figure 2 depicts the
current variants of attacks which still expose computer networks and require
further research to be discovered using NADSs, as detailed in the following.
attacker could also access private keys, condential information and secure
content which could help other cyber adversaries. Moreover, these vulner-
abilities allowed attackers to continually access the private information in
systems by sending a wide variety of malicious commands to susceptible
servers [25, 26].
[35]. The advantages and disadvantage of both environments are listed in Table
2.
ASG with attack detection. The former does not use any attack detection meth-
ods prior to generating signatures such as Polygraph and Honeycomb system,
whilst the latter identies an attack vector, and then creates its signatures such
as Honeycyber and Eudaemon systems.
An ADS creates a normal prole and identies any variations from it as an
suspicious event. It can identify known and zero-day attacks with less eort to
construct its prole than a MDS, but it still faces some challenges presented
in Section 8. A SPA examines protocol states, specically a pair of request-
response protocols, such as a HTTP protocol. Although a SPA is roughly
similar to an ADS, it relies on vendor-developed proles of certain protocols
and requires information of the relevant network's protocol standard from in-
ternational standard organisations [24]. As a SPA consumes many computer
resources to inspect protocol states and is incompatible with dierent dedicated
operating systems, an ADS is a better defence solution if its DE approach is
properly designed [3, 41]. Finally, a HDS applies integrated methods to improve
the detection accuracy. For example, MDS and ADS methods are accumulated
for identifying certain known attack types and zero-day attacks, respectively
[6, 23].
• Access point-based IDSs - are installed over access networks that link
subscribers to a specic service provider, and across the carrier network,
to other network systems (e.g., the Intranet and Internet). An access
point IDS should identify abnormal activities through network systems
that are connected by LANs and/or wireless LANs. Wireless Intrusion
Detection Systems (WIDSs) have been proposed to monitor the radio spec-
trum of LANs and/or wireless LANs for identifying unauthorised access
[45]. WIDSs are used to monitor and inspect trac of sensors, servers and
console. However, the heterogeneous sensors of antennas and radios that
examine the wireless spectrum demand handling data dimensionality and
developing self-adaptive NIDSs for dening malicious activities eectively.
• since many virtual machines are established and destroyed, the detection of
attacks is a dicult task to monitor and track normal users and attackers
over data centers.
New IDSs for the above applications should be capable of discovering known
and zero-data attacks discussed in Section 2. Such systems should eectively
and eciently monitor high-speed networks that can exchange data at 10 Gbps
or higher. Moreover, they should be scalable and self-adaptive for analysing
diverse networks through wide areas in real-times.
4. Components of NADS
Data source
Data pre-processing
· Feature creation
· Feature reduction
· Feature conversion
· Feature normalisation
Training phase:
Validation and
normal profile
testing phase
establishing
Defence response
Extracted features
With the high speeds and large sizes of current network environments, net-
work data has the characteristics of big data which is typically dened in terms
of volume (i.e., the amount of data), velocity (i.e., the speed of data processing)
and variety (i.e., the complexity of the data and to what extent they are of
diverse types and dimensions) [55]. As traditional database systems generally
cannot process the big data contained in real-world problems, it is vital to use,
for example, the Hadoop [56] or MySQL Cluster CGE [57] tools to store and
handle a network's big data as a data management unit for NIDS technologies
[2, 3].
In real-time processing, network trac is collected to monitor and detect
abnormal activities. Bidirectional or unidirectional network ows are aggregated
at the choke-points, for example, ingress router and switch devices, to reduce the
network's overheads. These devices have limited buers and simple mechanisms
for collecting ows which can accumulate using only one attribute for a given
time, such as source/destination IP addresses or protocols. To address this
limitation, the simple random sampling technique is basically applied to select
data portions each time. The technique randomly chooses a sample of a given
data size that no observations are included more than once, with all subsets of
the observations given an equal probability of selection [58].
To give an example of extracting network features, many tools such as tcp-
dump, Bro-IDS and MySQL Cluster CGE are utilised as shown in Figure 4. The
tcpdump tool is applied to sni network packets in the format of pcap les. Af-
ter that, the Bro-IDS is used for extracting the ow-based features and general
information about dierent protocol types from the pcap les. The extracted
features are stored in a MySQL database to make it easier to create labelling
4.1 Data source 13
Old datasets
• The KDD99 and NSL-KDD datasets - the IST group at the Lincoln
Laboratories in the MIT University performed a simulation involving both
normal and abnormal trac in the military network of the U.S. Air Force
LAN environment to generate the DARPA 98 dataset using nine weeks
of raw tcpdump les [50]. The NSL-KDD dataset [59] is an enhanced
version of the KDD99 dataset. This dataset tackles some drawbacks of
the KDD99 dataset. Firstly, it does not contain duplicated observations in
either the training or testing set. Secondly, the numbers of observations
in the training and testing sets are adopted from dierent portions of
the original KDD99 dataset without any duplication. Nevertheless, the
KDD99 and NSL-KDD datasets cannot represent contemporary network
4.1 Data source 14
trac as its legitimate and attack behaviours are extremely dierent from
those of current network trac.
• The CAIDA datasets [60] are collections of dierent data types for
analysing malicious events to validate attack detection approaches, but are
limited to particular types of attacks, such as DDoS ones, with their traces
the anonymised backbones of the packet headers without their payloads.
The most common CAIDA dataset is the CAIDA DDoS 2007 anomaly one
which includes an hour of anonymised network trac for DDoS attacks.
These datasets did not have a ground truth about the attack activities
involved and, moreover, their pcap les were not inspected precisely to
elicit features in order to discriminate attack activities from normal ones.
• The UNIBS dataset [62] was gathered from the network router of the
University of Brescia, Italy, on three days. Its trac was collected from 20
workstations running the GT client daemon using the tcpdump tool. The
raw packets were captured and logged on a disk of a workstation linked
to the router across an ATA controller.
• The LBNL dataset [63] was designed at the Lawrence Berkeley Na-
tional Laboratory (LBNL) that includes header network traces without
payload. The dataset was anonymised for excluding any sensitive infor-
mation which could recognsie individual IP addresses. Its network packets
were collected from two routers at the LBNL network that includes about
thousand host systems for nearly hundred hours.
• The CDX dataset [66] was synthetically developed by the Cyber Re-
search Center at the US Military Academy. It associates IP addresses
found in PCAP les with IP addresses of clients on the internal USMA
network. It was created during a network warfare competition for the
design of a tagged dataset. It comprises ASNM features generated from
the tcpdump capture of malicious and normal TCP communications on
network services which are vulnerable to DoS attacks.
• The CTU-13 dataset [67] which was developed at the CTU University,
consists of a collection of a large number of botnets and normal trac
involving 13 captures of dierent botnet scenarios. In each scenario, a
particular malware, which used many protocols and executed dierent
actions, was implemented.
New datasets
• The ISCX dataset [68, 69] was designed using the concept of proles
which contains descriptions of attacks and distribution models for a net-
work architecture. Its records were captured from a real-time simulation
conducted over seven days of normal network trac and synthetic attack
simulators. Several multi-stage attack scenarios were included to help
in evaluating NIDS methods. However, the dataset did not provide the
ground truth about attacks to reect the credibility of labelling and, sec-
ondly, the prole concept used to build the dataset could be impossible to
apply in a real complex network because of the diculty of analysing and
logging.
• The TUIDS dataset [70] was collected from the Network Security Lab
at the University of Tezpur, India based on dierent attack scenarios. Its
network packets were captured using the nfdump and gulp tools for cap-
turing representative features. Their features are categorised into basic,
content, time, window and connectionless, with adding their labels either
normal or attack.
• The ADFA dataset [71] was developed at the University of New South
Wales to evaluate Linux and Windows HIDSs. It contains host logs that
were manually designed using dierent simulation congurations. The
Linux data collection includes system call traces generated by the Linux
auditd program and then processed by size. For the training set, traces
larger than 300 bytes to 6 kB and, for the validation set, those outside
the range of 300 bytes to 10 kB were neglected. Windows XP was used to
generate a set of DLL calls of 1828 normal and 5773 attack traces.
Popular techniques for reducing network features. The Association Rule Mining
(ARM) [81], Principal Component Analysis (PCA) [82] and Independent Com-
4.2 Data pre-processing 18
Quality of
subset
No Stopping Yes Result
Criterion validation
ponent Analysis (ICA) [83] techniques are widely used for selecting important
network features, as described in the following.
• ARM - is a data mining technique used to compute the correlation be-
tween two or more variables in a dataset by determining the strongest
rules that occur between their values.
• PCA - sorts a set of attributes based on the highest variations for each
attribute and generates a new dimensional space of uncorrelated attributes
by omitting those with low variances.
• ICA - is a generative model which generalises the PCA technique. It
mines unidentied hidden components from multivariate data, that is,
linear mixtures of some hidden variables, using only the assumption that
the unknown components are mutually independent and non-normal dis-
tributions.
Many studies [77, 78, 84, 85] have used the ARM technique in a NADS to detect
abnormal instances. Luo et al. [86] used the ARM to construct a set of rules
from audit data to establish a normal prole and detect any variation from it as
an attack. Yanyan and Yuan [87] developed a partition-based ARM technique
for scanning the training set twice. In the rst scan, the data is divided into
many partitions to run easily in memory while, in the second, itemsets of the
training set are created.
As several research studies have been undertaken using the ICA and PCA
techniques to analyse the potential properties of network trac and eliminate
inappropriate or noisy features, these mechanisms are usually utilised in the data
pre-processing module to address the variety problem of big data discussed in
[88, 89]. In [90], a NADS technique using the ICA mechanism was developed
to detect stealthy attacks with a high detection accuracy. It was assumed that
the hacker has no information about the system, and malicious activities were
detected based on a measurement matrix. De la Hoz et al. [91] suggested
an adaptive IDS based on a hybrid statistical technique using PCA, the sher
discrimination ratio and probabilistic self-organising maps (SOMs).
19
Mapping
Proto TCP, UDP, ICMP Proto 1, 2, 3
Mapping
Service HTTP, FTP, SMTP Service 1, 2, 3
Mapping
State INT, FIN, CON State 1, 2, 3
Proto: low-level protocols (network and transport layers) such as TCP and UDP
Service: application protocols such as HTTP and SMTP
State: cases of dependent protocols such as ACC and FIN
Z = (X − µ)/σ (2)
where X denotes the feature values, µ is the mean of the feature values and
σ is the standard deviation.
For example, Table 5 lists an example of feature normalisation, where three
features with ve rows from the UNSW-NB15 data were normalised using equa-
tion (1).
· Regular Clustering
Clustering-based
· Co-Clustering
· Generative architecture
Deep learning-based
· Discriminative architecture
· Ensemble-based
Combination-based
· Fusion-based
1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 2 1 1 1 1 1 1
1 1 1 1 2 2
1 1 1
1 1 1 2 2 2 2 2 2
2 2 1 2 2 1 2 2 2
2 2 2 2 2 1 2 2 2
2 2 2 2 2 2 2 2
2 2 1 2 2 2 2 2 2 2 2 2
Normal class 3
C2 C3 N2
** *
** *** O2 ++++ +++
***
*** *** --- +++++
**** **
**** ** * C
C11 - +++++
* **
** *** C4 N1
*** **** ***** ** ****** O1
*** +++++++
*** *** -
*** *** ***** ++++++++++
*** -
********* ** * ++++++++++
**** ** * ++++++++++
++++++++
10 (b), the data points of O1 and O2 are outliers while those of N1 and N2 are
normal clusters [99].
Although there are dierent clustering techniques, the most popular types
applied for NADSs are regular and co-clustering with the dierence between
their strategies of processing the observations and features of a network dataset
[13, 14, 23]. Specically, regular clustering, such as K-means clustering, assem-
bles data points from the observations of a dataset while co-clustering simul-
taneously considers both the observations and features of a dataset to provide
clusters.
When using clustering to identify anomalies, three key assumptions are usu-
ally made. The rst is that, as legitimate data instances often fall into a cluster
whereas attacks do not, in a NADS methodology, clustering identies any data
instances that do not fall into a legitimate cluster as attacks, with noise data
also considered anomalous, as in [100]. A drawback of this assumption is that
clustering techniques cannot be optimised to identify anomalies as the major
goal of a clustering algorithm is to dene clusters. Secondly, legitimate data
instances are usually located near the closest cluster centroid while anomaly
ones are often far away from it [13].
Techniques using this assumption consider the points farthest from the clus-
ter centre as anomalies, with many of them suggested for designing NADSs [13]
whereas, if anomalies are located in normal clusters, they cannot be correctly
identied. To tackle this challenge, the third assumption is that legitimate data
instances fall into vast and dense clusters and anomalies into small or spare ones.
Mechanisms using this assumption identify data observations belonging to clus-
ters with those of sizes and/or densities under a baseline considered anomalies.
Bhuyan et al. [101] designed an outlier-based NADS in networks in which
legitimate data were clustered using a k-means technique and then a reference
point computed for each cluster, with these points classied as attacks if they
were less than a certain threshold value. Also, in [102], a NADS for large
network datasets using tree-based clustering and ensemble-based techniques for
improving accuracy in a real network environment was proposed. Nadiammai
et al. [103] analysed and evaluated k means, hierarchical and fuzzy c-means
clustering techniques for building a NADS. However, this system could not
work eectively on an unbalanced data problem in which the network instances
5.3 Deep Learning- based approaches 24
of normal class are too larger than the instances of abnormal class.
Clustering-based NADS techniques have several advantages. Firstly, they
group data points in an unsupervised manner which shows that they do not
need to provide class labels for observations, which is a very dicult process, to
ensure the correct labelling of data as either normal or attack. Secondly, they
are eective for clustering large datasets into similar groups to detect network
anomalies, which decrease computational complexity, and perform better than
classication methods. In contrast, one of clustering-based NADS drawbacks is
that its clustering is highly reliant on its ecacy in proling normal instances
while another is that dynamically updating a prole for legitimate network data
is time-consuming. Finally, its dependency on one of the three above assump-
tions is occasionally problematic for eectively recognising abnormal behaviours
as it produces a high false alarm rate and, in particular, attack instances can
conceal themselves in a normal cluster.
Deep learning
methods
• RNN- uses discriminative power for a classication task, and this occurs
when the output of the model is labelled in a sequence with the input.
Multiple research studies [105, 106, 107, 108, 109, 110] have recently applied
deep learning techniques to NADSs. Alom et al. [107] used a DBN-based
NADS by conguring a greedy layer-by-layer learning algorithm to learn each
stack of RBM at a time for discovering intrusion events. In [108], a deep auto-
encoder technique was developed to reduce data dimensions that was considered
a pre-stage for classifying network observations. A shallow ANN algorithm was
applied as a classier to assess the eectiveness of an auto-encoder technique
5.4 Knowledge-based approaches 26
compared with the PCA and factor analysis algorithms. Yin et al. [109] pro-
posed RNN-based NADS IDS for recognising malicious network instances. The
experiments were conducted on diernt hidden nodes and learning rate values.
In [110], the author proposed an ensemble method-based NADS that involves
DFN architectures that contain shallow auto-encoder and DFN, DBN and DNN
architectures. The method was assessed using the NSL-KDD dataset, and the
experiment results showed a reasonable performance for discovering abnormal
network activities. It is observed that deep learning algorithms could consider-
ably enhance the NADSs' performance, with high detection accuracy and low
false alarm rates. However, they usually consume a long time to process a net-
work data to determine the best neural weights for minimising classication
errors as possible.
Ensemble/hybrid-
based methods Hybrid methods for NADS and
Pros and Cons
evaluation using NSL-KDD dataset
· Reduces variations and enhances GP, DT, NB, SVM and J45 DT techniques were
accuracy utilised for identifying attack observations. They
Boosting · Not robust outliers or noisy achieved the best detection rate compared with
network data other hybrid algorithms, but they need more
· Flexible to be used with any loss samples for recognising different attack types.
function
modelled from the ingress network data from which their parameters should be
dynamically adjusted, instead of there being a static setting, to build a exible
model which distinguishes anomalies from normal observations [3, 128].
The methodologies of the most commonly used parametric methods are dis-
cussed in the following.
• Particle lter
A particle lter is an inference mechanism which measures the unknown
state from a set of observations with respect to a time, with the posterior
distribution established by a set of weighted particles [134, 135]. For
example, Xu et al. [136] proposed a Continuous Time BN (CTBN) model
for detecting attacks that penetrated both host and network activities.
• Bayesian network (BN)
A BN is a graphical probability distribution for making decisions regarding
uncertain data [52]. For instance, Altwaijry [137] developed a naive BN
NADS using the PCA which computed the highest ranked features within
the PCA and used the selected features and their components as weights
to improve the traditional naive Bayesian technique. The experimental
results reected that it could eectively decrease the data dimensions and
improve detection accuracy. Han et al. [138] designed a NADS using a
combination of a naive BN classier, Linear Discriminant Analysis (LDA)
and chi-square feature selection.
• Finite mixture models
As a nite mixture model can be dened as a convex combination of two or
more PDFs, the joint properties of which can approximate any arbitrary
distribution, it is a powerful and exible probabilistic modelling tool for
univariate and multivariate data [2, 3, 5, 41, 139]. Network data are typi-
cally considered multivariate as they have d dimensions for dierentiating
between attack and normal instances [2, 3, 75]. The GMM is the mixture
model most often applied for NADSs. It estimates the PDF of the target
class (i.e., normal class) given by a training set and is typically based on a
set of kernels rather than the rules in the training phase [18, 41]. Mixture
models require a large number of normal instances to correctly estimate
their parameters and it is dicult to select a suitable threshold (δ ), as
in equation (12), which dierentiates attack instances from the normal
training class with a certain score.
δ ≥ score =⇒ normal instance
(3)
otherwise =⇒ anomalous instance
This score can be dened using the unconditional probability distribution
(w(X) = p(x)) and a typical approach for setting the threshold (δ = p(x)) [140].
For example, Fan et al. [141] developed an unsupervised statistical technique for
identifying network intrusions in which legitimate and anomalous patterns were
learned through nite generalised Dirichlet mixture models based on Bayesian
inference, with the parameters of the mixture model and feature saliency simul-
taneously estimated.
Greggio [142] designed a NADS based on the unsupervised tting of net-
work data using a GMM which selected the number of mixture components
5.6 Statistical-based approaches 31
and t the parameter for each component in a real environment. The highest
covariance matrix identied legitimate network activities, with the smaller com-
ponents treated as anomalies. Christian et al. [143] proposed a NADS based on
combining parametric and non-parametric density modelling mechanisms in two
steps. Firstly, malicious samples were recognised using the GMM and then clus-
tered in a non-parametric measure in the second step. While a cluster stretched
to an adequate size, a procedure was identied, transformed into a parametric
measure and added to the established GMM. These techniques were evaluated
using the KDD99 dataset and their results reected a high detection accuracy
and low FPR. However, they would require the use of Bayesian inference to be
adjusted for their ecient application in real networking.
A brief comparison between advantages and disadvantage of the existing DE
techniques is demonstrated in Table 6.
32
Actual
Negative Positive
Predicted Negative TN FP
Positive FN TP
• The Detection Rate (DR), also called the true positive rate (TPR) or
sensitivity, is the proportion of correctly classied malicious instances of
the total number of malicious vectors and is computed as
DR = T P/(F N + T P ) (5)
• The True Negative Rate (TNR), also called the specicity, is the
percentage of correctly classied normal instances of the total number of
normal vectors and is computed as
T N R = T N/(T N + F P ) (6)
F P R = F P/(F P + T N ) (7)
F N R = F N/(F N + T P ) (8)
33
100%
Perfect detection
%
90
=~
%
80
C
%
=~
70
B
=~
A
TPR
Non-perfect detection
0 100%
FPR
where precision is the fraction of the predicted positive values which are
actually positive and recall the actual number of positives correctly detected,
as given in equations (7) and (8), respectively.
34
dst_host_srv_count Number of connections to the same service in the past 100 connections
count Number of connections to the same host as the current connection in the
dst_host_count Number of connections to the same host in the past 100 connections
hot Hot indicators, e.g., access to system directories, creation, and execution
of programs
UNSW-NB15 dataset
ct_dst_sport_ltm Number of connections containing the same destination address and
connections
service Service types, e.g., HTTP, FTP, SMTP, SSH, DNS and IRC
FS method that depends on labels, while the PCA and ICA techniques were
utilised as lter FS methods without labels. Moreover, the three techniques
can eectively deal with the potential characteristics of network data such as
non-linear and non-normal distributions [2, 3, 41, 78].
The techniques were developed using the `R programming language' on Linux
Ubuntu 14.04 with 16 GB RAM and an i7 CPU processor. To conduct the ex-
periments on each dataset, we select random samples from them with dierent
sample sizes of between 50,000 and 250,000. For each sample size used to estab-
lish the normal prole (i.e., the training phase), each normal sample is almost
65-75% of the total size while the others are used in the testing phase which
establishes the principle of NADS on which we focus in this paper. The perfor-
mances of the DE techniques are evaluated using 10-fold crossvalidations of the
sample sizes to determine their eects on all samples included in the learning
and validation processes.
The most important features are selected from the rules of the ARM tech-
nique which have higher levels of importance, and from the components of the
PCA and ICA with higher variances. The eight features for each dataset listed
in Table 8 are selected to reduce the processing time while applying DE as,
36
100
100
100
80
80
80
80
Detection Rate %
Detection Rate %
Detection Rate %
Detection Rate %
60
60
60
60
40
40
40
40
20
20
20
20
EM EM EM EM
LR LR LR LR
NB NB NB NB
0
0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
False Positive Rate % False Positive Rate % False Positive Rate % False Positive Rate %
ICA features
NSLKDD UNSW-NB15
100
100
80
80
Detection Rate %
Detection Rate %
60
60
40
40
20
20
EM EM
LR LR
NB NB
0
0 20 40 60 80 100 0 20 40 60 80 100
False Positive Rate % False Positive Rate %
Figure 14: ROC curves of three ML algorithms using ARM, PCA and ICA FS techniques
for less than this number, DE evaluations provide lower accuracies and higher
FARs [3, 77, 78].
To assess performances using the features selected from the datasets, three
ML algorithms, namely, EM clustering, Logistic Regression (LR) and Naive
Bayes (NB), are applied. The EM clustering technique was used as an example
of unsupervised learning that can identify attacks without using labels in the
training phase, while the LR and NB techniques were utilised as examples of
statistical and supervised learning approaches that demand labels to classify
attacks and their types. The evaluation criteria are estimated in terms of the
accuracy, and FAR and ROC curves to assess the eects of these features and
how they could improve performances at a lower computational cost, with the
results obtained provided in Table 9.
There are two reasons for the ML algorithms performing better on the
KDD99/NSL- KDD than UNSW-NB15 dataset. Firstly, the latter has many
values of normal and suspicious instances that are almost the same while the
former does not. Secondly, the data distributions of the NSL-KDD dataset's
training and testing sets are dierent due to the insertion of new attacks into the
testing set which clearly distinguish between its normal and abnormal instances
while executing ML algorithms. However, these distributions are approximately
the same in the UNSW-NB15 dataset because its normal and abnormal instances
were created from the same network. To compare the results obtained from the
three FS methods, we observe that the last two often provide better evaluation
results than the ARM using ML algorithms, as shown in Figures 14.
37
This is because the ARM technique deals directly with the values of fea-
tures while the others transform the feature space into another space based on
the highest variances between features which can greatly help DE techniques
nd dierences between normal and suspicious instances. However, the ARM
method can provide promising results when selecting relevant observations. Re-
garding the PCA and ICA techniques, there are only small dierences in the
evaluation performances of the ML algorithms as their internal methodologies
appear to be similarly based on variances. Consequently, we suggest using the
PCA in the feature reduction model due to its simplicity of execution and better
performances using ML algorithms [2, 3, 140].
In order to provide fair comparisons between the datasets in terms of the FS
and DE approaches discussed above, Table 10 presents some recently published
techniques. It is observed that FS methods can signicantly improve the per-
formance of a NADS by excluding irrelevant attributes from datasets. NADSs
using dierent DE approaches have their own merits and demerits, as shown in
Table 6. As statistical and ML techniques constantly try to enhance the process
of detecting abnormal activities from network and host systems, their complex-
ity becomes one of the essential criteria that should be considered in the design
of a lightweight and reliable NADS. For learning and validating ML mechanisms
on new datasets, combination and statistical techniques can eectively detect
existing and zero-day attacks while knowledge, classication and clustering can
eciently detect known ones.
The DE approaches used to identify recent network threats are explained in
Section 5. Classication, statistical and clustering algorithms can generally dis-
cover DoS, DDoS and botnet attacks because they can learn from the massive
amounts of data hackers send to victims' systems. They can also discrimi-
nate between DDoS and Flash crowded based on their dierent characteristics
[166]. Knowledge and classication techniques can recognise brute force and
shellshock malicious events as they can detect attempts to penetrate users' cre-
dentials and/or remotely exploit systems [22]. Clustering algorithms can detect
browser-based attacks because they can group legitimate rules generated from
websites and identify outliers as attacks [42, 75]. Combination and classica-
tion mechanisms can eectively identify SSL anomalous behaviours because they
can deal properly with features extracted from TLS/SSL protocols and achieve
promising detection rates [22]. Finally, statistical and classication techniques
can recognise backdoor attacks by eectively identifying abnormal patterns of
the IRC protocol.
detection methods. FPR and FNR errors occur when a normal behaviour
falls in an attack region and a malicious one in a normal region, respec-
tively.
• Real-time detection is also very challenging for several reasons. Firstly, the
features created for network trac may contain a set of noisy or irrelevant
ones. Secondly, the lightweight of detection methods need to be carefully
adopted, with respect to the above problems. These reasons increase the
processing time and false alarm rate if not properly addressed. Therefore,
feature reduction and lightweight DE approaches should be developed.
The feature reduction will assist in reducing irrelevant attributes, and DE
approaches will improve the detection accuracy if they can discriminate
between the low variations of normal and abnormal patterns.
• Designing eective ADSs that can eciently identify future cyber adver-
saries from IoT, Cloud/Fog computing paradigms, industrial control sys-
tems, or Software Dened Networks. New ADSs should be able to monitor
high-speed networks that exchange high data rates in real-time. Moreover,
such systems should be scalable and self-adaptive for protecting dierent
nodes of wide area networks. In IoT networks, there is a large amount
of network trac and telemetry data of IoT sensors as well as Cloud/Fog
services that should be examined [41, 43, 46]. Moreover, this requires
building collaborative NADSs to analyse dierent network nodes and ag-
gregate their data for recognising suspicious events.
40
9. Concluding remarks
This study discussed the background and literature related to IDSs, speci-
cally a NADS with dierent applications of backbone, IoT, data centers, Cloud
and Fog Computing paradigms. Due to rapid advances in technologies, com-
puter network systems need a solid layer of defence against vulnerabilities and
severe threats. Although an IDS is a signicant cyber security application which
integrates a defence layer to achieve secure networking, it still faces challenges
for being built in an online and adaptable manner. Anomaly detection method-
ologies which can eciently identify known and zero-day attacks are investi-
gated. It has been a very challenging issue to apply a NADS instead of a MDS
methodology in the computer industry which could be overcome by framing its
architecture with a data source, pre-processing method and DE mechanism.
A NADS is usually evaluated on a data source/dataset which involves a
wide variety of contemporary normal and attack patterns that reect the per-
formances of DE approaches. The network dataset used consists of a set of fea-
tures and observations that may include irrelevant ones that could negatively
aect the performances and accuracy of DE approaches. Consequently, data
pre-processing methods for creating, generating, reducing, converting and nor-
malising features are discussed to pass ltered information to a DE approach
which distinguishes between anomalous and legitimate observations and has
been applied based on classication, clustering, knowledge, combination and
statistics discussed to demonstrate their merits and demerits in terms of build-
ing an eective NADS.
References
[1] A. Shameli-Sendi, M. Cheriet, A. Hamou-Lhadj, Taxonomy of intrusion
risk assessment and response system, Computers & Security 45 (2014)
116.
[2] N. Moustafa, G. Creech, J. Slay, Big data analytics for intrusion detection
system: statistical decision-making using nite dirichlet mixture mod-
els, in: Data Analytics and Decision Support for Cybersecurity, Springer,
2017, pp. 127156.
[7] L. Wang, R. Jones, Big data analytics for network intrusion detection:
A survey, International Journal of Networks and Communications 7 (1)
(2017) 2431.
[24] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, K.-Y. Tung, Intrusion detection sys-
tem: A comprehensive review, Journal of Network and Computer Appli-
cations 36 (1) (2013) 1624.
[30] Y. Ji, X. Zhang, T. Wang, Backdoor attacks against learning systems, in:
Communications and Network Security (CNS), 2017 IEEE Conference on,
IEEE, 2017, pp. 19.
[52] S. Dua, X. Du, Data mining and machine learning in cybersecurity, 1st
Edition, Vol. 1, CRC press, 2016.
[65] The darpa-2009 dataset. darpa scalable network monitoring (snm) pro-
gram trac. packet clearing house. 11/3/2009 to 11/12/2009.
URL https://www.predict.org/
[78] N. Moustafa, J. Slay, The signicant features of the unsw-nb15 and the
kdd99 data sets for network intrusion detection systems, in: Building
Analysis Datasets and Gathering Experience Returns for Security (BAD-
GERS), 2015 4th International Workshop on, IEEE, 2015, pp. 2531.
[79] H. Liu, H. Motoda, Feature selection for knowledge discovery and data
mining, Vol. 454, Springer Science & Business Media, 2012.
[80] Y. Chen, Y. Li, X.-Q. Cheng, L. Guo, Survey and taxonomy of feature
selection algorithms in intrusion detection system, in: International Con-
ference on Information Security and Cryptology, Springer, 2006, pp. 153
167.
[84] W. Lee, S. J. Stolfo, et al., Data mining approaches for intrusion detec-
tion., in: Usenix security, 1998.
47
[86] J. Luo, S. M. Bridges, Mining fuzzy association rules and fuzzy frequency
episodes for intrusion detection, International Journal of Intelligent Sys-
tems 15 (8) (2000) 687703.
[88] C. Wagner, J. François, T. Engel, et al., Machine learning approach for ip-
ow record anomaly detection, in: International Conference on Research
in Networking, Springer, 2011, pp. 2839.
[96] S.-J. Horng, M.-Y. Su, Y.-H. Chen, T.-W. Kao, R.-J. Chen, J.-L. Lai,
C. D. Perkasa, A novel intrusion detection system based on hierarchical
clustering and support vector machines, Expert systems with Applications
38 (1) (2011) 306313.
48
[105] W. Huang, G. Song, H. Hong, K. Xie, Deep architecture for trac ow
prediction: deep belief networks with multitask learning, IEEE Transac-
tions on Intelligent Transportation Systems 15 (5) (2014) 21912201.
[109] C. Yin, Y. Zhu, J. Fei, X. He, A deep learning approach for intrusion
detection using recurrent neural networks, IEEE Access 5 (2017) 21954
21961.
[112] K. Chadha, S. Jain, Hybrid genetic fuzzy rule based inference engine
to detect intrusion in networks, in: Intelligent Distributed Computing,
Springer, 2015, pp. 185198.
[119] P. Naldurg, K. Sen, P. Thati, A temporal logic based framework for in-
trusion detection, in: International Conference on Formal Techniques for
Networked and Distributed Systems, Springer, 2004, pp. 359376.
[120] S.-S. Hung, D. S.-M. Liu, A user-oriented ontology-based approach for net-
work intrusion detection, Computer Standards & Interfaces 30 (1) (2008)
7888.
[134] C.-C. Lin, M.-S. Wang, Particle Filter for Depth Evaluation of Networking
Intrusion Detection Using Coloured Petri Nets, INTECH Open Access
Publisher, 2010.
51
[150] P. Pudil, J. Novovi£ová, Novel methods for feature subset selection with
respect to problem knowledge, in: Feature Extraction, Construction and
Selection, Springer, 1998, pp. 101116.
[151] S. Dubey, J. Dubey, Kbb: A hybrid method for intrusion detection, in:
Computer, Communication and Control (IC4), 2015 International Con-
ference on, IEEE, 2015, pp. 16.
[155] P. A. Porras, A. Valdes, Live trac analysis of tcp/ip gateways., in: NDSS,
1998.
[158] W.-C. Lin, S.-W. Ke, C.-F. Tsai, Cann: An intrusion detection system
based on combining cluster centers and nearest neighbors, Knowledge-
based systems 78 (2015) 1321.
[165] B. Wang, Y. Zheng, W. Lou, Y. T. Hou, Ddos attack protection in the era
of cloud computing and software-dened networking, Computer Networks
81 (2015) 308319.