0% found this document useful (0 votes)
11 views17 pages

AML Based Intrusion Detection

Uploaded by

20020065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

AML Based Intrusion Detection

Uploaded by

20020065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Highlights

Diffusion-powered Data Augmentation and Explainable Boosting-based Parallel Ensemble Learning for
Intrusion Detection in ICS
Tuyen T. Nguyen , Phong H. Nguyen , Hoa N. Nguyen

• Using SHAP-based, sophisticated feature selection methods to pinpoint important features.


• Proposing a diffusion-powered advanced dataset augmentation technique to improve the quality of the training set.
• Designing a weighted voting ensemble learning strategy that combines model predictions to increase accuracy,
dependability, and adversarial attack resistance.
• Conducting extensive experiments on the augmented datasets and achieving significant advancements in overall
detection efficacy and speed.
Diffusion-powered Data Augmentation and Explainable Boosting-based Parallel
Ensemble Learning for Intrusion Detection in ICS
a a a,∗
Tuyen T. Nguyen , Phong H. Nguyen , Hoa N. Nguyen
a Department of Information Systems, VNU University of Engineering and Technology,
Vietnam National University at Ha Noi, Hanoi, 100,000, Vietnam

Abstract
In the Industry 4.0 era, detecting cyber attacks on Industrial Control Systems (ICS) has become increasingly important
and challenging. This research introduces DAELID, a method designed to enhance the detection of cyber assaults on
the IEC 60870-5-104 protocol. DAELID begins by utilizing SHAP (SHapley Additive exPlanations) for feature selection
to ensure the model focuses on the most relevant data. It then employs various generative models, including GANs
(TabGAN), Diffusion Model (FDM and CFM), to generate synthetic samples, improving the robustness and diversity of
the training data. Additionally, DAELID uses a weighted voting ensemble learning strategy with multiple powerful AI
models running concurrently, enhancing the model’s performance and resilience against adversarial attacks. Comprehen-
sive experiments conducted using the IEC 60870-5-104 Intrusion Detection Dataset demonstrate that DAELID achieves
an accuracy of 86.83% and an AUC ROC score of 98.92%, significantly surpassing other state-of-the-art methods.
Keywords: Explainable Artificial Intelligence, Data Augmentation, Intrusion Detection, Industrial Control System

1. Introduction Given the current landscape, the implementation of


protective measures in ICS is imperative to safeguard
An Industrial Control Systems (ICS) is utilized to network security. Presently, cybersecurity experts
manage and control industrial processes and machinery advocate for the integration of machine learning (ML)
in manufacturing, production, and other industrial into Intrusion Detection Systems (IDS) within ICS. This
environments. Its purpose is to automate and optimize approach aims to fortify the reliability of IDS against
operations, monitor conditions, and ensure both safety cyber threats while enhancing their ability to identify
and efficiency. Nowadays, ICS are often designed with and address network attack issues. ML-driven IDSs have
internet connectivity, which necessitates heightened proven adept at identifying system anomalies effectively,
security measures. However, in practice, enhancing as highlighted in [2]. Consequently, ML-powered IDSs,
security protocols and defenses against cyber attacks as cited in [3], offer numerous advantages, including
has not been sufficiently prioritized. Consequently, increased accuracy and the ability to swiftly detect
numerous cyberattacks on ICS have recently had serious emerging threats. However, the focus of AI models
repercussions. deployed in IDS has often overlooked the development of
Various distinct communication protocols tailored defenses against attacks on specialized ICS protocols, such
for supervisory control and data acquisition (SCADA) as the IEC 60870-5-104 protocol, as noted in [4]. Recently,
systems are commonly employed within ICS. Among these Radoglou-Grammatikis et al. have introduced the IEC
protocols is the IEC 60870-5-104 protocol, which finds 60870-5-104 Intrusion Detection Dataset for training
significant utilization in Industrial Internet of Things AI models to detect intrusions in ICS. Nonetheless, the
(IIoT) applications, particularly within the energy and dataset’s quality and the accuracy of intrusion detection
healthcare domains [1]. However, despite its widespread remain suboptimal, achieving only 83.14% accuracy [1]
adoption, this protocol faces numerous security
challenges. For instance, it suffers from deficiencies in Research Challenges: In summary, there exist “barri-
access control and authentication, rendering it susceptible ers” to the design of an effective intrusion detection meth-
to threats such as Man-in-the-Middle (MITM) attacks ods, with the simplest following challenges:
and distributed denial-of-service (DDoS) attacks.
• The scarcity of high-quality, comprehensive datasets
poses a major obstacle to effective intrusion detec-
∗ Corresponding author tion. Existing datasets often lack representativeness
Email addresses: tuyennt@cybersecurity.vn (Tuyen T.
and may be outdated or limited in scope. To address
Nguyen ), 20020065@vnu.edu.vn (Phong H. Nguyen ),
hoa.nguyen@vnu.edu.vn (Hoa N. Nguyen ) this challenge, generating synthetic data to enrich

Preprint submitted to Journal of Network and Computer Applications June 27, 2024
existing datasets is essential for better aligning them The rest of the paper is structured as follows: Section
with real-world cyber threats. §2 examines prior research in the field. In Section §3,
we introduce our contributions in feature selection, data
• ICS encounter various complex cyber dangers, in-
augmentation, and an AI-driven intrusion detection and
cluding malware, ransomware, and advanced per-
prevention system named DAELID. This approach aims
sistent threats (APTs). IDS models must detect
to improve performance and increase resilience against ad-
diverse threats across system layers, necessitating
versarial attacks. In §4, we discuss the experiment and
advanced techniques to learn from known and new
evaluation of DAELID’s accuracy and speed, leading to
attack patterns.
the conclusions of our study in Section §5.
• Ensuring IDS models perform consistently across
various network environments is a critical challenge.
2. Problem Definition & Related Works
Disparities in traffic patterns, attack methods,
and network configurations can significantly 2.1. Problem Definition
impact model performance. Models trained in one
The objective of this research is to improve the de-
environment may struggle in others, necessitating
tection of cybersecurity threats within IDS, specifically
robust generalization techniques for reliable threat
focusing on the IEC 60870-5-104 protocol. Current detec-
detection across diverse settings.
tion methods struggle with imbalanced datasets, limited
• Efficiently processing vast amounts of network feature selection, and the need for robust, generalizable
data is crucial for real-time intrusion detection. models. Our approach aims to address these issues by
IDS models must maintain a delicate balance leveraging advanced data augmentation techniques, ex-
between speed and accuracy to ensure effective plainable AI methods, and ensemble learning. The main
threat detection without compromising quality. issue can be summarized as follows:
Therefore, enhancing IDS scalability and processing Given a dataset D = {(xi , yi )}N
i=1 , where xi represents
capabilities is essential for strong security in the input features and yi represents the corresponding
dynamic environments. labels, we seek to develop a model M that can accurately
classify instances into their respective categories, includ-
Highlight Contribution: From the challenges
ing both normal and attack states. The task at hand
mentioned above, we advocate an AI-powered intrusion
involves enhancing the model’s effectiveness with respect
detection method for ICS in this study, with the following
to minority classes (representing rare attacks) and ensure
main results:
the model’s decisions are interpretable and robust across
1. We employ the SHAP (SHapley Additive different operational conditions in various environments.
exPlanations) method for feature selection,
enhancing AI model interpretability by identifying 2.2. AI-powered intrusion detection
influential features and reducing training data noise, Deep learning (DL) techniques are highly effective in
thereby boosting overall model performance. identifying network attack patterns from raw data, en-
2. We employ several advanced dataset augmentation compassing tasks like feature extraction and classification.
technique including GANs, and Diffusion Model to Recently, a variety of DL methods [5, 6, 7] have been
generate synthetic samples that augment the train- incorporated into IDS, facilitating the real-time detection
ing set. This approach helps balance the dataset of network intrusions. An example includes the develop-
and improves he strength and applicability of the ment by Bontemps et al. [8] of a neural network-based
machine learning models across various scenarios. model for real-time anomaly detection, specifically LSTM
RNNs, trained on standard time series data. Similarly, Li
3. We develop a weighted voting ensemble that et al. [9] introduced an approach centered around hier-
combines predictions from AI models including archical and dynamic feature extraction, which interprets
XGBoost, RandomForest, and ExtraTrees, boosting network activities as sequences of packets and dynamically
the accuracy and reliability of intrusion detection adjusts feature representations using an attention mecha-
and increasing resilience against adversarial attacks. nism. These approaches have demonstrated high accuracy,
Additionally, parallelizing predictor execution achieving up to 99.05% precision on the CSE-CIC-IDS2018
enhances reliability and reduces prediction times. dataset.
4. We performed extensive experiments using the IEC Aldarwbi and colleagues (2022) [10] developed
60870-5-104 Intrusion Detection Dataset to assess a framework that transforms network traffic flow
our proposed techniques. Our findings indicate sub- characteristics into waveforms and utilizes sophisticated
stantial enhancements in both detection accuracy deep learning techniques like LSTM, DBNs, and CNN
and resilience when contrasted with current leading for detecting intrusions. Their approach yielded high
methods, affirming the practical viability of our ap- accuracy rates, achieving 84.82% accuracy on the
proach. NSL-KDD dataset and 99.41% on the CIC-IDS2017

2
dataset. Furthermore, they use Firefly Optimization for task is as follow:
intrusion detection and a Probabilistic Neural Network M
for classification on the NSL-KDD dataset [11], attaining 1 X
ŷ = Tm (x) (2)
an accuracy of 98.99 M m=1
Furthermore, Recent studies have shown that
XGBoost, ExtraTrees, and RandomForest are among the where Tm (x) denotes the output made by the m-th tree in
most effective machine learning models for predicting the ensemble.
intrusion attacks [12, 13, 14]. Consequently, our research In summary, ExtraTrees leverages randomization to
focuses on these methods to enhance intrusion detection build a diverse set of trees, which are then aggregated
by conducting an in-depth analysis of network traffic. to produce more accurate and stable predictions. This
This is corroborated by our findings, as these three method has shown to be effective in various fields, pro-
models perform the best on our dataset in our proposed viding a balance between bias and variance, and offering
method. improved performance over single decision tree models.
Extreme Gradient Boosting (XGB): XGBoost RandomForest (RF): RandomForest, introduced by
is founded on a series of machine learning principles Breiman in 2001, are robust ensemble learning methods
centered around decision trees. It employs reinforcement serve purposes in both classification and regression tasks
through error minimization by incorporating a gradient [17]. The core idea is to construct multiple decision
term [15]. XGBoost operates through an ensemble of trees and aggregate their predictions to improve overall
K classification and regression trees, with each tree performance and reduce overfitting.
i
containing KE | i ∈ 1..K nodes. The combined prediction Mathematically, a RandomForest consists of M deci-
scores from all trees provide the final prediction: sion trees, each trained on a different bootstrap sample
from the original dataset. For a given input x, the pre-
K
X diction of the forest is the majority vote of the predictions
yˆi = φ(xi ) = fk (xi ), fk ∈ F, (1)
from the individual trees in the case of classification, or the
k=1
average prediction for regression. This can be expressed
In this context, yi represents the class labels, xi denotes as follows:
M
the individual training set instances, fk represents the 1 X
fˆM (x) = Tj (x) (3)
score at a leaf node of the k th tree, while F denotes the M j=1
collection of all K scores derived from the ensemble of
classification and regression trees. In this formula, fˆM (x) represents the aggregated pre-
ExtraTrees (ET): ExtraTrees, short for Extremely Ran- diction of the RandomForest for the input x, M represents
domized Trees, is a technique in ensemble learning that the aggregate count of trees present within the forest, while
constructs a multitude of decision trees [16]. Unlike tra- Tj (x) denotes the prediction made by the j-th tree for
ditional decision tree algorithms, ExtraTrees introduces the input x. For classification, the final output is decided
randomness in two primary ways: by selecting a random based on the major agreement of all the trees’ predictions,
subset of features for each split and by choosing the split while for regression, it is the average of all tree predictions.
point randomly from the selected features. RandomForest employ two main techniques: bootstrap
The core idea of ExtraTrees is to reduce variance and aggregating (bagging) and random feature selection. Bag-
prevent overfitting by averaging the results of multiple ging involves training individual decision trees on ran-
trees, each trained with different random subsets of data. domly selected portions of the dataset, while random fea-
This method enhances the model’s robustness and gener- ture selection entails the process of selecting a subset of
alization capability. features randomly for each split within the tree. These
Mathematically, the ExtraTrees algorithm can be de- techniques help in reducing variance and preventing over-
scribed as follows: given a training set S, the algorithm fitting, making RandomForests highly effective for various
generates M unpruned decision trees. Each tree is trained machine learning tasks.
on the entire dataset, but at each node, the algorithm Ensemble Learning: Emerging machine learning
selects K random features and chooses a random split techniques, including Gradient Boosted Tree (GBT)
point for each feature. The best split among these random algorithms like XGBoost, CatBoost, and LightGBM,
splits is chosen to partition the data at that node. alongside deep learning methods such as Convolutional
The prediction for a new instance x is achieved by cal- Neural Networks (CNN), have shown exceptional
culating the average of predictions made by each individual effectiveness in general classification tasks and
tree in the ensemble. For classification tasks, this typically particularly in analyzing intrusion datasets [18]. However,
involves majority voting, while for regression tasks, the the evaluation metrics for the quality of modern
mean prediction is used. classification methods based on GBT and DL still need
The general formula for the prediction ŷ in a regression improvement and enhancement. Therefore, ensemble
learning methods have been proposed to leverage the

3
strengths of multiple individual classifiers, coordinating enabling efficient data generation and likelihood estima-
them to enhance the classifier’s performance [19]. tion. These models help balance datasets and enhance
In our method, we perform soft voting ensemble the robustness and generalization capabilities of machine
learning for these
Pn above models by following the formula: learning models.
Pvoting = n1 i=1 Pi (f ), where P1 (f ), P2 (f ),..., Pn (f )
are the probability outputs of n AI models for a feature 2.4.1. Generative Adversarial Networks
vector f . We parallelize any processes capable of parallel Generative Adversarial Networks (GANs) [22] are a
execution to minimize the time needed for intrusion class of generative models that aim to uncover the hidden
analysis. distribution of a specific data generation process. This
is achieved through an adversarial interaction between a
2.3. Intrusion Detection in Industrial Control Systems generator and a discriminator. The architecture of GANs
P. Radoglou-Grammatikis and colleagues [20] discuss involves two competing neural network models:
a threat model for IEC 60870-5-104, employing a colored • Generator: This network takes random noise as in-
Petri Net to pinpoint four cyber threats: unauthorized put and transforms it into data that resemble the
access, MITM attacks, traffic analysis attacks, and DDoS. authentic data in the training set. Its goal is to fool
IEC 60870-5-104, a prevalent industrial communication the discriminator by producing data indistinguish-
protocol in ICS/SCADA settings, functions through able from real data.
TCP’s default port 2404. However, the absence of
authentication and authorization mechanisms makes it • Discriminator: Opposite to the generator, the dis-
susceptible to these threats. criminator evaluates whether the incoming data are
In a related study [2], the researchers present genuine or produced by the generator. It learns to
an anomaly-based IDS designed to detect network distinguish real data from counterfeits effectively.
irregularities in IEC 60870-5-104. This IDS is comprised Model Optimization
of five components: monitoring network traffic, controlling Let θ and ϕ represent the learnable parameters in G
access to network packets, extracting data flows, detecting and D, respectively. GAN training is framed as a minimax
anomalies, and implementing responses. The Flow problem:
Extraction Module employs the CICFlowMeter to min max V (θ, ϕ), (4)
produce bidirectional flow statistics for IEC 60870-5-104 ϕ θ
network traffic, while the Anomaly Detection Module where V is the utility function.
detects irregularities using three different outlier detection Training GANs can be challenging, with frequent prob-
algorithms. Furthermore, Radoglou et al. [1] further lems such as mode collapse and mode dropping. Mode
investigate particular cyber threats targeting IEC collapse occurs when the generator repeatedly produces a
60870-5-104, such as Man-in-the-Middle (MITM) attacks, narrow range of samples. Mode dropping happens when
Denial of Service (DoS) attacks, traffic sniffing, and the generator fails to cover parts of the target distribu-
application-layer attacks. They utilize a decision tree tion. Other problems include checkerboard and waterdrop
method and Software-Defined Networking (SDN) for artifacts. This section covers the fundamentals of GAN
attack mitigation, resulting in the creation of the IEC training and methods to enhance stability.
60870-5-104 Intrusion Detection Dataset. 1. Learning Objective: The primary aim of GAN
Additionally, D. Asimopoulos et al. [21] focus on devel- training is to minimize the disparity between the actual
oping an AI-based IDS to defend against cyber threats tar- data distribution p(x) and the distribution of data
geting the IEC 60870-5-104 protocol. They employ various generated p(G(z; θ)). This difference can be evaluated
AI techniques, including machine learning and deep learn- using several metrics, including the Jensen-Shannon
ing models such as XGBoost, RandomForest, and Multi- divergence, Kullback-Leibler divergence, and integral
Layer Perceptron. The best results were obtained with the probability metric. Consequently, there are several GAN
following performance metrics: Accuracy = 82.44%, True loss functions, including the original GAN loss [23], the
Positive Rate (TPR) = 82.44%, False Positive Rate (FPR) Wasserstein GAN loss [24], the least-square GAN loss
= 1.59%, and F1-score = 80.98%. [25], and the hinge GAN loss [26].
Here is a broad GAN learning objective that encom-
2.4. Generative Models passes various well-known loss functions. To update the
To tackle the imbalanced dataset challenge, generative discriminator, the objective is:
models have been widely utilized to generate synthetic
max Ex∼D [fD (D(x; ϕ))] + Ez∼Z [fG (D(G(z; θ); ϕ))], (5)
samples from small, imbalanced datasets. Prominent gen- ϕ
erative models include Generative Adversarial Networks
The output layers fD and fG are responsible for con-
(GANs), which use a generator and discriminator to create
verting the discriminator D’s calculations into classifica-
realistic synthetic data and Diffusion Models, which gradu-
tion scores for real and fake images. In the generator’s
ally add and remove noise to produce high-quality samples,
update step, the goal is:
4
[30] We begin with the forward process and then explain
min Ez∼Z [gG (D(G(z; θ); ϕ))], (6) how to reverse it. Starting with data samples x0 ∼ p0 ,
θ
the forward SDE gradually changes x0 into x1 , where p1
where gG represents the output layer, converting the dis- is a simple prior distribution like a Gaussian. We consider
criminator’s output into a classification score for the gen- τ = 1 as the endpoint of the forward noising process, and
erated (fake) image. the transition from data to noise over time t ∈ [0, 1] is
2. Training: Two widely used variations of stochastic described by the following SDE,
gradient descent/ascent (SGD) for training GANs include
the simultaneous update and alternating update meth- dx = f (x, t) dt + g(t) dB (11)
ods. Denote the objective functions in Equations 5 and
6 as VD (θ, ϕ) and VG (θ, ϕ), respectively. In the concurrent This equation describes how x changes over time as
update method, every training iteration involves both re- it is continuously perturbed with noise. In SDE termi-
fining the discriminator and improving the generator in nology, the added noise comes from the diffusion term
update steps: g(t) dB, where g(t) is a time-dependent scalar (diffusion
coefficient) and dB corresponds to a d-dimensional Brow-
∂VD (θ(t) , ϕ(t) ) nian motion, representing a Gaussian perturbation with
ϕ(t+1) = ϕ(t) + αD , (7)
∂ϕ mean 0 and variance dt. If g(t) = 0, then Equation 11
becomes an ordinary differential equation (ODE), making
∂VG (θ(t) , ϕ(t) ) the forward and reverse processes deterministic. The drift
θ(t+1) = θ(t) − αG , (8)
∂θ term f (x, t) dt specifies the deterministic path for x in the
In which αD and αG represent the learning rates of the absence of noise. In the forward process, f (x, t) corrupts
discriminator and the generator. In each training itera- the data so that at t = 1, x1 is completely noise.
tion, the process involves updating the discriminator first, By freezing time t, we can model the distribution
followed by updating the generator. pt (x), describing the values of x from the forward process.
Deriving a closed-form solution for pt (x) requires solving
∂VD (θ(t) , ϕ(t) ) the Fokker–Planck equation of Equation (1) and is not
ϕ(t+1) = ϕ(t) + αD , (9) always feasible depending on the choices of drift and
∂ϕ
diffusion. We will show DMs where we do not need to be
∂VG (θ(t) , ϕ(t+1) ) explicitly concerned about pt (x).
θ(t+1) = θ(t) − αG . (10)
∂θ The seminal work by Anderson et al.[31] showed that
In the alternating update method, the generator’s up- Equation (1) can be reversed by another SDE that starts
date utilizes the discriminator’s freshly adjusted parame- from noise and reconstructs the same pt (x) at all times,
ters, enhancing stability, whereas simultaneous updating including t = 0, corresponding to the desired data distri-
can achieve computational efficiency gains. bution. Specifically, this reverse SDE is given by
Among the numerous Stochastic Gradient Descent h
2
i
(SGD) algorithms, ADAM [27] is widely used for training dx = f (x, t) − g(t) ∇ log pt (x) dt + g(t) dB (12)
Generative Adversarial Networks (GANs). ADAM
includes several adjustable parameters. Usually, the In statistics, ∇ log pt (x) is known as the stein score
initial momentum is adjusted to 0, while the second of the marginal density pt . We will use this term since
momentum is set to 0.999. The discriminator’s learning it is also common in DMs, which are sometimes called
rate is often adjusted to be 2 to 4 times greater than score-based generative models. The score term guides x to
the generator’s learning rate (usually 0.0001), a practice higher probability regions. Directly calculating the score
referred to as two-time update scales (TTUR) [28]. is impractical, as pt (x) is not always available, but it can
be estimated by minimizing the denoising score matching
2.4.2. Diffusion Models objective,[32, 33]
Diffusion models (DM) [29] are a type of probabilistic h
2
i
generative model where the generation process involves L = Ex0 ∼p0 ,xt ∼pt|0 sθ (x, t) − ∇ log pt|0 (x|x0 ) .
gradually refining noisy samples into clean ones. Train- (13)
ing these models uses a forward (diffusion) process that Here, sθ (x, t) is our estimate of the score, usually pa-
sequentially adds noise to examples until they are com- rameterized as a neural network with trainable weights
pletely corrupted. A deep neural network is then trained to θ. By minimizing L accurately, we approximate the true
reverse this process. Essentially, the model learns to trans- score: sθ (x, t) ≈ ∇ log pt (x)pt . The key component is the
form between a simple noise distribution (usually Gaus- data conditional distribution pt|0 , for which an analytical
sian) and the target data distribution, progressively con- closed form is often available (unlike pt (x)) when condi-
verting noise into data. tioned on a data point x0 from the training set.
Both the forward and reverse processes in DMs can be The key components for learning Diffusion Models in-
described using stochastic differential equations (SDEs). clude: (i) data samples from the target distribution p0 , (ii)
5
data conditional pt|0 with analytically computable score, Next, we select the features that all models chose.
and (iii) a neural network sθ to estimate the score. The These features are considered essential for training the
requirement for an analytical data conditional can be re- machine learning models. Finally, we rebuild the dataset
strictive and limits the modification of the diffusion pro- with selected features and reconstruct the data by keeping
cess for new applications. For example, if the prior is non- only the selected features. This method creates a new
Gaussian, the previous construction must be rederived. dataset with a reduced number of features, and we use it
Alternatively, flow matching [34] (FM) is a newer to train and predict AI models.
method that offers competitive performance with DMs
without needing an analytical data conditional. Instead 3. DEALID: Proposed Method for Intrusion De-
of a Stochastic Differential Equation (SDE), FMs rely tection in ICS
on an Ordinary Differential Equation (ODE) where
stochasticity is removed. With FMs, a non-Gaussian Recent research has extensively examined intrusion de-
prior can be chosen without additional complexities. tection systems that integrate adversarial machine learn-
Finally, stochastic interpolants [35, 36] provide a unified ing (AML) techniques. A thorough review of current liter-
framework for DMs and FMs, which helps understand ature reveals various strategies and advancements aimed
the trade-offs between modeling ODEs and SDEs. at enhancing the resilience and efficacy of these systems.
In the following section, we detail our methodology and
2.5. Feature Selection outline the components that make up our proposed ap-
proach.
In our proposed method, we employ an advanced data
augmentation technique, SHAP, for feature selection to en- 3.1. Approach Direction
hance the quality of the training dataset. Using SHAP, we To enhance intrusion detection performance in ICS, our
compute the Shapley value for each feature using machine approach is based on the following ideas:
learning models as explainers. Given a machine learning
model mi to predict the output y from a data sample x, the 1. Apply various data engineering techniques like data
(i)
Shapley value ϕj of the j-th feature in the data sample pre-processing, data normalizing to improve the
x is computed by taking the difference between the model quality of the training dataset for the AI model.
prediction when feature fj is present and when it is absent: 2. Evaluate several AI models on that processed dataset
to select the best-performing models for intrusion
(i)
X |S|!(N − |S| − 1)! 
ϕj =
N!
mi (xS∪{j} ) − mi (xS ) detection.
S⊆{1,2,...,N }\{j}
(14) 3. Explain that best-performing model on the
where N is the number of features; S is a subset of the set processed dataset by SHAP to assess the influence
of features not containing the j-th feature; xS is the data of each attribute. Subsequently, save the attribute
sample x containing only the features in set S; xS∪{j} is set with the best influence as a new dataset.
the data sample x containing the j-th feature; mi (xS ) is 4. Employ different generative methods on the new
the prediction value of the model when considering only dataset generated by SHAP.
the features in set S; mi (xS∪{j} ) is the prediction value of
the model when including the j-th feature. 5. Reevaluate AI models on those augmented datasets.
After computing the Shapley value for each data sam-
6. Employ parallel ensemble learning methods to en-
ple, use the following formula to calculate the importance
hance accuracy, speed and resilience of the intrusion
value of feature j over the entire dataset:
detection system.
n
1 X (i) As a result, our proposed method is named DAELID,
Ij = |ϕ | (15) short for “Diffusion-powered Data Augmentation and
n i=1 j
Explainable Boosting-based Parallel Ensemble Learning
where n is the number of data samples. for Intrusion Detection on ICS”. Firstly, we utilize SHAP
Finally, we select a threshold τ to determine the level for feature selection (as described in §3.2). Additionally,
of importance of a Shapley value significant enough to we employ TabGAN, FDM and CFM for different scenario
consider a feature important. We choose features with dataset augmentation (detailed in §3.3) and finally, an
Shapley values greater than or equal to the threshold τ , effective weighted voting ensemble approach, as outlined
i.e., Ij ≥ τ for each data sample. We use k machine in §3.4. We perform these processes in parallel to speed
learning models, m1 , m2 , m3 , ...mk , and based on the for- up the machine-learning model. Consequently, DAELID
mula 15 to determine the importance value for each feature represents a comprehensive research approach aimed
based on these machine learning models will be Ijk , where at improving real-time intrusion detection capabilities.
k ranges from 1 to k. Therefore, we identify all values Following sections will delve into a thorough elucidation
satisfying Ijk ≥ τ . of our proposed methodology. To summarize, our
proposed method’s workflow is illustrated in Fig. 1.
6
Figure 1: Architecture of our proposed Intrusion Detection System - DAELID

3.2. SHAP-based Feature Selection Following this, we arrange the features based on their
Feature selection is a critical aspect of ML, as it enables decreasing significance and visualize them. Fig. 2 displays
us to reduce the dimensionality of the feature space, im- the SHAP feature importance for the IEC 60870-5-104
prove model performance, increase model interpretability, Intrusion Detection Dataset.
and enhance adaptability. Traditional methods for feature As evident from the ranking, the primary
selection include filter-based, wrapper-based, and embed- features significantly influencing the model’s
ded methods. However, recent studies [37, 38, 39] have ultimate prediction include “flow IAT min”,
demonstrated notable advancements in feature selection “type id process information in control direction”, “flow
using a technique known as SHAP. In this study, we will packet APDU length max”, and “type id system informa
leverage the SHAP technique to elucidate the classification tion in control direction”. These attributes hold notable
model and identify the most influential features based on importance due to the following rationale:
the explanatory insights it provides. • “flow packet APDU length max” and “flow IAT min”
SHAP is a method used to understand ML models provide information about the characteristics of network
by analyzing the influence of individual features on the traffic flows, such as the maximum length of application
model’s output and offering insights into its behavior. The layer data units (APDUs) and the minimum inter-arrival
goal of SHAP is to clarify why a particular prediction of time between packets within a flow.
an instance x is made by assessing the impact of each
feature on that prediction.. A unique aspect of SHAP is • “type id process information in control direction”
its presentation of Shapley values as an additive feature at- represents the type of information being exchanged in
tribution technique, akin to a linear model. SHAP defines the control direction, such as commands or requests
the explanation as follows: sent from a master station to control devices.

M
• “type id system information in control direction”
relates to the exchange of system-related information
X

g(z ) = ϕ0 + ϕj zj′
j=1
in the control direction, such as status updates or
configuration changes.
where g is the explanation model, z ′ ∈ {0, 1}M is the coali- By leveraging these features as key indicators of po-
tion vector (i.e., simplified features), M is the maximum tential threats, our method, DAELID, can enhance its
coalition size, and ϕj ∈ R is the feature attribution for ability to detect and respond to security incidents in real-
feature j, the Shapley values. time, safeguarding critical infrastructure and ensuring the
In our suggested method, we utilize SHAP to inter- integrity and reliability of industrial control processes.
pret and compute Shapley values for a gradient boosting After analyzing these findings, we select the optimal
model, particularly ExtraTrees, on an unrefined dataset. threshold, labeled as τ , to 0.001. Subsequently, we include
7
Algorithm 1 SFS: SHAP-powered Feature Selection
Input: Da - augmented dataset; τ - Threshold; ET -
ExtraTrees model.
1: Xtrain , ytrain , Xtest , ytest ← Da ▷ Split the
augmented dataset Da into training and testing set.
2: ET ← ET.f it(Xtrain , ytrain ) ▷ Perform training
using ExtraTrees.
3: R ← ET.predict(Xtest , ytest ) ▷ Perform predicting
using ExtraTrees.
4: Explainer ← SHAP.explainer(R) ▷ Explain the
model.
5: SHAPf eature ← Explainer.shap value(R) ▷ Get
Shapley value of features.
6: F S ← Select F eatures(SHAPf eature , τ ) ▷ Select
features that have value ≥ τ .
Output: F S - Feature Set.

After that, the dataset will be divided into two subsets:


training and testing, allocated at a ratio of 7:3. Here,
τ serves as a constant used to determine the maximum
number of samples per label class within the training set.
Figure 2: SHAP feature importance on the IEC 60870-5-104 Intru- The testing set plays a role in evaluating our method’s
sion Detection Dataset
TabGAN models.
Step 2 - Training the TabGAN Model: The subsequent
all features with Shapley values equal to or exceeding this phase entails instructing the TabGAN model using the
threshold. This method leads to the identification of a dataset acquired in the initial step. During training, the
subset comprising 90 features from the original dataset, TabGAN model learns the fundamental data distribution
which originally comprised 111 features. The steps of this and produces artificial samples that closely mimic the orig-
procedure are depicted in Alg. 1. inal dataset.
Step 3 - Generating Synthetic Samples: After training
3.3. Synthetic Data Generation the TabGAN model, it becomes capable of producing arti-
To augment limited datasets and improve the robust- ficial instances based on the training dataset. By sampling
ness and generalization of ML models trained on them, from the generator model of TabGAN, new data points
we use various generative models including Tabular GAN are created that exhibit similar statistical properties and
(TabGAN), Forest Diffusion Model (FDM) and Condi- distributions as the original dataset. These synthetic sam-
tional Flow Matching (CFM) to generate synthetic sam- ples augment the original dataset and increase its size for
ples of the IEC 60870-5-104 Intrusion Detection Dataset. further analysis.
These techniques are specifically designed for tabular data, Step 4 - Evaluating the Synthetic Samples: After gen-
allowing for the creation of synthetic data points that erating synthetic samples, we will evaluate their quality
closely resemble the original dataset. and assess how well they preserve the characteristics of
the original dataset.
3.3.1. TabGAN-powered Dataset Augmentation Step 5 - Utilizing Synthetic Samples for IDS: Finally,
We opt for TabGAN for data augmentation due to the generated samples can be integrated with the original
its specialized design for tabular data, which allows it to dataset for intrusion detection tasks. By training intrusion
maintain the distribution and attributes of the original detection models combine the actual and generated data,
dataset more efficiently compared to conventional image- the model’s robustness and generalization capabilities can
based GANs. Research studies [40, 41] have illustrated be enhanced, leading to improved detection performance
TabGAN’s efficacy in preserving data distribution and im- and resilience against adversarial attacks.
proving model robustness and generalization. These find-
ings establish TabGAN as a preferred solution for gener- 3.3.2. Diffusion-powered Dataset Augmentation
ating synthetic data across diverse ML applications. In our Diffusion-powered data augmentation method,
To summarize, our TabGAN-powered data augmenta- we use both the Forest Diffusion Model (FDM) and Condi-
tion phase contains five main steps: tional Flow Matching (CFM) techniques to generate syn-
Step 1 - Dataset pre-processing: This involves handling thetic samples. Diffusion models are trained using a for-
missing values, eliminating duplicated samples, normaliz- ward process that adds noise to data and a reverse process
ing numerical features, and encoding categorical variables.
8
that denoises it. Studies [42] have shown that Diffusion Algorithm 2 Parallel Ensemble Learning-based Intrusion
model can preserve data distribution and improve model Detection
robustness and generalization. These results make Dif- Model: XGB, RF P,3 ET - pretrained model, and their
fusion a preferred choice for generating synthetic data in weights ωi , where i=1 ωi = 1.
various ML applications. Input: f - Traffic flow.
Here, we outline the steps involved in using Diffusion 1: F ← F eatureExtractor(f ) ▷ Extract features of
models for data augmentation: traffic flow f .
Step 1 - Dataset Preprocessing: Similar to the GAN- 2: Fin ← SF SF eatureEngineering(F ) ▷ Perform
based approach, we start by handling missing values, re- the feature engineering and keep only the feature set
moving duplicate samples, normalizing numerical features, selected by SFS.
and encoding categorical variables. The information is 3: Perform in parallel three processes P1,P2,P3:
divided into training and testing datasets in a 7:3 ratio. 4: P1: pXGB ← XGB.predict(Fin ) ▷ Predicting using
Step 2 - Training the model: We train both the For- XGB.
est Diffusion Model and the Conditional Flow Matching 5: P2: pRF ← RF.predict(Fin ) ▷ Predicting using RF .
Model on the preprocessed dataset. The FDM learns the 6: P3: pET ← ET.predict(Fin ) ▷ Predicting using ET .
data distribution through forward and reverse SDE pro- 7: Wait P1, P2, P3 finished.
cesses, while the CFM captures the conditional dependen- 8: scores ← (pXGB ∗ ω1 + pRF ∗ ω2 + pET ∗ ω3 )
cies within the data. 9: L ← scores.argmax(axis = 1) ▷ Get the final
Step 3 - Generating Synthetic Samples: Once the predicted label.
models are trained, they generate synthetic samples by Output: L - Predicted label.
transforming noise samples into realistic data. FDM
progressively refines noisy data into clean data, while
CFM generates samples based on learned conditional individual model is controlled by a weighted score (ωi ∈
distributions. These newly created data points augment [0, 1]) within the overarching model, DAELID. Generally,
the original dataset. withPn AI models, their combined scores must sum up to
n
Step 4 - Evaluating the Synthetic Samples: The quality 1: i=1 ωi = 1. These scores are established through
of the synthetic samples is assessed to ensure they maintain experimentation to determine the best blend of various
the characteristics of the original dataset. The use of AI models. As a result, Alg. 2 illustrates our DAELID
both FDM and CFM techniques enhances the fidelity and approach.
diversity of the generated samples. In Alg. 2, network traffic flows undergo initial capture,
Step 5 - Utilizing Synthetic Samples for IDS: The syn- extraction, and modeling into a feature vector denoted as
thetic samples produced by the FDM and CFM are in- F . The Custom IEC 60870-5-104 Python Parser tool is
tegrated with the original dataset for intrusion detection employed to generate a vector containing 111 features for
tasks. This integration improves the robustness and gen- each flow. Subsequently, F undergoes a cleansing pro-
eralization capabilities of the intrusion detection models, cess to remove redundant samples and standardize the
leading to enhanced detection performance. remaining data points. SHAP is then utilized to metic-
ulously select the top 90 features for integration into the
3.4. Parallel Boosting-powered Ensemble Learning for In- final DAELID model. These procedures are crucial for
trusion Detection preparing the training set for each AI model prior to their
The process of utilizing ensemble learning involves crit- individual training processes. Additionally, all AI models
ically analyzing, evaluating, and selecting individual AI require training using augmented datasets before integra-
models to integrate into the ensemble model. This task tion into the DAELID framework.
is of paramount importance and can be effectively ac- Returning to the DAELID algorithm, the normalized
complished through practical experimentation employing vectors are subsequently inputted into artificial
AutoML frameworks. In our research, Autogluon was uti- intelligence models for prediction. Subsequently,
lized to assess and choose the most efficient AI models to predictions are conducted for all three foundational
incorporate into DAELID. After conducting experiments models parallelly. Finally, Equation 16 is employed to
with training data from the IEC 60870-5-104 Intrusion ascertain whether the activity is benign or constitutes an
Detection Dataset, we identified three models — XGBoost attack.
(XGB), RandomForest (RF), and ExtraTrees (ET) — as n
the most effective for intrusion detection. Consequently, 1X
ELP red(f ) = Pi (f ) ∗ ωi (16)
these models were selected as individual components to be n i=1
train in parallel and utilized within our ensemble model,
DAELID. where P1 (f ), P2 (f ),...Pn (f ) are the probability outputs of
In DAELID, multiple AI models are combined using a n AI models for a feature vector f ; ω1 , ω2 ,..., ωn are weight
weighted voting technique. However, the influence of each ratio that represents the importance of model where each

9
n
ωi ∈ (0, 1) and
P
ωi = 1 Algorithm 3 OBHO: Optuna-Based Hyperparameters
i=1 Optimization
Through this approach, our DAELID method combines Input: model AI model, Dtrain = {(Xtrain , ytrain )} -
the strengths of multiple AI models to enhance intrusion Training set, Dtest = {(Xtest , ytest )} - Testing set; N -
detection performance, ensuring resilient against adversar- Number of trials; T - Optimization timeout; param -
ial attacks and accurate identification of network threats. Define parameters.
1: Initialize an empty list opt params = ∅
3.5. Model Parameter Optimization using Optuna
2: def objective(trial )
To optimize machine learning models in our approach, 3: param ← {p1 , p2 , p3 , ...pn } ▷ Define the
such as training individual models, we use Optuna [43] to parameter of the machine learning model
find the best parameters for each model, ensuring their 4: for p ∈ param do ▷ Use Optuna to suggest
performance is maximized. This method is employed hyperparameters
through Alg. 3. 5: param[p] ← trial.suggest ⟨parameter type⟩(“p”,
Our method initializes an empty list to store the op- ⟨min value⟩, ⟨max value⟩)
timized model parameters (opt params). Subsequently, 6: end for
it defines the objective function objective(trial), which 7: clf ← model (**param) ▷ Define model with
assesses the model’s performance with a specified set of optimized parameters
hyperparameters. We define the parameters of the ma- 8: clf.fit(Xtrain , ytrain ) ▷ Train model on training
chine learning model (param), representing the hyperpa- data
rameters for optimization. The algorithm suggests hyper- 9: preds ← clf.predict(Xtest ) ▷ Make predictions
parameters for each parameter in param; then the model on testing data
is instantiated with the suggested hyperparameters and 10: metric ← performance metric(ytest , preds) ▷
trained on the training data (Xtrain , ytrain ). Predictions Calculate performance metric
are made on the testing data (Xtest ), and a performance 11: return metric ▷ End of objective function
metric is calculated based on the actual labels (ytest ) and 12: Optimize the objective function using Optuna:
the predicted labels (preds). The algorithm returns the 13: study ← optuna.create study(direction = “max”)
performance metric as an objective value. We create a 14: study.optimize(objective, n trials=N , timeout=T )
study object (study) and set the optimization direction to 15: trial ← study.best trial
“maximize”. The objective function (objective) is opti- 16: opt params ← trial.params ▷ Get optimized model
mized by invoking the study.optimize function with the parameters from the best trial
specified number of trials (Ntrials ) and timeout (Ttimeout ),
Output: opt params - Optimized model parameters.
yielding the best trial from the study. We then retrieve the
optimized model parameters (opt params) from the best
trial. imbalance in the dataset and improving the robustness
Finally, the algorithm returns the optimized model and generalization of the ML models.
parameters as an output. To sum up, the algorithm Step 3: We evaluate various different AI models on
quickly finds the best hyperparameters for a machine each dataset generated by TabGAN, FDM and CFM. This
learning model by suggesting that it use Optuna step aims to evaluate the synthetic samples generated by
repeatedly and check how well it works on a validation various techniques to ensure they maintain the character-
set. It subsequently provides the set of hyperparameters istics of the original dataset.
that yield the best performance metric. Step 4: To enhance our methodology, we employ par-
allel ensemble learning, which combines various AI models
3.6. DAELID-based Intrusion Detection through weighted voting (detailed in Section 3.4). This
The proposed method, DAELID, aims to detect net- approach harnesses individual model strengths, ensuring a
work intrusion detection the following main steps: resilient and precise intrusion detection system. Moreover,
Step 1: We apply data engineering techniques to ex- aggregating diverse decision boundaries enhances system
tract, clean, and vectorize features from the IEC 60870-5- robustness, making it more difficult for adversarial attacks
10 Intrusion Detection Dataset. Using the SFS algorithm to exploit vulnerabilities.
detailed in Section 3.2 for feature selection, we analyze and
identify critical features to streamline data dimensionality
4. Experiments & Evaluation
for training and testing. This process results in refined
training sets that exclude unnecessary features, optimizing In this section, we will perform experiments to assess
preparation for subsequent step. and illustrate the efficacy of the DAELID approach. The
Step 2: Using the dataset from Step 1, we employ experiments aim to address the following research ques-
different generative models, outlined in Section 3.3 to cre- tions:
ate realistic samples. This step helps in addressing the RQ1: Does employing feature selection techniques en-

10
hance the effectiveness of the training dataset for machine
learning?
RQ2: Do GAN and Diffusion models generate a higher-
quality dataset compared to the original dataset?
RQ3: Do ensemble learning methods improve accuracy
in intrusion detection while reducing the rates of over-
looked intrusions and false positives when compared to
other cutting-edge approaches.
RQ4: How does the DAELID method improve the re-
silience and robustness of AI models against adversarial
attacks?
RQ5: Does executing models in parallel using data aug-
mentation methods improve the time for detecting network
intrusion anomalies? Figure 3: ExtraTrees Performace-based Threshold
The upcoming sections will present and discuss our
experimental findings and assessments.
dataset. We apply Alg. 1 using a set of thresholds to
4.1. Experiment Environment compute the optimal number of features for the classifi-
cation task. We utilize eight thresholds: 0.005, 0.0025,
Our computational environment are deploy on an
0.001, 0.00075, 0.0005, 0.0001, 0.000075 , 0.00005. For
NVIDIA Tesla T4 (16GB) GPU, Dual Intel Xeon-
each threshold, features with SHAP values greater than
Platinum 8160 processors, each with 24 cores, supported
the chosen threshold are selected. We analyze the result
by 384GB of DDR4 RAM and 6.0TB of SSD storage.
based on the threshold that we chose, as shown in Fig. 3.
The software stack comprises the following libraries and
Evident from that, we identify the threshold of 0.0001
frameworks: Pandas v1.5.3, Numpy v1.23.5, Scikit-learn
gives the best result. This threshold yields 90 impor-
v1.2.2, AutoGluon v0.7.0, XGB v1.7.5, CatB v1.1.1, and
tant features out of the total 111 features in the original
GBM v3.3.5. We also use Optuna [43] for hyperparameter
dataset. We then save this set of features as a new training
optimization across our AI models.
set, ready for use in subsequent stages.
4.2. Dataset Preparation 4.3. Evaluation Metrics
The IEC 60870-5-104 Intrusion Detection Dataset [1] To evaluate the effectiveness of our network intrusion
is utilized to evaluate the effectiveness of AI-powered detection approach, we employ common metrics obtained
intrusion detection systems against attacks such as Fast from the confusion matrix. These metrics encompass Ac-
Gradient Sign Method (FGSM) and Conditional Tabular curacy (Acc), Precision (Prec), F1-score (F1), True Pos-
Generative Adversarial Networks (CTGAN). Specifically, itive Rate (TPR), False Positive Rate (FPR), and Area
we use the Balanced IEC104 Train Test CSV files.zip, Under The Curve (AUC).
containing balanced CSV files sourced from both All ML models in this study are multi-label classifiers
CICFlowMeter and the IEC-104 Flow Extraction Module. [44, 45]. For this type of evaluation, macro-average metrics
These files are essential for training machine learning and are preferred over micro-average. Unlike micro-average
deep learning methods within AI-powered IDS. The zip file metrics, which can skew results towards more frequent
contains several folders, each corresponding to different classes, macro-average treat each label equally. This en-
flow timeout values for various generators. In this study, sures a balanced assessment across all classes, effectively
we use the test custom 120 test and train files, which addressing class imbalances. Thus, we use macro-average
provide flow statistics related to multiple IEC 60870-5-104 metrics to ensure robust detection of all types of intrusions,
attacks, including MITM, traffic sniffing, C RD NA 1, including rare but critical ones. Here are the formulas to
C CI NA 1, C RP NA 1, C SC NA 1, C SE NA 1, compute them for n labels, T Pi for True Positives, T Ni
M SP NA 1 DOS, C CI NA 1 DOS, C SE NA 1 DOS, for True Negatives, F Pi for False Positives, F Ni for False
C RD NA 1 DOS, and C RP NA 1 DOS. Negatives of each label i:
The dataset contains 12 unique labels and 6828 Accuracy:
samples, which are divided into training and testing sets
n
in a 7:3 ratio. Specifically, the training set contains 4800 1X T Pi + T Ni
samples, while the testing set includes 2028 samples. Acc = (17)
n i=1 T Pi + T Ni + F Pi + F Ni

SHAP-based Feature Selection Macro-averaged Precision:


In the proposed method, we used SHAP to explain n
and calculate Shapley values for the ExtraTrees model - 1X T Pi
P recmacro = (18)
the best-performing model on the original, unoptimized n i=1 T Pi + F Pi

11
Macro-averaged F1-Score:
Table 1: S1 & S2 Performance Evaluation (%)
n
1X 2 · T Pi Method Acc Prec F1 AUC TPR FPR
F 1macro = (19)
n i=1 2 · T Pi + F Pi + F Ni Original
Macro-averaged FPR: XGB 83.88 83.88 83.13 98.63 83.11 1.45
n
ET 84.17 84.17 83.71 98.07 83.89 1.43
1X F Pi RF 83.09 83.09 82.34 98.00 82.60 1.52
F P Rmacro = (20)
n i=1 F Pi + T Ni Original + SFS
Macro-averaged TPR (also known as Recall): XGB 84.32 84.32 83.65 98.62 83.67 1.41
n
ET 84.62 84.62 84.09 98.09 84.43 1.38
1X T Pi RF 83.97 83.97 83.27 98.23 83.34 1.44
T P Rmacro = (21)
n i=1 T Pi + F Ni
• Scenario S1: Utilize the Original IEC 60870-5-104
These formulas allow us to assess the efficacy of a model Intrusion Detection Dataset to evaluate the three
across multiple labels, providing insights into how well the base models. This scenario shows how the models
model handles each label individually and in aggregate. perform on a standard, non-augmented dataset, pro-
viding a baseline for comparison against the perfor-
4.4. Model Selection & Hyperparameter Optimization
mance of the weighted voting and other enhance-
We employ AutoGluon - an AutoML tool, to run ments.
through the dataset obtained in the previous step. This
helps in identifying the best model and its optimal • Scenario S2: Evaluate all base models on a new
parameters. We pass the dataset through 10 models to dataset generated using SHAP. This scenario aims
compute and rank their effectiveness. Figure 4 displays to demonstrate how SHAP-based feature selection
the performance of each model. can optimize the original dataset by focusing on im-
portant features, thereby reducing dimensionality.
• Scenario S3: Apply parallel ensemble learning on
the generated data using three different generative
models: GAN (TabGAN), Forest Diffusion Model
(FDM), and Conditional Flow Matching (CFM) on
the reduced dataset created by SHAP. This scenario
compares and identifies the most efficient generative
model for the intrusion detection task.

In Scenario S3, we created a weighted voting


Figure 4: Autogluon Leaderboard ensemble that incorporated the three models. Each
model contributed predictions weighted according to its
As indicated by the leaderboard, we chose XGBoost, performance. To find the optimal weights for each model,
ExtraTrees, and RandomForest as the three primary mod- we conducted experiments and cross-validation, aiming to
els for our experimental process. maximize the overall predictive accuracy of the ensemble.
To determine the best hyperparameters for each model,
we use Optuna, as depicted in Alg. 3. After the au- 4.5.1. S1 Evaluation:
tomation process, the optimal hyperparameters for each
In this scenario, we focus on evaluating three models:
model are as follows: For XGBoost, we set the number of
XGB, ET, and RF on the original IEC 60870-5-104 Intru-
estimators (“n estimators”) to 3000 and the learning rate
sion Detection Dataset without using any data engineering
(“learning rate”) to 0.1. For ExtraTrees and RandomFor-
techniques. Each model was fine-tuned to optimize perfor-
est, the number of estimators was configured to 20 and
mance.
52, respectively, with the maximum number of leaf nodes
The results from this experiment, referred to as Sce-
(“max leaf nodes”) set to 15000 for both models.
nario 1 (S1), are presented in Table 1. This table presents
essential performance measures like Accuracy, Precision
4.5. Results and Evaluation
(macro), F1-score (macro), TPR, FPR, and AUC for each
In this section, we explore different scenarios to il-
model. By examining these measurements, we can eval-
lustrate the effectiveness of our augmented dataset when
uate the fundamental effectiveness of each model on the
applying feature selection techniques and generative mod-
raw dataset, setting a foundation for comparison with fu-
els. Our objective is to showcase the model’s performance
ture experiments involving data engineering and feature
under different strategies for feature selection and augmen-
selection.
tation. We undertake three scenarios as detailed below:
12
Table 2: S3 Performance Evaluation (%)
Method Acc Prec F1 AUC TPR FPR
TGAN-based Dataset Augmentation
XGB 83.93 83.93 83.31 98.61 83.28 1.45
ET 85.84 85.84 85.44 98.38 85.72 1.28
RF 85.45 85.45 84.83 98.03 85.07 1.31
DAELID 85.90 85.90 85.47 98.75 85.77 1.27
FDM-based Dataset Augmentation
XGB 84.27 84.27 83.74 98.68 83.75 1.42
ET 85.90 85.90 85.43 98.47 85.72 1.27
RF 85.90 85.90 85.53 98.57 85.69 1.27
DAELID 86.00 86.00 85.56 98.83 85.84 1.26
CFM-based Dataset Augmentation
Figure 5: Confusion matrix of TGAN-based DAELID
XGB 85.31 85.31 84.91 98.76 85.01 1.33
ET 86.19 86.19 85.78 98.38 86.08 1.24
RF 86.79 86.79 86.31 98.74 86.77 1.19
DAELID 86.83 86.83 86.39 98.92 86.80 1.18

4.5.2. S2 Evaluation:
In this scenario, we evaluate three models: XGB, ET,
and RF on the dataset generated using SHAP. As dis-
cussed in Section 3.2, a threshold of 0.0001 yields an opti-
mal subset of 90 features out of the original 111 features.
This process reduces dimensionality, thereby decreasing
processing time and enhancing the accuracy of all models
by focusing solely on the most important features.
The results of this scenario are presented in Table 1.
As shown, all metrics surpass those of the original dataset Figure 6: Confusion matrix of FDM-based DAELID
in Scenario S1. Specifically, accuracy, F1-score, and TPR
increase while FPR decreases. This demonstrates that an AUC of 98.92%. The high AUC indicates that the
SHAP-based feature selection significantly improves the refined data captures crucial insights from the original
quality of the training dataset, thereby addressing RQ1. dataset, improving model generalization and prediction
accuracy. As a result, this addresses RQ3: DAELID
4.5.3. S3 Evaluation: enhances intrusion detection accuracy while reducing
To validate the effectiveness of weighted voting en- missed intrusions and false positives.
semble on the augmented dataset, we first evaluate the Ultimately, while the weighted voting ensemble
performance of all base models concurrently on datasets approach showed only marginal performance gains
generated by TabGAN, FDM, and CFM. Metrics across all compared to individual models (improving accuracy by
models in each generated dataset surpass those in Scenar- approximately 0.1% over the best base model across
ios S1 and S2, showing that our data augmentation using all generative scenarios), its true strength lies in its
GAN and Diffusion Model indeed improves the dataset’s resilience against adversarial attacks. Unlike standalone
quality, addressing RQ2. Detailed performance metrics models, which exhibited vulnerabilities to such attacks,
for these models and datasets are presented in Table 2, the ensemble demonstrated stability by mitigating the
with confusion matrices depicted Fig. 5, 6 and 7. impact of adversarial perturbations. This collective
Across all models, The CFM approach consistently decision-making approach enhances the overall robustness
outperforms other scenarios across all models. For of the intrusion detection system, protecting critical
instance, in the RandomForest (RF) model, the systems from sophisticated cyber-attacks. This addresses
CFM-based dataset achieves an accuracy of 86.79%, RQ4, confirming that the DAELID method not only
compared to 85.90% for FDM and 85.45% for TabGAN. bolsters stability but also ensures the safety of the
Similarly, the CFM method achieves an F1-score of intrusion detection system, thereby proving highly
86.31%, while FDM scores 85.53% and TabGAN scores effective in defending against adversarial attacks.
84.83%. Additionally, employing a weighted voting We evaluated our approach’s computational efficiency
ensemble further enhances performance, yielding the by measuring execution time across all scenarios. Our
highest Accuracy of 86.83%, F1-score of 86.39%, and analysis highlights the impact of SHAP-based feature se-

13
Table 3: Comparison with SOTA methods (%)
Method Acc Prec F1 AUC TPR FPR
CFM-DAELID 86.83 86.83 86.39 98.92 86.80 1.18
TDAELID [47] 85.44 85.49 84.88 98.82 85.44 1.31
DT [1] 83.14 - 82.58 - 83.14 1.53
RF [48] 82.44 - 80.98 - 82.44 1.59

“elephant” flows [46], DAELID can analyze the network


throughput of 24, 795 ∗ 10KB ≃ 242.14M B/s ≃ 1.94Gbps
for mouse flows (less than 10KB/flow). For elephant
flows (more than 10MB), DAELID can reach up to
24, 795 ∗ 10M B ≃ 247, 950M B/s = 1, 983Gbps. This
demonstrates that DAELID-based intrusion detection
Figure 7: Confusion matrix of CFM-based DAELID is capable of efficiently handling large-scale networks,
performing real-time analysis of network traffic flow.
23.11 Finally, parallel processing significantly reduces exe-
ParaEns SeqEns XGB cution time compared to sequential methods in ensemble
24.03
ET RF techniques. For CFM, parallel processing reduced the time
S1 36.59
from 143.41 ms to 81.79 ms, a decrease of 61.62 ms, making
22.13 it nearly twice as fast. This addresses RQ5, demonstrat-
22.28 ing that parallel ensemble techniques significantly enhance
S2 24.79 processing time for intrusion detection systems.
54.37
36.17 4.6. Comparisons with SOTAs
S3 52.87 We conducted a comparison between our proposed
143.41 method, DAELID, and other existing methods using
81.79 the identical dataset. The results of the network
intrusion detection rates for DAELID and methods
Figure 8: Speed Evaluation (ms) are presented in Table 3. With DAELID, the highest
accuracy reaches 86.83%, surpassing the accuracy
lection and parallel ensemble learning. We specifically of other methods in the comparison. Additionally,
compare processing times between Scenarios S1 and S2, in another scenario, the accuracy also exceeds that
and assess the execution times of the top generative models of other compared methodologies. For example,
in Scenario S3 - CFM. Performance comparisons for each Radoglou-Grammatikis et al. [1] proposed a method
scenario are shown in Fig. 8 using reinforcement learning approach that yields an
In Scenario S2, we observe improved execution times accuracy of 83.14% for the decision tree classifier model,
compared to S1. Specifically, for S2, XGBoost (XGB) and D. Asimopoulos et al. [48] reaches an accuracy of
runs in 24.79 ms, ExtraTrees (ET) in 22.28 ms, and Ran- 82.44% using the RandomForest model. Consequently, all
domForest (RF) in 22.13 ms, whereas in Scenario S1, XGB the measurements indicate that our suggested approach
takes 36.59 ms, ET 24.03 ms, and RF 23.11 ms. These is presently the most effective AI-powered intrusion
results highlight the effectiveness of data engineering tech- detection method.
niques, including SHAP-based feature selection, which sig-
nificantly enhancing performance and speed in S2.
5. Conclusions
In S3, we evaluate the runtime of the best generative
model, and compare the time required for sequential and In this study, we introduce DAELID, an AI-driven ap-
parallel ensemble techniques. Among the base models, proach designed to enhance the effectiveness and robust-
ExtraTrees consistently had the fastest execution time at ness of real-time intrusion detection systems for IDS aimed
36.17 ms, followed by XGBoost and RandomForest with at detecting network attacks on ICS via the IEC 60870-
execution times of 52.87 ms and 54.37 ms, respectively. 5-104 protocol. We suggest adding to the training set
The results also show that the time needed for with the SHAP algorithm for feature selection and several
weighted voting DAELID to process 2,028 flows in the generative models to generate synthetic samples of the
testing set is about 81.79ms, resulting in an average time IEC 60870-5-104 Intrusion Detection Dataset in order to
of 40.33µs per flow. Consequently, the DAELID-based get high-quality dataset for AI models. To enhance the
intrusion detection system can handle 1/40.33µs = 24, 795 performance and speed of detecting intrusion, DAELID
flows per second. Using the concepts of “mouse” and uses a parallel weighted voting ensemble algorithm, which

14
aims to combine multiple AI models, including XGBboost, Data and Code Availability
RandomForest, and ExtraTrees. To minimize the delay
in inspections, we suggest an effective ensemble learning The DAELID datasets, codes, and experiment results
strategy for sensing network flows. This approach sig- used in this manuscript are available at https://github.
nificantly contributes to the overall resilience of intrusion com/.
detection systems, particularly in scenarios governed by
the IEC 60870-5-104 protocol. References
Through rigorous experiments with established
datasets, we have demonstrated DAELID’s superiority [1] P. Radoglou-Grammatikis, K. Rompolos, P. Sarigiannidis,
V. Argyriou, T. Lagkas, A. Sarigiannidis, S. Goudos, S. Wan,
over other methods. Performance metrics and speed Modeling, detecting, and mitigating threats against industrial
evaluation results confirm that DAELID is suitable healthcare systems: A combined software defined networking
for integration into inline IDPS for real-time intrusion and reinforcement learning approach, IEEE Transactions on
detection in ICS. Given typical network traffic in ICS Industrial Informatics 18 (3) (2022) 2041–2052. doi:10.1109/
TII.2021.3093905.
does not exceed 1GB, DAELID’s low-latency deep flow [2] A. Aldweesh, A. Derhab, A. Z. Emam, Deep learning ap-
analysis, combined with appropriate sampling, supports proaches for anomaly-based intrusion detection systems: A
real-time deployment. However, deploying AI models survey, taxonomy, and open issues, Know.-Based Syst. 189 (C)
for intrusion detection faces challenges like bottlenecks (feb 2020). doi:10.1016/j.knosys.2019.105124.
[3] H. Liu, B. Lang, Machine learning and deep learning methods
and latency, particularly in high-bandwidth networks. for intrusion detection systems: A survey, Applied Sciences
Deploying DAELID in intranet systems with extensive 9 (20) (2019). doi:10.3390/app9204396.
traffic requires periodic analysis sampling to manage [4] F. Qin-cui, L. Zi-ying, F. Ke-jia, Implementation of iec60870-5-
these challenges, ensuring effective, AI-powered intrusion 104 protocol based on finite state machines, in: 2009 Interna-
tional Conference on Sustainable Power Generation and Supply,
detection in high-volume environments. 2009, pp. 1–5. doi:10.1109/SUPERGEN.2009.5348268.
In conclusion, our findings demonstrate significant im- [5] S. T. Ikram, A. K. Cherukuri, B. Poorva, P. S. Ushasree,
provements in detection accuracy and speed, highlighting Y. Zhang, X. Liu, G. Li, Anomaly detection using xgboost
ensemble of deep neural network models, Cybern. Inf. Technol.
the effectiveness of our approach in strengthening intrusion 21 (3) (2021) 175–188. doi:10.2478/cait-2021-0037.
detection systems for ICS against evolving cyber threats [6] P. Mishra, V. Varadharajan, U. Tupakula, E. S. Pilli, A detailed
targeting the IEC 60870-5-104 protocol. As we navigate investigation and analysis of using machine learning techniques
the complex realm of cybersecurity, the insights gleaned for intrusion detection, IEEE Communications Surveys & Tuto-
rials 21 (1) (2019) 686–728. doi:10.1109/COMST.2018.2847722.
from this study provide valuable tools and methodologies [7] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, H. Janicke, Deep
to enhance the reliability and adaptability of intrusion learning for cyber security intrusion detection: Approaches,
detection systems, particularly within the realm of ICS datasets, and comparative study, Journal of Information Secu-
utilizing the IEC 60870-5-104 protocol. Future studies rity and Applications 50 (12 2019). doi:10.1016/j.jisa.2019.
102419.
seek to broaden our intrusion detection approach to ad- [8] L. Bontemps, V. L. Cao, J. Mcdermott, N.-A. Le-Khac,
dress unidentified adversarial threats targeting both net- Collective anomaly detection based on long short-term memory
work and host systems. recurrent neural networks, 2016, pp. 141–152. doi:10.1007/
978-3-319-48057-2_9.
[9] Y. Li, T. Qin, Y. Huang, J. Lan, Z. Liang, T. Geng, Hdfef:
CRediT authorship contribution statement A hierarchical and dynamic feature extraction framework for
intrusion detection systems, Computers & Security 121 (2022)
102842. doi:10.1016/j.cose.2022.102842.
Tuyen T. Nguyen: Writing – review & editing, Writ-
[10] M. Aldarwbi, A. Habibi Lashkari, A. Ghorbani, The sound
ing – original draft, Visualization, Validation, Software, of intrusion: A novel network intrusion detection system,
Investigation, Formal analysis. Computers and Electrical Engineering 104 (10 2022). doi:
Phong H. Nguyen: Writing – review & editing, Writ- 10.1016/j.compeleceng.2022.108455.
[11] N. Omer, A. H. Samak, A. I. Taloba, R. M. Abd El-Aziz, A novel
ing – original draft, Visualization, Validation, Software, optimized probabilistic neural network approach for intrusion
Investigation, Formal analysis. detection and categorization, Alexandria Engineering Journal
Hoa N. Nguyen: Writing – review & editing, 72 (2023) 351–361. doi:10.1016/j.aej.2023.03.093.
Methodology, Conceptualization, Project administration, [12] R. A. Disha, S. Waheed, Performance analysis of machine
learning models for intrusion detection system using gini
Supervision. impurity-based weighted random forest (giwrf) feature selection
technique, Cybersecurity 5 (1) (2022) 1.
[13] A. R. Kharwar, D. V. Thakor, An ensemble approach for feature
Declaration of competing interest selection and classification in intrusion detection using extra-
tree algorithm, International Journal of Information Security
The authors assert that they do not possess any iden- and Privacy (IJISP) 16 (1) (2022) 1–21.
tifiable financial interests or personal relationships that [14] T.-T.-H. Le, Y. E. Oktian, H. Kim, Xgboost for imbalanced
multiclass classification-based industrial internet of things in-
could have been perceived to impact the findings presented
trusion detection systems, Sustainability 14 (14) (2022) 8707.
in this paper. [15] K.-C. Li, B. B. Gupta, D. P. Agrawal, Recent advances in
security, privacy, and trust for internet of things (iot) and cyber-
physical systems (cps) (2020).

15
[16] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, (2022).
Machine learning 63 (2006) 3–42. [37] W. Estécio Marcı́lio Júnior, D. Eler, From explanations to
[17] G. Biau, E. Scornet, A random forest guided tour, Test 25 feature selection: assessing shap values as feature selection
(2016) 197–227. mechanism, 2020. doi:10.1109/SIBGRAPI51738.2020.00053.
[18] H. V. Vo, H. P. Du, H. N. Nguyen, Ai-powered intrusion [38] A. Gramegna, P. Giudici, Shapley feature selection, FinTech 1
detection in large-scale traffic networks based on flow sensing (2022) 72–80. doi:10.3390/fintech1010006.
strategy and parallel deep analysis, Journal of Network and [39] F. Hassan, J. Yu, Z. Syed, A. H. Magsi, N. Ahmed, Developing
Computer Applications 220 (2023) 103735. doi:https://doi. transparent ids for vanets using lime and shap: An empirical
org/10.1016/j.jnca.2023.103735. study, Computers, Materials & Continua 77 (2023) 1–10. doi:
[19] H. V. Vo, H. P. Du, H. N. Nguyen, Apelid: Enhancing real-time 10.32604/cmc.2023.044650.
intrusion detection with augmented wgan and parallel ensemble [40] L. Xu, K. Veeramachaneni, Synthesizing tabular data using
learning, Computers & Security 136 (2024) 103567. doi:10. generative adversarial networks (11 2018).
1016/j.cose.2023.103567. [41] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni,
[20] P. Radoglou-Grammatikis, P. Sarigiannidis, I. Giannoulakis, Modeling tabular data using conditional GAN, Curran Asso-
E. Kafetzakis, E. Panaousis, Attacking iec-60870-5-104 scada ciates Inc., Red Hook, NY, USA, 2019.
systems, in: 2019 IEEE World Congress on Services (SER- [42] B. Trabucco, K. Doherty, M. Gurinas, R. Salakhutdinov, Effec-
VICES), Vol. 2642-939X, 2019, pp. 41–46. doi:10.1109/ tive data augmentation with diffusion models (2023). arXiv:
SERVICES.2019.00022. 2302.07944.
[21] D. C. Asimopoulos, P. Radoglou-Grammatikis, I. Makris, [43] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna:
V. Mladenov, K. E. Psannis, S. Goudos, P. Sarigiannidis, A next-generation hyperparameter optimization framework, in:
Breaching the defense: Investigating fgsm and ctgan adversarial Proceedings of the 25th ACM SIGKDD International Con-
attacks on iec 60870-5-104 ai-enabled intrusion detection sys- ference on Knowledge Discovery & Data Mining, KDD ’19,
tems, in: Proceedings of the 18th International Conference on Association for Computing Machinery, New York, NY, USA,
Availability, Reliability and Security, ARES ’23, Association 2019, p. 2623–2631. doi:10.1145/3292500.3330701.
for Computing Machinery, New York, NY, USA, 2023. doi: [44] G. V. Le, T. H. Nguyen, P. D. Pham, O. V. Phung, H. N.
10.1145/3600160.3605163. Nguyen, Guruws: A hybrid platform for detecting malicious
[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- web shells and web application vulnerabilities, Transactions on
Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial Computational Collective Intelligence 11370 (2019) 184–208.
networks, Communications of the ACM 63 (11) (2020) 139–144. doi:10.1007/978-3-662-58611-2_5.
[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [45] H. V. Vo, H. N. Nguyen, T. N. Nguyen, H. P. Du, Sdaid:
Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial Towards a hybrid signature and deep analysis-based intrusion
nets, Advances in neural information processing systems 27 detection method, in: GLOBECOM 2022 - 2022 IEEE Global
(2014). Communications Conference, 2022, pp. 2615–2620. doi:10.
[24] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative 1109/GLOBECOM48099.2022.10001582.
adversarial networks, in: International conference on machine [46] J. Alvarez-Horcajo, D. Lopez-Pajares, J. M. Arco, J. A.
learning, PMLR, 2017, pp. 214–223. Carral, I. Martinez-Yelmo, Tcp-path: Improving load balance
[25] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, S. Paul Smolley, by network exploration, in: 6th IEEE International Conference
Least squares generative adversarial networks, in: Proceedings on Cloud Networking, CloudNet 2017, Prague, Czech Republic,
of the IEEE international conference on computer vision, 2017, September 25-27, 2017, IEEE, 2017, pp. 65–70. doi:10.1109/
pp. 2794–2802. CloudNet.2017.8071533.
[26] J. H. Lim, J. C. Ye, Geometric gan, arXiv preprint [47] T. T. Nguyen, P. H. Nguyen, M. Q. Nguyen, H. N. Nguyen,
arXiv:1705.02894 (2017). Tabgan-powered data augmentation and explainable boosting-
[27] D. P. Kingma, J. Ba, Adam: A method for stochastic optimiza- based ensemble learning for intrusion detection in industrial
tion, arXiv preprint arXiv:1412.6980 (2014). control systems, in: International Conference on Computer
[28] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochre- Communication and the Internet, ICCCI, 2024.
iter, Gans trained by a two time-scale update rule converge [48] D. Asimopoulos, P. Radoglou Grammatikis, I. Makris,
to a local nash equilibrium, Advances in neural information V. Mladenov, K. Psannis, S. Goudos, P. Sarigiannidis, Breach-
processing systems 30 (2017). ing the defense: Investigating fgsm and ctgan adversarial at-
[29] J. Yim, H. Stärk, G. Corso, B. Jing, R. Barzilay, T. S. Jaakkola, tacks on iec 60870-5-104 ai-enabled intrusion detection systems,
Diffusion models in protein structure and docking, Wiley 2023, pp. 1–8. doi:10.1145/3600160.3605163.
Interdisciplinary Reviews: Computational Molecular Science
14 (2) (2024) e1711.
[30] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon,
B. Poole, Score-based generative modeling through stochastic
differential equations, arXiv preprint arXiv:2011.13456 (2020).
[31] B. D. Anderson, Reverse-time diffusion equation models,
Stochastic Processes and their Applications 12 (3) (1982) 313–
326.
[32] P. Vincent, A connection between score matching and denoising
autoencoders, Neural computation 23 (7) (2011) 1661–1674.
[33] A. Hyvärinen, P. Dayan, Estimation of non-normalized statis-
tical models by score matching., Journal of Machine Learning
Research 6 (4) (2005).
[34] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, M. Le,
Flow matching for generative modeling, arXiv preprint
arXiv:2210.02747 (2022).
[35] M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, Stochastic
interpolants: A unifying framework for flows and diffusions,
arXiv preprint arXiv:2303.08797 (2023).
[36] M. S. Albergo, E. Vanden-Eijnden, Building normalizing flows
with stochastic interpolants, arXiv preprint arXiv:2209.15571

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy