0% found this document useful (0 votes)
38 views13 pages

Predict and Prevent DDOS Attacks Using Machine Lea

This document discusses using machine learning and statistical algorithms to predict and prevent DDoS attacks. It analyzes a dataset containing over 50 million records to identify the most relevant features for detecting attacks. Several machine learning models are tested, with XGBoost providing the best detection accuracy of 99.9999% after addressing imbalanced data. The goal is to improve detection of DDoS attacks and suggest ways to avoid them.

Uploaded by

Neethu S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views13 pages

Predict and Prevent DDOS Attacks Using Machine Lea

This document discusses using machine learning and statistical algorithms to predict and prevent DDoS attacks. It analyzes a dataset containing over 50 million records to identify the most relevant features for detecting attacks. Several machine learning models are tested, with XGBoost providing the best detection accuracy of 99.9999% after addressing imbalanced data. The goal is to improve detection of DDoS attacks and suggest ways to avoid them.

Uploaded by

Neethu S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Predict And Prevent DDOS Attacks Using Machine Learning and

Statistical Algorithms
Azadeh Golduzian1
1
Department of Mathematics and Statistics, University of New Mexico, NM 87106 USA
Corresponding author: Azadeh Golduzian ( agolduzian96@unm.edu).
This work was supported in part by the University of New Mexico, Albuquerque.

A malicious attempt to exhaust a victim's resources to cause it to crash or halt its services is known as a
distributed denial-of-service (DDoS) attack. DDOS attacks stop authorized users from accessing specific
services available on the Internet. It targets varying components of a network layer and it is better to stop into
layer 4 (transport layer) of the network before approaching a higher layer. This study uses several machine
learning and statistical models to detect DDoS attacks from traces of traffic flow and suggests a method to
prevent DDOS attacks. For this purpose, we used logistic regression, CNN, XGBoost, naive Bayes,
AdaBoostClassifier, KNN, and random forest ML algorithms. In addition, data preprocessing was performed
using three methods to identify the most relevant features. This paper explores the issue of improving the DDOS
attack detection accuracy using the latest dataset named CICDDoS2019, which has over 50 million records.
Because we employed an extensive dataset for this investigation, our findings are trustworthy and practical. Our
target class (attack class) was imbalanced. Therefore, we used two techniques to deal with imbalanced data in
machine learning. The XGboost machine learning model provided the best detection accuracy of (99.9999%)
after applying the SMOTE approach to the target class, outperforming recently developed DDoS detection
systems. To the best of our knowledge, no other research has worked on the most recent dataset with over 50
million records, addresses the statistical technique to select the most significant feature, has this high accuracy,
and suggests ways to avoid DDOS attack

I. INTRODUCTION Hence, the imperativeness of DDoS prediction and fortification


looms large. An intriguing dichotomy emerges when inspecting
The phenomenon known as a "Distributed Denial-of-Service DDoS attack trends; the frequency of such attacks witnessed a
(DDoS) Attack" constitutes a formidable form of cybercrime discernible dip from the onset of 2021 until its culmination,
that unleashes havoc by inundating a server with an while maintaining relative consistency over the preceding
overwhelming barrage of requests, effectively rendering online biennium. Remarkably, 2021 exhibited a mere 3% reduction
sites and services inaccessible to legitimate users. The insidious compared to the prior year. Paradoxically, as the attack
intent behind a DDoS attack lies in its capacity to disrupt the frequency wanes, the magnitude of these attacks experiences
seamless provision of both internal and external services exponential growth [3]. Envision a malevolent hacker striving
offered by a website [1]. What sets DDoS attacks apart in their to incapacitate a service. Here, the hacker's modus operandi
potency is their utilization of a multitude of compromised veers from employing a solitary computer and static IP address;
computer systems as sources for the deluge of attack traffic. instead, an arsenal of diverse computers, each equipped with
This assortment of exploited machines extends to encompass a distinct IP addresses, is wielded to elude security measures. A
gamut of devices, including computers and networked pivotal hallmark of DDOS attacks is their orchestrated
resources such as Internet of Things (IoT) devices [1]. The orchestration from a multitude of hosts, encompassing even the
advent of the Internet of Things (IoT) heralds an interconnected manipulation of your server to assail another. Often
realm wherein objects interlink to glean and exchange orchestrated by botnets, these attacks unfold through networks
information autonomously, erasing the need for manual of automated robots or computers, each programmed to execute
intervention [2]. However, as the IoT's omnipresence burgeons, specific tasks—a landscape where the term "zombies" finds
the proliferation of remote employees and the surge in Internet- relevance. Historically, the roots of DDoS trace back to 1998,
connected devices come with a caveat. IoT devices, while though the full impact remained obscure until July 1999, when
prolific, may not consistently uphold robust security measures, influential organizations and corporate entities endured the
leaving the networks they permeate susceptible to infiltration brunt of these assaults [4]. The repercussions of such attacks
and malicious attacks. are far-reaching, with organizations and communications
infrastructures susceptible to debilitating disruptions that may
extend to minutes or even hours, if proactive protection
mechanisms are not in place. Thus, the urgency to fortify digital
bastions against these evolving threats has never been more
critical.

VOLUME 12, 2022


● SYN Flood: Three-way communication between two
systems is necessary for every TCP session. Using an
SYN flood, the attacker rapidly overwhelms the
victim with connection requests, so numerous that it
can no longer handle them, causing network
saturation. This occurs when the host sends a large
number of TCP/SYN packets with a forged sender
address. Each of these packets functions as a
connection request, causing the server to maintain
several open half connections. The server sends or
returns TCP/SYN-ACK packets to wait for response
packets from the sender's address; however, no
response is returned because the sender's address is
fake.

This paper aims to find the best algorithms to detect DDOS


attacks, separate them from regular traffic, and identify the
features that are most relevant to see so we can prevent them in
advance. The significant contributions of this study are as
follows.
FIGURE 1. Types of DDOS attacks
● Analysis of the latest dataset named CICDDoS2019 with
over 50 million records and 88 features
In addition, there are capacity issues for businesses that provide ● Use statistics and machine learning algorithms to
defense systems to stop this attack. Fig. 1. presented some determine the best relevant features(feature selection)
DDoS attacks. ● To improve the effectiveness of DDoS detection, various
1- UDP Flood 2- ICMP(Ping) Flood 3- SYN Flood 4- Ping machine learning models were used during the training
Of Death 5- HTTP Flood. I will give a brief description of each process.
attack: ● Test the machine-learning models and use the model
with the best accuracy score among other methods.
● UDP Flood: In this attack, the attacker sets up random
ports on the target network by sending IP packets
containing UDP datagrams. The victim system II. Related works
repeatedly attempts to prevent the UDP packet from
responding in an attempt to match each datagram The closest competitor to this study and any similar models that
technique to a program, but it is unable to do so and used the CICDDoS2019 dataset are briefly described in this
will eventually wear out and fail. section. Research related to [5] surveys recently created
● HTTP Flood: This attack uses several, seemingly machine-learning-based DDoS detection methods. The authors
genuine HTTP GET or POST requests to target an of [6, 7, 8] suggested the naive Bayes model as the DDoS
application or web server. These inquiries are detection method. In contrast, this work [9, 10, 11] applied a
frequently made to help criminals avoid being support vector machine model to identify the presence of DDoS
discovered by learning crucial details about their attacks.
intended victims before an attack. In addition, as shown in [12, 13], the decision tree algorithm
● Ping Flood and Ping of Death: Another typical flood has been used to detect DDoS attacks. The authors of [14] used
attack exploits many ICMP echo queries. The target a deep neural network (DNN) as a deep learning technique to
system attempts to react to numerous requests, identify DDoS attacks in a sample of packets from network
eventually restricting its network bandwidth because traffic. Because the DNN model includes feature extraction and
each ping given requires a cross-response that classification techniques, it can operate quickly and with a high
constitutes the same number of packets to be returned. degree of detection accuracy even with tiny samples. The
Another variation of this attack known as "ping of CICDDoS2019 dataset, which contains multiple DDoS attack
death" causes the operating system to crash by having types developed in 2019, was used by the authors to conduct
the victim send ping packets with the incorrect format the tests. The proposed system achieved an accuracy rate of
and shape. 94.57 percent using a deep learning model. Zeeshan Ahmad et
al.

VOLUME 12, 2022


[15] proposed a scientific classification approach that B. CICDDoS2019 Dataset
depends on well-known ML and DL processes in the planning
network-based IDS (NIDS) framework. To the best of our
knowledge, no other research has worked on the most recent
dataset with over 50 million records, addresses the statistical CICDDoS2019 represents the most up-to-date version of the
technique to select the most significant feature, has this high dataset currently accessible for analysis. Comprising a
accuracy, suggests ways to avoid DDOS attacks, and tackles comprehensive collection of 88 distinct features and an
these issues. extensive dataset encompassing over 50 million records, this
III. PROPOSED MODEL repository includes data on both benign and denial-of-service
(DoS) flows. The dataset's uniqueness lies in its incorporation
of network traffic analysis outcomes, which have been
This study aims to create an accurate DDoS attack detection
meticulously annotated based on a variety of attributes such as
algorithm with a low false-positive rate. Here, a model based
source ports, inbound and outbound IP addresses, destination
on a collection of seven classifiers for machine learning is
ports, and protocols. A focal point of our investigation centers
provided. The chosen classifiers were naive Bayes, KNN,
on the attribute denoted as "Label." This particular attribute
logistic regression, CNN, XGboost, AdaBoost, and random
serves as our target of interest, dichotomizing data into two
forest. All seven algorithms in our model operate independently
distinct classes: Y=1, denoting the attack class, and Y=0,
and produce a unique data model. The outputs of the seven
characterizing the benign class. Notably, the observations
classifiers were combined using accuracy, precision, and recall
attributed to the attack class surpass those of the benign class,
techniques to arrive at the model's outcome. To train the model,
thus creating an inherent imbalance within the dataset. To
the CIC-DDoS2019 dataset was used. This dataset contained
address the challenge posed by imbalanced classes, a
88 features. The feature set of the training dataset should be
meticulous approach was taken. Specifically, a combination of
condensed, which is accomplished using the feature selection
oversampling and undersampling techniques was deployed.
algorithm, "ANOVA, ExtraTreeClassifier, and logistic
Among these methodologies, the Synthetic Minority Over-
regression."
sampling Technique (SMOTE) emerged as the most effective
strategy for our data. By employing SMOTE, a balanced
A. Methodology representation was achieved, resulting in approximately 50%
of the data constituting attack instances, while the remaining
In the context of this study, we employed the CICDDoS2019 50% corresponded to benign instances. This strategic
dataset as the foundation for our DDoS attack detection application of oversampling has been visualized and is
endeavors. The systematic process employed for DDoS illustratively depicted in Figure 3, underscoring the equilibrium
detection is intricately portrayed in Figure 2, offering a visual achieved within the dataset after employing the SMOTE
representation of our investigative methodology.Initiating this technique. This step not only addresses the imbalanced nature
process, the foremost step involved meticulous examination of of the dataset but also lays the foundation for more robust and
a sufficiently comprehensive DDoS dataset. Specifically, all 11 accurate analytical outcomes.
CSV files were meticulously amalgamated to create a unified
and holistic dataset that would serve as the bedrock of our
analysis.Subsequent to dataset compilation, our focus turned to
refining and curating the data. A critical aspect of this stage
entailed the selection of essential characteristics and features
that would play a pivotal role in our analysis. This was
accomplished through the application of three distinct feature-
extraction techniques, each tailored to extract key attributes that
would contribute to the precision of our DDoS detection
framework.The culmination of these preparatory steps led to
the development of a robust machine learning system,
meticulously designed to detect DDoS attacks within the
dataset. To gauge the efficacy of this system, we subjected it to
comprehensive testing using an array of machine learning
models. The testing phase encompassed both training and
validation components, collectively serving to quantify the
accuracy and reliability of the model's performance.This
multistep approach, spanning dataset curation, feature
extraction, and model training and testing, constitutes a
comprehensive framework designed to empower accurate
DDoS attack detection. By iteratively refining our methods and
harnessing the potential of machine learning, we endeavor to
fortify the efficacy of our approach and contribute to the
domain of cybersecurity research.
FIGURE 2. After Over Sampling

VOLUME 12, 2022


FIGURE 3. ML Data Process

There are 11 CSV files in the CICDDoS2019 dataset and 11 Furthermore, in our pursuit of refining data quality, it was
attacks, namely, TDTP, Syn, DrDoS_UDP, DrDoS_DNS, imperative to address the presence of rows harboring either
DrDoS_LDAP, DrDoS_SSDP, DrDoS_MSSQL, missing values or infinite values. Recognizing that certain
DrDoS_NetBIOS, UDP-lag, DrDoS_SSDP, and algorithms are ill-equipped to handle such anomalies, and that
DrDoS_NTP. these aberrations can potentially impede machine learning
To perform a machine-learning model, I split the data into efficiencies, we conscientiously omitted these rows from our
training and test datasets, first training the model with the analysis.
training dataset, and then testing our model. Testing the model As emphasized earlier, the expanse of the CICDDoS2019
is key to understanding the performance of the suggested dataset unfurls across a staggering 88 distinct features. The
model. sheer magnitude of these features ushers in complexities during
both training and predictive phases. Navigating this intricate
C. Data preprocessing landscape, therefore, necessitates a judicious curation of
A data-mining technique called data preparation, which is both features—choosing the ones most germane to the model's
practical and effective, is used to format the raw data. The steps objectives. Opting to engage with pertinent features as opposed
involved in data Preprocessing were 1-Data cleaning, 2-Data to encompassing all features, irrespective of their relevance,
transformation, and 3- data reduction. furnishes several invaluable advantages. Notably, this selective
Most machine learning models only work with numeric approach enhances the velocity of model training, allocates
values. more temporal resources for the predictive task, amplifies
In this step, we converted non-numeric values into numbers. prediction accuracy, and decisively mitigates the specter of
Feature set selection is performed using a heatmap matrix, tree overfitting.
classifier, and an additional logistic approach. Should reduce The ensuing section assumes the mantle of elucidating the
the features used to train the model. strategies employed to distill indispensable features from the
Using more features can lead to low accuracy, whereas labyrinthine CICDDoS2019 dataset.
using fewer features can lead to a high false-positive rate. It A panoramic view of these strategies is encapsulated within
should balance the number of features to obtain a model with Figure 4, delineating a quartet of categories: Filter Methods:
high accuracy but a low false-positive rate. In the dataset, Delineating techniques that derive feature relevance
attributes (columns) that contained mostly zero values were independent of the chosen machine learning model.
removed because they negatively affected the models.

VOLUME 12, 2022


Wrapper Methods: Encompassing methodologies that evaluate
feature subsets based on model performance, often employing
a predictive model as a yardstick. Embedded Methods:
Entailing methods where feature selection transpires as an
inherent part of the model-building process. Hybrid Methods:
Navigating a fusion of the aforementioned strategies to harness
their collective strengths. In a bid to orchestrate this feature
curation symphony, our approach gravitated towards a
composite tapestry of techniques. Filtering, embedded, and
elimination features were harnessed in tandem, buttressed by
the bedrock of logistic regression—a formidable statistical
methodology renowned for its robust analytical capabilities.
The subsequent phases of our analysis were artfully
choreographed to not only isolate essential features but to also
synergize these features in a manner that underpins the veracity
and robustness of our predictive model.

FIGURE 4. Feature Selection

1. Filtering

In contrast to emphasizing the cross-validation performance, FIGURE 5. Heat Map


filter approaches delve into the inherent traits of features,
gauged through univariate statistics. Unlike wrapper methods,
variables—can unravel the ability of one variable to forecast
these techniques function with swiftness and computational
another. This is akin to deciphering the symphony within data,
efficiency, making them a pragmatic choice. Particularly when
where one note holds predictive potential for another. The
dealing with data with a high number of dimensions, filter
tapestry of desirable features is woven with threads of
methods prove to be more economically viable in terms of
correlation that intricately connect them to the target. A strong
computation. Figure 5 unveils the elegance of the heat map
correlation between a variable and the target makes it an
matrix, a quintessential embodiment of the filter approach.
attractive candidate for selection.
Within this matrix, the interplay of the 30 features is vividly
However, this isn't a simple weave; the variables should
portrayed, painted with shades of correlation.
harmonize through uncorrelation amongst themselves, while
This correlation—a measure of linear connection between
maintaining a meaningful correlation with the ultimate goal.
multiple
Just like skilled musicians in an orchestra, variables must strike
a balance—each contributing in a harmonious fashion to create
the symphony of predictive excellence.

VOLUME 12, 2022


3. Feature elimination with logistic regression
2. Embedded
As shown in fig 7, the p-values for most of the variables are
The methods we used in this study cleverly combined the smaller than 0.05, except for four variables. Therefore, when
strengths of two approaches—wrapper and filter—by we wish to test our model with 20 and 10 features, these are the
considering how different features work together while also candidate characteristics that we will eliminate. Here, I must
keeping the calculations manageable. Embedded systems, point out that we cannot make a decision based only on the
which work step by step in training our model, carefully picked logistic regression; rather, we must also take into account the
out the most helpful parts at each stage. We used a special tool heat map and the extra tree classifier because statisticians
called a decision tree classifier to figure out which features believe that p-values that are too low (0.0000) and too high
were the most important. In a big chart (Figure 6), we (1.000) are questionable. Again, we see that “Inbound” has the
highlighted the top 30 features that mattered the most out of the highest effect and “Fwd packet length means” is the one that
huge dataset of 50 million records. When it comes to picking needs to be removed from our model.
the right features, we didn't just guess. We tried different
combinations of 30, 20, and 10 best features to see what gave D. DDOS ML Model
us the best results. Figure 6 also showed us that six features—
“Inbound,” “URG Flag Count,” “Destination Port,” “Avg Bwd Using individual classifiers, the model classifies the data
Segment Size,” “min-seg-size-forward,” and “Source Port”— individually. These classifiers operate in parallel and generate
stood out as the most useful. But not all features are created multiple models from a training dataset. Although many other
equal. As shown in Figure 6, only 10 out of the 30 features classifiers are available for machine learning, we used seven
turned out to be really good at helping us make predictions. The classifiers.
rest weren't as useful for our goal. This discovery helped us
focus on what really matters in making accurate predictions. 1. Naive Bayes

Indeed, the probabilistic learning model known as Naive


Bayes (NB) finds its niche in the realm of machine learning,
where its prowess lies in data classification. By following a
sequence of calculations rooted in probability principles, Naive
Bayes takes in data and efficiently sorts it into predefined
categories. The underlying foundation is Bayes' theorem, a
pivotal theorem in probability theory. One of the charming
traits of Naive Bayes is its simplicity and speed of
implementation. However, this efficiency comes with a
prerequisite—it assumes that the predictors are independent of
each other, which is a bit of an idealized assumption. So, how
does this classification wonder work? It starts with feeding a
chunk of training data into the system. This training data is a
compilation of samples, each tagged with a specific class. The
essence of NB lies in learning from these examples. When it
comes to testing, the system processes the test data using
methods learned from the training data. It's like following a
well-practiced recipe—each ingredient (feature) contributes its
flavor to the final dish (class prediction). The more elaborate
your training dataset, the more adept the system becomes at
dishing out accurate predictions for the test data's class. It's as
if the system becomes a seasoned chef, confidently picking out
the main ingredients from a jumble of flavors. So, in this
culinary analogy of machine learning, Naive Bayes serves as
the master chef—quick, efficient, and reliable, as long as the
ingredients play along with its independence assumption.
FIGURE 6. The best 30 relevant features

VOLUME 12, 2022


FIGURE 7. P-value of Selected Features

2. KNN
KNN is the abbreviation for k-nearest neighbors. It is a 4. XGBOOST
nonparametric algorithm based on a supervised learning let's delve further into the world of the XGBoost learning
technique that can be used to solve classification and regression model, a powerhouse that operates on the foundation of trees
problems. It stores all the existing data and classifies a new data but brings a turbocharged performance that's almost too good
point based on its similarity. When new data are introduced, it to believe. In fact, its speed is so impressive that it can be up to
determines the class of the new data by looking at its K-nearest 100 times faster than other models in its league. Imagine a
neighbors. The Manhattan, Minkowski, and Euclidean distance racing car on a machine learning track—XGBoost would be the
functions are used to determine the distance between two data one leaving all others in the dust. The allure of XGBoost lies
points. The Euclidean distance function was used in this study. not just in its breakneck pace, but in its harmonious blend of
Similarity between the samples to be classified and the samples attributes. It's not just a speed demon; it's also scalable,
in the classes were detected. When faced with new data, the accommodating a vast volume of data without breaking a
distance of this data to the training set data was calculated sweat. Efficiency dances hand in hand with simplicity, making
separately using the Euclidean function. The classification set it accessible to a wide range of users, from newcomers to
was then created by selecting k datasets from the smallest seasoned data scientists. One of the remarkable traits of
distance. The number of KNN (k) neighbors is based on the XGBoost is its reliability when it comes to handling large
classification value. datasets. It's like having a reliable workhorse that can manage
massive loads with ease, ensuring that data analysis doesn't
buckle under the weight of sheer volume. When you're
navigating through an ocean of data, XGBoost is the trusty ship
3. Random Forest that keeps you sailing smoothly. At the core of its operation lies
It consists of several decision trees that are independently probability—a crucial element in its decision-making process.
trained on a random subset of labeled data. Random forest
works well because many relatively separate trees perform
better than any individual model.

VOLUME 12, 2022


Each step XGBoost takes, each branching decision it makes, is Confusion matrix: A confusion matrix table is used to
rooted in probability. It's like a well-tailored suit, where each describe the performance of a classification system. The
stitch is carefully calculated to bring out the best fit—except in effectiveness of the classification was represented and
this case, it's the best fit for predicting outcomes with an astute summarized using a confusion matrix.
understanding of likelihood. So, in the realm of machine ● TP: You predicted positive, and it's true
learning, where models are akin to skilled artisans crafting ● FP(Type 1 Error): You predicted positive, and it's
masterpieces, XGBoost stands out as the sprinter among the false
runners, the reliable partner amidst the data deluge, and the ● TN: You predicted negative, and it's true
probability-savvy maestro composing predictions with ● FN(Type 2 Error): You predicted negative, and it's
precision [15]. false
Accuracy: Evaluation of a model's performance in a dataset to
5. Logistic Regression (Statistical Method) find relationships and patterns based on input data, also known
as training data.
The probability of a response variable was predicted using a Recall: Recall is determined as the proportion of positive
supervised learning classification algorithm known as logistic samples correctly identified as positive for all positive samples.
regression. Because the nature of the dependent variable is
binary, only two possible classes exist. Here, if the traffic flow
is considered an attack, its value is 0. In other words, P(Y=1)
was predicted by a regression model as a function of X.

6. Convolutional Neural Network (CNN)

A convolutional neural network (CNN/ConvNet) is a deep


learning technique that takes an input image and assigns
importance (learnable weights and bias) to each of the
objects/aspects in the image while also distinguishing which
ones are related to one another. Comparatively speaking, the F1-Score: The harmonic mean of recall and precision is known
ConvNet algorithm requires less "pre-processing" than other as the F1-score. The following formula combines the recall and
classification techniques. ConvNets can learn these precision into a single formula:
filters/features with sufficient practice; however, the filters of
the original approaches are manually engineered. The main
advantage of CNN over its predecessors is that it automatically
detects significant features without human supervision, making
it the most used [16].

7. AdaBoost
Precision: Metrics such as precision and recall allow us to
AdaBoost is an ensemble learning technique (Statistical assess how well a classification model predicts outcomes for a
Classification) that was initially developed to boost the given class of interest, or "positive class.". While recall
performance of binary classifiers (sometimes referred to as measures the degree of the error caused by false negatives
"meta-learning"). AdaBoost uses an iterative process to (FNs), precision measures the degree of the error caused by
improve the poor classifiers by learning from their errors. False Positives (FPs).

IV. RESULTS AND DISCUSSION

In this section, we summarize the findings from various tests


conducted to evaluate the effectiveness of various machine
learning models. Moreover, it introduces an analysis to
determine the differences between ML models.

A. Performance Evaluation

To evaluate the effectiveness of the deployed DDoS detection


system, several factors were considered, including

VOLUME 12, 2022


B. Evaluation of ML models C. Visual Explanation

As mentioned previously in the preceding part, we evaluated I will begin by using an approach known as individual
seven different machine learning models, and each model was conditional expectation (ICE) plots. They are straightforward
assessed using 30, 20, and ten features. Among models with 30 to use and demonstrate how the forecast varies as the feature
elements, XGBoost provided the best accuracy (99.99996%). values change. They are comparable to partial dependence
With 20 features in the Rf model, it offers the best accuracy of graphs, but as ICE plots show one line per instance, they go one
(99.99999%) with a precision of 1, and KNN with an accuracy step further and illustrate the heterogeneous effects. figure 9
of (99.98%). With ten features, XGboost and RF had the same shows that there is a positive relationship between the 'Inbound'
accuracy of (99.99%). The CNN models with 30 features and our target. The thick red line is the Partial dependency plot,
produced an accuracy of (84.75%). XGboost and RF were the which shows the change in the average prediction as we vary
best ML models in our study. The detection accuracies for the the "Inbound' feature. Inbound is the first and most significant
top six machine-learning models using 30 characteristics are feature that has a beneficial effect on attacks. This demonstrates
displayed in Table 1. We can save time and money by reducing that whenever attacks occur, they are highly predictable as
the number of features as much as possible, for example, to attacks if Inbound <= 0.5 (Fig. 10); however, to be more
only five. As shown in Fig. 8, if the firewall can identify attacks accurate, we need to build a decision tree. To simplify, I train
with only five features with over 80 % rate, we can buy some an exemplary decision tree and make predictions based on our
time and prevent the server from going down. It takes a long random forest regressor. This tree is based on 30 million data,
time, and it is not reasonable for the firewall to check 30 or 20 and we see that the first split is at the feature 'Inbound,' followed
features to determine if they are attacks or benign. by the 'Source port' and 'Destination port. 'If you recall, these
Next, we discuss the precision, which was evaluated for each were the three most essential features picked by the random
machine-learning model. Precision is the frequency at which forest, heat map, and extra tree classifier. In addition, the p-
the machine-learning model can predict the correct response. value table calculated zero values for these three features.
The XGBoost model with a 30-feature set provided the highest
level of precision (100% ). However, the RF model obtained
the highest precision result with a 20-feature group (100%),
V. CONCLUSION AND FUTURE WORK
whereas the KNN model achieved the best precision result with
a 20-feature set (99.99%) and the CNN model provided the best
precision result with a 20-feature set (98.99 %). Consequently,
the Random Forest model with a 20-feature set had the best Within the landscape of network security, DDoS attacks have
precision results. emerged as a prominent adversary, especially within expansive
networks. Astonishingly, a staggering 64 percent of these
attacks have been found to impact the operational security of
the very servers that are fundamental to the functioning of our
ML Algorithms Accuracy Precision Recall FN most widely used systems [17]. The ubiquity of DDoS attacks
with 30 Feature
in large networks has propelled us on a mission to harness the
power of machine learning to predict and subsequently thwart
these malevolent activities. By delving into the intricacies of
XGBoost 99.99996 1.00 1.00 3 the ML model, we've been able to gain predictive insights into
potential DDoS attacks. Armed with the knowledge of pertinent
features, we're embarking on a journey to not just predict but
actively prevent these attacks from wreaking havoc. As the
AdaBoost 99.97 99.99 99.97 1227
modern world continues to pivot towards a digital existence,
our reliance on the internet for everyday tasks becomes
KNN 99.73 99.97 99.75 119190 increasingly pronounced. This heightened reliance necessitates
robust defenses, most notably fortified firewall systems, to
Logistic 80.63 99.97 80.63 1172949 fortify our networks against these relentless attacks. And herein
Regression
lies the significance of our work: through the utilization of
RF 99.9998 1.00 99.99 4 seven distinct ML models, we've culled that XGBoost emerges
as the apex algorithm for this task. And it's not just about
identifying the algorithm but also zeroing in on the key features
that are indispensable for prediction.
Naive Bayes 90.08 99.96 90.11 598920

Table. 1 Results of 6 ML Algorithms

VOLUME 12, 2022


Our findings reveal that a mere five features—namely
“Inbound,” “Destination Port,” “URG Flag Count,” “Source
Port,” and “Avg Bwd Segment Size”—hold the potential to
unlock the prevention of DDoS attacks. These features act as
beacons, guiding our efforts towards effectively safeguarding
our networks. However, we acknowledge that the landscape of
cyber threats is perpetually evolving. As we strive to identify
common traits of these attacks, the perpetrators themselves are
relentless in their quest for innovation. They continually refine
their techniques, such as exploiting arbitrary source ports,
which serves as a reminder that our battle against these threats
is an ongoing one. In this dynamic arena, the onus is on us to
keep refining and advancing our machine learning algorithms,
equipping them to learn and adapt in tandem with emerging
trends and tactics. Ultimately, our journey is fueled by a shared
commitment—to protect our networks and digital spaces from
the persistent threat of DDoS attacks. Through predictive
insights, fortified algorithms, and an unwavering pursuit of
knowledge, we are forging a path towards a safer and more
secure digital landscape.

FIGURE 8. Firewall and Log Request

VOLUME 12, 2022


FIGURE 9 Positive relationship between Inbound and Target

FIGURE 10. Surrogate model (in this case: decision tree)

ACKNOWLEDGMENT

If the primary author receives an email requesting it, the


implementation code for the suggested technique will be made
available for study purposes. Additionally, I will post a portion
of the code on my GitHub page if anyone is interested in it.

VOLUME 12, 2022


[6] A. Bivens, C. Palagiri, R. Smith, B. Szymanski, M.
REFERENCES Embrechts, et al, “Networkbased intrusion detection using
neural networks,” Intelligent Engineering Systems through
[1] Mirkovic, Jelena, and Peter Reiher. "A taxonomy of DDoS Artificial Neural Networks, vol. 12, no. 1 , pp. 579–584, 2002.
attack and DDoS defense mechanisms." ACM SIGCOMM
Computer Communication Review 34.2 2004, 39-53
[7] Jasreena Kaur Bains ,Kiran Kumar Kaki ,Kapil Sharma,‟
Intrusion Detection System with Multi-Layer using Bayesian
[2] T. Su, H. Sun, J. Zhu, S. Wang and Y. Li, "BAT: Deep
Networks‟, International Journal of Computer Applications
learning methods on network intrusion detection using NSL-
(0975 – 8887) Volume 67– No.5, April 2013.
KDD dataset", IEEE Access, vol. 8, pp. 29575-29585, 2020.

[8] M. Alkasassbeh, G. Al-Naymat et.al,” Detecting


[3]https://www.f5.com/labs/articles/threat-intelligence/2022-
Distributed Denial of Service Attacks Using Data Mining
application-protection-report-ddos-attack-
Technique,” (IJACSA) International Journal of Advanced
trends#:~:text=The%20overall%20number%20of%20DDoS%
20attacks%20declined%203%25%20between%202020,most Computer Science and Applications, Vol. 7, pp. 436-445, 2016.
Science and Information Technologies, Vol. 6 (2), pp. 1096-
%20attacked%20sector%20in%202021.
1099, 2015.

[4] Advances in Parallel Distributed Computing, 2011, Volume


[9] Mangesh Salunke, RuhiKabra, Ashish Kumar.‟ Layered
203 ISBN : 978-3-642-24036-2A. Srivastava, B. B. Gupta, A.
architecture for DoS attack detection system by combine
Tyagi, Anupama Sharma, Anupama Mishra
approach of Naive Bayes and Improved Kmeans Clustering
Algorithm‟, International Research Journal of Engineering and
[5] Arshi, M., M. D. Nasreen, and Karanam Madhavi. "A Technology (IRJET), Volume: 02 Issue: 03, June-2015
Survey of DDOS Attacks Using Machine Learning
Techniques." In E3S Web of Conferences, vol. 184, p. 01052.
EDP Sciences, 2020.

. [14] Cil, A.E., Yildiz, K. and Buldu, A., 2021. Detection of


DDoS attacks with feed forward based deep neural network
model. Expert Systems with Applications, 169, p.114520.
[10] T. Subbulakshmi et.al, ‟A Unified Approach for
Detection and Prevention of DDoS Attacks Using Enhanced
Support Vector Machine and Filtering Mechanisms‟, [15] Z. Ahmad, A. S. Khan, C. W. Shiang, J. Abdullah, and
ICTACT Journal on Communication Technology, June 2013. F. Ahmad, ‘‘Network intrusion detection system: A
systematic study of machine learning and deep learning
approaches,’’ Trans. Emerg. Telecommun. Technol., vol. 32,
[11] Yogeswara Reddy B, Srinivas Rao J, Suresh Kumar T, no. 1, p. e4150, Jan. 2021
Nagarjuna A, International Journal of Innovative Technology
and Exploring Engineering, Vol.8, No. 11, 2019, pp: 1194-
1198. [16] Ismail et al., "A Machine Learning-Based Classification
and Prediction Technique for DDoS Attacks," in IEEE
Access, vol. 10, pp. 21443-21454, 2022, doi:
[12] HodaWaguih, “A Data Mining Approach for the 10.1109/ACCESS.2022.3152577.
Detection of Denial of Service Attack”, International Journal
of Artificial Intelligence, vol. 2 pp. 99106(2013).
[17] Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of
deep learning: concepts, CNN architectures, challenges,
[13] Dewan Md. Farid, Nouria Harbi, EmnaBahri, applications, future directions. J Big Data 8, 53 (2021).
Mohammad Zahid ur Rahman, Chowdhury Mofizur https://doi.org/10.1186/s40537-021-00444-8
Rahman,‟ Attacks Classification in Adaptive Intrusion
Detection using Decision Tree “International Journal of
Computer, Electrical, Automation, Control and Information [18]Arbor Networks, "Worldwide ISP Security Report",
Engineering, Vol:4, No:3, 2010. Sept. 2005, pp. 1- 23.

VOLUME 12, 2022


VOLUME 12, 2022

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy