Predict and Prevent DDOS Attacks Using Machine Lea
Predict and Prevent DDOS Attacks Using Machine Lea
Statistical Algorithms
Azadeh Golduzian1
1
Department of Mathematics and Statistics, University of New Mexico, NM 87106 USA
Corresponding author: Azadeh Golduzian ( agolduzian96@unm.edu).
This work was supported in part by the University of New Mexico, Albuquerque.
A malicious attempt to exhaust a victim's resources to cause it to crash or halt its services is known as a
distributed denial-of-service (DDoS) attack. DDOS attacks stop authorized users from accessing specific
services available on the Internet. It targets varying components of a network layer and it is better to stop into
layer 4 (transport layer) of the network before approaching a higher layer. This study uses several machine
learning and statistical models to detect DDoS attacks from traces of traffic flow and suggests a method to
prevent DDOS attacks. For this purpose, we used logistic regression, CNN, XGBoost, naive Bayes,
AdaBoostClassifier, KNN, and random forest ML algorithms. In addition, data preprocessing was performed
using three methods to identify the most relevant features. This paper explores the issue of improving the DDOS
attack detection accuracy using the latest dataset named CICDDoS2019, which has over 50 million records.
Because we employed an extensive dataset for this investigation, our findings are trustworthy and practical. Our
target class (attack class) was imbalanced. Therefore, we used two techniques to deal with imbalanced data in
machine learning. The XGboost machine learning model provided the best detection accuracy of (99.9999%)
after applying the SMOTE approach to the target class, outperforming recently developed DDoS detection
systems. To the best of our knowledge, no other research has worked on the most recent dataset with over 50
million records, addresses the statistical technique to select the most significant feature, has this high accuracy,
and suggests ways to avoid DDOS attack
There are 11 CSV files in the CICDDoS2019 dataset and 11 Furthermore, in our pursuit of refining data quality, it was
attacks, namely, TDTP, Syn, DrDoS_UDP, DrDoS_DNS, imperative to address the presence of rows harboring either
DrDoS_LDAP, DrDoS_SSDP, DrDoS_MSSQL, missing values or infinite values. Recognizing that certain
DrDoS_NetBIOS, UDP-lag, DrDoS_SSDP, and algorithms are ill-equipped to handle such anomalies, and that
DrDoS_NTP. these aberrations can potentially impede machine learning
To perform a machine-learning model, I split the data into efficiencies, we conscientiously omitted these rows from our
training and test datasets, first training the model with the analysis.
training dataset, and then testing our model. Testing the model As emphasized earlier, the expanse of the CICDDoS2019
is key to understanding the performance of the suggested dataset unfurls across a staggering 88 distinct features. The
model. sheer magnitude of these features ushers in complexities during
both training and predictive phases. Navigating this intricate
C. Data preprocessing landscape, therefore, necessitates a judicious curation of
A data-mining technique called data preparation, which is both features—choosing the ones most germane to the model's
practical and effective, is used to format the raw data. The steps objectives. Opting to engage with pertinent features as opposed
involved in data Preprocessing were 1-Data cleaning, 2-Data to encompassing all features, irrespective of their relevance,
transformation, and 3- data reduction. furnishes several invaluable advantages. Notably, this selective
Most machine learning models only work with numeric approach enhances the velocity of model training, allocates
values. more temporal resources for the predictive task, amplifies
In this step, we converted non-numeric values into numbers. prediction accuracy, and decisively mitigates the specter of
Feature set selection is performed using a heatmap matrix, tree overfitting.
classifier, and an additional logistic approach. Should reduce The ensuing section assumes the mantle of elucidating the
the features used to train the model. strategies employed to distill indispensable features from the
Using more features can lead to low accuracy, whereas labyrinthine CICDDoS2019 dataset.
using fewer features can lead to a high false-positive rate. It A panoramic view of these strategies is encapsulated within
should balance the number of features to obtain a model with Figure 4, delineating a quartet of categories: Filter Methods:
high accuracy but a low false-positive rate. In the dataset, Delineating techniques that derive feature relevance
attributes (columns) that contained mostly zero values were independent of the chosen machine learning model.
removed because they negatively affected the models.
1. Filtering
2. KNN
KNN is the abbreviation for k-nearest neighbors. It is a 4. XGBOOST
nonparametric algorithm based on a supervised learning let's delve further into the world of the XGBoost learning
technique that can be used to solve classification and regression model, a powerhouse that operates on the foundation of trees
problems. It stores all the existing data and classifies a new data but brings a turbocharged performance that's almost too good
point based on its similarity. When new data are introduced, it to believe. In fact, its speed is so impressive that it can be up to
determines the class of the new data by looking at its K-nearest 100 times faster than other models in its league. Imagine a
neighbors. The Manhattan, Minkowski, and Euclidean distance racing car on a machine learning track—XGBoost would be the
functions are used to determine the distance between two data one leaving all others in the dust. The allure of XGBoost lies
points. The Euclidean distance function was used in this study. not just in its breakneck pace, but in its harmonious blend of
Similarity between the samples to be classified and the samples attributes. It's not just a speed demon; it's also scalable,
in the classes were detected. When faced with new data, the accommodating a vast volume of data without breaking a
distance of this data to the training set data was calculated sweat. Efficiency dances hand in hand with simplicity, making
separately using the Euclidean function. The classification set it accessible to a wide range of users, from newcomers to
was then created by selecting k datasets from the smallest seasoned data scientists. One of the remarkable traits of
distance. The number of KNN (k) neighbors is based on the XGBoost is its reliability when it comes to handling large
classification value. datasets. It's like having a reliable workhorse that can manage
massive loads with ease, ensuring that data analysis doesn't
buckle under the weight of sheer volume. When you're
navigating through an ocean of data, XGBoost is the trusty ship
3. Random Forest that keeps you sailing smoothly. At the core of its operation lies
It consists of several decision trees that are independently probability—a crucial element in its decision-making process.
trained on a random subset of labeled data. Random forest
works well because many relatively separate trees perform
better than any individual model.
7. AdaBoost
Precision: Metrics such as precision and recall allow us to
AdaBoost is an ensemble learning technique (Statistical assess how well a classification model predicts outcomes for a
Classification) that was initially developed to boost the given class of interest, or "positive class.". While recall
performance of binary classifiers (sometimes referred to as measures the degree of the error caused by false negatives
"meta-learning"). AdaBoost uses an iterative process to (FNs), precision measures the degree of the error caused by
improve the poor classifiers by learning from their errors. False Positives (FPs).
A. Performance Evaluation
As mentioned previously in the preceding part, we evaluated I will begin by using an approach known as individual
seven different machine learning models, and each model was conditional expectation (ICE) plots. They are straightforward
assessed using 30, 20, and ten features. Among models with 30 to use and demonstrate how the forecast varies as the feature
elements, XGBoost provided the best accuracy (99.99996%). values change. They are comparable to partial dependence
With 20 features in the Rf model, it offers the best accuracy of graphs, but as ICE plots show one line per instance, they go one
(99.99999%) with a precision of 1, and KNN with an accuracy step further and illustrate the heterogeneous effects. figure 9
of (99.98%). With ten features, XGboost and RF had the same shows that there is a positive relationship between the 'Inbound'
accuracy of (99.99%). The CNN models with 30 features and our target. The thick red line is the Partial dependency plot,
produced an accuracy of (84.75%). XGboost and RF were the which shows the change in the average prediction as we vary
best ML models in our study. The detection accuracies for the the "Inbound' feature. Inbound is the first and most significant
top six machine-learning models using 30 characteristics are feature that has a beneficial effect on attacks. This demonstrates
displayed in Table 1. We can save time and money by reducing that whenever attacks occur, they are highly predictable as
the number of features as much as possible, for example, to attacks if Inbound <= 0.5 (Fig. 10); however, to be more
only five. As shown in Fig. 8, if the firewall can identify attacks accurate, we need to build a decision tree. To simplify, I train
with only five features with over 80 % rate, we can buy some an exemplary decision tree and make predictions based on our
time and prevent the server from going down. It takes a long random forest regressor. This tree is based on 30 million data,
time, and it is not reasonable for the firewall to check 30 or 20 and we see that the first split is at the feature 'Inbound,' followed
features to determine if they are attacks or benign. by the 'Source port' and 'Destination port. 'If you recall, these
Next, we discuss the precision, which was evaluated for each were the three most essential features picked by the random
machine-learning model. Precision is the frequency at which forest, heat map, and extra tree classifier. In addition, the p-
the machine-learning model can predict the correct response. value table calculated zero values for these three features.
The XGBoost model with a 30-feature set provided the highest
level of precision (100% ). However, the RF model obtained
the highest precision result with a 20-feature group (100%),
V. CONCLUSION AND FUTURE WORK
whereas the KNN model achieved the best precision result with
a 20-feature set (99.99%) and the CNN model provided the best
precision result with a 20-feature set (98.99 %). Consequently,
the Random Forest model with a 20-feature set had the best Within the landscape of network security, DDoS attacks have
precision results. emerged as a prominent adversary, especially within expansive
networks. Astonishingly, a staggering 64 percent of these
attacks have been found to impact the operational security of
the very servers that are fundamental to the functioning of our
ML Algorithms Accuracy Precision Recall FN most widely used systems [17]. The ubiquity of DDoS attacks
with 30 Feature
in large networks has propelled us on a mission to harness the
power of machine learning to predict and subsequently thwart
these malevolent activities. By delving into the intricacies of
XGBoost 99.99996 1.00 1.00 3 the ML model, we've been able to gain predictive insights into
potential DDoS attacks. Armed with the knowledge of pertinent
features, we're embarking on a journey to not just predict but
actively prevent these attacks from wreaking havoc. As the
AdaBoost 99.97 99.99 99.97 1227
modern world continues to pivot towards a digital existence,
our reliance on the internet for everyday tasks becomes
KNN 99.73 99.97 99.75 119190 increasingly pronounced. This heightened reliance necessitates
robust defenses, most notably fortified firewall systems, to
Logistic 80.63 99.97 80.63 1172949 fortify our networks against these relentless attacks. And herein
Regression
lies the significance of our work: through the utilization of
RF 99.9998 1.00 99.99 4 seven distinct ML models, we've culled that XGBoost emerges
as the apex algorithm for this task. And it's not just about
identifying the algorithm but also zeroing in on the key features
that are indispensable for prediction.
Naive Bayes 90.08 99.96 90.11 598920
ACKNOWLEDGMENT