TelematiqueVol21Issue1 616
TelematiqueVol21Issue1 616
Abstract
IoT attacks have become very common in recent years, especially during pandemic times when
most activity takes place online. These attacks involve theft of data and complete or partial
blocking of access to various devices, creating an emergency at various locations. These at-
tacks/attackers can be found in various forms on the internet. With that being said, the aim of
this study is to identify ‘IoT attacks’ and ‘DDoS attacks’ using three different datasets, namely
BoT-IoT, IoT-23, and the Canadian Institute of Cyber Security-Distributed Denial of Service-
2019 (CIC-DDoS2019). BoT-IoT and IoT23 datasets are utilized in experiment I and II for
identifying IoT attacks. BoT-IoT dataset will be used for training in Experiment I, and the
testing will be done by IoT-23 dataset. Experiment II is conducted in the reverse order of the
datasets. Experiment III was conducted to identify DDoS attacks in the CIC-DDoS2019 dataset
on two different days. Training and testing were done in all experiments using two gradient
boosting techniques, namely Extreme Gradient Boosting (XGB) and Light Gradient Boosting
Method (LGBM), and their performance was compared with that of the Cascaded Deep Forest
(CDF). Feature extraction and selection (FES) is done using two established methods: principal
component analysis (PCA) and analysis of variance (ANOVA). The accuracy achieved with
the boosting methods is at least 16% higher than that achieved with CDF. Boosting algorithms
are at least 240 times faster than CDF. Among the two boosting algorithms, the execution time
of LGBM is the lowest; it is executed in 54 seconds or less and has an accuracy of up to 94.79%.
Index Term: IoT Attacks, Cross dataset, Machine Learning (ML), LGBM, XGB, CDF
1. INTRODUCTION
The increase in online activity during the pandemic period has opened up enormous op-
portunities for attackers [1]. Hackers capitalise on the venerability and security flaws in IoT
devices [2]. According to a report by Internet Data Corporation (IDC), there will be 55.7 bil-
lion connected devices, and 75% of them will be connected to IoT devices [3]. The number of
cyberattacks will increase to 15.4 million by 2023, doubling the 7.9 million attacks in 2018
[4].
Artificial intelligence (AI) is a buzzword now a days and widely accepted solution for
detecting attacks on the internet [5–6]. Machine learning (ML) categorization algorithms [7–
9] evolve as autonomous analytical tool to obtain accurate scores based on extracted features.
Various approaches to cyber-attack identification include non-AI-based methods, ML and DL.
6982
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
In this section, the authors aim to provide a review of the literature and background infor-
mation on ML approaches to cyber-attack intrusion and anomaly detection. Various ML algo-
rithms [10] were applied to BoT-IoT dataset to identify the traffic associated with attacks and
anomalies in an IoT network based on 44 selected features. Five ML algorithms, such as NB,
Bayes net (BN), Decision Tree (DT), C4.5 Random Tree, and Random Forest (RF) for accurate
identification of malicious Bot-IoT traffic. The C4.5, Random Tree, and RF algorithms
achieved 99.99% accuracy, while NB and BN achieved 99.79% and 99.77% accuracy, respec-
tively. To improve efficiency, they use the bijective soft set approach, which has been shown
to be very effective in decision-making and selection concepts. Hence, conclusion drawn that
the NB algorithm is most effective in detecting intrusions and anomalies in IoT networks. A
deep learning (DL) model [11] has been proposed to detect DDoS attacks on network traffic.
The proposed architecture was able to detect changes quickly and accurately, even with
smaller sample size. This is due to the classification process, feature extraction technique and
layers that are updated during training. They used the CICDoS2019 dataset and converted it
into two different formats to make it more effective for classifying and detecting DDoS attacks
with DDN. The first dataset was labelled as two types of traffic for the presence and absence
of DDoS attacks. The second dataset detects the entire spectrum of DDoS attacks. It was found
that the attacks were detected with 99.97% and 99.99% accuracy and precision, and the attack
types were classified with 80.49% and 94.57% precision and accuracy, respectively.
Pokhrel et al. [12] proposed an innovative technique using a ML approach to mitigate and
detect botnet DDoS attacks on IoT networks to solve the security problems caused by bots.
They used the BoT-IoT dataset, which has 999,610 records. Of these, 994,828 are botnet traffic
and the rest are normal. As the dataset was not balanced, another balanced dataset with an
equal amount of normal and botnet traffic was created using the SMOTE technique. Various
ML models were used to train the BoT-IoT dataset, namely, K-nearest neighbour (KNN), NB
and artificial neural network (ANN). The KNN algorithm achieved 92.1% accuracy and
ROC_AUC of 92.2% for the dataset created using SMOTE and 99.6% accuracy and 99.2%
ROC_AUC for the real-time dataset. It was found that the KNN algorithm was the best algo-
rithm for detecting cyberattacks in the BoT-IoT dataset. Hasan et al. [13] designed a deep
convolutional (DC) architecture for detecting DDoS attacks on optical burst systems (OBS).
The DC neural network approach proves to be very promising when the dataset is miniscule,
as general ML algorithms cannot effectively perform traffic analysis. A comparison of applied
ML algorithms support vector machine (SVM), NB and the KNN algorithm was done. It was
found that the DCNN model achieved 99% accuracy, while the ML models like NB, SVM and
KNN did not perform well, with 79%, 88%, and 93% accuracy, respectively. Their study there-
fore concluded that the DCNN model is most promising compared to the traditional ML algo-
rithms. Priyadarshini et al. [14] proposed a long short-term memory (LSTM) model to detect
the anomalous characteristics of DDoS attacks at the transport/network level. The model se-
cures cloud computing and fog computing environments. The LSTM model is most efficient
on time-based sequential data and is therefore proper for training samples of network traffic
packets recorded at specific time intervals. LSTM has the ability to retain past and future
knowledge to influence the current packet. They used the IDS CTU-13 botnet and ISCX 2012
datasets. The experiment was conducted with different numbers of hidden layers, units and
dropouts for the LSTM model. It was found that the model with 128 units and three hidden
layers was the best model, with 98.88% accuracy.
6983
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
The behaviour of the attacks is variable by nature. The new attacks that take place in the
future may differ from the previous ones. Therefore, it is not enough to train and test the same
attack. To solve this problem, a cross-data test of the trained models was performed in the
current work. Two gradient boosting algorithms are used to identify attacks, which are trained
and tested with different datasets. The entire workflow of the current work is shown in Fig. 2.
6984
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
I. Experiments
1.1 Data description
In this paper, three datasets, BoT IoT[10, 27, 40, 41], IoT-23[27-28, 60] and CIC-DDoS-
2019[30, 31, 42] ], are used for attack detection. The first two datasets were based on attacks
on IoT devices, and the third dataset consists of 12 different DDoS attacks triggered on two
different days. The description of the three datasets is as follows:
2.1.1 BOT IOT
A realistic network environment was created by designing a cyber range lab at UNSW
Canberra for the preparation of the BoT-IoT dataset [43]. The reliability of the BoT-IoT dataset
was evaluated [10, 27, 40, 41] evaluated through different machine learning and statistical;
methods for various application and also compared with the existing datasets. In this dataset a
combination of normal and botnet traffic identification across IOT-specific network is provided.
It consists of different format pcap file having more than 72,0000,000 records of size 69.3 GB,
where flow traffic csv file is of 16.7 GB. The dataset includes different attacks such as: DDoS,
DoS, OS and Service Scan, Keylogging and Data exfiltration attacks. To handle such huge data,
5% of the of original data was extracted through MYSQL queries which is having 3 million
records of size 1.07 GB.
2.1.2 IOT23
The IoT-23 dataset was captured in the Stratosphere Laboratory by the AIC Group, FEL and
CTU University in the Czech Republic. It is a large dataset [28, 44] with real and labelled
6985
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
network traffic from IoT devices. The aim of preparing this dataset is to provide a new reposi-
tory of IoT malware for the application of ML algorithms. The IoT-23 dataset consists of 20
malware captures running on IoT devices and three captures of benign IoT device traffic. Many
researchers [45-47] have used the dataset for their experiment in identification of IOT
attacks.
2.1.3 CIC-DDoS-2019
CIC-DDoS-2019 [30-32, 42] ] is a new real-world dataset consisting of various types of
attacks with 50,063,112 records [48]. Of these, 56,863 records represent benign traffic and
50,006,249 records represent DDoS attacks. Each of these rows contains 86 features to identify
the respective attacks. In preparing the dataset, a realistic background of traffic was generated
using the B-profile system [49] to mock the abstract behaviour of human interactions in the
proposed testbed. The attacks included in the CIC-DDoS-2019 dataset consisted of DDoS at-
tacks on two days. The first day is called the ‘test day’, and the second day is called the ‘training
day’. There was no connection between the experiments on the two days. The different attacks
on two different days are shown in Fig .3.
SYN
Flood(4284751)
PortMap(186960)
NetBIOS(3454578)
LDAP(1905191) MSSQL(5763061) NetBIOS(3454578) PortMap(186960)
SYN Flood(4284751) UDP-Flood(3754680) UDP-Lag(1873)
(a) Attack type present on ‘test day’ with their number of instances in each CSV
6986
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
SYN
TFTP(1048574) Flood(1582681) UDP
NTP(1217007)
Flood(1470742)
NetBIOS(4094986) UDP-Lag(370607)
MSSQL(1844905)
SSDP(2611374)
SNMP(5161377) LDAP(2181542)
DNS(5074413)
(b)Attack type present on ‘training day’ with their number of instances in each CSV
2.2.1 PCA
6987
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
2.2.2 ANOVA
The term ‘analysis of variance’ i.e. ANNOVA refers to the process of comparing two or more
variables [54- 55]. As the name suggests, it compares many independent groups using variance
as a metric. One-way ANOVA and two-way ANOVA are two types of ANOVA. When there
are three or more independent groups of a variable, a one-way ANOVA is used [36-37]. Fig. 5
shows the two distributions and their behaviour.
From Fig. 5 it can be inferred that if the distributions are close to each other or overlap, the
overall mean and the individual means are comparable. However, when the distributions are
far apart, the overall mean and the individual means differ by a greater distance. Since the
values in each group differ, this indicates differences between them. Therefore, the ANOVA
was used in the current study to examine variability between different groups and within the
same group. The significant difference between groups was measured by the F-ratio in
ANOVA. This is close to one if there is no significant difference between the groups and all
variances are identical.
2.3 Classification
The concept of manuals is evolving in a world where virtually all manual processes have
been automated. Algorithms for ML can make computers play chess, perform operations and
become more intelligent and more personable. Many security techniques have been used to
prevent cyberattacks [56] and to identify intruders through intrusion detection systems (IDS).
In this study, two boosting algorithms are used: XGB, LGBM and a non-boosting technique:
the CDF algorithm used to identify intrusion in cyberattacks [20, 57, 58]. Boosting is an en-
semble strategy used to improve the performance of model predictions of any learning algo-
rithm. The goal of boosting is to instruct poor learners in a systematic way, with each attempt
correcting the previous one. Extensions aimed at computational efficiency have recently made
boosting approaches fast enough for widespread use. Boosting methods have become the pre-
ferred and often the best-performing strategy in ML contests for classification and regression
on tabular data.
6989
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
previous research suggests that it should only be used for data with 10,000 rows or more [79].
Light GBM uses histogram-based methods that divide continuous feature values (attributes)
into discrete bins using histogram-based techniques [61-62]. This reduces memory require-
ments and speeds up training.
3.1 Pre-processing
All attack files present in the various datasets are pre-processed in the following step.
1. Removal of attributes with object instances (e.g. pkSeqID, daddr, subcategory, etc.) (See
supplementary material Tables S1–S3, for details).
2. Encoding the labels i.e., normal as 0 and 1, for all attacks in Experiments II and I. The five
attacks LDAP, MSSQL, NetBIOS, SYN-Flood and UDP- Flood are encoded as 1–5, respec-
tively and benign is encoded as 0 in experiment III. (See supplementary material Tables S4–
S6, for details).
3. Removal of columns with standard deviation = 0 and correlation = 0.
4. Normalisation of independence variables x using equation 1.
x = (x-µ(x))/σ (x)
(1)
5. Verification of standard deviation & correlation.
6. Creation of train and test set. Instances of train and test set are taken as different dataset/
data from different days in the case of CIC-DDoS-2019. (See supplementary material Table
S4–S6, for details)
The results obtained in the three experiments are evaluated in terms of time, accuracy and
number of features required to achieve highest accuracy. The formula used for accuracy calcu-
lation is given in equation 2.
6990
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
Accuracy=TP+TN/Total instances
(2)
In this experiment, the BoT-IoT dataset is taken as training data with 718045 instances and
IoT-23 dataset is taken as test set with 147662 instances. The BoT-IoT instance has been re-
trieved from All features - Files - CloudStor (aarnet.edu.au) which consists of 5% of the entire
dataset labelled as normal and type of attack. All different types of attacks are considered ma-
licious data, and all normal instances are considered non-malicious data. Four attacks, Benign,
Okiru, PartOfAHorizontalPortScan and DDoS, are taken from the IoT-23 dataset for test sam-
ples in pcap format. The pcap files were converted to csv using CICFLOWMETER software
to make attributes compatible in both datasets. From converted csv files, benign instances are
treated as non-malicious samples, and the rest of all attack instances are treated as malicious.
Hence, binary classification is done here. The results obtained in this experiment are shown in
Table 1.
Table 1 shows that the highest accuracy is achieved for five features in the shortest time for
both boosting and non-boosting algorithms. However, PCA feature selection works better for
boosting algorithms, while ANOVA achieves higher accuracy for non-boosting algorithms.
Comparing the results by time, the non-boosting algorithm takes the most time. The same ob-
servation was made when comparing accuracy. The accuracy of the non-boosting algorithms
was lower than that of the boosting algorithms.
Table 1: Results of boosting and non-boosting models for BoT IoT vs IoT 23 in terms of
accuracy and time
Boosting techniques Non boosting
technique
XGB LGBM CDF
Features No. Accu- Time Accu- Time Accu- Time
Methods of fea- racy (Min) racy (Min) racy (Min)
tures
5 88.54 2.39 90.54 0.05 71.96 54.47
PCA 10 83.74 2.51 83.89 0.08 66.87 55.45
15 83.14 3.14 82.98 0.11 66.93 55.39
20 81.51 3.27 82.71 0.14 66.87 55.58
25 81.64 3.39 82.96 0.15 66.12 56.07
30 82.14 3.47 83.11 0.18 66.21 56.37
35 81.64 3.59 83.37 0.21 65.74 56.49
ANOVA 5 84.67 1.42 86.87 0.09 79.84 56.37
10 77.44 1.55 78.74 0.10 70.77 58.59
15 77.87 1.57 78.76 0.11 70.58 57.57
20 77.69 2.01 78.64 0.12 71.27 57.32
25 77.74 2.27 78.77 0.12 71.34 57.49
6991
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
Table 2: Results of boosting and non-boosting models for IoT 23 vs BoT IoT in terms of
accuracy and time
Boosting techniques Non boosting
technique
XGB LGBM CDF
Features No. Accu- Time Accu- Time Accu- Time
Methods of fea- racy (Min) racy (Min) racy (Min)
tures
5 85.25 2.39 87.79 0.05 70.37 56.28
PCA 10 79.88 1.24 81.09 0.07 62.69 57.42
15 79.74 1.39 80.96 0.08 62.43 58.04
20 79.63 1.42 80.71 0.11 62.17 58.47
25 79.47 1.51 80.53 0.12 62.09 58.52
30 80.03 1.57 81.24 0.15 63.66 59.31
35 79.21 2.06 80.46 0.18 62.35 59.77
ANOVA 5 81.96 1.42 83.45 0.03 79.84 56.37
10 74.67 1.44 75.24 0.04 69.45 56.57
15 74.43 1.52 75.18 0.05 69.38 57.12
20 74.37 2.03 75.49 0.08 68.23 57.44
25 74.19 2.11 75.32 0.11 68.18 58.29
30 75.02 2.15 76.01 0.12 69.17 58.32
35 74.18 2.23 75.67 0.15 68.32 59.01
Table 2 shows that the results obtained in experiment II are consistent with those of exper-
iment I. PCA feature selection works better with boosting algorithms, while ANOVA achieves
higher accuracy with non-boosting. Comparing the results by time, the non-boosting algorithm
takes the most time. The same observation was made when comparing accuracy. The accuracy
of the non-boosting algorithm was lower than that of the boosting algorithm.
From Table 4, the highest accuracy was obtained for 35 features. The feature selection
method PCA gives higher accuracy in the boosting algorithm, while ANOVA gives higher
accuracy in the non-boosting method. The accuracy obtained with the non-boosting method
was at least 9% lower than the accuracy obtained with the boosting algorithms. Additionally,
the values achieved by the boosting algorithms were at least 46 minutes lower than those of the
non-boosting method.
Table 4: Results of boosting and non-boosting models for day2 vs day1 in terms of accu-
racy and time
Boosting techniques Non-boosting tech-
nique
XGB LGBM CDF
Features No. Accu- Time Accu- Time Accu- Time
Methods of fea- racy (Min) racy (Min) racy (Min)
tures
5 90.86 2.37 91.79 0.29 80.92 48.27
PCA 10 91.67 2.54 92.45 0.36 81.77 48.44
15 92.09 3.10 92.84 0.39 82.16 49.07
20 92.74 3.24 93.55 0.41 82.89 49.36
25 93.13 3.36 93.97 0.45 83.03 49.41
30 93.78 3.49 94.33 0.47 83.14 49.45
35 94.49 4.05 94.79 0.51 83.76 50.02
ANOVA 5 90.61 2.30 91.84 0.34 80.11 48.27
10 91.47 2.45 92.47 0.37 81.26 48.41
15 92.22 3.07 92.51 0.41 82.71 49.09
20 92.82 3.34 93.13 0.45 83.84 49.32
25 93.07 3.46 93.42 0.47 84.05 49.46
30 93.18 3.51 93.97 0.49 84.48 50.07
35 93.86 4.08 94.29 0.54 85.41 50.39
6993
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
Several researchers have applied different ML algorithms to identify different types of at-
tacks [4, 67]. Attacks on the internet can take different forms depending on the attack target,
severity of the attack, network type and legal aspects [68]. Of the different types of attacks,
‘IoT attacks’ and ‘DDoS attacks’ are identified in this paper. Two gradient boosting and one
non-boosting ensemble ML were used to identify both types of attacks. The results from the
three experiments show that gradient-boosting methods perform better than non-boosting
methods. Even the boosting method is faster than the non-boosting method in both binary and
multiclass classifications. The graph shown in Figs. 6 and 7 shows the average accuracies
achieved in the three experiments conducted here.
Average accuracies(%)
100.00
80.00
60.00
40.00
20.00
0.00
5 10 15 20
25 30 35 5 10 15 20
PCA 25 30 35
ANOVA
60.00
40.00
20.00
0.00
5 10 15
20 25 30
35 5 10 15 20 LGBM
PCA 25 30 35
ANOVA
Fig. 6 shows that the average accuracy of the boosting algorithms is between 84–90%, while
it is less than 80% for the non-boosting algorithms. However, the average accuracy of both
6994
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
methods is higher for five features, which is evident from the peaks at the beginning and middle
of the graph. Comparing the average accuracy of the two boosting methods, the values obtained
by LGBM were higher than those of XGB.
Fig. 7 shows that the average execution time of the two boosting algorithms is significantly
lower (about 50 minutes) than that of the non-boosting algorithms. The maximum execution
time of CDF is about 53–56 minutes for 5, 10, 15, .......35 number of features. The average
execution time of LGBM is between 0.13–0.30 minutes for the different features selected. The
average execution time for XGB is between CDF and LGBM.
As in the current work, various ML algorithms are used to identify attacks, which were sum-
marised by Hazi and Ameen in 2021 [69]. ML has been applied in a collaborative and decen-
tralised manner called federated learning, as described in [70]. IoT botnet attacks were identi-
fied in [71–72] using ML and rule-based fuzzy learning. Shafiq et al. (2020) proposed a wrap-
per-based feature selection method for malicious IoT traffic [73].
The traditional ML was applied to the IoT-23 dataset in [74–75]. In [76], 1D, 2D, and 2D
deep learning methods were applied to the BoT-IoT and IoT-23 datasets. A hybrid approach,
CyDDoS [77], was proposed in Intrusion Detection System, which combines an ensemble of
feature engineering with DL. The proposed method was applied to the CIC-DDoS- 2019 da-
taset and its performance was tested in CPU and GPU environments.
XGB [78] was used as a feature selection tool by Poornima et al. (2022) along with LSTM.
Of the two boosting methods used in the present work, LGBM was found to be an efficient and
suitable method for intrusion detection [79–83]. However, none of the researchers used LGBM
for cross-data intrusion detection. However, a cross-dataset attack identification study was
presented in [84] for the IOTID20 and Bot-IoT datasets. However, the basic ML algorithms
DT, NB, kNN, logistic regression (LR) and RF were used for binary classification. No time
comparison analysis was done, which is important in the present study, as non-boosting tech-
niques are 240 times faster than boosting techniques (from Fig. 7).
As an advance in research on intruder detection, ML was used in [85], which confirms the
validity of the current research in the present scenario. As every research study has some lim-
itations, the current research is not validated for real data as the infrastructure is not available.
Designing and testing adversarial cases could be an extension of the current research in the
future.
The conclusions from the above discussion state that identifying IoT attacks with boosting
methods’ is an important research work. To achieve the above objective, two types of attacks
are identified: IoT attacks and DDoS attacks as binary class and multiclass output, respectively.
Three trending datasets are used to identify the above attacks, namely BoT-IoT, IoT-23 and
CIC-DDoS-2019. A boosting and non-boosting approach was used to identify the attacks. The
boosting approach was found to be suitable for identifying attacks. Of the two boosting meth-
ods, LGBM is the most efficient, with an accuracy of 94.79% in 0.51 seconds, along with PCA
as the FES method for DDoS attacks.
3. REFERENCE
[1] A. Arampatzis, L. O'Hagan, Cybersecurity and Privacy in the Age of the Pandemic, In
Handbook of Research on Cyberchondria, Health Literacy, and the Role of Media in So-
ciety’s Perception of Medical Information (2022) :pp. 35-53, IGI Global.
[2] A. E. Omolara, A. Alabdulatif, O. I. Abiodun, M. Alawida, A. Alabdulatif, and H. Ar-
shad, The internet of things security: A survey encompassing unexplored areas and new
insights, Computers & Security 112 (2022): 102494.
6995
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
[3]https://www.cisco.com/c/en/us/solutions/collateral/executive perspectives/annual-internet-
report/white-paper-c11-741490.html
[4] Y. Miao, C. Chen, L. Pan, Q.-L. Han, J. Zhang, and Y. Xiang, Machine Learning–based
Cyber Attacks Targeting on Controlled Information: A Survey, ACM Computing Sur-
veys (CSUR) 54, no. 7 (2021): 1-36.
[5] C. Iwendi, et al. "Sustainable security for the internet of things using artificial intelli-
gence architectures." ACM Transactions on Internet Technology (TOIT) 21.3 (2021): 1-
22.
[6] S. Dilek, H. Çakır, and M. Aydın, Applications of artificial intelligence techniques to
combating cyber crimes: A review, arXiv preprint arXiv:1502.03552 (2015).
[7] A. Delplace, S. Hermoso, and K. J. a. p. a. Anandita, "Cyber Attack Detection thanks to
Machine Learning Algorithms," arXiv preprint arXiv:2001.06309 (2020).
[8] A. A. AlZubi, M. Al-Maitah, and A. Alarifi, Cyber-attack detection in healthcare using
cyber-physical system and machine learning techniques, Soft Computing 25, no. 18
(2021): 12319-12332..
[9] A. Handa, A. Sharma, and S. K. Shukla, Machine learning in cybersecurity: A review,
Data Mining and Knowledge Discovery 9, no. 4 (2019): e1306.
[10] M. Shafiq, Z. Tian, Y. Sun, X. Du, and M. Guizani, Selection of effective machine learning
algorithm and Bot-IoT attacks traffic identification for internet of things in smart city,
Future Generation Computer Systems 107 (2020): 433-442.
[11] A. E. Cil, K. Yildiz, and A. Buldu, Detection of DDoS attacks with feed forward based
deep neural network model, Expert Systems with Applications 169 (2021): 114520..
[12] S. Pokhrel, R. Abbas, and B. Aryal, IoT Security: Botnet detection in IoT using Machine
learning, arXiv preprint arXiv:2104.02231 (2021).
[13] M. Z. Hasan, K. Z. Hasan, and A. Sattar, Burst header packet flood detection in optical
burst switching network using deep learning model, Procedia computer science 143
(2018): 970-977.
[14] R. Priyadarshini, R. K. Barik, and I. Sciences, A deep learning based intelligent framework
to mitigate DDoS attack in fog environment, Journal of King Saud University-Computer
and Information Sciences (2019).
[15]N. Islam et al., "Towards machine learning based intrusion detection in IoT networks," vol.
69, pp. 1801-1821, 2021.
[16] E. Papageorgiou, A Predictive Model for Customer Satisfaction, (2021).
[17] M. H. L. Louk, and B. A. Tama, Exploring Ensemble-Based Class Imbalance Learners for
Intrusion Detection in Industrial Control Networks, Big Data and Cognitive Computing 5,
no. 4 (2021): 72.
[18] S. Das, S. Bose, G. K. Nayak, S. C. Satapathy, and S. Saxena, Brain tumor segmentation
and overall survival period prediction in glioblastoma multiforme using radiomic features,
Concurrency and Computation: Practice and Experience (2021): e6501.
[19] F. Alzamzami, M. Hoda, and A. Saddik, Light gradient boosting machine for general sen-
timent classification on short texts: a comparative evaluation, IEEE Access 8 (2020):
101840-101858.
[20] M. Massaoudi, H. Abu-Rub, S. S. Refaat, I. Chihi, and F. S. Oueslati, An effective ensem-
ble learning approach-based grid stability assessment and classification, in 2021 IEEE
Kansas Power and Energy Conference (KPEC), 2021, pp. 1-6: IEEE.
[21] N. Koroniotis, N. Moustafa, E. Sitnikova, and B. Turnbull, Towards the development of
realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot
dataset, Future Generation Computer Systems 100 (2019): 779-796
6996
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
6997
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
6998
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
6999
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
Supplementary Material
Correlation between attributes in different datasets pre-processing step 5:
7000
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
7001
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
7002
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
7003
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
7. ltime ltime
8. seq seq
9. dur dur
10. mean mean
11. stddev stddev
12. sum sum
13. min min
14. max max
15. spkts spkts
16. dpkts dpkts
17. sbytes sbytes
18. dbytes dbytes
19. rate rate
20. srate srate
21. drate drate
22. TnBPSrcIP TnBPSrcIP
23. TnBPDstIP TnBPDstIP
24. TnP_PSrcIP TnP_PSrcIP
25. TnP_PDstIP TnP_PDstIP
26. TnP_PerProto TnP_PerProto
27. TnP_Per_Dport TnP_Per_Dport
28. AR_P_Proto_P_SrcIP AR_P_Proto_P_SrcIP
29. AR_P_Proto_P_DstIP AR_P_Proto_P_DstIP
30. N_IN_Conn_P_DstIP N_IN_Conn_P_DstIP
31. N_IN_Conn_P_SrcIP N_IN_Conn_P_SrcIP
32. AR_P_Proto_P_Sport AR_P_Proto_P_Sport
33. AR_P_Proto_P_Dport AR_P_Proto_P_Dport
34. Pkts_P_State_P_Proto- Pkts_P_State_P_Proto-
col_P_DestIP col_P_DestIP
35. Pkts_P_State_P_Proto- Pkts_P_State_P_Proto-
col_P_SrcIP col_P_SrcIP
36. attack attack
37. pkSeqID pkSeqID
38. proto proto
39. saddr saddr
40. sport sport
41. daddr daddr
42. dport dport
43. state state
44. subcategory subcate-
gory
45. category category
46. flgs flgs
Table S2 : IoT-23
7005
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
Table S3 : CIC-DDoS-2019
S.No. Attributes present in Attributes selected after Attributes deleted
dataset pre-processing
1. Flow ID Flow Duration Flow Packets/s
2. Source IP Total Fwd Packets Flow Bytes/s
3. Source Port Total Backward Pack- Flow ID
ets
4. Destination IP Total Length of Fwd Source IP
Packets
5. Destination Port Total Length of Bwd Destination Port
Packets
6. Protocol Fwd Packet Length Source Port
Max
7. Timestamp Fwd Packet Length Destination IP
Min
8. Flow Duration Fwd Packet Length Timestamp
Mean
9. Total Fwd Packets Fwd Packet Length Protocol
Std
10. Total Backward Bwd Packet Length Bwd PSH Flags
Packets Max
11. Total Length of Bwd Packet Length Fwd URG Flags
Fwd Packets Min
12. Total Length of Bwd Packet Length Bwd URG Flags
Bwd Packets Mean
13. Fwd Packet Length Bwd Packet Length FIN Flag Count
Max Std
14. Fwd Packet Length Flow IAT Mean PSH Flag Count
Min
15. Fwd Packet Length Flow IAT Std ECE Flag Count
Mean
16. Fwd Packet Length Flow IAT Max Fwd Avg
Std Bytes/Bulk
17. Bwd Packet Length Flow IAT Min Fwd Avg Pack-
Max ets/Bulk
7008
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
18. Bwd Packet Length Fwd IAT Total Fwd Avg Bulk Rate
Min
19. Bwd Packet Length Fwd IAT Mean Bwd Avg
Mean Bytes/Bulk
20. Bwd Packet Length Fwd IAT Std Bwd Avg Pack-
Std ets/Bulk
21. Flow Bytes/s Fwd IAT Max Bwd Avg Bulk Rate
22. Flow Packets/s Fwd IAT Min SimillarHTTP
23. Flow IAT Mean Bwd IAT Total
24. Flow IAT Std Bwd IAT Mean
25. Flow IAT Max Bwd IAT Std
26. Flow IAT Min Bwd IAT Max
27. Fwd IAT Total Bwd IAT Min
28. Fwd IAT Mean Fwd PSH Flags
29. Fwd IAT Std Fwd Header Length
30. Fwd IAT Max Bwd Header Length
31. Fwd IAT Min Fwd Packets/s
32. Bwd IAT Total Bwd Packets/s
33. Bwd IAT Mean Min Packet Length
34. Bwd IAT Std Max Packet Length
35. Bwd IAT Max Packet Length Mean
36. Bwd IAT Min Packet Length Std
37. Fwd PSH Flags Packet Length Vari-
ance
38. Bwd PSH Flags SYN Flag Count
39. Fwd URG Flags RST Flag Count
40. Bwd URG Flags ACK Flag Count
41. Fwd Header Length URG Flag Count
42. Bwd Header CWE Flag Count
Length
43. Fwd Packets/s Down/Up Ratio
44. Bwd Packets/s Average Packet Size
45. Min Packet Length Avg Fwd Segment
Size
46. Max Packet Length Avg Bwd Segment
Size
47. Packet Length Fwd Header Length.1
Mean
48. Packet Length Std Subflow Fwd Packets
49. Packet Length Var- Subflow Fwd Bytes
iance
50. FIN Flag Count Subflow Bwd Packets
51. SYN Flag Count Subflow Bwd Bytes
52. RST Flag Count Init_Win_bytes_for-
ward
53. PSH Flag Count
Init_Win_bytes_backward
7009
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
7010
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
85. SimillarHTTP
86. Inbound
87. Label
S2. Pre-processing Step 6: Divide in train test set
Table S6 Experiment III : CIC-DDoS 2019 Train vs Test(Train test ratio 70:30)
Instances before Instances Method of resampling
resampling after resampling
Train LDAP(1)- LDAP(1)-217993 Every 10th instance was taken
day 2181542 MSSQL(2)- of LDAP
MSSQL(2)- 205568 Every 11th instance was taken
1844905 NetBIOS(3)- of MSSQL
NetBIOS(3)- 204664 Every 20th instance was taken
4094986 SYN Flood(4)- of NetBIOS
SYN Flood(4)- 226042 Every 14th instance was taken
1582681 UDP Flood(5)- of SYN Flood
UDP Flood (5)- 208977 Every 14th instance was taken
1470742 of UDP Flood
Test LDAP(1)- LDAP(1)-63507 Every 30th instance was taken
day 1905191 MSSQL(2)-57631 of LDAP
MSSQL(2)- NetBIOS(3)-57577 Every 100th instance was taken
5763061 SYN Flood(4)- of MSSQL
NetBIOS(3)- 71413 Every 60th instance was taken
3454578 UDP Flood(5) - of NetBIOS
SYN Flood(4)- 62578 Every 60th instance was taken
4284751 of SYN Flood
7011
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 6982 – 7012
7012