BDCC 08 00116 v2
BDCC 08 00116 v2
cognitive computing
Article
Attention-Driven Transfer Learning Model for Improved IoT
Intrusion Detection
Salma Abdelhamid 1, * , Islam Hegazy 2 , Mostafa Aref 2 and Mohamed Roushdy 2
1 Computer Science Department, Faculty of Computers and Information Technology, Future University in
Egypt, Cairo 11835, Egypt
2 Computer Science Department, Faculty of Computer and Information Sciences, Ain Shams University,
Cairo 11566, Egypt; islheg@cis.asu.edu.eg (I.H.); mostafa.aref@cis.asu.edu.eg (M.A.);
mroushdy@cis.asu.edu.eg (M.R.)
* Correspondence: salma.abdelhamed@fue.edu.eg
Abstract: The proliferation of Internet of Things (IoT) devices has become inevitable in contemporary
life, significantly affecting myriad applications. Nevertheless, the pervasive use of heterogeneous IoT
gadgets introduces vulnerabilities to malicious cyber-attacks, resulting in data breaches that jeopar-
dize the network’s integrity and resilience. This study proposes an Intrusion Detection System (IDS)
for IoT environments that leverages Transfer Learning (TL) and the Convolutional Block Attention
Module (CBAM). We extensively evaluate four prominent pre-trained models, each integrated with an
independent CBAM at the uppermost layer. Our methodology is validated using the BoT-IoT dataset,
which undergoes preprocessing to rectify the imbalanced data distribution, eliminate redundancy,
and reduce dimensionality. Subsequently, the tabular dataset is transformed into RGB images to
enhance the interpretation of complex patterns. Our evaluation results demonstrate that integrating
TL models with the CBAM significantly improves classification accuracy and reduces false-positive
rates. Additionally, to further enhance the system performance, we employ an Ensemble Learning
(EL) technique to aggregate predictions from the two best-performing models. The final findings
prove that our TL-CBAM-EL model achieves superior performance, attaining an accuracy of 99.93%
as well as high recall, precision, and F1-score. Henceforth, the proposed IDS is a robust and efficient
Citation: Abdelhamid, S.; Hegazy, I.;
Aref, M.; Roushdy, M. Attention-
solution for securing IoT networks.
Driven Transfer Learning Model for
Improved IoT Intrusion Detection. Big Keywords: attention mechanism; deep learning; ensemble learning; Internet of Things; intrusion
Data Cogn. Comput. 2024, 8, 116. detection; transfer learning
https://doi.org/10.3390/
bdcc8090116
a dynamic configuration [2] and increases the threat of unknown attacks for newly intro-
duced devices. Furthermore, IoT networks are large-scale networks in which devices are
embedded in open vicinities and transfer a sheer amount of data, making them vulnerable
to different attacks [3]. Consequently, numerous studies have started deploying Machine
Learning (ML) and Deep Learning (DL) techniques to conquer IoT barriers and strengthen
the performance of IDSs in IoT networks [4,5]. The Intelligence of these techniques and their
learning potential are used to analyze the system and distinguish any suspicious behavior.
Nonetheless, both ML and DL approaches assume similar data distributions for labeled and
unlabeled data [6]. In addition, the training process is time-consuming, which limits the
use of real-time applications. Therefore, to improve ML and DL techniques, scientists have
proposed Transfer Learning (TL). TL uses information developed from previous training
to perform a specific task. It takes advantage of earlier knowledge to avoid the scarcity of
substantial training data and to reduce the long training and computation time required by
ML and DL techniques, making it practical for real-time or time-sensitive applications.
Nevertheless, most TL models are based on Convolutional Neural Networks (CNN)
that are trained using massive amounts of data. CNNs have been effective in an assortment
of applications, such as image classification, Natural Language Processing (NLP), detection
of traffic signals, and self-driving vehicles [4]. Despite its remarkable results, the CNN
architecture performs better when crucial data information is encoded within patterns,
such as in voice applications and imaging [7]. Thus, to make the best out of CNNs, we
convert our tabular data into images. Transforming tabular data into images allows the
representation of features through spatially coherent pixels, making it possible to use
the powerful feature extraction capabilities of CNNs. This approach allows the network
to deduce useful patterns from pixel values. Furthermore, image-based representations
provide better interpretability, allowing researchers to visually analyze the representations
and comprehend the underlying relationships between data. Several studies have demon-
strated the efficacy of this approach. For instance, the DeepInsight method has shown
enhanced performance by converting diverse types of data, including gene expression and
text, into images to leverage the CNNs’ feature extraction capabilities [8]. Similarly, the
IGTD algorithm [7] and the REFINED approach [9] have successfully applied image-based
representations for drug response prediction and feature visualization. Moreover, algo-
rithms for converting tabular data into images are flexible and scalable, requiring minimal
prior knowledge of features, thus making them applicable across various datasets and
applications. This versatile approach shows the potential for enhancing the performance
and generalization of CNN models across different tabular datasets, which are widely used
in diverse fields. Additionally, converting the dataset into images benefits from the power
of transfer learning by utilizing powerful pre-trained models.
This study presents a TL-based IDS for IoT environments. We explore four prevalent
deep transfer learning models: Visual Geometry Group (VGG) [10], Residual Network
(ResNet) [11], MobileNet [12], and EfficientNet [13]. Each of these models represents a
state-of-the-art technique in deep learning, which provides a thorough comparison of pre-
trained leading-edge technologies. Moreover, these models have proven their effectiveness
in image classification tasks, providing a reliable foundation for our study. In our approach,
we integrate a standalone Convolutional Block Attention Module (CBAM) with each model.
CBAM is a neural network module that enhances feature representations by applying
sequential channel attention and spatial attention [14]. It enables models to focus on the
most relevant parts of the input data, improving the performance on various vision tasks.
This integration leads to the creation of eight distinct models for evaluation. Finally, we
aggregate the predictions of the two best models using an average ensemble strategy for
more accurate final predictions. We train the models and validate them using the Bot-IoT
network dataset [15]. This dataset combines four diverse attacks: Denial of Service (DoS),
Distributed Denial of Service (DDoS), Reconnaissance, and Theft attacks. The foremost
contributions of this study are as follows:
Big Data Cogn. Comput. 2024, 8, 116 3 of 23
• Introducing a novel use of the CBAM as a standalone top layer block. This innovative
approach emphasizes the extraction of discriminative features and informative regions
within both the spatial and channel domains of the generated images. This approach
strengthens the performance of the models without adding computational complexity
or compromising the generalization of the system.
• Leveraging the knowledge of pre-trained models to overcome the long training
time and provide a robust foundation for IoT attack classification with minimal
labeled data.
• Transforming the dataset into RGB images to facilitate the visualization of tabular data,
improve data interpretation, and empower the extraction of meaningful patterns from
complex feature relationships.
• Aggregating the predictions of diverse individual models through Ensemble Learning
(EL) to enhance the results and increase the robustness of the intrusion detection system.
• Proposing a new methodology that explores four widely used, state-of-the-art mod-
els. Our study can be effectively used in image classification tasks across various
domains with diverse data types, offering a versatile approach for future applications
and research.
Following the introduction, the structure of this paper is as follows: Section 2 overviews
the background and related studies in the field of IoT IDSs. Section 3 introduces the pro-
posed methodology, along with TL, CBAM, and EL practices. Section 4 presents the
experimental results. We discuss the results in Section 5. Finally, Section 6 presents the
conclusions of the study.
CNN3D models, respectively. For multiclass classification in the BoT-IoT, the accuracy
percentage was 99.97% for CNN1D, 99.95% for CNN2D, and 99.94% for CNN3D, with a
maximum precision and F1-score of 99.96%.
Yang et al. [18] proposed an approach for distinguishing malicious traffic using a
self-supervised Contrastive Learning (CL) approach. They preprocessed the raw data of
the BoT-IoT dataset and transformed the unlabeled data into vector traffic data. Their
model used a self-attention mechanism and a GELU-LSTM module to learn information
and extract features of malicious data. The experimental work resulted in 99.48% accuracy
and 99.46% F1-Score.
Awajan [19] presented a fully connected four-layer network architecture for malicious
detection in IoT environments. The communication protocol-independent model achieved
high precision in sinkhole attacks, with an average accuracy of 93.74%, precision of 93.172%,
recall of 93.824%, and an F1-score of 93.472%. However, the authors trained and evaluated
the models using an experimental dataset without using a benchmark dataset. Moreover,
each new deployment of the IDS required retraining with a dataset exclusive to that
specific system.
He et al. [20] introduced a lightweight IDS based on feature grouping for securing
IoT. Their optimized model used ML and DL for attack detection. Their method involved
designing semantic-based features, implementing a fast protocol parsing method, and
proposing session merging and feature grouping. Their IDS achieved a classification
accuracy of 99% on three public IoT datasets: MedBIoT, MQTT-IoT-IDS2020, and BoT-IoT.
Their model was suitable for constrained-processing IoT devices; however, further training
for each attack was required.
the Tshark network analyzer. They employed Feed Forward Neural Network (FNN) for
both binary classification (bFNN) and multiclass classification (mFNN). Their optimized
and fine-tuned model delivered accuracies of 99.99% in binary classification and 99.79% in
multi-classification.
Ullah et al. [17] also proposed an IDS based on TL, which reduced the model’s com-
plexity. However, the accuracies of their experimental work degraded to 99.3%, 98.6%, and
98.13% for the 1D, 2D, and 3D CNNs, respectively. The limitation of this study was the
high False Negative Rate (FNR) in the results.
Table 1. Cont.
3. Proposed Methodology
This section discusses the proposed IDS and provides an overview of the implemented
techniques and approaches.
Figure1.1.The
Figure Theproposed
proposedIDS
IDSmodel
modelfor
forexperimental
experimentalwork.
work.
3.2.
3.2.Deep
DeepTransfer
TransferPre-Trained
Pre-TrainedModels
Models
Pre-trained
Pre-trainedmodels
modelsprovide
providesolid
solidground
groundforforadditional
additionaltraining
trainingand
andperform
perform effec-
effec-
tively
tivelyininmany
manyareas,
areas,suchsuchasasimage
imagerecognition,
recognition,fraud
frauddetection,
detection,and
andnatural
naturallanguage
language
processing
processing[32].
[32].Researchers
Researchershave havedeveloped
developedthese
thesemodels
modelstotoaddress
addresscomparable
comparabletasks tasks
through
through refinement or fine-tuning. They are beneficial for minimizing the
refinement or fine-tuning. They are beneficial for minimizing the time
time and
and re-
re-
sources
sourcesrequired
required to to solve
solve complex
complex problems. Thisstudy
problems. This studyinvestigates
investigatesthetheperformance
performanceof
offour
fourpre-trained
pre-trainedmodels:
models: ResNet50,
ResNet50, VGG16,
VGG16, MobileNetV1,
MobileNetV1, and and EfficientNetB0.
EfficientNetB0. Each Each
model
modelhas hasa different
a different approach
approachandand
unique architecture,
unique thereby
architecture, ensuring
thereby that our
ensuring thatresearch
our re-
benefits from a broad
search benefits fromspectrum of advanced
a broad spectrum of techniques and methodologies.
advanced techniques The models are
and methodologies. The
pre-trained using the ImageNet dataset [33]. This huge dataset includes
models are pre-trained using the ImageNet dataset [33]. This huge dataset includes 1000 classes and
1000
14,197,122
classes and arranged
14,197,122according
arrangedto according
the WordNet structure.
to the WordNet It contains 1,281,167
structure. training
It contains im-
1,281,167
ages, 50,000 validation images, and 100,000 test images. Many tools and
training images, 50,000 validation images, and 100,000 test images. Many tools and platforms provide
Big Data Cogn. Comput. 2024, 8, 116 8 of 23
models already trained on the ImageNet dataset, such as TensorFlow [34], PyTorch [35], and
Keras Application [36]. The availability of such trained models facilitates their implementa-
tion, as we directly load the models with the saved weights, which eliminates the need for
long training time pre-training time and with a starting point to build our system. In our
implementation, we freeze the layers of the pre-trained modes. This approach reduces the
number of parameters that need to be stored in memory in forward and backward propa-
gation and eliminates the need to calculate the gradients of the loss function. With respect
to the number of parameters, freezing the layers leads to a reduced complexity percentage
of 11% for VGG16, 92% for ResNet50, 76% for MobileNetV1, and EfficientNetB0.
• VGG: This model can classify more than a thousand images into thousands of cat-
egories and is one of the most important milestones in the object recognition field.
VGG has many variants, among which VGG16 is the most well-known. The number
“16” refers to the total number of weight layers. It requires an input size of 224 × 224.
Despite applying stacked small-sized convolution filters of size 3 × 3, these filters
made an effective receptive field like large-sized filters. The hidden layers of this
model use the Rectified Linear Units (ReLu) activation function. The model ends with
three fully connected layers with 4096, 4096, and 1000 channels, respectively. The final
activation function is the softmax function.
• ResNet: Amongst TL models, ResNet has shown outstanding performance. It is a
CNN with skip connections in its architecture. These skip connections allow the
gradient to flow smoothly and maintain the key features until the last layer of the
model. ResNet has different versions; the premier version is ResNet34, which has
34 weighted layers and a 3 × 3 filter inspired by the VGG model. The most widely
used model is ResNet50. It includes 50 layers: 48 convolutional layers, 1 MaxPool
layer, 1 average pooling layer, and 1 fully connected layer with 1000 nodes. This model
uses the softmax activation function.
• MobileNet: This is a lightweight DNN-based model for low-resource devices. It
applies two convolution techniques. The first one is a 3 × 3 depth-wise convolution to
filter the inputs. The second convolution technique is a 1 × 1 point-wise convolution
layer that combines the filtered inputs. MobileNetV1 is the earliest version of this
family and encompasses 28 convolutional layers followed by a fully connected layer
that is activated by the softmax function.
• EfficientNet: Proposed by Tan et al. [13], this model aims to achieve better results by
balancing the depth, width, and resolution of the network via compound scaling. The
main block of the model is a mobile inverted residual block (also called MBConv). The
authors demonstrated the effectiveness of their model by scaling up the well-known
ResNet and MobileNet Architectures, in addition to scaling up a newly proposed
network, EfficientNetB0, to induce a family of EfficientNet models. The baseline
architecture of this family comprises two convolution layers with sixteen MBConv
modules in between.
Table 2 summarizes the base architectures of these models. In this table, “Model size”
denotes the size of memory required by the model. “Depth” refers to all layers that have
parameters within the model. Lastly, top-1 and top-5 accuracies refer to the accuracies of
the model with the ImageNet dataset [36]. All the models have an input size of 224 × 224
for RGB images.
Figure
Figure2.2.The
Theoverall
overallCBAM
CBAMstructure.
structure.
Thefunctionalities
The functionalitiesofofthe
theCAM
CAMare
areexpressed
expressedasasfollows:
follows:
c
FFcavg = AvgPoolcc(F)
avg= (F) (1)(1)
c = MaxPoolcc (F)
FFcmax
max = MaxPool (F) (2)(2)
Mc (F) = σ(W1 W0 Fcavg + W 1 ( W ( Fc
0 max )) (3)
Mc F = σ(W1 W0 Fcavg + W1 W0 Fcmax ) (3)
where σ is the sigmoid function, and W ∈ R(C/r)×C and W1 ∈ RC×(C/r) are the MLP
where σ is the sigmoid function, and W0 ∈ 0ℝ(C/r)×C and W1 ∈ ℝC×(C/r) are the MLP weights,
weights, respectively [14].
respectively [14].
The SAM uses average and maximum pooling as well, but it is associated with
The SAM uses average and maximum pooling as well, but it is associated with a con-
a convolution layer instead of the MLP. It generates two 2Ds maps:1×H×W Fsavg ∈ R1s ×H×W
volution layer instead of the MLP. It generates two 2D maps: F avg ∈ ℝ and Fmax ∈
and Fsmax ∈ R1×H×W . It employs a sigmoid activation function and element-wise mul-
ℝ 1×H×W . It employs a sigmoid activation function and element-wise multiplication to infer
tiplication to infer a spatial attention map Ms ∈ RC×H×W . The operations of the spatial
a spatial attention map Ms ∈ ℝC×H×W. The operations of the spatial attention module on an
attention module on an input feature map (F) are expressed as follows [14]:
input feature map (F) are expressed as follows [14]:
Fsavg = AvgPools (F) (4)
(a) (b)
Figure
Figure 3.
3. The
Thearchitecture
architectureof
of the
the proposed
proposed model:
model: (a)
(a)the
the pre-trained
pre-trained model
model with
with customized
customized top
top
layers; (b) the pre-trained model with customized top layers and plugged CBAM.
layers; (b) the pre-trained model with customized top layers and plugged CBAM.
3.4. Ensemble
3.4. EnsembleLearning
Learning
Ensemblelearning
Ensemble learningisisthethelast
laststage
stageinin our
our proposed
proposed system.
system.ItItaggregates
aggregatesan anassort-
assort-
ment of
ment of base
base learning
learning models
models toto reach
reach anan enhanced
enhanced overall
overall performance.
performance. The The two
two broad
broad
techniques of ensemble learning are parallel and sequential learning [28].
techniques of ensemble learning are parallel and sequential learning [28]. In the parallel In the parallel
approach, different
approach, different base
base learners
learnersare
aretrained
trainedindependently,
independently, and
and then
then their
theirpredictions
predictionsare
are
combined via a combiner. The sequential approach, on the other hand,
combined via a combiner. The sequential approach, on the other hand, trains the base trains the base mod-
els sequentially,
models sequentially,suchsuchthat that
eacheach
model optimizes
model the errors
optimizes of its
the errors ofprecedent. TheThe
its precedent. Random
Ran-
dom Forests algorithm is a widely used parallel ensemble model, while the Boosting al-is
Forests algorithm is a widely used parallel ensemble model, while the Boosting algorithm
an example
gorithm is anofexample
a sequential
of a ensemble.
sequential Ensemble
ensemble.techniques could combine
Ensemble techniques homogenous
could combine ho- or
heterogeneous classifiers. Homogenous ensembles use the same ML
mogenous or heterogeneous classifiers. Homogenous ensembles use the same ML algo- algorithm for the base
learners,
rithm for while
the baseheterogeneous ensembles
learners, while employ diverse
heterogeneous ML algorithms
ensembles employ diverse[28]. Combining
ML algo-
rithms [28]. Combining the predictions of the base models is dividedmeta-learning
the predictions of the base models is divided into voting techniques and into voting
techniques. In the voting approach, each base classifier independently predicts the class
label for a given input, and the final prediction is determined by aggregating the individual
predictions using a specific rule, such as maximum voting, average voting, or weighted
average voting. Meta-learning techniques, on the other hand, involve multiple learning
stages in which the output of the base learners is used to train meta-models. These meta-
Big Data Cogn. Comput. 2024, 8, 116 11 of 23
models aim to enhance the overall performance using the knowledge they acquired from
previous outputs.
In our study, we apply the ensemble averaging approach. The approach fuses the
predictions of the base learner and decides the class based on the highest probability. The
final probability of a class is the arithmetic mean of all base learners’ probabilities for this
class. Given the diverse structures of the network, the inference may take place directly
on the devices. The averaging approach is well-suited to this scenario as it requires less
computational overhead and memory compared to more complex methods like stacking
or boosting. This ensures quicker inference times and lower energy consumption, making
it a practical choice for resource-constrained IoT devices. Our models use the softmax
activation function to obtain the probability of each class, which is calculated as follows:
ezj
s(z)j = for j = 1, 2, . . . , K (9)
∑K
K=1 e
zK
where z denotes the input vector, K is the total number of classes, and ezj and ezK are the
exponential functions of the input and output vectors, respectively.
4. Experimental Work
4.1. Environmental Setup and Hyperparameters
We conducted our experimental work on Google Colaboratory (Colab) [41], which
is a cloud-based Jupyter notebook environment that provides free-of-charge access to
substantial computing resources. The hardware allocated for a user is Intel(R) Xeon vCPU
@2.3 GHz, 2 cores, and 13 GB RAM. Nevertheless, Colab does not guarantee the assigned
resources over time, which eliminates the legitimate comparison of the training and testing
time taken by each model. We implemented and wrote the models in Python 3.10 language,
along with Keras 2.12 API running on Tensorflow 2.13 backend. In addition, we used
NumPy [42], Pandas [43], and Matplotlib [44] libraries. We evaluated the classification
results using the Scikit-Learn (sklearn) library [45].
The number of epochs was set to 20 epochs, and the number of batches was set to
64. To save resources and reduce the training time, we monitored the validation accuracy,
and the model triggered an early stop callback if it witnessed no improvement for two
consecutive epochs. We applied a dynamic learning rate approach, in which we reduced
the learning rate at higher epochs. Starting at 0.01, the learning rate was reduced every
2 consecutive epochs to 0.005, 0.001, and 0.0005 until it was set at a minimum value of
0.0001. Regarding the CBAM, we set the kernel size to 3 × 3 and the reduction ratio to
1. For the loss function, we used the sparse categorical cross-entropy, and the optimizer
used was the “Adam Optimizer”. In the pre-trained models, and as aforementioned, we
froze the trainable layers of the models and removed their top layers. These top layers
are responsible for the classification process and were originally designed to match the
1000 classes within the ImageNet dataset. To adapt the VGG16, we removed the final
Flatten layer, 2 dense layers, each with 4096 neurons and a ReLU activation layer, and
the final dense layer, which compromises 1000 neurons representing the categories of the
ImageNet dataset. The top layer becomes a MaxPooling layer that passes an output of size
7 × 7 × 512 to the CBAM, where 512 is the number of channels. For ResNet50, we removed
the global average pooling (GAP) layer and the final dense layer with 1000 output units and
a softmax activation function. The output feature map of its block is passed as an input to
the CBAM, with a size 7 × 7, and 2048 channels. We excluded the top average pooling layer
and the fully connected layer, which had 1000 neurons and softmax activation functions
from MobileNetV1. The output of its top point-wise convolution block is a 7 × 7 feature
map with 1024 channels. For EfficientNetB0, we also excluded the average pooling layer,
along with the 1000 neurons fully connected layer that is activated by a sigmoid function.
This model passes to the CBAM a feature map of 7 × 7 × 1280, where the latter represents
the number of channels.
Big Data Cogn. Comput. 2024, 8, 116 12 of 23
4.2. Dataset
We validated our intelligent system using the Bot-IoT dataset. It is a real IoT network
representation generated at the University of New South Wales (UNSW), Canberra. The
raw dataset holds around 70 GB of PCAP files that are processed using a network audit
tool, namely Argus. The full dataset has more than 72 million instances and is represented
by 43 independent features. The dataset entails both legitimate and adversary traffic. The
authors created two subsets of the original BoT-IoT dataset, the 5% subset and the 10-Best
features subsets. The first subset includes around 3.6 million records, representing 5% of
the full dataset instances. The authors provided this subset as a manageable and compact
version for easier implementation. On the other hand, the 10-Best subset encompasses only
10 features out of the 43 elementary independent features. The authors determined these
best 10 features by mapping the average correlation coefficient and joint entropy and then
selecting the features with the best scores. We used the 10-Best Bot-IoT subset, which is
available for public use in a CSV file format. In addition to the 10 selected features, this
dataset encloses 5 independent flow identifiers and 3 dependent features [15]. The flow
identifiers are “saddr”, “sport”, “daddr”, “dport”, and “proto”. The three dependent fea-
tures are “attack”, “category”, and “subcategory”. The authors also implanted “pkSeqID”,
a numerical feature that serves as a row identifier. The descriptions and datatypes of all
features in this subset are depicted in Table 3.
In ML and DL, data preprocessing is an essential phase as it ensures that the input
data are consistent and properly formatted, both of which have a direct impact on the per-
formance and accuracy of the model. It also helps in handling missing values, outliers, and
feature scaling, making the training process more efficient and the results more reliable [46].
The preprocessing steps are as follows:
Big Data Cogn. Comput. 2024, 8, 116 13 of 23
• Data Cleaning: In the early data preprocessing stages, we cleared the dataset from
null, duplicate, and infinite values. We used the “drop_duplicates()”, “isinf()”, and
“isnull()” methods to carry out this process.
• Dimensionality Reduction: Reducing the dimensions plays a vital role in learning-
based IoT IDSs, as it removes redundant information that negatively impacts the
results of the model. It also decreases the computational and training time costs. We
reduced the dimensionality of the dataset by dropping the flow identifiers features.
These features provide local information that cannot be generalized and can lead to
biased predictions toward the nodes of the networks [47]. The “pkSeqID” feature is
also dropped as it has the same task as the automatically created index. Furthermore,
since this study aims to detect the main attack category, we dropped the “subcategory”
feature along with the “attack” feature, noting that the information provided by the
latter is implicitly expressed by the “category” feature.
• Data Transformation: In this step, we converted any categorical features to numerical
ones for additional processing. The “category” feature is the only remaining categor-
ical feature after dropping the previously mentioned features. It holds five textual
representations of the instance category, which are DoS, DDoS, Reconnaissance, Theft,
and Normal. We used the LabelEncoder class from the sklearn library to encode a
unique number for each of these five classes. Integer labels require less memory than
textual representations, which is beneficial for reducing memory consumption and
managing large datasets. Label encoding is also a computationally efficient method
that matches most deep learning framework requirements and streamlines the prepa-
ration of data. Moreover, most ML models operate on numerical data, making this
conversion a necessary step in data preprocessing [48].
• Data Balancing: Table 4 presents the class distribution of the BoT-IoT dataset. As
shown, over 98% of data instances fall under DoS and DDoS classes. This data imbal-
ance is a major challenge to accurate detection, as it leads to models that are biased
towards the majority classes, reducing the overall predictive performance and gen-
eralizability [48]. Therefore, for enhanced performance, the dataset should enclose a
consistent number of samples in each class [48,49]. To overcome this problem, many
studies have used the oversampling of the minority classes along with the under-
sampling of the majority ones [49]. We used a random under-sampling technique to
reduce the “DDoS” and “DoS” classes, and we used the Synthetic Minority Oversam-
pling Technique (SMOTE) [50] to augment the minority classes. SMOTE is a k-nearest
clustering-based algorithm that creates synthetic data points that are vaguely different
from the original data points. The primary benefit of this strategy is that dataset
augmentation depends on newly created data points instead of duplicating existing
data. Rather than randomly oversampling the minority classes, SMOTE creates for
each minority sample xi , a new sample xi′ , that is dependent on xj , which is randomly
selected from xi ’s nearest neighbors (k), and λ is a random value ranged in [0, 1]. The
two classes, “Normal” and “Theft”, are oversampled to reach our target number of
samples. SMOTE is the most used oversampling technique and is compatible with
a wide range of data types [51]. We used the SMOTE class of the Python imbalance
library to sample up the minority classes of the dataset automatically with k = 2.
Equation (10) defines the generated samples of SMOTE [50].
xi′ = xi′ + λ xj′ − xi′ (10)
Figure
Figure 4. Transformationof
4. Transformation of the
the tabular
tabular data
datainto
intoRGB
RGBimages, in which
images, eacheach
in which 3 consecutive feature
3 consecutive feature
values
values representa acolor
represent colorchannel
channel of of aasingle
singlepixel.
pixel.
n. Comput. 2024, 8, x FOR PEER REVIEW 15 of 23
Big Data Cogn. Comput. 2024, 8, 116 15 of 23
Figure 5. SamplesFigure
of the5.generated
Samples of10
the×generated
10 × 3 images for×each
10 × 10 3 images for. each
attack The RGB
attack.color variations
The RGB in
color variations in
the images revealthe
distinct patterns for each attack type, with bright and scattered RGB distributions
images reveal distinct patterns for each attack type, with bright and scattered RGB distributions
for DoS and DDoS,fordispersed patterns
DoS and DDoS, for Reconnaissance,
dispersed and block-like
patterns for Reconnaissance, andRGB clusters
block-like RGBfor Theft.for Theft.
clusters
Table 6. Performance evaluation of the TL models, the two best models that we aggregate, are
indicated in bold.
The AUC scores for all the models in this study are impressively high, with values
ranging from 0.99987 to 1.0, demonstrating their strong ability to distinguish between
positive and negative instances. The integration of the CBAM leads to even higher AUC
scores, highlighting its effectiveness in refining feature extraction and enhancing model
performance. Specifically, EfficientNetB0 + CBAM and TL-CBAM-EL both achieved a
perfect AUC of 1.0, indicating flawless classification. This score can be attributed to the
effectiveness of EfficientNetB0, combined with CBAM’s ability to focus on critical features.
The TL-CBAM-EL model benefits from ensemble learning, aggregating predictions from
the best-performing models to achieve robust classification results. The consistently high
AUC scores reflect the effectiveness of both the model architectures and the preprocessing
techniques employed, ensuring high-quality data input and optimal learning conditions.
By studying the data in Tables 5 and 6, we deduce that the most convenient base
learners for ensemble learning are the MobileNetV1 + CBAM and EfficientNetB0 + CBAM.
ResNet50 + CBAM 99.53 99.60 99.60 99.60 0.99997
MobileNetV1 99.59 99.59 99.59 99.59 0.99998
MobileNetV1 + CBAM 99.80 99.80 99.80 99.80 0.99998
Big Data Cogn. Comput. 2024, 8, 116 17 of 23
EfficientNetB0 99.53 99.53 99.53 99.52 0.99999
EfficientNetB0 + CBAM 99.87 99.87 99.87 99.87 1.0
TL-CBAM-EL 99.93
Moreover, opting for these models 99.93
is advantageous99.93 99.93 design, ensur-
due to their lightweight 1.0
ing efficient resource utilization and swift performance. As previously mentioned, ensemble
learning combines the strengths of multiple models to improve overall performance and
5.3. Overfitting Inspection
robustness. Therefore, we average the predictions of these two models to mitigate any
existing weaknesses and provide a more accurate and reliable prediction [53]. As illustrated
To ensure that both base learners generalize well and do not entail any overfitting,
in Table 6, the aggregation of these two selected models via average ensemble learning
we examine theiroutperforms
learning all curves. Figure
the single models6and
shows theintraining
succeeds accuracy,
reaching a 99.93% validation
classification accu-
accuracy.
racy, training loss, and validation loss curves of the EfficientNetB0 + CBAM and Mo-
5.3. Overfitting Inspection
bileNetV1 + CBAM models. As depicted, the training loss curves show a good fit, and
To ensure that both base learners generalize well and do not entail any overfit-
neither model suffers
ting, we from overfitting.
examine In each
their learning curves.model,
Figure 6the training
shows loss accuracy,
the training curve decreases
valida-
until it reaches a stable state, indicating that no further learning is required. Moreover, the
tion accuracy, training loss, and validation loss curves of the EfficientNetB0 + CBAM
training loss and and MobileNetV1
validation loss+ CBAM
curvesmodels.
convergeAs depicted, the training
and stabilize atloss curves values,
similar show a good fit,
proving
and neither model suffers from overfitting. In each model, the training loss curve decreases
consistent performance on the
until it reaches training
a stable state, and validation
indicating data learning
that no further [54]. The curvesMoreover,
is required. illustrate thethat
the early stop forces theloss
training models to stoploss
and validation training when the
curves converge validation
and stabilize accuracy
at similar values, stabilizes.
proving
consistent performance on the training and validation data
The decision to terminate training after two stagnant epochs balances computational [54]. The curves illustrate thateffi-
the early stop forces the models to stop training when the validation accuracy stabilizes.
ciency with the potential
The decision benefits of learning
to terminate training after rate
twoadjustments.
stagnant epochs This
balances approach optimizes
computational effi-
efficiency, reduces training time, and minimizes energy consumption.
ciency with the potential benefits of learning rate adjustments. This approach optimizes
efficiency, reduces training time, and minimizes energy consumption.
(a) (b)
Figure 6. The learning
Figurecurves of the curves
6. The learning two best
of themodels: (a) the(a)
two best models: training and
the training the
and thevalidation curves
validation curves of of
the EfficientNetB0 with the CBAM; (b) the training and the validation curves
the EfficientNetB0 with the CBAM; (b) the training and the validation curves of MobileNetV1 with of MobileNetV1 with
the CBAM.
the CBAM.
5.4. Significance of the CBAM
5.4. Significance of the During
CBAMthe training phase, the CBAM managed to reduce the training time by accel-
erating the convergence of the models within fewer epochs. The models integrated with
During the training
the CBAMphase,
had lowerthe CBAM
training managed
epochs and highertoaccuracy.
reduceThe thenumber
training time epochs
of training by accel-
erating the convergence
was reducedof the
frommodels
6 to 4 for within
the VGG fewer epochs.
model, from 11 to The models
9 for the ResNetintegrated
model, from with
8
to 7 for the MobileNetV1 model, and from 9 to 7 for the EfficientNet model. Moreover, to
the CBAM had lower training epochs and higher accuracy. The number of training epochs
further emphasize the significance of integrating the CBAM and the ensemble learning, we
was reduced fromcompare
6 to 4 the
forclasses’
the VGG model,
precision, from
recall, 11 to 9obtained
and F1-score for theinResNet model,
each of our from 8 to
implemented
7 for the MobileNetV1 model,
models, and and
they are from 9in to
illustrated 7 for
Figure the EfficientNet
7, Figure 8, and Figure 9, model. Moreover, to
respectively.
further emphasize the significance of integrating the CBAM and the ensemble learning,
we compare the classes’ precision, recall, and F1-score obtained in each of our imple-
mented models, and they are illustrated in Figure 7, Figure 8, and Figure 9, respectively.
As shown in Figure 7, MobileNetV1 and EfficientNetB0 models demonstrate excep-
tional performance, achieving above 98% precision for “DDoS” and “DoS” attacks and
perfect precision for “Normal”, “Reconnaissance”, and “Theft” classes. Moreover, inte-
Figure 8 displays the recall percentages per class for our implemented models. All
the models have approximately 100% recall for the “Normal”, “Reconnaissance”, and
Big
“Theft” classes, indicating their strong ability to identify the true positive instances18
across
Big Data
Data Cogn.
Cogn. Comput.
Comput.2024,
2024, 8,
8, x116
FOR PEER REVIEW of 23
these classes. The recall percentages for the “DoS” and “DDoS”, while slightly lower 18 of 23
com-
pared to other classes, remain high and are indicative of robust model performance. They
range from 98 to 100%, demonstrating their reliability in detecting these two attacks. This
reduction is attributed to the similar features of “DoS” and “DDoS” attacks, as they fun-
damentally aim to disable the service or the system to its legitimate users. Nevertheless,
the incorporation of the attention mechanism enhances the models’ focus on significant
features, resulting in improved recall rates.
Figure 9 depicts the F1-score per class of our implemented models. The figure
demonstrates that all the models succeed in attaining a high F1 score, indicating a bal-
anced performance in terms of both precision and recall. The findings prove that the mod-
els effectively identify true positives while minimizing false positives and false negatives.
They also prove that the attention mechanism improves the models’ ability to classify
complex patterns in the data, leading to enhanced scores across all classes. Our tested
models exhibited high F1-scores for “Normal”, “Reconnaissance”, and “Theft” classes,
ranging from 99.7% to 100%. The “DoS” and “DDoS” classes have slightly fewer F1 scores,
but they are still high scores in the context of IDSs. This variation in the results indicates
that some classes have distinctive features that are easier to learn and classify, which is
affirmed
Figure
Figure 7.
by the visual
7. Precision
representation
Precisionpercentage
percentage per
per class
ofeach
class for
the data
for each of
of the
depicted in Figure
the implemented
5.
implemented models.
models.
Figure 8 displays the recall percentages per class for our implemented models. All
the models have approximately 100% recall for the “Normal”, “Reconnaissance”, and
“Theft” classes, indicating their strong ability to identify the true positive instances across
these classes. The recall percentages for the “DoS” and “DDoS”, while slightly lower com-
pared to other classes, remain high and are indicative of robust model performance. They
range from 98 to 100%, demonstrating their reliability in detecting these two attacks. This
reduction is attributed to the similar features of “DoS” and “DDoS” attacks, as they fun-
damentally aim to disable the service or the system to its legitimate users. Nevertheless,
the incorporation of the attention mechanism enhances the models’ focus on significant
features, resulting in improved recall rates.
Figure 9 depicts the F1-score per class of our implemented models. The figure
demonstrates that all the models succeed in attaining a high F1 score, indicating a bal-
Big Data Cogn. Comput. 2024, 8, x FORanced performance in terms of both precision and recall. The findings prove that the19mod-
PEER REVIEW of 23
els effectively identify true positives while minimizing false positives and false negatives.
Figure 8.8.Recall
Figurealso
They Recall
provepercentage
that theper
percentage perclass
classfor
attention each
eachof
ofthe
formechanismtheimplemented
implementedmodels.
models.
improves the models’ ability to classify
complex patterns in the data, leading to enhanced scores across all classes. Our tested
models exhibited high F1-scores for “Normal”, “Reconnaissance”, and “Theft” classes,
ranging from 99.7% to 100%. The “DoS” and “DDoS” classes have slightly fewer F1 scores,
but they are still high scores in the context of IDSs. This variation in the results indicates
that some classes have distinctive features that are easier to learn and classify, which is
affirmed by the visual representation of the data depicted in Figure 5.
Figure
Figure 9.
9. F1-score
F1-score percentage
percentage per
per class
class for
for each
each of
of the
the implemented
implemented models.
models.
As shown to
5.5. Comparison in Other
FigureModels
7, MobileNetV1 and EfficientNetB0 models demonstrate excep-
tional performance, achieving
To assess the efficiency of our above
final98% precision
model, for “DDoS”
we compared it to and “DoS”
different attacks
models and
previ-
perfect precision for “Normal”, “Reconnaissance”, and “Theft” classes. Moreover,
ously overviewed in Section 2. In our comparison, we consider diverse approaches to en- integrat-
ing the
sure theCBAM
noveltyenhances the precision of
and competitiveness of our
all models, with
approach. the
For TL-CBAM-EL
a valid model
comparison, all standing
selected
out by recording
models8.use near-perfect
thepercentage
BoT-IoT dataset. scores across all classes. The high precision indicates that the
the
Figure Recall per classAs
forshown
each of in
theFigure 10, incorporating
implemented models. the CBAM with
models are effective at correctly identifying attacks with low FPR.
MobileNetV1 and EfficientNetB0 exhibits superior performance. The TL-CBAM-EL model
surpasses all other models concerning accuracy. This strongly supports the conclusion
that our TL-CBAM-EL model is an effective model for IoT intrusion detection and demon-
strates its capability to accurately identify intrusions with high precision compared to its
predecessors.
Big Data Cogn. Comput. 2024, 8, 116 19 of 23
Figure 8 displays the recall percentages per class for our implemented models. All the
Big Data Cogn. Comput. 2024, 8, x FORmodels have approximately
PEER REVIEW 100% recall for the “Normal”, “Reconnaissance”, and 19 “Theft”
of 23
classes, indicating their strong ability to identify the true positive instances across these
classes. The recall percentages for the “DoS” and “DDoS”, while slightly lower compared to
other classes, remain high and are indicative of robust model performance. They range from
98 to 100%, demonstrating their reliability in detecting these two attacks. This reduction is
attributed to the similar features of “DoS” and “DDoS” attacks, as they fundamentally aim
to disable the service or the system to its legitimate users. Nevertheless, the incorporation
of the attention mechanism enhances the models’ focus on significant features, resulting in
improved recall rates.
Figure 9 depicts the F1-score per class of our implemented models. The figure demon-
strates that all the models succeed in attaining a high F1 score, indicating a balanced
performance in terms of both precision and recall. The findings prove that the models
effectively identify true positives while minimizing false positives and false negatives.
They also prove that the attention mechanism improves the models’ ability to classify
complex patterns in the data, leading to enhanced scores across all classes. Our tested
models exhibited high F1-scores for “Normal”, “Reconnaissance”, and “Theft” classes,
ranging from 99.7% to 100%. The “DoS” and “DDoS” classes have slightly fewer F1 scores,
but they are still high scores in the context of IDSs. This variation in the results indicates
Figure 9. F1-score
that some classespercentage per class for
have distinctive each ofthat
features the implemented
are easier tomodels.
learn and classify, which is
affirmed by the visual representation of the data depicted in Figure 5.
5.5. Comparison to Other Models
5.5. Comparison
To assess theto efficiency
Other Models
of our final model, we compared it to different models previ-
ouslyTo assess the in
overviewed efficiency ofIn
Section 2. our
ourfinal model, we
comparison, wecompared it to different
consider diverse models
approaches to pre-
en-
viously overviewed in Section 2. In our comparison, we consider diverse approaches
sure the novelty and competitiveness of our approach. For a valid comparison, all selected
to ensure
models usethe
thenovelty
BoT-IoTand competitiveness
dataset. As shown inofFigure
our approach. For a valid
10, incorporating comparison,
the CBAM all
with the
selected models use the BoT-IoT dataset. As shown in Figure 10, incorporating
MobileNetV1 and EfficientNetB0 exhibits superior performance. The TL-CBAM-EL model the CBAM
with the MobileNetV1
surpasses and EfficientNetB0
all other models exhibitsThis
concerning accuracy. superior performance.
strongly supports The TL-CBAM-
the conclusion
EL model
that surpasses all
our TL-CBAM-EL otherismodels
model concerning
an effective model foraccuracy. This detection
IoT intrusion strongly supports the
and demon-
conclusion that our TL-CBAM-EL model is an effective model for IoT intrusion
strates its capability to accurately identify intrusions with high precision compared to its detec-
tion and demonstrates its capability to accurately identify intrusions with high precision
predecessors.
compared to its predecessors.
Figure 10.
Figure Comparisonbetween
10.Comparison betweenthe
theaccuracies
accuraciesof
ofthe
theproposed
proposedTL-CBAM-EL
TL-CBAM-ELmodel,
model,the
thepre-trained
pre-trained
models integrated
models integrated with
with the
the CBAM,
CBAM, and
and previously
previously presented
presented models
models [16–18,20,27,29].
[16–18,20,27,29].
architectures [55], coupled with the frozen layers during transfer learning, should preserve
resource consumption. Our findings support promising robust and efficient intrusion de-
tection capabilities. Nevertheless, to facilitate the practical implementation of the proposed
IDS in IoT networks, various deployment scenarios can be considered based on the specific
network requirements and structure. This flexibility ensures that the system can adapt to
different IoT environments, optimizing resource usage and energy efficiency. The following
outlines some potential deployment approaches:
• Hybrid Processing: A hybrid system in IoT takes advantage of different technologies
in the network, such as cloud or fog computing, edge computing, and on-device
computing, to provide a tailored learning-based IDS [56]. It deploys the training
and inference of the model according to the network’s specifications and the diverse
needs of the applications. Hybrid approaches aim to balance the trade-off between
time–response, computational complexity, and energy consumption.
• AI edge computing: Instead of relying on centralized platforms and cloud databases,
giant technology corporations are widely expanding in the development of AI chips
that allow intelligent computing to be performed on edge devices [57]. This approach
reduces latency and network congestion [56]; however, the energy levels at the selected
edge devices should be carefully considered.
• Tiny Machine Learning (TinyML): TinyML is the discipline of implementing ML mod-
els on ultra-low-power microcontrollers or embedded devices [58]. It enables cognitive
processing and decision-making directly on edge devices without the need for cloud or
edge connectivity. This field requires further refinement and investigations to develop
benchmark DL models and datasets that can be deployed in an IDS TinyML system.
• Adaptive security: Given the dynamic nature of mobile IoT environments, security
measures must be continuously adjusted to ensure ongoing protection of data and
devices [59]. Adaptive security represents a promising research avenue for mobile
IoT, as it allows for the implementation of a range of defense mechanisms tailored
to specific contextual environments. This adaptability involves addressing different
types of attacks and vulnerabilities that IoT devices encounter as they move through
various zones and networks.
6. Conclusions
This research proposes a novel approach for intrusion detection systems in IoT net-
works based on transfer learning, ensemble learning, and Convolutional Block Attention
Module. The study evaluates four state-of-the-art pre-trained models, which are VGG16,
ResNet50, MobileNetV1, and EfficientNetB0. Each model is examined with and without the
integrated CBAM, resulting in eight distinct models. Using the 10-Best BoT-IoT dataset, we
preprocessed the data to address the class imbalance and transformed tabular data into RGB
images, enabling spatially coherent feature representation and complex pattern recognition.
Our experimental results prove that our novel approach to integrating a single lightweight
CBAM at the top layer of the models leads to faster convergence, higher accuracy, and
minimized false positive rates. Among the evaluated models, EfficientNetB0 + CBAM and
MobileNetV1 + CBAM exhibited the best performance. By applying ensemble learning
to aggregate the predictions of these two top-performing models, we developed the TL-
CBAM-EL model. The TL-CBAM-EL model achieved an impressive accuracy of 99.93% in
detecting and classifying various intrusion attacks, surpassing the performance of existing
methods in this domain. This result underscores the effectiveness of combining transfer
learning, CBAM, and ensemble learning techniques for enhancing intrusion detection
systems in IoT networks.
For our future work, we aim to conduct a detailed analysis of the model’s complexity,
including training and inference times. This will involve assessing various optimization
strategies to further enhance computational efficiency. This examination was unattainable
due to the instability of the Google Colab resources. Another potential area for future
research involves examining the use of different features and different datasets to en-
Big Data Cogn. Comput. 2024, 8, 116 21 of 23
sure the generalization and robustness of our model. Moreover, we plan to investigate
secure data transformation techniques and incorporate robust security measures to pre-
vent any vulnerabilities or data leaks, ensuring the integrity and confidentiality of the
transformed dataset.
Author Contributions: Conceptualization, M.A. and M.R.; Methodology, S.A.; Software, S.A.; For-
mal analysis, S.A. and I.H.; Investigation, S.A.; Data curation, S.A.; Writing—original draft, S.A.;
Writing—review and editing, I.H. and M.R.; Visualization, S.A.; Supervision, I.H., M.A. and M.R. All
authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset used in this study is publicly available by UNSW Canberra
at the Australian Defence Force Academy (ADFA) at the following link: https://research.unsw.edu.
au/projects/bot-iot-dataset (accessed on 19 June 2022).
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Fortune Business Insights. Internet of Things (IoT) Market. Available online: https://www.fortunebusinessinsights.com/
industry-reports/internet-of-things-iot-market-100307 (accessed on 23 April 2023).
2. Sarker, I.; Khan, A.; Abushark, Y.; Alsolami, F. Internet of Things (IoT) security intelligence: A comprehensive overview, machine
learning solutions and research Directions. Mobile Netw. Appl. 2023, 28, 296–312. [CrossRef]
3. Almuqren, L.; Alqahtani, H.; Aljameel, S.S.; Salama, A.S.; Yaseen, I.; Alneil, A.A. Hybrid metaheuristics with machine learning
based botnet detection in cloud assisted internet of things environment. IEEE Access 2023, 11, 115668–115676. [CrossRef]
4. Mohammadpour, L.; Ling, T.C.; Liew, C.S.; Aryanfar, A. A Survey of CNN-based network intrusion detection. Appl. Sci. 2022,
12, 8162. [CrossRef]
5. Abdelhamid, S.; Aref, M.; Hegazy, I.; Roushdy, M. A survey on learning-based intrusion detection systems for IoT networks. In
Proceedings of the 2021 Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt,
5–6 December 2021.
6. Nguyen, C.; Van Huynh, N.; Chu, N.; Saputra, Y.; Hoang, D.; Nguyen, D.; Pham, Q.; Niyato, D.; Dutkiewicz, E.; Hwang, W.
Transfer learning for wireless networks: A comprehensive survey. Proc. IEEE 2022, 110, 1073–1115. [CrossRef]
7. Zhu, Y.; Brettin, T.; Xia, F.; Partin, A.; Shukla, M.; Yoo, H.; Evrard, Y.; Doroshow, J.; Stevens, R. Converting tabular data into
images for deep learning with convolutional neural networks. Sci. Rep. 2021, 11, 11325. [CrossRef]
8. Sharma, A.; Vans, E.; Shigemizu, D.; Boroevich, K.A.; Tsunoda, T. DeepInsight: A methodology to transform a non-image data to
an image for convolution neural network architecture. Sci. Rep. 2019, 9, 11399–11405. [CrossRef]
9. Bazgir, O.; Zhang, R.; Dhruba, S.R.; Rahman, R.; Ghosh, S.; Pal, R. Representation of features as images with neighborhood
dependencies for compatibility with convolutional neural networks. Nat. Commun. 2020, 11, 4391. [CrossRef]
10. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd
International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015.
11. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
12. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
13. Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International
Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114.
14. Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on
Computer Vision (ECCV), Munich, Germany, 8–14 September 2018.
15. Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the internet of
things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [CrossRef]
16. Alkadi, O.; Moustafa, N.; Turnbull, B.; Choo, K. A deep blockchain framework-enabled collaborative intrusion detection for
protecting IoT and cloud networks. IEEE Internet Things J. 2021, 8, 9463–9472. [CrossRef]
17. Ullah, I.; Mahmoud, Q. Design and development of a deep learning-based model for anomaly detection in IoT networks. IEEE
Access 2021, 9, 103906–103926. [CrossRef]
18. Yang, J.; Jiang, X.; Liang, G.; Li, S.; Ma, Z. Malicious Traffic Identification with Self-Supervised Contrastive Learning. Sensors 2023,
23, 7215. [CrossRef] [PubMed]
19. Awajan, A. A Novel deep learning-based intrusion detection system for IoT networks. Computers 2023, 12, 34. [CrossRef]
Big Data Cogn. Comput. 2024, 8, 116 22 of 23
20. He, M.; Huang, Y.; Wang, X.; Wei, P.; Wang, X. A lightweight and efficient IoT intrusion detection method based on feature
grouping. IEEE Internet Things J. 2024, 11, 2935–2949. [CrossRef]
21. Bozinovski, S.; Fulgosi, A. The influence of pattern similarity and transfer of learning upon training of a base perceptron B2. Proc.
Symp. Inform. 1976, 3, 121–126.
22. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [CrossRef]
23. Vu, L.; Nguyen, Q.; Nguyen, D.; Hoang, D.; Dutkiewicz, E. Deep transfer learning for IoT attack detection. IEEE Access 2020,
8, 107335–107344. [CrossRef]
24. Hussain, F.; Abbas, S.; Husnain, M.; Fayyaz, U.; Shahzad, F.; Shah, G. IoT DoS and DDoS attack detection using ResNet. In
Proceedings of the 2020 IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020.
25. Fan, Y.; Li, Y.; Zhan, M.; Cui, H.; Zhang, Y. IoTDefender: A federated transfer learning intrusion detection framework for 5G IoT. In
Proceedings of the 2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE), Guangzhou, China,
31 December–1 January 2020.
26. Guan, J.; Cai, J.; Bai, H.; You, I. Deep transfer learning-based network traffic classification for scarce dataset in 5G IoT systems. Int.
J. Mach. Learn. Cybern. 2021, 12, 3351–3365. [CrossRef]
27. Ge, M.; Syed, N.; Fu, X.; Baig, Z.; Robles-Kelly, A. Towards a Deep Learning-Driven Intrusion Detection Approach for Internet of
Things. Comput. Netw. 2021, 186, 107784. [CrossRef]
28. Mienye, I.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022,
10, 99129–99149. [CrossRef]
29. Thakkar, A.; Lohiya, R. Attack classification of imbalanced intrusion data for iot network using ensemble learning-based deep
neural network. IEEE Internet Things J. 2023, 10, 11888–11895. [CrossRef]
30. Awotunde, J.B.; Folorunso, S.O.; Imoize, A.L.; Odunuga, J.O.; Lee, C.C.; Li, C.T.; Do, D.T. An ensemble tree-based model for
intrusion detection in industrial internet of things networks. Appl. Sci. 2023, 13, 2479. [CrossRef]
31. Alotaibi, Y.; Ilyas, M. Ensemble-Learning Framework for Intrusion Detection to Enhance Internet of Things’ Devices Security.
Sensors 2023, 23, 5568. [CrossRef] [PubMed]
32. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE
2020, 109, 43–76. [CrossRef]
33. Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the
2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.
34. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow:
Large-scale machine learning on heterogeneous systems. arXiv 2015, arXiv:1603.04467.
35. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing
Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019.
36. Keras Documentation: Keras Applications. 2020. Available online: https://www.keras.io/api/applications (accessed on 16 May 2022).
37. Soydaner, D. Attention mechanism in neural networks: Where it comes and where it goes. Neural Comput. Appl. 2022,
34, 13371–13385. [CrossRef]
38. Wang, A.; Liang, G.; Wang, X.; Song, Y. Application of the YOLOv6 combining CBAM and CIoU in forest fire and smoke detection.
Forests 2023, 14, 2261. [CrossRef]
39. Agac, S.; Durmaz, O. On the use of a convolutional block attention module in deep learning based human activity recognition
with motion sensors. Diagnostics 2023, 13, 1861. [CrossRef]
40. Wang, Y.; Chen, X.; Li, J.; Lu, Z. Convolutional Block Attention Module–Multimodal Feature-Fusion Action Recognition: Enabling
Miner Unsafe Action Recognition. Sensors 2024, 24, 4557. [CrossRef]
41. Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive
Guide for Beginners; Springer: Cham, Switzerland, 2019; pp. 59–64.
42. Van Der Walt, S.; Colbert, S.; Varoquaux, G. The numpy array: A structure for efficient numerical computation. Comput. Sci. Eng.
2011, 13, 22–30. [CrossRef]
43. McKinney, W. Pandas: A foundational python library for data analysis and statistics. In Proceedings of the Python for High
Performance and Scientific Computing, Austin, TX, USA, 14 November 2011; pp. 1–9.
44. Hunter, J. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [CrossRef]
45. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine learning in python. JMLR 2011, 12, 2825–2830.
46. Talaei Khoei, T.; Kaabouch, N. Machine Learning: Models, Challenges, and Research Directions. Future Internet 2023, 15, 332.
[CrossRef]
47. Sarhan, M.; Layeghy, S.; Moustafa, N.; Gallagher, M.; Portmann, M. Feature extraction for machine learning-based intrusion
detection in iot networks. Digit. Commun. Netw. 2024, 10, 205–216. [CrossRef]
48. Hossain, M.A.; Islam, M.S. A novel hybrid feature selection and ensemble-based machine learning approach for botnet detection.
Sci. Rep. 2023, 13, 21207. [CrossRef] [PubMed]
49. Yang, C.; Guan, W.; Fang, Z. IoT botnet attack detection model based on DBO-CatBoost. Appl. Sci. 2023, 13, 7169. [CrossRef]
Big Data Cogn. Comput. 2024, 8, 116 23 of 23
50. Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, W. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002,
16, 321–357. [CrossRef]
51. Zhao, Y.F.; Xie, J.; Sun, L. On the data quality and imbalance in machine learning-based design and manufacturing—A systematic
review. Engineering, 2024; in press.
52. De Amorim, L.B.; Cavalcanti, G.D.; Cruz, R.M. The choice of scaling technique matters for classification performance. Appl. Soft
Comput. 2023, 133, 109924. [CrossRef]
53. Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.
Comput. Inf. Sci. 2023, 35, 757–774. [CrossRef]
54. Ying, X. An overview of overfitting and its solutions. J. Phys. Conf. Ser. 2019, 1168, 22022. [CrossRef]
55. Zhou, Y.; Chen, S.; Wang, Y.; Huan, W. Review of research on lightweight convolutional neural networks. In Proceedings of the
IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020.
56. Tekin, N.; Acar, A.; Aris, A.; Uluagac, A.S.; Gungor, V.C. Energy consumption of on-device machine learning models for IoT
intrusion detection. Internet Things 2023, 21, 100670. [CrossRef]
57. Liu, D.; Kong, H.; Luo, X.; Liu, W.; Subramaniam, R. Bringing AI to edge: From deep learning’s perspective. Neurocomputing 2022,
485, 297–320. [CrossRef]
58. Kallimani, R.; Pai, K.; Raghuwanshi, P.; Iyer, S.; López, O.L. TinyML: Tools, applications, challenges, and future research directions.
Multimed. Tools Appl. 2024, 83, 29015–29045. [CrossRef]
59. Golpayegani, F.; Chen, N.; Afraz, N.; Gyamfi, E.; Malekjafarian, A.; Schäfer, D.; Krupitzer, C. Adaptation in Edge Computing: A
review on design principles and research challenges. ACM Trans. Auton. Adapt. Syst. 2024; just accepted.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.