0% found this document useful (0 votes)
28 views17 pages

Research 4

Uploaded by

devika Nair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views17 pages

Research 4

Uploaded by

devika Nair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Internet of Things 27 (2024) 101300

Contents lists available at ScienceDirect

Internet of Things
journal homepage: www.elsevier.com/locate/iot

Research article

Deep Image: A precious image based deep learning method for


online malware detection in IoT environment
Meysam Ghahramani a , Rahim Taheri b ,∗, Mohammad Shojafar c , Reza Javidan d ,
Shaohua Wan e
a
Faculty of Basic Sciences - Department of Mathematics and Computer Science, Lorestan University, Lorestan, Iran
b
Faculty of Technology, School of Computing, University of Portsmouth, Portsmouth, UK
c
5G/6G Innovation Centre (5G/6GIC), Institute for Communication Systems (ICS), University of Surrey, Guildford, UK
d Computer Engineering and IT Department, Shiraz University of Technology, Shiraz, Iran
e School of Information and Safety Engineering, Zhongnan University of Economics and Law, Hubei, China

ARTICLE INFO ABSTRACT

Keywords: In this study, we address the challenge of online malware detection for IoT devices. We propose
Malware detection a method that monitors malware behavior, extracts dynamic features, and converts them into
Image-based clustering sparse binary images for analysis. The primary problem is to identify the most effective approach
Deep learning
among clustering, probabilistic, and deep learning methods for analyzing this unique image
IoT devices
dataset. We extract dynamic features from the monitored malware behavior, transforming them
Visualization analysis
into binary images, which are then subjected to three different analysis methods. The clustering,
probabilistic, and deep learning approaches are compared and evaluated in terms of various
metrics. Our study contributes insights into the performance of various online malware detection
approaches for IoT devices. We demonstrate that deep learning outperforms other methods,
achieving the best results in seven out of eight metrics. The results of our analysis reveal that
the deep learning approach exhibits the highest accuracy in seven of the eight evaluated metrics.
We found that the lattice-based approach consistently returns the maximum maliciousness level,
which can be instrumental in label flipping scenarios.

1. Introduction

In today’s digital landscape, the proliferation of malware persists unabated, even in the face of ongoing efforts to detect and
combat it. The need for effective malware analysis is paramount in countering the ever-evolving and sophisticated behaviors
exhibited by malicious software [1,2]. Traditional heuristic methods have proven insufficient to combat this rising tide efficiently. In
response to these fundamental challenges, the cybersecurity community has embraced behavior-based malware detection methods,
combined with machine learning approaches, as a way forward. These approaches leverage supervised classifiers to assess their
predictive capabilities in terms of identifying pertinent features from the original dataset while balancing a high detection rate
with low computational overhead. However, although machine learning-based malware detection systems have shown promise in
identifying malware, their relatively shallow learning architecture still struggles with the identification of intricate and complex
malware strains.

∗ Corresponding author.
E-mail addresses: ghahramani.m@lu.ac.ir (M. Ghahramani), rahim.taheri@port.ac.uk (R. Taheri), m.shojafar@surrey.ac.uk (M. Shojafar),
javidan@sutech.ac.ir (R. Javidan), shaohua.wan@ieee.org (Sh. Wan).

https://doi.org/10.1016/j.iot.2024.101300
Received 15 June 2023; Received in revised form 24 October 2023; Accepted 17 July 2024
Available online 19 July 2024
2542-6605/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
M. Ghahramani et al. Internet of Things 27 (2024) 101300

The exponential growth in malware can be attributed to the financial incentives associated with malware vulnerabilities, such
as cryptocurrency miners, banking Trojans, and ransomware. Cybersecurity analysts have increasingly relied on machine learning
algorithms to intelligently discover and classify malware in a wide range of environments, including the Internet of Things (IoT).
The efficacy of these automated systems critically depends on the selection and availability of features, encompassing data extracted
from static binary executable files and dynamic malware and software execution behaviors.
The majority of contemporary deep learning-based malware detection algorithms for Windows typically identify test input files
by analyzing static PE file attributes, often generating images from the entire content of the PE file [3]. However, as PE files come in
various sizes, scaling techniques are necessary to unify images of different dimensions into a consistent format. A primary limitation
of static feature-based detection and classification techniques is their vulnerability to disguised malware.
Furthermore, many recent studies in malware analysis assume that a malware sample is readily available, although, in practice,
this is often not the case. Advanced threat actors frequently employ targeted attacks and subsequently erase all traces of the malware
they have deployed, making it inaccessible for analysis.
While most malware detection systems focus on enhancing accuracy and reducing False Positive Rates (FPR), this paper
introduces a novel perspective. It contends that while accuracy and FPR are indeed essential metrics, they alone may not suffice.
Even with high accuracy and low FPR, certain detection systems may still fall short. Therefore, this paper proposes the integration
of a supplementary metric that can enhance the overall effectiveness of malware detection systems. This novel metric is rooted
in a risk-based strategy. For each sample, a risk percentage, a value ranging from zero to one, is considered. Decisions regarding
malware detection are then based on this risk percentage.
In light of these considerations, this paper endeavors to redefine the landscape of malware detection, addressing the limitations
of existing methods. By developing a risk-based metric, our study seeks to improve the efficacy and reliability of malware detection
systems, transcending conventional accuracy and FPR considerations. The research questions, contributions, and specific aims of
this study are delineated in the subsequent sections.

1.1. Motivation and open issues

In high-risk applications such as malware detection, understanding when a machine learning model is uncertain about its
prediction is essential. When an automated malware detection algorithm is unsure about a sample, the estimated uncertainty can be
used to flag the sample for investigation by a more computationally demanding algorithm or human review. Therefore, proposing
a method that can predict the result with more certainty is important, and it is an open problem that this paper has tried to solve
by introducing a risk criterion.
Because the information entering computers, including Windows malware, from networks, is of the stream type, using time series
for Windows malware classification is very common and important, and more researchers have worked on this method and obtained
very good results. However, the current study is significant because the proposed method uses the fewest number of samples to train
a model, which is useful given that access to a large number of malware for training the model is not always possible.

1.2. Research questions

In this paper, we are seeking answers to the following questions.

• How is the monitoring of target programmes effective in detecting malicious behavior during the defined monitoring time?
• What are the comparative advantages and limitations of the cluster-based, probability-based, and deep learning-based approaches
within the proposed method for the classification and detection of malware?
• How does time series to image conversion technique impact the accuracy and efficiency of malware detection?

1.3. Problem definition

Malware behavior can be demonstrated using time series 𝛯 shown in Eq. (1). In this representation, 0 ≤ 𝜉 ≤ 1 shows the
maliciousness level. Also, each malware has features 𝜙𝑖 ’s, 1 ≤ 𝑖 ≤ 𝑛, so that the feature 𝜙𝑖 is available at the time 𝜏𝑖 .

𝛯 = [𝜉, 𝜙1 ∶ 𝜏1 , 𝜙2 ∶ 𝜏2 , … , 𝜙𝑛 ∶ 𝜏𝑛 ] (1)

This paper proposes a malware detection method to calculate 𝜉 from [𝜙1 ∶ 𝜏1 , … , 𝜙𝑛 ∶ 𝜏𝑛 ]. Eq. (1) is not the only representation for
𝛯. For example, a two-dimensional diagram with label 𝜉, as in Fig. 1, can be used as another representation.
Table 1 summarizes the symbols and notation used in this paper.

1.4. Contribution

The research presented in this paper contributes significantly to the field of malware detection and classification in several key
ways:

• Real-Time Monitoring and Analysis: Our approach includes a file monitoring stage, allowing for real-time observation and data
collection during a predefined monitoring period. This feature captures up-to-the-minute information about running applications,
enhancing the overall effectiveness of the malware detection system.

2
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 1. 2D form of a time series.

Table 1
Notations.
Notation Description
𝛯 Time series of Malware
0≤𝜉≤1 Maliciousness level
𝜙𝑖 , 1 ≤ 𝑖 ≤ 𝑛 availability at the time 𝜏𝑖
𝜉 label

• Innovative Time Series to Image Conversion: Our proposed method introduces a novel approach to convert time series data,
which captures the behavior of running applications, into structured images. This transformation facilitates more effective analysis,
clustering, and classification of malware behaviors. We offer three distinct conversion scenarios that enrich the understanding of
these behaviors, contributing to a holistic and comprehensive analysis.
• Multi-Faceted Detection Approaches: Our research incorporates a diverse set of detection methods, including the cluster-
based approach, the probability-based approach, and the deep learning-based approach. This multi-faceted approach offers a
comprehensive strategy for identifying and categorizing malware behaviors, addressing the limitations of existing methods and
enhancing the accuracy of detection.

In summary, this paper makes a substantial contribution to the field of malware detection by introducing an innovative approach to
enabling real-time monitoring, time series data conversion, employing diverse detection methods, and offering an in-depth security
analysis. These contributions collectively advance the state of the art in malware detection and significantly enhance the ability to
detect and classify malware behaviors.

1.5. Roadmap

The remainder of this paper is structured as follows: Section 2 provides a summary of related studies that have been designed
to tackle malware analysis and visualization techniques. Section 4 presents the proposed architecture, which include: clustering,
probabilistic, and deep learning approaches. Performance analysis of the proposed methods is presented in Section 5. Finally,
Section 6 summarizes the main achievements of the paper and gives some directions for future work.

2. Related work

In the literature, various approaches have been proposed for malware analysis and detection, many of which utilize machine
learning techniques. In this section, we will discuss relevant research, emphasizing the unique aspects of our proposed method and
its advantages over existing solutions.

2.1. Static and dynamic analysis

Traditional malware detection methods often involve static and dynamic analysis. Static analysis focuses on extracting features
from executable applications without running them, such as APIs, permissions, and hardware components [4]. Machine learning
techniques have been applied to static analysis to detect malware [5]. Dynamic analysis, on the other hand, examines the behavior of
applications during runtime. Some approaches, like DroidCat [6], employ method calls and inter-component communication (ICC)
features to create classifiers for identifying Android malware.
While these static and dynamic analysis techniques have their merits, they may not effectively address certain challenges, such
as detecting applications that become malicious during runtime or identifying malware that conceals its behavior during operation.
Our proposed method aims to overcome these limitations by combining both dynamic and static analysis within a single procedure,
allowing us to leverage the strengths of each approach while mitigating their respective weaknesses.

3
M. Ghahramani et al. Internet of Things 27 (2024) 101300

2.2. Image-based malware classification

A promising direction in malware detection is the use of image-based classification algorithms. These methods convert binary
files into image representations, which can be processed by advanced machine learning algorithms [7–9]. The conversion involves
reading the binary file’s byte sequences as grayscale pixel values, creating a visual representation of the binary content.
Previous research has explored the use of visual features from malware samples in combination with machine learning algorithms.
Notably, Yajamanam et al. [10] utilized a GIST-based byte-plot method with deep learning techniques to achieve high accuracy in
classifying malware samples. Le et al. [11] employed convolutional neural networks (CNN) to classify binary files converted into
byte-plot images, achieving impressive results with low processing times.
While image-based methods have demonstrated their effectiveness in addressing code obfuscation challenges, these approaches
primarily focus on the visual aspects of malware representation. In contrast, our proposed method combines the benefits of
image-based classification with dynamic and static analysis, offering a more comprehensive approach to malware detection and
classification.

2.3. Space-filling curves and hybrid approaches

There has been limited research on using space-filling curves for malware classification. Baptista et al. [12] proposed a method
based on Hilbert curves and a Self-Organizing Incremental Network to classify malware. However, their experiments were conducted
on a relatively small dataset, which raises questions about the method’s effectiveness when applied to a larger and more diverse set
of samples.
Similarly, Vu et al. [13] introduced a hybrid image transformation (HIT) method that combined statistical and syntactic elements
and translated them into image formats using space-filling curves. While their approach achieved accuracy levels of 93%, it primarily
focused on categorizing objects as either ‘‘good’’ or ’’harmful’’, lacking the ability to identify malware families or behaviors.
In contrast, our proposed method provides a holistic approach that combines space-filling curves, static analysis, dynamic
analysis, and deep learning to detect and classify malware. By incorporating multiple detection techniques and observation windows,
our approach offers improved detection randomness and overall detection rates, addressing the limitations of existing methods in
the literature.
Overall, our proposed method advances the field of malware detection by providing a comprehensive and versatile approach
that leverages the strengths of various detection techniques to enhance the accuracy and robustness of malware classification.

3. Threat model and security analysis

In this section, the threats to Windows systems and their effect on security are discussed. In the threat model subsection, we
investigate Windows’s most common security threats that cause malicious data to enter the system. Then, under the security analysis
subsection, we discuss the effect of the proposed method on information security architecture.

3.1. Threat model

The threat actors in this model encompass a wide range of attackers, including script kiddies, hackers, cybercriminals, and
advanced persistent threats (APTs). They aim to infiltrate target systems through malware with various objectives, such as data theft,
system compromise, and unauthorized control. The targeted systems are Windows-based operating environments. These systems may
include both personal and enterprise computers. Attacks occur in diverse environments, such as corporate networks, home networks,
or standalone personal computers. Adversaries employ multiple attack vectors to deliver malware, including phishing emails,
malicious downloads, exploiting software vulnerabilities, and social engineering tactics. They may also abuse network weaknesses
or fake network access points to penetrate the target system. Malware is delivered through various means, such as malicious email
attachments, unverified software downloads, drive-by downloads, and infected external media. For example, attackers can use a
variety of tactics and techniques based on the MITRE ATT&CK [14], including:

• Gaining a foothold in the target system, which can include spear-phishing, drive-by exploitation, and exploiting known vulnera-
bilities.
• Running malicious code or scripts on the system. This may involve launching malicious software or scripts within the context of
the target application.
• Ensuring continued access to the compromised system, often by modifying system settings, creating startup processes, or
establishing backdoors.
• Employing tactics to bypass or disable security mechanisms, such as antivirus tools and intrusion detection systems.
• Obtaining credentials to access privileged information or systems, often through techniques like keylogging, credential dumping,
or brute force attacks.
• Gaining knowledge about the target system, including its network, users, and installed software, to identify vulnerable areas.

4
M. Ghahramani et al. Internet of Things 27 (2024) 101300

3.2. Security analysis

The security of computer resources is of paramount importance in the context of malware detection and mitigation. This section
provides an in-depth analysis of how our proposed method addresses the core security principles of authentication, authorization,
and accounting (AAA) within the framework of Windows malware defense.
Authentication: Our method significantly contributes to bolstering authentication security in Windows environments. Authenti-
cation ensures the verification of user identities, and we have taken multiple measures to protect this critical aspect: One of the risks
posed by malware is its ability to tamper with user information, potentially leading to identity theft and impersonation. Our method
detects and mitigates such risks by monitoring changes to user information, ensuring that the authenticity of user identities remains
intact. Windows malware often attempts to change user credentials or steal sensitive data for malicious purposes. Through advanced
anomaly detection techniques, we proactively identify and counteract these attempts, preserving the integrity of the authentication
process.
Authorization: Which involves defining and enforcing permissions and access control, plays a pivotal role in Windows malware
defense. Our method comprehensively addresses authorization security through the following means: We protect against Windows
malware that seeks to manipulate or escalate system authorizations. Our system continually monitors and verifies permissions,
ensuring that unauthorized changes are detected and rectified promptly.
Accounting: Involves the tracking and recording of user activities, which is crucial for security and forensics. Our method
offers robust accounting capabilities while safeguarding against Windows malware that seeks to disrupt user activities: Our system
maintains comprehensive logs of user activities, including processes executed, files accessed, and system changes. This ensures that
any deviations from normal behavior are promptly identified.
Our proposed method integrates these AAA security principles into a cohesive and interdependent security framework. We
recognize that attacks on computer systems often challenge multiple components simultaneously, and we have designed our system
to respond effectively to such complex threats.

4. Proposed method

The proposed method in this paper is dedicated to monitoring target programs during 𝜏 seconds, which is called monitoring time.
During this time, applications invoke features that the monitoring system has extracted as a time series that is accessible using
available datasets.
The proposed method has three main components. The first and second parts, file monitoring and time series to image conversion,
are common among all approaches. However, the third part, encompasses three different approaches: the cluster-based approach, the
probability-based approach, and the deep learning-based approach. Each of these approaches represents a unique perspective within
the proposed method, contributing to its comprehensive evaluation and analysis. We use the algorithm 1 to map the samples into an
image with dimensions 𝑛𝜙 ×𝜏, where 𝑛𝜙 and 𝜏 represent the number of extracted features and the monitoring time, respectively. Then
we convert time series to images. After obtaining the images, different approaches are used to analyze them and detect malicious
samples.

4.1. File monitoring

This part is the initial stage of our proposed method and sets the stage for subsequent analysis. In this stage, our objective is
to observe and capture essential information from target programs during a predefined monitoring time, 𝜏. During this monitoring
time, various applications are executed, and we aim to collect relevant features and behaviors for further analysis. The duration of
𝜏 seconds is carefully chosen to ensure that the monitoring system has sufficient time to capture the necessary data. Throughout
the monitoring period, our system extracts features invoked by the running applications. These features can include system calls,
API calls, resource usage, network activity, and other behaviors. The choice of features depends on the specific requirements of the
malware detection task. The extracted features are recorded as time series data, where each feature invocation is timestamped. This
time series data forms the basis for the subsequent stages of our proposed method. The data collected during the monitoring period
is stored and organized for further analysis. The part essentially serves as a data acquisition phase. It allows us to gather real-time
information about the behavior of running applications, and this information is then transformed into structured representations in
the ‘‘Time series to image conversion’’ stage.

4.2. Time series to image conversion

This part represents a pivotal stage in our proposed method, where the collected time series data, detailing the behaviors
of running applications during the monitoring period, undergoes transformation into structured images. This transformation is
fundamental for enabling subsequent analysis, clustering, and classification of malware based on the observed behaviors. This part
comprises three different scenarios, each with its unique approach to the time series to image conversion.
In this crucial stage of the proposed method, we focus on the transformation of time series data into structured images, which
forms the basis for subsequent analysis and classification of malware behaviors. The process unfolds in several key steps, includes:
Within the training dataset, as represented by 𝛯, we iterate through each time series entry. Each entry is characterized by a sequence
of features, denoted by 𝜙𝑖 , observed at specific time intervals, represented by 𝜏𝑖 . The transformation starts by converting these

5
M. Ghahramani et al. Internet of Things 27 (2024) 101300

feature-time pairs into structured images. For each feature 𝜙𝑖 at time 𝜏𝑖 , we set the corresponding pixel in the image matrix 𝑍 to
1. This binary representation captures the presence of the feature at the given time. Simultaneously, during the conversion process,
we compute the clustering image 𝐶𝐼. The clustering image is a crucial component in the subsequent clustering-based approach. It
tracks the presence of features across different time windows within predefined clusters. We identify the target cluster (𝑇 𝐶) based
on the maliciousness level of the training data (𝜉). This is determined through the equation 𝑇 𝐶 = ⌊𝓁 × 𝜉⌋ + 1. This part is presented
in lines 12–23 of the algorithm 1.
For each feature 𝜙𝑖 at time 𝜏𝑖 , we set the corresponding pixel in the clustering image 𝐶𝐼 to 1 if the feature is present. This binary
representation facilitates the clustering of similar behaviors. In addition to the clustering image, we compute the probabilistic image
𝑃 𝐼. The probabilistic image plays a role in the probabilistic-based approach and offers insights into the frequency of malware
behaviors across time. Similar to the clustering image, for each feature 𝜙𝑖 at time 𝜏𝑖 , we update the corresponding pixel in the
probabilistic image 𝑃 𝐼 by incrementing its value by 1. This accumulation of values reflects the number of malware occurrences
associated with that feature-time pair. The conversion process not only serves clustering and probabilistic approaches but also
prepares the data for deep learning-based analysis. For each entry in the training dataset, we save the resulting image matrix 𝑍 in a
designated folder labeled with the target cluster (𝑇 𝐶). These images become input data for deep learning models, allowing for more
dynamic and detailed analysis. The process is then replicated for the test dataset, enabling the application of the trained models for
the detection and classification of malware behaviors. This comprehensive transformation of time series data into images serves as
the foundation for the subsequent analytical methods and ensures a holistic approach to malware detection and classification.

4.3. Detection method

The detection method in this research forms the final aspect of our proposed approach for malware detection and classification. It
comprises a multi-faceted strategy, employing various techniques to identify and categorize malware behaviors. These techniques are
organized into three distinct approaches, each offering a unique perspective on the analysis of malware characteristics. The following
subsections delves into the methodology and algorithms underpinning each approach, providing a comprehensive overview of our
malware detection strategy.

4.3.1. Clustering approach


In this approach, images are analyzed, and their labels are extracted using the clustering technique. To calculate labels, a clustering
image containing malware information is needed to calculate labels. This image is a 𝑛𝜙 -by-𝜏 × 𝓁 black and white image 𝐶𝐼 consisting
of several black or white pixels. This image can be considered a binary matrix that contains information about 𝓁 clusters. If the
element in position (𝜙𝑖 , 𝑗) is equal to 1, it means that there is at least one malware with label 𝜉, as Eq. (2), that called feature 𝜙𝑖 at
time 𝜏𝑖 = 𝑗 − 𝜏 × ⌊𝑗∕𝜏⌋.
( )
1 𝑗 1 𝑗
×⌊ ⌋≤𝜉 < × ⌊ ⌋+1 (2)
𝓁 𝜏 𝓁 𝜏
Fig. 2 shows a clustering image of the data set used as the training set for the proposed clustering approach (CA), which contains
more than 80,000 malware samples for eight maliciousness levels. In this figure, the white pixels represent 1, and the black pixels
represent 0. To detect malware, it is enough to get a visual representation of the malware, compare it with the clusters in the
clustering image, and report the label of the nearest cluster as a malware label. The proposed clustering approach is summarized
in Algorithm 1.

4.3.2. Probabilistic approach


In the previous section, the clustering approach through clustering images was briefly introduced for malware detection. The
clustering image provides suitable information for analysts. For example, at the bottom of Fig. 2 there is a black rectangle, which
means that none of the features in this rectangle are called in any of the malware, which can be a useful point in identifying
malware behaviors. For this reason, this section proposes another method for improving the performance of clustering image, called
the probabilistic image. This image is very similar to the clustering images and can be represented by a 𝑛𝜙 -by-𝜏 × 𝓁 matrix 𝑃 𝐼 with
integer elements. The element 𝑃 𝐼𝜙𝑖 ,𝑗 on the position (𝜙𝑖 , 𝑗) indicates that there were 𝑃 𝐼𝜙𝑖 ,𝑗 malware in the used database, labeled
by 𝜉, which called 𝜙𝑖 at time 𝜏𝑖 = 𝑗 − 𝜏 × ⌊𝑗∕𝜏⌋. Algorithm 1 summarizes how to create 𝑃 𝐼, and calculate the malware labels. Similar
to the previous method, 𝜉 satisfies Eq. (2), and the method called Probabilistic Approach (PA).

4.3.3. Deep learning approach


Although clustering and probabilistic methods can provide valuable information to malware analysts, these methods do not
examine or link the different situations. For example, consider malware that uses feature 𝜙1 after 1 s of monitoring, and this malware
is available in the used database to train the malware detection system. Suppose again that the same malware is present in the test
dataset and uses 𝜙1 after 2 s. In this case, the previous two methods verify 𝐶𝐼𝜙1 ,2 and 𝑃 𝐼𝜙1 ,2 to assess the maliciousness level,
while there is no valuable information in such positions. In other words, previous methods lose their effectiveness against dynamic
feature delays. To solve this problem, deep learning-based methods seem to be a good solution. For example, Convolution Neural
Networks (CNN) pooling techniques address adjacent pixels. Therefore, this method converts the samples to images with black and
white pixels, stores them according to their maliciousness levels in folders with labels 1, … , 𝓁, and examines them by CNN. The
details of three method are integrated with algorithm 1.

6
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Algorithm 1 Proposed Method


Required matrices for 𝓁 maliciousness levels:
1: Set a zero 𝑛𝜙 by 𝜏 × 𝓁 matrix as 𝐶𝐼, for clustering image.
2: Set a zero 𝑛𝜙 by 𝜏 × 𝓁 matrix as 𝑃 𝐼, for Probabilistic image.
3: Create a zero 1 by 𝓁 matrix as 𝐶𝑊 , for the weight of clusters.
4: Create a zero 1 by 𝓁 matrix as 𝐶𝐿, for the label of clusters.
5: Create 𝓁 empty folders named 1, ⋯ , 𝓁, for storing deep learning images.
6: for 𝛯 ∈ Trainset do
7: Set the label of 𝛯 as 𝜉.
8: Set the Target Cluster TC as: 𝑇 𝐶 = ⌊𝓁 × 𝜉⌋ + 1.
9: 𝐶𝑊1,𝑇 𝐶 ← 𝐶𝑊1,𝑇 𝐶 + 1
10: 𝐶𝐿1,𝑇 𝐶 ← 𝐶𝐿1,𝑇 𝐶 + 𝜉
11: Set a zero 𝑛𝜙 by 𝜏 matrix as 𝑍.
12: for (𝜙𝑖 ∶ 𝜏𝑖 ) ∈ 𝛯 do
Time series to image conversion:
13: 𝑍𝜙𝑖 ,𝜏𝑖 ← 1
Clustering image computation:
14: 𝐶𝐼𝜙𝑖 ,(𝑇 𝐶−1)×𝜏+𝜏𝑖 ← 1
Probabilistic image computation:
15: 𝑃 𝐼𝜙𝑖 ,(𝑇 𝐶−1)×𝜏+𝜏𝑖 ← 𝑃 𝐼𝜙𝑖 ,(𝑇 𝐶−1)×𝜏+𝜏𝑖 + 1
16: end for
Image generation for deep learning detection:
17: Save image form of 𝑍 in folder TC.
18: end for
19: for 𝛯 ∈ Testset do
20: Set a zero 𝑛𝜙 by 𝜏 matrix as 𝑍.
21: for (𝜙𝑖 ∶ 𝜏𝑖 ) ∈ 𝛯 do
22: 𝑍𝜙𝑖 ,𝜏𝑖 ← 1
23: end for
Cluster-based Approach (CA):
∑𝑛𝜙 ∑𝜏
24: Compute 𝑖 that minimises 𝑟=1 𝑐=1 |𝑍𝑟,𝑐 − 𝐶𝐼𝑟,(𝑖−1)×𝜏+𝑐 |
𝐶𝐿1,𝑖
25: Return 𝐶𝑊1,𝑖
as the estimated maliciousness level 𝜉 ′ .
Probability-based Approach (PA):
26: Compute 𝑗 that minimises:
∑𝑛𝜙 ∑𝜏 𝑃 𝐼𝑟,(𝑗−1)×𝜏+𝑐
𝑟=1 𝑐=1
|𝑍𝑟,𝑐 − 𝐶𝐼𝑟,(𝑗−1)×𝜏+𝑐 × 𝐶𝑊1,𝑗
|
𝐶𝐿1,𝑗
27: Return 𝐶𝑊1,𝑗
as the estimated maliciousness level 𝜉 ′ .
Deep learning-based Approach (DA):
28: Find the returned image of Convolution Neural Network (CCN) from folders 1, ⋯ , 𝓁.
29: Return the label of the returned image as the estimated maliciousness level 𝜉 ′ .
30: end for

4.4. Space & time complexity

The proposed method stores several matrices and images, and the sum of their dimensions determines the space complexity.
The matrices stored in algorithm 1 are 𝐶𝐼, 𝑃 𝐼, 𝐶𝑊 , 𝐶𝐿 and 𝑍, whose dimensions are 𝑛𝜙 × 𝜏 × 𝓁, 𝑛𝜙 × 𝜏 × 𝓁, 𝓁, 𝓁, and 𝑛𝜙 × 𝜏,
respectively. In addition, the algorithm stores
( 𝑁𝑇 𝑅 images
) ( in different
) ( folders,) each of which has dimensions 𝑛𝜙 × 𝜏. Therefore, the
space complexity of algorithm 1 is 2 × 𝓁 × 𝑛𝜙 × 𝜏 + 1 + 𝑁𝑇 𝑅 + 1 × 𝑛𝜙 × 𝜏 . Similarly, the time complexity of algorithm 1 can be
calculated. The training part of the algorithm consists of two loops in lines 6 and 12. So the time complexity of this part is equal to
𝑁𝑇 𝐸 × 𝑆𝛯 , where 𝑆𝛯 is the size of 𝛯. At the worst case, 𝛯 includes 𝑛𝜙 × 𝜏 elements. So, 𝑆𝛯 ≤ 𝑛𝜙 × 𝜏. In the test section, there are
two loops in lines 19 and 21, and the complexity of lines 24, 26, and 28 are assigned to each approach. In line 24, the complexity
for calculating 𝑖 is 𝑛𝜙 × 𝜏 × 𝓁, and in line 26, the complexity of calculating 𝑗 is the same. Finally, the time complexity of line 28
depends on the complexity of the used deep learning method, which is assumed to be 𝑇𝐶𝑁𝑁 . As a result, the total space complexity
is as Eq. (3), and Eq. (4) represents the time complexity.
( )
𝑂 max{𝓁, 𝑁𝑇 𝑅 } × 𝑛𝜙 × 𝜏 (3)
( ( ))
𝑂 𝑁𝑇 𝐸 × 𝑛𝜙 × 𝜏 × 𝓁 + 𝑇𝐶𝑁𝑁 (4)
The numerical value for these parameters is 𝓁 = 10, 𝑛𝜙 = 482, 𝜏 = 60, 𝑁𝑇 𝑅 = 86, 284, 𝑁𝑇 𝐸 = 21, 572, and 𝑁𝑇 𝐸 × 𝑇𝐶𝑁𝑁 ≈ 5318 s per
epoch and so the time for training all network in 100 epoch is close to 148 h. where 𝑁𝑇 𝐸 × 𝑇𝐶𝑁𝑁 is required time for training and
testing the dataset using CNN. The space complexity of this method is 𝑂(𝓁 × 𝑛 × 𝜏).

5. Experimental evaluation

This section includes four subsections, as follows: In the first subsection, we describe the setup required to run the experiments.
Then, the solutions used for comparison and the testing metrics are introduced in the next two subsections. The final subsection of
this section is dedicated to experimental results.

7
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 2. Clustering image.

Fig. 3. Histogram of maliciousness level.

5.1. Simulation setup

This section compares the proposed method with the state-of-the-art ones using a comprehensive database. The results in this
paper were obtained using the database introduced in [15], which contains more than 100 K malicious files. These malware samples
were obtained by analyzing millions of executable files by 52 antivirus vendors over four years and extracting 486 important features.
These features fall into the following four categories:

(1) Application Programming Interface (API) with 353 features, which is dedicated to accessing the basic functions of the
operating system.
(2) Directory with four features, which are dedicated to the global configuration of the operating system.
(3) File System with 11 features, which is dedicated to the operating system’s data organization.
(4) Miscellaneous with 118 features, to an executable file’s risk level indication.

The used dataset is available on the UCI Machine Learning Repository [16]. Fig. 3 shows the distribution of the maliciousness level
𝜉. As this figure shows, there is more than 6 K malware with 0.74 < 𝜉 ≤ 0.76. Also, the Cumulative Distribution Function for 𝜉 is
depicted in 4, which shows that 100 − 27.67 = 72.33% of the analyzed malware is very dangerous with 𝜉 > 0.5. All simulations in
this paper are performed in MATLAB R2017a on a computer with a 2.2 GHz i7-3632QM CPU and 8 GB RAM.

5.2. Comparison of solutions

In the previous section, three approaches to detect the maliciousness level of malware were proposed. This section introduces
two methods for comparison. In 2020, Taheri et al. [17] proposed a similarity-based method for detecting malware that is based on

8
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 4. CDF of maliciousness level.

Fig. 5. Confusion Matrix. 𝓁𝑖 = True label, and 𝓁𝑗′ = Estimated label.

Hamming distance. They showed that the static features used in malware are so similar to each other that by examining the similarity
of the samples with pre-identified dataset samples, malware can be identified with an accuracy of over 99%. This method has a
better performance compared to fixed-size clustering, automatic clustering, neural networks, SVM, etc. For this reason, the method
proposed in this paper will be compared with First Nearest Neighbor (FNN), which has the least complexity among the methods
in [17]. This method is easy to implement and to calculate the maliciousness level of a test sample, it is enough to calculate its
distance from all the members in the trainset. In other words, 𝜉 ′ = 𝜉𝑘 , where 1 ≤ 𝑘 ≤ 𝑁𝑇 𝑅 minimizes Eq. (5). In this case, 𝜉𝑘 is the
label of 𝑘th sample in the trainset with 𝑁𝑇 𝑅 elements. 𝑍 ′ is the image form of the test sample, and 𝑍 𝑘 is the same for 𝑘th sample.
𝑛
∑ ∑
𝜙 𝜏

|𝑍𝑖,𝑗 𝑘
− 𝑍𝑖,𝑗 | (5)
𝑖=1 𝑗=1

In 2021, Ritter and Urcid [18] published a fascinating study on lattice algebra that deals with the applications of lattice algebra
in image processing, pattern recognition, and artificial intelligence. As a second method to compare with the proposed method,
lattice-based associated memory (LAM) is used, which has a stronger mathematical background and much higher efficiency than
other methods such as Hopfield Memory. This method integrates the trainset instances into an 𝑛𝜙 -by-𝜏 image LAM as in Eq. (6),
and uses Eq. (7) to estimate the maliciousness level 𝜉 ′ .
𝑘
𝐿𝐴𝑀𝑖,𝑗 = max{𝜉𝑘 − 𝑍𝑖,𝑗 } ∶ 1 ≤ 𝑘 ≤ 𝑁𝑇 𝑅 . (6)

𝜉 ′ = min{𝑍𝑖,𝑗

+ 𝐿𝐴𝑀𝑖,𝑗 } ∶ 1 ≤ 𝑖 ≤ 𝑛𝜙 , 1 ≤ 𝑗 ≤ 𝜏. (7)

5.3. Test metrics

The most well-known evaluation criterion is mean cumulative absolute error (MCAE), which is shown in Eq. (8). This criterion
shows the average difference between the actual level of maliciousness and the estimated level.
∑𝑛
𝑖=1 |𝜉𝑖 − 𝜉𝑖 |

(8)
𝑛
In Eq. (8), 𝑛 represents the size of the test set, 𝜉𝑖 represents the actual maliciousness level of the 𝑖th test, and 𝜉𝑖′ represents the
estimated one.
The next section shows that the MCAE is not sufficient to demonstrate the performance of the proposed method, and it is easy
to suggest methods that have a low MCAE but are not suitable for practical purposes and provide unreliable results. For this reason,
MCAE and Confusion Matrix are used simultaneously in evaluating the proposed method. This matrix displays valuable information
about the proposed method’s performance, the structure shown in Fig. 5. In the 𝑖th row and 𝑗th column of the confusion matrix, there
is 𝜆𝑖,𝑗 , which means that in the evaluation process, there were 𝜆𝑖,𝑗 samples that had the true label 𝓁𝑖 , while the proposed method

9
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 6. Confusion Matrix important regions for test metrics.

Table 2
Test metrics for 𝑖𝑡ℎ label.

Metric Region-based formula Confusion matrix-based formula


∑ ∑
𝑅1 +𝑅4 𝜆𝑖,𝑖 + 𝓁𝑟≠𝑖,𝑟=1 𝓁𝑐≠𝑖,𝑐=1 𝜆𝑟,𝑐
Accuracy 𝑅1 +𝑅2 +𝑅3 +𝑅4
∑𝓁 ∑𝓁
∑𝓁 𝑟=1 𝑐=1 𝜆𝑟,𝑐
𝑅2 +𝑅3 𝑗≠𝑖,𝑗=1 𝜆𝑖,𝑗 +𝜆𝑗,𝑖
Error Rate 𝑅1 +𝑅2 +𝑅3 +𝑅4
∑𝓁 ∑𝓁
𝑟=1 𝑐=1 𝜆𝑟,𝑐
𝑅1 𝜆𝑖,𝑖
Precision ∑
𝑅1 +𝑅3 𝜆𝑖,𝑖 + 𝓁𝑐=1 𝜆𝑖,𝑐
𝑅1 𝜆𝑖,𝑖
Recall ∑ ∑
𝑅1 +𝑅4 𝜆𝑖,𝑖 + 𝓁𝑟≠𝑖,𝑟=1 𝓁𝑐≠𝑖,𝑐=1 𝜆𝑟,𝑐
2𝑅1 2𝜆𝑖,𝑖
F1-Score 2𝑅1 +𝑅3 +𝑅4
∑𝓁 ∑𝓁
2𝜆𝑖,𝑖 + 𝑟≠𝑖,𝑟=1 𝑐=1 𝜆𝑟,𝑐

Fig. 7. MCAE of clustering based approach.

estimates label 𝓁𝑗′ . In evaluating the proposed method, the maliciousness levels are divided into 10 parts, and if (𝑖−1)∕10 ≤ 𝜉𝑖 < 𝑖∕10,
𝓁𝑖 = 𝑖 is considered, where 1 ≤ 𝑖 ≤ 10. The same is right for 𝓁𝑗′ .
The performance evaluation criteria in the confusion matrix are based on the four numbers shown in Fig. 6. To calculate these
numbers for a target label, the confusion matrix is divided into four regions, represented by 𝑅1 , 𝑅2 , 𝑅3 , and 𝑅4 , which show True
Positive, True Negative, False Positive, and False Negative, respectively. Using these values, different test metrics can be obtained
that show the performance of the proposed method well. Some of these metrics are summarized in Table 2.
In addition to the metrics introduced in this section, this section introduces two other metrics. A malware detection method
is called conservative when Eq. (9) is established (𝑅6 ≤ 𝑅5 ), otherwise it is called a loose detector. In this case, 𝑅5 ∕𝑅6 is called
conservativeness ratio. Finally, the total accuracy of an algorithm can be expressed by Eq. (10).
∑𝓁 ∑𝓁
𝑅5 𝑟=1 𝑐=𝑟+1 𝜆𝑟,𝑐
= ∑ ∑ ≥1 (9)
𝑅6 𝓁 𝑟−1
𝑟=1 𝑐=1 𝜆𝑟,𝑐
∑𝓁
𝑅7 𝜆𝑖,𝑖
= ∑ 𝑖=1 (10)
𝑅5 + 𝑅6 + 𝑅7 𝓁 ∑𝓁
𝑟=1 𝑐=1 𝜆𝑟,𝑐
In the next part, the methods introduced in this article are evaluated using the metrics, and their pros and cons are examined.

5.4. Experimental results

In this section, the proposed methods are evaluated, and the dataset is split into two parts. The first one is for training, and the
second, which constitutes 20% of the data, is used to evaluate. The result of the first experiment is summarized in Fig. 7, which

10
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 8. Confusion Matrix of clustering-based approach.

Fig. 9. MCAE of similarity based approach.

Fig. 10. Confusion Matrix of similarity-based approach.

deals with the MCAE of a clustering-based approach. This figure shows that in the first 60 evaluation samples, the average error has
an upward trend. At the same time, it gradually decreases so that after evaluating 400 samples, the average error does not change
much. The average error after evaluating 21572 samples is 41.59%, which is a large error.
Fig. 8 shows the confusion matrix of this experiment. As this figure shows, 𝑅6 ’s weight is much more than 𝑅5 , which means that
the clustering approach is a loose method. In this case, the probability that the true maliciousness level is less than the estimated
one is 5/6 = 83.33%, and its total accuracy is only 7.31%.
Fig. 9 shows the evaluation results of the similarity-based approach under the same conditions as before, while the error trend
is the opposite of the previous approach. The average error trend in this figure is downward and eventually tends to 26.36%, which
improves the previous error by 37%.
Checking the confusion matrix gives a similar result. As Fig. 10 shows, the weights of 𝑅5 are much more than 𝑅6 , meaning that the
similarity approach is conservative and is more likely to identify low-risk malware as high-risk. In other words, 𝑃 (𝜉 ′ ≥ 𝜉) = 80.67%.

11
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 11. MCAE of probability based approach.

Fig. 12. Confusion Matrix of probability-based approach.

Examination of the Confusion Matrix of the previous two experiments shows a result that cannot be deduced from their MCAE
comparisons, as expressed by 𝑅5 and 𝑅6 regions. Fig. 10 shows another result that can be used to improve the MCAE of the proposed
methods. As Fig. 10 shows, the weights of columns 2 and 4 are zero, which means that no 𝜉 ′ ∈ [0.1, 0.2) ∪ [0.3, 0.4) is estimated for
a similarity-based approach. Similarly, the weight of 6th column is 4, and the weight of 3rd, 5th, and 10th columns is one. These
results show that the variety of malware with maliciousness levels 𝜉 < 0.1 and 0.6 ≤ 𝜉 < 0.9 is so great that there is always a sample
among them that bears the closest similarity to the samples being evaluated. The total accuracy for the similarity-based approach
is 15.79%, which is much better than the clustering approach. Figs. 11 and 12 show the results of the probabilistic approach
evaluation. The general trend of MCAE in this approach is similar to the clustering method, while its confusion matrix behavior
follows the similarity-based method. The average error of this method fluctuates per 1000 evaluated samples; later, it does not
change significantly and eventually converges to 17.6%, which has less error than the previous two approaches. As Fig. 12 shows,
the confusion matrix of this approach has a large number of zero columns, which was also observed in the similarity-based one. In
this figure, there are only three non-zero columns: 𝐶5 , 𝐶8 , and 𝐶9 , where 𝐶8 has more weight. This figure shows that with 97.67%
probability, the probabilistic approach identifies all evaluated malware as malware with maliciousness level 0.7 ≤ 𝜉 ′ < 0.8. The total
accuracy of this method is 25.51%, which is a much better performance than the previous two approaches.
The number of non-zero columns in the lattice-based confusion matrix reaches its peak so that there is only one non-zero column
in this matrix, which means that the output of this method is always a constant maliciousness level 𝑐. By analyzing Eqs. (6) and (7),
the constant output of this method can be calculated. In the used database, 𝑍𝑖,𝑗 ′ , 𝑍 𝑘 ∈ {0, 1} and 0 ≤ 𝜉 < 1. Assume that in Eq. (7),
𝑖,𝑗

𝐿𝐴𝑀𝑖,𝑗 is a fixed number 𝑐. In this case, since 𝑍𝑖,𝑗 ∈ {0, 1} and 𝐿𝐴𝑀𝑖,𝑗 = 𝑐, so 𝜉 ′ = min{0 + 𝑐, 1 + 𝑐} = 𝑐 means that the output
of this method is always equal to 𝑐. In evaluating the lattice-based approach, the method returns 𝑐 = 0.9828. Now the question is:
If a particular method, such as in lattice-based approach, always provides a fixed output 𝑐, is this output always the best possible
output that keeps the error to a minimum?
Fig. 13 shows the MCAE for 10000 fixed maliciousness levels. As this figure shows, for 𝑐 = 0, the lattice-based method is a loose
approach that identifies all dangerous malware as harmless. The MCAE is over 60%, which is very high. As 𝑐 increases, the mean
error decreases so that at 0.6613 ≤ 𝑐 ≤ 0.6697 the mean error reaches the lowest possible value of 17.1%. As shown in Fig. 4, 72.33%
of the evaluated malware is high-risk, and to reduce the average error in the presence of a constant maliciousness level, the samples
should be estimated at 𝑐 > 0.5, as Fig. 13 confirms.
The results of the last experiment are provided in Figs. 14 and 15. As Fig. 14 shows, in the first 100 tests, the average error
has an upward trend, followed by a downward trend. The downtrend eventually converges to an average error of 11.1%, which is

12
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 13. MCAE of single level-based approach.

Fig. 14. MCAE of deep learning based approach.

Fig. 15. Confusion Matrix of deep learning based approach.

better than all previous methods. Confusion matrix analysis in the deep learning method also indicates the high accuracy of this
method compared to other methods. As Fig. 15 shows, the weight of region 𝑅7 is higher than other methods, which means that
this method has the highest total accuracy of 45.1%. All methods in this section have been compared using different test metrics in
Table 3, except for the error rate, which is equal to (1 - accuracy).
The bold values in Table 3 represent the best results for each class, which are assigned to the estimated maliciousness levels 𝜉 ′ ’s,
and ‘‘-’’ represents the undefined test metric. For example, in the similarity-based method 𝑅1 , 𝑅3 = 0, which makes the precision
at 0.1 ≤ 𝜉 ′ < 0.2 equal to (0 + 0)∕0. As discussed earlier, the presence of zero column 𝐶𝑖 in the confusion matrix of an algorithm
indicates that this algorithm is unable to generate output 𝜉𝑖′ , and this appears ‘‘-’’ in the precision column corresponding to the 𝑖th
estimated label.

13
M. Ghahramani et al.
Table 3
Accuracy, Precision, Recall, F1-Score, MCAE, Conservativeness Ratio, and Total Accuracy values for evaluated dataset (Acc= accuracy; Pre= precision; Rec= recall; F1-S= F1-Score).
Evaluated Approach
Clustering Similarity-based [17] Probabilistic Lattice-based [18] Deep Learning
Maliciousness level Acc Pre Rec F1-S Acc Pre Rec F1-S Acc Pre Rec F1-S Acc Pre Rec F1-S Acc Pre Rec F1-S
0.0 ≤ 𝜉 ′ < 0.1 0.518 0.046 0.044 0.044 0.855 0.04 0.006 0.01 0.974 - 0 0 0.974 - 0 0 0.974 0.495 0.013 0.026
0.1 ≤ 𝜉 ′ < 0.2 0.864 0.047 0.007 0.012 0.974 - 0 0 0.974 - 0 0 0.974 - 0 0 0.903 0.345 0.015 0.028
Estimated Label

0.2 ≤ 𝜉 ′ < 0.3 0.871 0.056 0.006 0.011 0.954 0 0 0 0.954 - 0 0 0.954 - 0 0 0.954 0.33 0.008 0.016
0.3 ≤ 𝜉 ′ < 0.4 0.858 0.077 0.007 0.012 0.921 - 0 0 0.921 - 0 0 0.921 - 0 0 0.902 0.4 0.037 0.068
0.4 ≤ 𝜉 ′ < 0.5
14

0.81 0.038 0.005 0.009 0.91 0 0 0 0.91 - 0 0 0.91 - 0 0 0.904 0.416 0.029 0.055
0.5 ≤ 𝜉 ′ < 0.6 0.839 0.112 0.006 0.012 0.874 0.75 0 0 0.874 0.435 0.004 0.008 0.875 - 0 0 0.851 0.377 0.059 0.102
0.6 ≤ 𝜉 ′ < 0.7 0.824 0.238 0.006 0.011 0.752 0.153 0.024 0.042 0.835 - 0 0 0.835 - 0 0 0.819 0.457 0.105 0.170
0.7 ≤ 𝜉 ′ < 0.8 0.758 0.464 0.016 0.03 0.754 0.317 0.006 0.012 0.263 0.246 0.912 0.388 0.76 - 0 0 0.796 0.508 0.147 0.228
0.8 ≤ 𝜉 ′ < 0.9 0.841 0.533 0.009 0.018 0.362 0.176 0.36 0.236 0.847 0.742 0.013 0.026 0.84 - 0 0 0.834 0.488 0.11 0.18
0.9 ≤ 𝜉 ′ ≤ 1.0 0.964 0.4 0 0 0.964 1 0 0 0.964 - 0 0 0.063 0.036 1 0.07 0.964 0.492 0.013 0.025
Per class metric score 1 0 1 1 5 2 1 1 8 1 1 1 7 0 1 1 4 7 6 6
MCAE 41.59% 26.36% 17.6% 36.93% 11.1%
Conservativeness Ratio 0.203 3.3572 3.012 ∞ 1.2814
Total Accuracy 7.31% 15.79% 25.51% 3.63% 45.1%

Internet of Things 27 (2024) 101300


M. Ghahramani et al. Internet of Things 27 (2024) 101300

In Table 3, a per class metric score is assigned to each approach, which shows the number of best results obtained for each
metric test in different classes. For example, the accuracy of the probabilistic approach with score 8 has the best value, except for
𝜉 ′ ∈ [0.5, 0.6) ∪ [0.7, 0.8), which is in the best rank compared to all other approaches. This score corresponds to the number of bold
values in each column. Note that the total accuracy of this method is 25.51%, which is at the second rank. The results show that
except for the accuracy per class metric, deep learning is in the best position for all test metrics.

5.5. Mean time to detect

In this paper, we have also incorporated another factor during testing, which is the ‘‘mean time to detect’’ (MTTD). To achieve
this, we randomly selected 100 samples from the test dataset, and for each case, we determined the time it took for the model to
detect the sample based on the trained model. We then calculated the average of these detection times. For the trained model in
this paper, by proposed methods we have following times.
MTTD (Clustering): 3.29 s MTTD (Probablistic): 1.14 s MTTD (Deep Learning): 4.36 s

6. Discussions

In the previous section, several methods for malware detection were analyzed, and the proposed method was compared with
similar ones. This section discusses other ideas.
In the previous section, the proposed solution was compared with the similarity-based approach, presented in [17]. In this
comparison, the samples were first converted into images, and then the method was applied to the obtained images. This approach
is for detecting malware using static features, while the proposed method is based on dynamic ones. For this reason, instead of
calculating the images using the proposed method, a vector corresponding to the extracted features can be used, and then the
similarity-based method can be applied. In this case, the feature call times are removed, and the resulting vector contains the
maliciousness level with the features shown in Eq. (11).

𝑋𝑖 = [𝜉, 𝜙1 ∶ 𝜏1 , … , 𝜙𝑛 ∶ 𝜏𝑛 ] → [𝜉, 𝜙1 , … , 𝜙𝑛 ] (11)


After this conversion, the similarity-based method can be applied to the calculated vector. Unfortunately, our implementation shows
that this method suffers more errors than the image-based method and therefore was not described in detail in the previous section.
The next point is about the clustering image shown in Fig. 2. As mentioned earlier, this image gives us much information about
malware. For example, the black lines in this image indicate that there are features in the dataset that do not affect any of the
clusters. Black columns, on the other hand, indicate that there are times when no feature is called. By removing these two black
areas, the dimensions of the images can be reduced. Our implementation shows that this increases CNN 4 times faster. Unfortunately,
this reduced the accuracy of the deep learning method by 4%, which is why it was not covered in detail in the previous section.
The next point is to analyze the results of the lattice-based method. As stated in the previous section, the outputs of this method
are always 𝑐 = 0.9828. The reader may be asked: what this number is and why labels are always estimated with this number? This is the
largest label in our train set. To prove this, suppose our dataset has 𝑛𝑚𝑎𝑥 samples with the largest maliciousness level 𝑐. The image
form of the first instance can be represented as a binary 𝑍 1 matrix. Using Eqs. (6) and (7), we can deduce the result in Eq. (12),
which shows that a large number of zero elements in 𝑍 1 increases the probability of convergence to 𝑐.
1
𝑍𝑖,𝑗 = 0 → 𝐿𝐴𝑀𝑖,𝑗 = 𝑐 (12)

Now consider the second train instance 𝑍2 with label 𝑐. Among the situations in which 𝑍1 contains 1, 𝑍2 may be zero. Similar to
the previous case, 𝑍𝑖2′ ,𝑗 ′ = 0 makes 𝐿𝐴𝑀𝑖′ ,𝑗 ′ = 𝑐. This continues until Eq. (13) is established.

𝑘
∃1 ≤ 𝑖 ≤ 𝑛𝜙 , 1 ≤ 𝑗 ≤ 𝜏, 1 ≤ 𝑘 ≤ 𝑛𝑚𝑎𝑥 𝑠.𝑡 𝑍𝑖,𝑗 =0 (13)

Note that the binary matrices are sparse, with many zeros. For example, the size of 𝑍1
corresponding to the maliciousness level
0.9828 in our dataset is 482 × 60 = 28920, while it has only 17 non-zero elements, meaning that at least 99.94% of 𝐿𝐴𝑀’s elements
are 𝑐. Because 𝑐 − 1 is a negative number, the remaining 17 elements will probably converge to the second largest number in the
dataset, 0.9815. This analysis shows that, at best, the evaluated samples are compared with a maximum of 18 maliciousness levels.
Although this observation reduces the performance of the lattice-based method in detecting malware in this paper, it also has some
interesting applications, which are discussed in [18]. Again, see the clustering image corresponding to the 8 clusters in Fig. 2. In this
image, there are pixels that are common to all clusters. These points will probably impose additional overhead on the system. Our
focus is on pixels in only one cluster, which can lead to the emergence of interesting ideas. These points are referred to as kernels
in the lattice-based approach. Fig. 16 shows the number of kernels in each cluster. In the kernel-based method, instead of using all
trainset samples, only the samples containing kernels can be used, which increases efficiency. Implementing this method improves
the accuracy of the lattice-based approach from 3.6% to 7.65% while still not being competitive with other compared methods.
Also, the conservativeness ratio of the lattice-based method was ∞, while in the kernel-based approach, it is 0.0178, meaning that
the kernel-based approach is a loose method that treats high-risk samples such as low-risk ones.

7. Conclusion and future directions

This paper introduces a method to detect online malware. The proposed method monitors the behavior of malware for a period
of time and extracts the dynamic features during this period in the form of a vector. Due to the fact that we do not have much

15
M. Ghahramani et al. Internet of Things 27 (2024) 101300

Fig. 16. Number of kernels in different clusters.

storage and processing space in IoT devices, the proposed method in this paper transformed the extracted vector into a sparse binary
image, and clustering, similarity-based, probabilistic, lattice-based, and deep learning approaches are introduced for the generated
image dataset analysis. The results showed that the mean error of the clustering method is more than 40%, which puts this method
in the worst rank, and on the other hand, deep learning with an error of 11.1% is in the best position. The evaluated methods were
compared with 8 different metrics, among which the probabilistic method was in the best position in terms of accuracy per class
metric, and deep learning, with 45.1% accuracy in the remaining 7 metrics, surpassed others.
This paper provides valuable results for future research. For example, we have shown that MCAE is not the only effective
parameter in the performance of methods, and if regions 𝑅1 and 𝑅3 in the Confusion Matrix of a target class are equal to zero,
there will be a zero column in this matrix that makes precision metric calculation impossible. This result can be used to suggest a
multi-objective optimization learning base method.
The results of this article can be used in other areas, such as label flipping in adversarial machine learning. For example, we
showed that the lattice-based approach always returns the maximum maliciousness level 𝑐 as output for the used dataset. In this
case, changing the label 𝑐 ′ < 𝑐 will not change the performance of this method. In addition, the conservativeness ratio can be
used to identify vulnerable methods against adversarial targets. If this rate is too high, the probability of identifying dangerous
samples increases, while if the rate is less than 1, this method returns smaller maliciousness levels as output, and this can encourage
adversaries to produce malware with high maliciousness levels.

Declaration of competing interest

We declare that there are no conflicts of interest related to this submission.

Data availability

Data will be made available on request.

References

[1] R. Yumlembam, B. Issac, S.M. Jacob, L. Yang, Iot-based android malware detection using graph neural network with adversarial defense, IEEE Internet
Things J. (2022).
[2] Y. Wu, J. Shi, P. Wang, D. Zeng, C. Sun, DeepCatra: Learning flow-and graph-based behaviours for Android malware detection, IET Inf. Secur. 17 (1)
(2023) 118–130.
[3] X. Hu, C. Zhu, G. Cheng, R. Li, H. Wu, J. Gong, A deep subdomain adaptation network with attention mechanism for malware variant traffic identification
at an IoT edge gateway, IEEE Internet Things J. (2022).
[4] M. Mimura, Y. Tajiri, Static detection of malicious PowerShell based on word embeddings, Internet Things 15 (2021) 100404.
[5] S.K.J. Rizvi, W. Aslam, M. Shahzad, S. Saleem, M.M. Fraz, PROUD-MAL: Static analysis-based progressive framework for deep unsupervised malware
classification of windows portable executable, Complex Intell. Syst. 8 (1) (2022) 673–685.
[6] H. Cai, N. Meng, B. Ryder, D. Yao, Droidcat: Effective Android malware detection and categorization via app-level profiling, IEEE Trans. Inf. Forensics
Secur. 14 (6) (2018) 1455–1470.
[7] F. Abdullayeva, Malware detection in cloud computing using an image visualization technique, in: 2019 IEEE 13th International Conference on Application
of Information and Communication Technologies, AICT, IEEE, 2019, pp. 1–5.
[8] R.U. Khan, X. Zhang, R. Kumar, Analysis of ResNet and GoogleNet models for malware detection, J. Comput. Virol. Hacking Tech. 15 (1) (2019) 29–37.
[9] E.M. Karanja, S. Masupe, M.G. Jeffrey, Analysis of internet of things malware using image texture features and machine learning techniques, Internet
Things 9 (2020) 100153.
[10] S. Yajamanam, V.R.S. Selvin, F. Di Troia, M. Stamp, Deep learning versus gist descriptors for image-based malware classification, in: Icissp, 2018, pp.
553–561.
[11] Q. Le, O. Boydell, B. Mac Namee, M. Scanlon, Deep learning at the shallow end: Malware classification for non-domain experts, Digit. Investig. 26 (2018)
S118–S126.
[12] I. Baptista, Binary Visualisation for Malware Detection, University of Plymouth, 2018.
[13] D.-L. Vu, T.-K. Nguyen, T.V. Nguyen, T.N. Nguyen, F. Massacci, P.H. Phung, HIT4Mal: Hybrid image transformation for malware classification, Trans.
Emerg. Telecommun. Technol. 31 (11) (2020) e3789.

16
M. Ghahramani et al. Internet of Things 27 (2024) 101300

[14] S. Roy, E. Panaousis, C. Noakes, A. Laszka, S. Panda, G. Loukas, Sok: The MITRE att&ck framework in research and practice, 2023, arXiv preprint
arXiv:2304.07411.
[15] N.A. Huynh, W.K. Ng, K. Ariyapala, A new adaptive learning algorithm and its application to online malware detection, in: International Conference on
Discovery Science, Springer, 2017, pp. 18–32.
[16] D. Dua, C. Graff, UCI machine learning repository, 2017, URL http://archive.ics.uci.edu/ml.
[17] R. Taheri, M. Ghahramani, R. Javidan, M. Shojafar, Z. Pooranian, M. Conti, Similarity-based android malware detection using hamming distance of static
binary features, Future Gener. Comput. Syst. 105 (2020) 230–247.
[18] G.X. Ritter, G. Urcid, Introduction to lattice algebra: With applications in AI, pattern recognition, image analysis, and biomimetic neural networks, 2021.

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy