Scalable_malware_detection_system_using_big_data_a
Scalable_malware_detection_system_using_big_data_a
https://doi.org/10.1007/s00500-021-06492-9 (0123456789().,-volV)(0123456789().
,- volV)
Abstract
Computer, Internet, and Smartphone have changed our life as never before. Today, we cannot even imagine our life
without these technologies. If we look around, we find everything, everywhere connected and controlled by system and
software. We find amazing software and mobile applications which have become nerve of our daily life. Our dependency
on this software and systems is so and so much that it is scary even to imagine, what if this system fails at any point in time.
There is always a threat surrounded by various types of cyber-attacks. Every day cybercriminals are evolving their
attacking strategy. Cyber-attacks using ever-more sophisticated malware are the major cause of concern for all types of
users. Cyber-world has witnessed rapid changes in malware attacking strategy in the recent past. The volume, velocity, and
complexity of malware are posing new challenges for malware detection systems. A scalable malware detection system
with the capability to detect complex attacks is the time of need. In this paper, we have proposed a scalable malware
detection system using big data and a machine learning approach. The machine learning model proposed in the system is
implemented using Apache Spark which supports distributed learning. Locality-sensitive hashing is used for malware
detection, which significantly reduces the malware detection time. A five-stage iterative process has been used to carry out
the implementation and experimental analysis. The proposed model shown in the paper has achieved 99.8% accuracy. The
proposed model has also significantly reduced the learning and malware detection time compared to models proposed by
other researchers.
Keywords Malware Big data Machine learning Static analysis Dynamic analysis Locality-sensitive hashing
123
scalability issues are some of the major challenges which the most recent and relevant works followed by a conclu-
need to be addressed. sion in Sect. 11.
High-speed networks and the increased number of users
generate large numbers of suspect files to be scanned by the
malware detection system. As the number of sample files 2 Malware
and size of the signature database increase, it poses scal-
ability and performance issues. Another major disadvan- The history of malware is quite interesting. According to
tage of the traditional malware detection system is that it Scientific American, the idea of a computer virus was born
needs a continuous update to detect a novel malware attack in 1949, when well-known computer scientist John von
(Kolosnjaji et al. 2018; Masabo et al. 2018; Poudyal et al. Neumann published a paper ‘‘Theory and Organization of
2019). Complicated Automata.’’ In this paper, John von Neumann
One of the objectives of the research work proposed in hypothesizes how a computer program can reproduce itself.
the paper is to develop a system that can handle a large In the year 1950s, Bell Labs employees could implement
number of files efficiently. The system should also evolve the John von Neumann envision and developed a game
with time without manual intervention and detect novel called ‘‘Core Wars.’’ In this game, programmers would
attacks (Wu et al. 2021). In such a situation, big data unleash software ‘‘organisms’’ that competed for control of
system seems to be a feasible solution that can deal with the computer.
the storage, ingestion, and extraction activities and support Computer viruses began to appear in the early 1970s.
the machine learning system to learn and grow. Historians often credit the ‘‘Creeper Worm,’’ an experi-
Various classical machine algorithms can be used for the mental self-replicating program written by Bob Thomas at
malware detection system. However, not all algorithms BBN Technologies with being the first virus. However, the
have the same performance. We need to identify the term virus was not introduced in computer science jargon
machine learning algorithm which supports fast learning until the mid-eighties. Fred Cohen coined the term virus in
and provides high accuracy on the big-data-based platform. his Ph.D thesis in the year 1986. He defined a ‘‘virus’’ in a
Hence, many classical algorithms (i.e., Decision Tree, single sentence as: ‘‘A program that can infect other pro-
Logistic Model, Random Forest, SVM, Naı̈ve Bayes, grams by modifying them to include a, possibly evolved,
Gradient Boosting, and KNN) have been experimented version of itself.’’
with and based on the best performance Random Forest There is the various definition of malware. Put simply,
machine learning algorithm has been selected for the malware is any piece of code or software that is inten-
implementation in the proposed malware detection system. tionally developed for malicious intent. Malware is gen-
To reduce the malware detection time, Trend Micro erally used for umbrella terms for many different types of
Locality-Sensitive Hashing (TLSH) algorithm has been threats. It includes viruses, trojan, adware, riskware, etc.
implemented along with the Random Forest algorithm. The objectives of these malware may be multifaceted like
Many researchers have already experimented with stealing the data, damaging the device, etc. (Higuera 2020;
machine learning algorithms for the malware detection Vinayakumar and Soman 2018; Rathore et al. 2019).
system. However, the implementation of a machine learn-
ing algorithm on the distributed parallel system and 2.1 Malware analysis techniques
application of locality-sensitive hashing for malware
detection is a novel contribution of this paper. Malware is evolving continuously with a complex and
At the beginning of the paper, Sect. 2 gives a short stealthy approach. Every time new types of malware appear
history of the malware followed by malware analysis with more damaging characteristics than ever before. The
techniques and challenges. Section 3 highlights the recent evolving complexity of malware always poses a challenge
works done by other researchers on a similar topic. Sec- for the analyst for its detection. Attackers always find new
tion 4 discusses the research objectives and methodologies and advanced techniques to escape from detection. This is
followed by Sects. 5 and 6, which briefly discuss big data, where malware analysis plays a crucial role. The core
machine learning, and its applications in the malware objectives of malware analysis are to understand the mal-
detection system. Section 7 discusses in detail locality- ware approach of infection, threat, risk, and associated
sensitive hashing which is used for malware detection in damaging factors. It requires series of methods and tech-
the proposed model. The experimental setup is discussed in niques to analyze the malware. Better analysis can help to
Sect. 8 followed by Sect. 9, which briefly discusses the develop better defensive techniques for the organization.
experimental results. Section 10 of the paper shows a There is various approach (Fig. 1) for malware analysis in
comparative analysis of the proposed work with some of which Static Analysis and Dynamic Analysis are the most
123
common and well-known approach (Agarkar and Ghosh matches the hash value of the suspected binary file
2020; Higuera 2020). with their signature database. The signature-based
approach is easy; however, it cannot detect the novel
2.2 Static analysis malware attack (Gupta 2019).
(2) Malware writers use code obfuscation to change the
Static analysis is a process of analyzing the malware binary structure and pattern of the malware program. It
code without execution. It is performed by determining the makes it difficult for the malware analyst to under-
signature of the binary file or by calculating the hash of the stand the code or reverse engineer it using any
binary file. The binary file can be reverse-engineered by available tools (Anderson et al. 2017).
dissembling the code and converting the machine-exe- (3) There is always a need of monitoring the live
cutable code into assembly language code. The analyst can network data. However, live network monitoring is
better understand the assembly language code and what it’s never been an easy task. Monitoring petabytes and
programmed to do. exabytes of live network data are challenging
scalability and performance issues. The traditional
2.3 Dynamic analysis malware detection system is not capable of process-
ing big data (AlAhmadi et al. 2018).
Unlike static analysis, it is a behavioral-based analysis and (4) Different types of devices connected to the network
it is performed by executing the malware code and communicate the data in different formats. The
observing its behavior. The dynamic analysis is performed malware detection system must be able to interpret
in the closed and isolated virtual environment so that it can heterogeneous data format which is not a simple job.
be analyzed thoroughly without affecting the system. Many researchers have worked on this issue. How-
Dynamic analysis is performed to determine the function- ever, it remains a challenging task (Burnap et al.
ality of the malware which is generally difficult through 2018; Gupta and Rani August 2018; Higuera 2020).
any other approach. 5) Some of the static malware analysis and detection
system shows promising result, but time complexity
2.4 Malware analysis challenges and delay in detection is one of the major challenges
(Vinayakumar and Soman 2018).
Malware analysis is a challenging task, and some of the
The malware detection system using big data and
common challenges are as follows:
machine learning approach effectively handle the above-
(1) Most of the malware detection software is based on a mentioned challenges. Implementation of big data makes
signature-based approach. The detection software the malware detection system scalable. The machine
123
learning approach makes the system evolve by learning combating and triaging malware is now a big data problem
from new attack patterns. (Anderson et al. 2017; Paranthaman and Thuraisingham
2017; Wassermann and Casas 2018).
The traditional malware detection system has several
3 Related works technical difficulties, which have to be addressed to make
the system effective to face next-generation malware
Malware analysis is one of the active research fields. We detection challenges. The contemporary malware system is
have gone through many research papers while doing this not good at handling large numbers and big-size files. It’s
research. Table 1 summarizes some of the work which is also practically difficult to add storage and processing
related to our proposed work. capacity dynamically based on demand (Oliveira 2019;
Paola et al. 2020; Ullah and Babar 2019). The system is not
flexible enough for dynamic changes in programming
4 Research objectives and methodology scripts or database schemas. Lack of fault-tolerant capa-
bilities is also a major cause of concern.
4.1 Research objectives The old traditional approach of static malware analysis
is not scalable and efficient enough. High-speed networks
The main research objectives of the work presented in this and the increased number of users pose scalability chal-
paper are as follows: - lenges as the delay in analysis and detection can severely
affect the overall performance. Big data and a machine
1. Fast Learning Model with High Accuracy:—
learning approach can greatly improve the efficiency of the
Machine learning is a time-consuming and resource-
malware detection system. It will make the overall system
intensive process. We wanted to experiment with a
highly scalable and improve performance (Kolosnjaji et al.
parallel machine learning process to reduce the learn-
2018; Masabo et al. 2018; Poudyal et al. 2019).
ing time as well as to achieve higher accuracy.
Big data deals with storage, ingestion, and extraction
2. Fast and Scalable Malware Detection System:—The
activities. Big data as the name suggest helping the system
objective behind making a scalable system is to satisfy
to cope up with the large volume of data for analysis,
the on-demand performance. A scalable system can be
discovering hidden pattern and information extraction. So
easily scaled up for higher performance based on
basically, big data support the analysis of a large volume of
requirements. We have used locality-sensitive hashing
data.
for malware detection. Locality-sensitive hashing is
Machine learning helps the system to learn and take a
highly scalable and time-efficient.
predictive decision without human intervention. With the
help of machine learning, a software application can learn.
4.2 Research methodology It improves the system’s accuracy to predict better out-
comes. Machine Learning teaches computers how to take
In a machine learning experiment, workflow is an iterative input and processor interpret it according to the machine
process that can be divided into five stages. The initial learning model and produce desirable outputs. The overall
process of the experiment begins with the data gathering, architecture of the proposed big-data and machine-learn-
cleaning, and preprocessing the data, building the machine ing-based malware detection systems is shown in Fig. 3.
learning model, validating the model, and finally deploying There are two major phases in the proposed system, i.e.,
the model. The entire process is iterative (Fig. 2) till an the training phase and the detection phase. The training
expected outcome is achieved. phase of the system is based on a machine learning model.
The machine learning models are executed on Apache
Spark, which provides a scalable and parallel machine
5 Big data perspective learning platform. A parallel machine learning platform
significantly reduces the training time compared to a stand-
Malware collection and processing is always a critical and alone system, and it is highly scalable. For training the
challenging job for most of the organization dealing in the machine learning model, the malware sample files are
cybersecurity domain. The proliferation of the internet and collected which are stored in the repository. The malware
exponential growth in the number of users generate a flood files from the repository are processed for feature extrac-
of malware samples. Major anti-malware software devel- tion. The extracted malware features are fed into the
opment companies receive millions of samples in a day. machine learning model for training the model. Script files
It’s difficult to collect, store and analyze such a huge are used for various pre-processing and processing of the
number of samples in a scalable manner. Effectively malware samples. The script files are also used for passing
123
In Cho et al. Sequence Alignment Method (Widely used in bioinformatics The experimental results show that the pairwise sequence
(2016) field) alignment algorithm is useful. However, the algorithm has
high computing overheads. Because of high computing
overhead, a framework with a pairwise sequence alignment
algorithm may not be suitable for high-speed malware
classification
Anderson Reinforcement Learning Approach The author has implemented a black-box attack using a
et al. (2017) reinforcement learning model which consists of an agent and
an environment. The agent learns incrementally through the
environment. The authors in the paper have concluded that the
reinforcement learning models and manipulations are
relatively rudimentary, the modest evasion rate demonstrates
that black-box machine learning models for malware
detection can be evaded
Ramkumar Random Forest, SVM The paper has shown the efficiency of Apache Spark over
et al. (2017) Apache Hadoop in terms of performance. The Apache Spark
performance is approximately two times better than the
Apache Hadoop
Al Ahmadi K-Nearest Neighbour (KNN) and Random Forest (RF) The authors have concluded that malware classification systems
et al. (2018) that apply a supervised machine learning approach require
continuous training of new malware variants to adapt to
behavioral changes. However, applying a fuzzy similarity
measure allows a degree of flexibility in malware behavioral
change and thus only needs training on samples of a new
malware family
Cui et al. Deep Learning, CNN, Bat Algorithm The author has proposed the application of deep learning
(2018) techniques to improve the detection rate of malware variants.
The method discussed in the paper transforms the malicious
code into a grayscale image. The grayscale images were
identified and classified by CNN. It could automatically
extract the malware features. Since CNN is very effective and
efficient for identifying malware images, the proposed model
using CNN was significantly faster than other approaches
Li et al. SVM, Association Rules The authors in the paper have shown that it is possible to reduce
(2018) the number of permissions to be analyzed for mobile malware
detection without compromising the high accuracy and
effectiveness. The experimental results show that with less
permission, the runtime performance has been improved by
85.6%, and over 90% detection accuracy has been achieved.
The author has also concluded that a smaller feature set can
also reduce memory consumption. Based on the tested 67
machine learning algorithms, it was shown that the machine
learning methods based on tree structure can produce better
results
Burnap et al. The Self Organizing Feature Map (SOFM) The result is promising when SOFM is used with Logistic
(2018) Regression Model. The proposed method is good for the
detection of APTs and polymorphic malware
Yuxin and Deep Belief Network (DBN) Experimental results show that A DBN can learn from
Siyi (2019b) unlabeled data. The proposed model in the paper produces
better classification results than classical ML algorithms
Azmoodeh Deep Learning The author has achieved a malware detection accuracy of
et al. (2019) 98.37% and a precision rate of 98.59%. The authors have also
discussed in detail the mitigation techniques for junk code
insertion attacks
123
Table 1 (continued)
Authors Techniques Used Summary/Result
Vinayakumar Deep Neural Network, Classical Machine Learning The experiment was conducted with various classical machine
et al. (2019) Algorithms learning classifiers such as Random Forest (RF), Logistic
regression (LR), Decision Tree (DT), k-Nearest Neighbor
(KNN), Naive Bayes (NB), SVM with linear (Sl) and rbf
kernel (Sr). The ROC for the Deep Neural Network model
(DNN) showed the highest AUC of 0.9983
Choi (2020) kNN Classification and Hierarchical Similarity Hash The author of the paper has addressed the detection time of the
AI based Malware Detection System. To reduce the detection
time, author in the paper has proposed k-nearest-neighbor
(kNN) classification for malware detection with a vantage-
point (VP) tree using a similarity hash. The author has
reduced the detection time by 67% and has increased the
detection rate by 25%
Chen et al. K-Nearest Neighbors (KNN), Decision tree (DT), Gradient The experimental results in the paper show the detection
(2021) Boosting Decision Tree (GBDT), and Extreme Gradient accuracy of 99.55% and 99.55% f1-score using the XGBoost
Boosting (XGBoost) algorithm. The author has claimed higher accuracy; however,
the detection process of the proposed system is time-
consuming
Serpanos Random Forest Sisyfos is a modular and extensible platform for malware
et al. (2021) analysis. It addresses multiple operating systems. Sisyfos has
been developed based on open software for feature extraction
and is available as a stand-alone tool with a web interface
123
123
computer networks, cloud-based infrastructure, etc. These Many times the machine learning-based systems gen-
data can be collected in two different ways: erate a false alarm. There are two types of false alarms.
• Pre-Execution Phase:—In this phase, static informa- • False Positive:—System mistakenly identifies a benign
tion about the files is collected without executing it. file as a malicious file.
This may be the file header information, file format, • False Negative:—System mistakenly identifies a mali-
binary data statistics, code descriptions, etc. Signature- cious file as a benign file.
based static analysis can be used to train the detection
The objective of the machine learning model is to
model.
reduce the false alarm as low as possible or zero. It is
• Post-Execution Phase:—It focuses on behavior anal-
complex by the fact that everyday novel malware keeps
ysis of the code and series of events caused by its
coming, which the learning model has not seen during the
execution. Behavior analysis of the file can be mapped
training phase and may detect as a benign file. Occasion-
with the known malware and can be used to train the
ally the system may detect a benign file as malware
detection model.
because the latest software and files developed by software
Research work discussed in this paper is based on the vendors may be of new pattern and behavior. The system
static features of malware, i.e., the pre-execution phase. may raise the false alarm because of a weak training
dataset or fault in the model itself. In such a scenario, we
6.1 Application of machine learning in malware need to thoroughly examine the system and correct either
analysis the dataset or model.
As mentioned, many new benign files, as well as mal-
Our malware detection system is using both supervised and ware files, keeps coming up. Slowly with time, the number
unsupervised learning models. The overall objective to of false alarms increases. It decreases the overall perfor-
make the system learn and evolve by itself without human mance of the system. To maintain the system performance
intervention. Hence unsupervised learning is used. How- high, the system must be flexible enough for any changes in
ever, some attack is very novel and do not have any known the learning model and adding the new dataset to train the
pattern. In such a situation, supervised learning through model on the fly. The model should evolve by learning and
manual intervention is required (Hou et al. 2017; Ali et al. updating with the new datasets. The major steps of a
2020). machine learning-based static malware analysis system are
shown in Fig. 4.
6.2 Training dataset
It is important to understand that the success of the 7 Malware detection using locality-sensitive
machine learning model heavily depends on the training hashing
dataset. A large dataset with variant samples and proper
labeling can make the model accurate. The dataset should The main objective behind the implementation of the
contain a large number of conditions to match with a real- machine learning model for malware detection is to reduce
life scenario. Collecting such a dataset is a crucial job. human intervention. Malware detection systems use file
Anti-Malware development companies collect these sam- properties, file behavior, code fragment, hash value, or a
ples from their clients and users. Many of these organiza- combination of all the approaches to detect malware. The
tions and websites provide malware samples in the public current malware detection systems are facing performance
domain for research purposes. The details of data sources issues because of the following reasons.
used in this research work are mentioned in Sect. 8.1.
• Manual creation of detection rules couldn’t keep up
with the emerging flow of new malware.
6.3 Model
• The system couldn’t detect new malware until the
signature database is updated with the detection rule.
The machine learning model is like a black box. It takes
input X and produces output Y through the complex We wanted to make the system robust enough to detect
sequence of the process. Generally, the models are so even the small changes in the file. It would detect the new
complex that it is difficult for a human to interpret every malware even with minor modifications. Above all, with
input and output. However, though these models are the added features, we wanted to make the system scalable,
complex in nature, the developer should have a thorough fast, and efficient. To achieve our objectives and expected
understanding of these models. outcomes from the system, we have selectively used many
123
concepts, which are discussed in detail in the following minor changes. To accomplish this task, we used locality-
sections. sensitive hashing (LSH).
LSH is different from normal hashing techniques. As
7.1 Locality-sensitive hashing (LSH) shown in Fig. 5, in the normal hashing technique, the hash
value of two almost identical files is as different as the hash
One of the major challenges of the current malware value of two completely different files. The normal cryp-
detection system is to analyze and detect new malware. tographic hash algorithm does not reflect the near similarity
These new malware are improvised or minor changed between two files (Yuxin and Siyi 2019a; Gupta 2019;
versions of existing malware. Any changes in the existing Naderi et al. 2019). However, LSH is an algorithmic
malware code make it undetectable because of a mismatch technique that hashes similar inputs into the same ‘‘bucket’’
in pattern with known malware signatures. with high probability, while input data which are far dif-
So, the challenge is how effectively we match the ferent are likely to be in different ‘‘buckets.’’ This tech-
known signature patterns with new malware files which has nique helps for clustering and nearest neighbor search. The
123
overall process can be divided into three major steps as • If similarityðf 1 ,f 2 Þ is low then
shown in Fig. 6. probabilityðH ðf 1 Þ ¼¼ Hðf 2 Þ) is low.
Jaccard similarity calculation is feasible if the number of
7.2 Shingling
files and samples is less. However, in malware analysis, a
huge number of sample files are received every day. To
In this process, the file is converted into sets. Each file is
calculate Jaccard similarity, the system needs to load all the
converted into a set of characters of length k. The objective
sets and calculate intersection and union. The whole cal-
is to represent each document in the collection as a set of
culation process computationally resources intensive.
k-shingles.
Instead of dealing with large sets which require a lot of
computing time and memory, Min-Hash can provide the
7.2.1 Jaccard index
approximate measurements in a scalable manner.
The Jaccard Index is used for measuring the similarity and
7.2.2 Min-hash
diversity of sample sets. It is also known as the Jaccard
similarity coefficient and the intersection over the union.
Min-Hash is a technique to quickly estimate the similarity
Now to measure the similarity between two files (i.e., new
between two files. Jaccard similarity of two files Jðf 1 ,f 2 Þ
malware with small changes and the known malware sig-
are calculated by the ratio of their intersection and union
nature), Jaccard Index can be used. The Jaccard Index
using Eq. 1. The value of Jaccard similarity would be 0, if
between the two files A and B can be calculated as:
the two files are different, it would be 1, if the two files are
j A \ Bj completely similar, and it would be between 0 and 1,
J ðA; BÞ ¼ ð1Þ
j A [ Bj otherwise. The goal of Min-hash is to quickly calculate
Jðf 1 ,f 2 Þ without explicitly computing the intersection and
Though the Jaccard Index can measure the similarity
union. It greatly reduces the space complexity and time
between modified malware files and known malware
complexity.
samples, it raises some scalability issues. The documents
need to be stored in a sparse matrix to calculate similarity
7.2.3 Locality-sensitive hashing using min-hash
index and it’s required huge memory. The time complexity
of the algorithm is O(n2). As malware detection systems
Our malware analysis and detection system have a very
receive millions of files per day, it would require huge
large collection of signature database. Whenever the sys-
memory and time for similarity matching. Hashing is the
tem receives a suspected file for malware detection, it
way to solve time and space complexity issues.
generates a query to find Jaccard similarities. LSH algo-
Hashing is a function to convert the file into a small
rithm achieves the sub-linear time complexity by reducing
signature. Instead of storing the complete file in memory,
the number of comparisons needed to find the similarity
the hash value of the file is small enough to fit into memory
between two objects. LSH primarily differs from the con-
and greatly reduces the space complexity of the system. If
ventional hashing technique in the sense that the normal
H is a hash function, f 1 and f 2 are two sample file, the
hashing technique tries to avoid hash collision, whereas
similarity index using the hash function can be represented
LSH aims to maximize collision for similarity in the items.
as.
In-depth detail and working principles of Min-Hash algo-
• If similarityðf 1 ,f 2 Þ is high then rithm and locality-sensitive hashing for example can be
probabilityðH ðf 1 Þ ¼¼ Hðf 2 Þ) is high.
123
found in (Bryłkowski 2017; Pagani et al. 2018; Hashing 8.1.1 String features
2017).
Some of the well-known locality-sensitive hashing String feature extraction analysis is the most common
algorithms are SSDEEP, SDHASH, and TLSH. In this approach in malware static analysis. Extracting these fea-
experiment, we have used TLSH for malware detection. tures involved extracting and analyzing the exe-
TLSH can be used to inspect a large amount of malware in cutable files’ function names, comments, messages, import
very less time. TLSH uses threshold-based Hierarchical commands, etc. to determine the important features.
Agglomerative Clustering (HAC-T), which can cluster However, the string feature analysis is effective only for
hash digest in a scalable manner. The clustering techniques unencrypted files.
used in TLSH can cluster digest in O(nlogn) time on
average which significantly improves the malware detec- 8.1.2 PE header features
tion performance. The complete technical details for TLSH
can be found in (Dell’Amico 1910; Oliver et al. 2020; The set of features were extracted from the file’s
Technical Overview 2021). portable executable headers. Executable files have a com-
mon format called Common Object File Format (COFE).
Portable Executable (PE) format is one such COFE format
8 Experimental setup available for the Windows executables. PE format is
actually a data structure that tells the Windows OS loader
One of the main objectives of this research was to exper- about the information required to manage the wrapped
iment with the parallel machine learning approach for fast executable files. It contains valuable information such as
training. Hence multiple nodes were configured with references to DLL libraries needed to be imported and
Apache Spark. The system was installed with a 64-bit exported, code, data required to run an executable, and the
Ubuntu operating system. The configuration details of the resources needed by the executable, etc.
head node and worker nodes are mentioned in Table 3.
We collected malware datasets from various sources. We Spark machine learning provides a suite of metrics
took 7825 Malware samples from VirsusShare and 8127 (Table 4) to evaluate the performance of machine learning
samples from VirusTotal and theZoo. These samples con- models.
sist of malware of different families. Malware analysis We analyzed the performance of the proposed spark-
datasets from IEEE data repository were also used for based scalable malware system using various classical
analysis. It includes approx. 1000 portable executable (PE) machine learning algorithms. We did three different types
header. Some malware samples were also collected from of analysis i.e., Training Time, Detection Time, and
Malwr, Lenny Zelter, and Contagio malware repositories. Accuracy for various classical machine learning
2128 benign files were collected from various application algorithms.
and utility software like Media Player, Adobe Reader,
Image Editing Software, Office Utility Software, Gaming
Applications, Network Tools, etc.
Number of nodes 2 4
Processor Intel Core i7 (5.0 GHz) Intel Core i7 (5.0 GHz)
Memory (RAM) 16 GB 8 GB
Storage 1 TB 1 TB
Operating system Ubuntu 20.04 Ubuntu 20.04
123
where TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives
Table 5 Training time of machine learning algorithms on stand-alone system (in Hours)
Number of Algorithms
files
Decision tree Logistic model Random forest SVM Naı̈ve Bayes Gradient boosting KNN
123
Table 6 Training time of machine learning algorithms on distributed systems (in Hours)
Number of files Algorithms
Decision tree Logistic model Random forest SVM Naı̈ve Bayes Gradient boosting KNN
analyzed in our experiment. Figure 9 shows that the overall Machine (SVM), and Decision Tree (DT). Figure 11 shows
accuracy of the Random Forest algorithm is better than all the AUC for Random Forest, Naı̈ve Bayes, Gradient
the other algorithms. Boosting and KNN.
Area under the curve (AUC) better visualizes the per-
formance of the algorithms. The higher the AUC, the better 9.3 Malware detection
the performance of the model. To make the graph clearly
visible, two AUC graphs have been drawn. Figure 10 After analyzing the training performance and accuracy of
shows the AUC for Logistic Model, Support Vector multiple machine learning algorithms, we found that the
123
123
In Cho et al. Sequence VX Heaven windows virus 87% High accuracy and good The research analysis is
(2016) alignment collection time efficiency mainly focused on
Method window-based malware
variants. Scalability and
time efficiency are not
considered in the research
objectives
Paranthaman and Random Forest, CAIDA Dataset, CTU-13 96.31% The proposed framework has Scalability is not
Thuraisingham SVM Dataset, ISOT BotNet, good time efficiency considered as a research
(2017) Android Malware Genome objective
Project Dataset, ADFA
IDS dataset, CS Mining
Malicious Software
Dataset,
Cui et al. (2018) Deep Learning, Dataset from the Vision 94.5% Novel approach. This The proposed system needs
CNN, Bat Research Lab (https:// method transformed the a unique approach to
Algorithm vision.ece.ucsb.edu/ malicious code into transforming malicious
research/signal-processing- grayscale images. Next, the code into color images.
malware-analysis) images were identified and Scalability and time
classified by a CNN that efficiency are not
could extract the features considered in the research
of the malware images objectives
automatically
Burnap et al. The Self VirusTotal 93.76% Good for detection of APTs The system needs to wait
(2018) Organizing and polymorphic malware for the full execution of
Feature Map the malicious payload.
(SOFM) Scalability and time
efficiency are not
considered in the research
objectives
Azmoodeh et al. Deep Learning, VirusTotal and Self- 99.68% Sustainable against Junk Work is mainly focused on
(2019) Collected Customized Code Insertion Attacks IoT and IoBT malware
Dataset detection. Scalability and
time efficiency are not
considered in the research
objectives
Vinayakumar Deep Neural Publicly available dataset 96.3% The framework is capable of The author has claimed that
et al. (2019) Network, Ember analyzing a large number the system is scalable, but
Classical of malware in real-time. It the same has not been
Machine could be scaled out to shown in experimental
Learning analyze an even larger results
Algorithms number of malware
Yuxin and Siyi Deep Belief VirusTotal and Self- 98.37% The classification result is Scalability and time
(2019b) Network (DBN) Collected Customized better and accuracy is efficiency are not
Dataset higher than the baseline considered in the research
models objectives
Choi (2020) Combined kNN HAURI, Antivirus Company 99.8% The system has significantly The author in the paper has
Classification reduced the malware implemented vantage-
and Hierarchical detection time using Trend point (VP) tree using a
Similarity Hash Micro locality-sensitive similarity hash. The
for Fast Malware hashing (TLSH) search time of the VP tree
Detection needs improvement
Chen et al. (2021) A Learning-based VirusTotal 99.12% The accuracy of the The malware detection time
Static Malware proposed system is high of the proposed system is
Detection high
System with
Integrated
Feature
123
Table 9 (continued)
Authors and year Techniques used Dataset Result Strength Weakness
accuracy
Serpanos Random Forest EMBER: An Open Dataset 99.21% Sisyfos is a modular and The research work is
et al.,(Serpanos for Training Static PE extensible platform for focused on the
et al. 2021) Malware Machine malware analysis. It development of a
2021 Learning Models addresses multiple modular and extensible
operating systems platform for supporting
multiple operating
systems. Detection time
and scalability are not the
research objective
Proposed Random Forest VirsusShar, VirusTotal, 99.8% Highly Scalable and Time The malware detection time
architecture with TLSH theZoo, IEEE, Malwr, Efficient of the proposed system is
Lenny Zelter and Contagio almost half compared to
malware repositories the existing model
proposed by other
researchers. However, the
proposed system is based
on Apache Spark which
requires higher RAM
capacity
123
Cho IK, Kim TG, Shim YJ, Ryu M, Im EG (2016) Malware analysis ’18). Association for Computing Machinery, New York, NY,
and classification using sequence alignments. Intell Autom Soft USA, 354–365. https://doi.org/10.1145/3176258.3176306.
Comput 22(3):371–377. https://doi.org/10.1080/10798587.2015. Paola A De, and Lo Re G (2020) A hybrid system for malware
1118916 detection on big data - IEEE Conference Publication. Accessed
Choi S (2020) Combined kNN classification and hierarchical March 23. https://ieeexplore.ieee.org/document/8406963/.
similarity hash for fast malware detection. Appl Sci Paranthaman R and Thuraisingham B (2017) Malware collection and
10(15):5173. https://doi.org/10.3390/app10155173 analysis. 2017 IEEE International Conference on Information
Cui Z, Xue F, Cai X, Cao Y, Wang G, Chen J (2018) Detection of Reuse and Integration (IRI), San Diego, CA, pp 26–31 https://
malicious code variants based on deep learning. IEEE Trans doi.org/10.1109/IRI.2017.92.
Industr Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018. Poudyal S, Akhtar Z, Dasgupta D and Gupta KD (2019) Malware
2822680 analytics: review of data mining, machine learning and big data
Dell’Amico M (2019) Fishdbc: Flexible, incremental, scalable, perspectives. 2019 IEEE Symposium Series on Computational
hierarchical density-based clustering for arbitrary data and Intelligence (SSCI), Xiamen, China, pp 649-656, https://doi.org/
distance. arXiv preprint 1910.07283 10.1109/SSCI44817.2019.9002996
Gupta S (2019) Locality sensitive hashing. Medium. Towards Data Rathore H, Agarwal S, Sahay SK, Sewak M (2019) Malware
Science, https://towardsdatascience.com/understanding-locality- detection using machine learning and deep learning. arXiv.org
sensitive-hashing-49f6d1f6134 https://arxiv.org/abs/1904.02441v1.
Gupta D, Rani R (2018) Big data framework for zero-day malware Serpanos D, Michalopoulos P, Xenos G, Ieronymakis V (2021)
detection. Cybern Syst 49(2):103–121. https://doi.org/10.1080/ Sisyfos: A modular and extendable open malware analysis
01969722.2018.1429835 platform. Appl Sci 11(7):2980. https://doi.org/10.3390/
Hordri NF, Ahmad NA, Yuhaniz SS, Sahibuddin S, Ariffin AF, Saupi app11072980
NA, Zamani NA, Jeffry Y, Senan MF (2018) Classification of Smart Whitelisting Using Locality Sensitive Hashing (2017) Trend
malware analytics techniques: a systematic literature review. Int micro. https://www.trendmicro.com/en_us/research/17/c/smart-
J Secur Appl 12(2):9–18 whitelisting-using-locality-sensitive-hashing.html
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) HinDroid: An TLSH - Technical Overview. (2021) TLSH Technical Overview.
intelligent android malware detection system based on structured https://tlsh.org/papers.html
heterogeneous information network. In Proceedings of the 23rd Ullah F, Babar MA (2019) Architectural tactics for big data
ACM SIGKDD International Conference on Knowledge Dis- cybersecurity analytics systems: a review. J Syst Softw
covery and Data Mining (KDD ’17). Association for Computing 151:81–118. https://doi.org/10.1016/j.jss.2019.01.051
Machinery, New York, NY, USA, 1507–1515. https://doi.org/10. Venkatraman S, Alazab M (2018) Use of data visualisation for zero-
1145/3097983.3098026 day malware detection. Secur Commun Netw 2018:1–13. https://
Kaspersky-Lab-Whitepaper-Machine-Learning. Accessed March 23, doi.org/10.1155/2018/1728303
2020. https://media.kaspersky.com/en/enterprise-security/Kas Vinayakumar R, Soman K (2018) Deepmalnet: evaluating shallow
persky-Lab-Whitepaper-Machine-Learning.pdf. and deep networks for static pe malware detection. ICT Express
Kolosnjaji B, Demontis A, Biggio B, Maiorca D, Giacinto G, Eckert 4(4):255–258
C and Roli F (2018) Adversarial malware binaries: evading deep Vinayakumar R, Alazab M, Soman KP, Poornachandran P, Venka-
learning for malware detection in executables. In 2018 26th traman S (2019) Robust intelligent malware detection using deep
European Signal Processing Conference (EUSIPCO), learning. IEEE Access 7(2019):46717–46738. https://doi.org/10.
pp 533–537. IEEE 1109/access.2019.2906934
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant Wassermann S and Casas P (2018) Bigmomal. Proceedings of the
permission identification for machine-learning-based android 2018 Workshop on Traffic Measurements for Cybersecurity -
malware detection. IEEE Trans Industr Inf 14(7):3216–3225. WTMC 18, https://doi.org/10.1145/3229598.3229600.
https://doi.org/10.1109/TII.2017.2789219 Wu Q, Zhu X, Liu B (2021) A survey of android malware static
Masabo E, Kaawaase KS, Sansa-Otim J (2018) Big data. Proceedings detection technology based on machine learning. Mob Inf Syst
of the 2018 International Conference on Software Engineering in 2021:1–18. https://doi.org/10.1155/2021/8896013
Africa - SEiA 18, https://doi.org/10.1145/3195528.3195533. Ye Y, Li T, Adjeroh D, Iyengar SS West Virginia University, West
Naderi H, Vinod P, Conti M, Parsa S, Alaeiyan MH (2019) Malware Virginia University, Tao Li Florida International University,
signature generation using locality sensitive hashing. Commun et al. A survey on malware detection using data mining
Comput Inf Sci Secur Privacy. https://doi.org/10.1007/978-981- techniques. ACM Computing Surveys (CSUR), 2017 https://
13-7561-3_9 doi.org/10.1145/3073559.
Oliveira A (2019) ‘‘Malware analysis datasets: Top-1000 PE imports. Yuxin D, Siyi Z (2019a) Malware detection based on deep learning
IEEE Dataport, https://doi.org/10.21227/004e-v304. algorithm. Neural Comput Appl 31(2):461–472
Oliver J, Ali M, & Hagen J (2020) HAC-T and Fast Search for Yuxin D, Siyi Z (2019b) Malware detection based on deep learning
Similarity in Security. 2020 International Conference on Omni- algorithm. Neural Comput Appl 31:461–472. https://doi.org/10.
Layer Intelligent Systems (COINS). https://doi.org/10.1109/ 1007/s00521-017-3077-6
coins49042.2020.9191381
Pagani F, Dell’Amico M, and Balzarotti D (2018) Beyond Precision Publisher’s Note Springer Nature remains neutral with regard to
and recall: Understanding uses (and misuses) of similarity hashes jurisdictional claims in published maps and institutional affiliations.
in binary analysis. In Proceedings of the Eighth ACM Confer-
ence on Data and Application Security and Privacy (CODASPY
123
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com