0% found this document useful (0 votes)

30 views18 pages

Scalable_malware_detection_system_using_big_data_a

Uploaded by

Nids Chakravarty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views18 pages

Scalable_malware_detection_system_using_big_data_a

Uploaded by

Nids Chakravarty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Soft Computing (2022) 26:3987–4003

https://doi.org/10.1007/s00500-021-06492-9 (0123456789().,-volV)(0123456789().
,- volV)

APPLICATION OF SOFT COMPUTING

Scalable malware detection system using big data and distributed

machine learning approach
Manish Kumar1

Accepted: 24 October 2021 / Published online: 5 November 2021

The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
Computer, Internet, and Smartphone have changed our life as never before. Today, we cannot even imagine our life
without these technologies. If we look around, we find everything, everywhere connected and controlled by system and
software. We find amazing software and mobile applications which have become nerve of our daily life. Our dependency
on this software and systems is so and so much that it is scary even to imagine, what if this system fails at any point in time.
There is always a threat surrounded by various types of cyber-attacks. Every day cybercriminals are evolving their
attacking strategy. Cyber-attacks using ever-more sophisticated malware are the major cause of concern for all types of
users. Cyber-world has witnessed rapid changes in malware attacking strategy in the recent past. The volume, velocity, and
complexity of malware are posing new challenges for malware detection systems. A scalable malware detection system
with the capability to detect complex attacks is the time of need. In this paper, we have proposed a scalable malware
detection system using big data and a machine learning approach. The machine learning model proposed in the system is
implemented using Apache Spark which supports distributed learning. Locality-sensitive hashing is used for malware
detection, which significantly reduces the malware detection time. A five-stage iterative process has been used to carry out
the implementation and experimental analysis. The proposed model shown in the paper has achieved 99.8% accuracy. The
proposed model has also significantly reduced the learning and malware detection time compared to models proposed by
other researchers.

Keywords Malware Big data Machine learning Static analysis Dynamic analysis Locality-sensitive hashing

1 Introduction objectives. Some malware targets individual users, whereas

some are for jeopardizing the corporate network. Some
There are various types of malware and each malware has malware has a clear objective to affect the critical infras-
its unique way to attack the target. The most innovative tructures, whereas some have the intention of financial gain
part of any malware is the way it delivers the payload to the (Catak 2019; Hordri et al. 2018). Malware is written for
target and evades the detection (Azmoodeh et al. 2019; targeting specific operating systems, software, and types of
Gupta and Rani August 2018; Venkatraman and Alazab files. With the proliferation of mobile applications and
February 2018). It can be delivered through email attach- cloud computing, now there are variants of malware
ment, sharing of an executable through social media, dis- available for targeting smartphone users and cloud
tribution of malicious web links through instant messaging, resources.
etc. Malware is one of the major threats to the IT infras-
It’s also important to understand the motive behind the tructure. Many researchers around the world are actively
malware. Every malware has a specific target and working on malware detection and analysis techniques.
There are many well-known open-source as well as com-
mercial malware detection systems available in the market.
& Manish Kumar
manishkumarjsr@yahoo.com However, continuous changes in malware attacking tech-
niques are triggering the need for evolution in malware
1
Department of Master of Computer Applications, M. detection systems. Real-time detection, accuracy, and
S. Ramaiah Institute of Technology, Bangalore, India

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3988 M. Kumar

scalability issues are some of the major challenges which the most recent and relevant works followed by a conclu-
need to be addressed. sion in Sect. 11.
High-speed networks and the increased number of users
generate large numbers of suspect files to be scanned by the
malware detection system. As the number of sample files 2 Malware
and size of the signature database increase, it poses scal-
ability and performance issues. Another major disadvan- The history of malware is quite interesting. According to
tage of the traditional malware detection system is that it Scientific American, the idea of a computer virus was born
needs a continuous update to detect a novel malware attack in 1949, when well-known computer scientist John von
(Kolosnjaji et al. 2018; Masabo et al. 2018; Poudyal et al. Neumann published a paper ‘‘Theory and Organization of
2019). Complicated Automata.’’ In this paper, John von Neumann
One of the objectives of the research work proposed in hypothesizes how a computer program can reproduce itself.
the paper is to develop a system that can handle a large In the year 1950s, Bell Labs employees could implement
number of files efficiently. The system should also evolve the John von Neumann envision and developed a game
with time without manual intervention and detect novel called ‘‘Core Wars.’’ In this game, programmers would
attacks (Wu et al. 2021). In such a situation, big data unleash software ‘‘organisms’’ that competed for control of
system seems to be a feasible solution that can deal with the computer.
the storage, ingestion, and extraction activities and support Computer viruses began to appear in the early 1970s.
the machine learning system to learn and grow. Historians often credit the ‘‘Creeper Worm,’’ an experi-
Various classical machine algorithms can be used for the mental self-replicating program written by Bob Thomas at
malware detection system. However, not all algorithms BBN Technologies with being the first virus. However, the
have the same performance. We need to identify the term virus was not introduced in computer science jargon
machine learning algorithm which supports fast learning until the mid-eighties. Fred Cohen coined the term virus in
and provides high accuracy on the big-data-based platform. his Ph.D thesis in the year 1986. He defined a ‘‘virus’’ in a
Hence, many classical algorithms (i.e., Decision Tree, single sentence as: ‘‘A program that can infect other pro-
Logistic Model, Random Forest, SVM, Naı̈ve Bayes, grams by modifying them to include a, possibly evolved,
Gradient Boosting, and KNN) have been experimented version of itself.’’
with and based on the best performance Random Forest There is the various definition of malware. Put simply,
machine learning algorithm has been selected for the malware is any piece of code or software that is inten-
implementation in the proposed malware detection system. tionally developed for malicious intent. Malware is gen-
To reduce the malware detection time, Trend Micro erally used for umbrella terms for many different types of
Locality-Sensitive Hashing (TLSH) algorithm has been threats. It includes viruses, trojan, adware, riskware, etc.
implemented along with the Random Forest algorithm. The objectives of these malware may be multifaceted like
Many researchers have already experimented with stealing the data, damaging the device, etc. (Higuera 2020;
machine learning algorithms for the malware detection Vinayakumar and Soman 2018; Rathore et al. 2019).
system. However, the implementation of a machine learn-
ing algorithm on the distributed parallel system and 2.1 Malware analysis techniques
application of locality-sensitive hashing for malware
detection is a novel contribution of this paper. Malware is evolving continuously with a complex and
At the beginning of the paper, Sect. 2 gives a short stealthy approach. Every time new types of malware appear
history of the malware followed by malware analysis with more damaging characteristics than ever before. The
techniques and challenges. Section 3 highlights the recent evolving complexity of malware always poses a challenge
works done by other researchers on a similar topic. Sec- for the analyst for its detection. Attackers always find new
tion 4 discusses the research objectives and methodologies and advanced techniques to escape from detection. This is
followed by Sects. 5 and 6, which briefly discuss big data, where malware analysis plays a crucial role. The core
machine learning, and its applications in the malware objectives of malware analysis are to understand the mal-
detection system. Section 7 discusses in detail locality- ware approach of infection, threat, risk, and associated
sensitive hashing which is used for malware detection in damaging factors. It requires series of methods and tech-
the proposed model. The experimental setup is discussed in niques to analyze the malware. Better analysis can help to
Sect. 8 followed by Sect. 9, which briefly discusses the develop better defensive techniques for the organization.
experimental results. Section 10 of the paper shows a There is various approach (Fig. 1) for malware analysis in
comparative analysis of the proposed work with some of which Static Analysis and Dynamic Analysis are the most

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 3989

Fig. 1 Malware analysis

methods and techniques

common and well-known approach (Agarkar and Ghosh matches the hash value of the suspected binary file
2020; Higuera 2020). with their signature database. The signature-based
approach is easy; however, it cannot detect the novel
2.2 Static analysis malware attack (Gupta 2019).
(2) Malware writers use code obfuscation to change the
Static analysis is a process of analyzing the malware binary structure and pattern of the malware program. It
code without execution. It is performed by determining the makes it difficult for the malware analyst to under-
signature of the binary file or by calculating the hash of the stand the code or reverse engineer it using any
binary file. The binary file can be reverse-engineered by available tools (Anderson et al. 2017).
dissembling the code and converting the machine-exe- (3) There is always a need of monitoring the live
cutable code into assembly language code. The analyst can network data. However, live network monitoring is
better understand the assembly language code and what it’s never been an easy task. Monitoring petabytes and
programmed to do. exabytes of live network data are challenging
scalability and performance issues. The traditional
2.3 Dynamic analysis malware detection system is not capable of process-
ing big data (AlAhmadi et al. 2018).
Unlike static analysis, it is a behavioral-based analysis and (4) Different types of devices connected to the network
it is performed by executing the malware code and communicate the data in different formats. The
observing its behavior. The dynamic analysis is performed malware detection system must be able to interpret
in the closed and isolated virtual environment so that it can heterogeneous data format which is not a simple job.
be analyzed thoroughly without affecting the system. Many researchers have worked on this issue. How-
Dynamic analysis is performed to determine the function- ever, it remains a challenging task (Burnap et al.
ality of the malware which is generally difficult through 2018; Gupta and Rani August 2018; Higuera 2020).
any other approach. 5) Some of the static malware analysis and detection
system shows promising result, but time complexity
2.4 Malware analysis challenges and delay in detection is one of the major challenges
(Vinayakumar and Soman 2018).
Malware analysis is a challenging task, and some of the
The malware detection system using big data and
common challenges are as follows:
machine learning approach effectively handle the above-
(1) Most of the malware detection software is based on a mentioned challenges. Implementation of big data makes
signature-based approach. The detection software the malware detection system scalable. The machine

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3990 M. Kumar

learning approach makes the system evolve by learning combating and triaging malware is now a big data problem
from new attack patterns. (Anderson et al. 2017; Paranthaman and Thuraisingham
2017; Wassermann and Casas 2018).
The traditional malware detection system has several
3 Related works technical difficulties, which have to be addressed to make
the system effective to face next-generation malware
Malware analysis is one of the active research fields. We detection challenges. The contemporary malware system is
have gone through many research papers while doing this not good at handling large numbers and big-size files. It’s
research. Table 1 summarizes some of the work which is also practically difficult to add storage and processing
related to our proposed work. capacity dynamically based on demand (Oliveira 2019;
Paola et al. 2020; Ullah and Babar 2019). The system is not
flexible enough for dynamic changes in programming
4 Research objectives and methodology scripts or database schemas. Lack of fault-tolerant capa-
bilities is also a major cause of concern.
4.1 Research objectives The old traditional approach of static malware analysis
is not scalable and efficient enough. High-speed networks
The main research objectives of the work presented in this and the increased number of users pose scalability chal-
paper are as follows: - lenges as the delay in analysis and detection can severely
affect the overall performance. Big data and a machine
1. Fast Learning Model with High Accuracy:—
learning approach can greatly improve the efficiency of the
Machine learning is a time-consuming and resource-
malware detection system. It will make the overall system
intensive process. We wanted to experiment with a
highly scalable and improve performance (Kolosnjaji et al.
parallel machine learning process to reduce the learn-
2018; Masabo et al. 2018; Poudyal et al. 2019).
ing time as well as to achieve higher accuracy.
Big data deals with storage, ingestion, and extraction
2. Fast and Scalable Malware Detection System:—The
activities. Big data as the name suggest helping the system
objective behind making a scalable system is to satisfy
to cope up with the large volume of data for analysis,
the on-demand performance. A scalable system can be
discovering hidden pattern and information extraction. So
easily scaled up for higher performance based on
basically, big data support the analysis of a large volume of
requirements. We have used locality-sensitive hashing
data.
for malware detection. Locality-sensitive hashing is
Machine learning helps the system to learn and take a
highly scalable and time-efficient.
predictive decision without human intervention. With the
help of machine learning, a software application can learn.
4.2 Research methodology It improves the system’s accuracy to predict better out-
comes. Machine Learning teaches computers how to take
In a machine learning experiment, workflow is an iterative input and processor interpret it according to the machine
process that can be divided into five stages. The initial learning model and produce desirable outputs. The overall
process of the experiment begins with the data gathering, architecture of the proposed big-data and machine-learn-
cleaning, and preprocessing the data, building the machine ing-based malware detection systems is shown in Fig. 3.
learning model, validating the model, and finally deploying There are two major phases in the proposed system, i.e.,
the model. The entire process is iterative (Fig. 2) till an the training phase and the detection phase. The training
expected outcome is achieved. phase of the system is based on a machine learning model.
The machine learning models are executed on Apache
Spark, which provides a scalable and parallel machine
5 Big data perspective learning platform. A parallel machine learning platform
significantly reduces the training time compared to a stand-
Malware collection and processing is always a critical and alone system, and it is highly scalable. For training the
challenging job for most of the organization dealing in the machine learning model, the malware sample files are
cybersecurity domain. The proliferation of the internet and collected which are stored in the repository. The malware
exponential growth in the number of users generate a flood files from the repository are processed for feature extrac-
of malware samples. Major anti-malware software devel- tion. The extracted malware features are fed into the
opment companies receive millions of samples in a day. machine learning model for training the model. Script files
It’s difficult to collect, store and analyze such a huge are used for various pre-processing and processing of the
number of samples in a scalable manner. Effectively malware samples. The script files are also used for passing

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 3991

Table 1 Recent works on malware analysis using Machine Learning

Authors Techniques Used Summary/Result

In Cho et al. Sequence Alignment Method (Widely used in bioinformatics The experimental results show that the pairwise sequence
(2016) field) alignment algorithm is useful. However, the algorithm has
high computing overheads. Because of high computing
overhead, a framework with a pairwise sequence alignment
algorithm may not be suitable for high-speed malware
classification
Anderson Reinforcement Learning Approach The author has implemented a black-box attack using a
et al. (2017) reinforcement learning model which consists of an agent and
an environment. The agent learns incrementally through the
environment. The authors in the paper have concluded that the
reinforcement learning models and manipulations are
relatively rudimentary, the modest evasion rate demonstrates
that black-box machine learning models for malware
detection can be evaded
Ramkumar Random Forest, SVM The paper has shown the efficiency of Apache Spark over
et al. (2017) Apache Hadoop in terms of performance. The Apache Spark
performance is approximately two times better than the
Apache Hadoop
Al Ahmadi K-Nearest Neighbour (KNN) and Random Forest (RF) The authors have concluded that malware classification systems
et al. (2018) that apply a supervised machine learning approach require
continuous training of new malware variants to adapt to
behavioral changes. However, applying a fuzzy similarity
measure allows a degree of flexibility in malware behavioral
change and thus only needs training on samples of a new
malware family
Cui et al. Deep Learning, CNN, Bat Algorithm The author has proposed the application of deep learning
(2018) techniques to improve the detection rate of malware variants.
The method discussed in the paper transforms the malicious
code into a grayscale image. The grayscale images were
identified and classified by CNN. It could automatically
extract the malware features. Since CNN is very effective and
efficient for identifying malware images, the proposed model
using CNN was significantly faster than other approaches
Li et al. SVM, Association Rules The authors in the paper have shown that it is possible to reduce
(2018) the number of permissions to be analyzed for mobile malware
detection without compromising the high accuracy and
effectiveness. The experimental results show that with less
permission, the runtime performance has been improved by
85.6%, and over 90% detection accuracy has been achieved.
The author has also concluded that a smaller feature set can
also reduce memory consumption. Based on the tested 67
machine learning algorithms, it was shown that the machine
learning methods based on tree structure can produce better
results
Burnap et al. The Self Organizing Feature Map (SOFM) The result is promising when SOFM is used with Logistic
(2018) Regression Model. The proposed method is good for the
detection of APTs and polymorphic malware
Yuxin and Deep Belief Network (DBN) Experimental results show that A DBN can learn from
Siyi (2019b) unlabeled data. The proposed model in the paper produces
better classification results than classical ML algorithms
Azmoodeh Deep Learning The author has achieved a malware detection accuracy of
et al. (2019) 98.37% and a precision rate of 98.59%. The authors have also
discussed in detail the mitigation techniques for junk code
insertion attacks

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3992 M. Kumar

Table 1 (continued)
Authors Techniques Used Summary/Result

Vinayakumar Deep Neural Network, Classical Machine Learning The experiment was conducted with various classical machine
et al. (2019) Algorithms learning classifiers such as Random Forest (RF), Logistic
regression (LR), Decision Tree (DT), k-Nearest Neighbor
(KNN), Naive Bayes (NB), SVM with linear (Sl) and rbf
kernel (Sr). The ROC for the Deep Neural Network model
(DNN) showed the highest AUC of 0.9983
Choi (2020) kNN Classification and Hierarchical Similarity Hash The author of the paper has addressed the detection time of the
AI based Malware Detection System. To reduce the detection
time, author in the paper has proposed k-nearest-neighbor
(kNN) classification for malware detection with a vantage-
point (VP) tree using a similarity hash. The author has
reduced the detection time by 67% and has increased the
detection rate by 25%
Chen et al. K-Nearest Neighbors (KNN), Decision tree (DT), Gradient The experimental results in the paper show the detection
(2021) Boosting Decision Tree (GBDT), and Extreme Gradient accuracy of 99.55% and 99.55% f1-score using the XGBoost
Boosting (XGBoost) algorithm. The author has claimed higher accuracy; however,
the detection process of the proposed system is time-
consuming
Serpanos Random Forest Sisyfos is a modular and extensible platform for malware
et al. (2021) analysis. It addresses multiple operating systems. Sisyfos has
been developed based on open software for feature extraction
and is available as a stand-alone tool with a web interface

from the malware database(Ali et al. 2020). If the TLSH

detects the matching hash for the suspected file, it is flag-
ged as malware. If the TLSH does not detect the hash
match, the suspect file is processed using a trained machine
learning model for scanning and malware detection. The
use of TLSH drastically reduces the malware detection
time. Overall, the proposed architecture is designed to
address the major challenges mentioned in Sect. 2.4.

5.1 Scalable malware processing architecture

One of the core objectives of the research is to design a

malware detection system that is scalable and efficient
enough to handle a large number of malware sample files.
The simple and easy approach to handle the scalability
issue is to add more hardware resources to the system.
Fig. 2 Machine Learning-based Malware Analysis Research
However, just adding the hardware resources does not
Methodology
guarantee the efficiency of the system. It’s quite chal-
the required parameters and settings of the machine lenging to handle complex parallelization, synchronization,
learning models. communications, and resource utilization with the addition
The detection phase scans the files and detects malware. of hardware resources to an existing system. Therefore, a
The proposed model can detect malware directly through framework is required to handle these issues. We deter-
the trained machine learning model. However, malware mined that Apache Spark can be used to address scalabil-
detection through a machine learning system is slightly ity, reliability, efficiency, and maintainability concerns of
time-consuming. To reduce the detection time TLSH large-scale malware processing.
algorithm is used. TLSH is a locality-sensitive hashing
technique that quickly matches the hash of the suspect files

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 3993

Fig. 3 Machine Learning-Based Malware Detection System’s Architecture

5.1.1 Apache spark 6 Machine learning

Apache Spark is an open-source software framework. It The machine learning approach can greatly boost the per-
can run on stand-alone mode, on the cloud, using cluster formance of malware detection systems (Bryłkowski 2017;
manager, or on other platforms. Spark is designed for fast Kaspersky-Lab-Whitepaper-Machine-Learning 2020; Ye
performance using RAM for caching and data processing. et al. 2017). The machine learning model is a mathematical
It allows the distributed processing of a large volume of representation of the real-world process. The machine
data across clusters of multiple computing nodes. Spark learning algorithm can be used to analyze and discover a
platform is a highly scalable and fault-tolerant system. hidden pattern. To make the machine learning model take a
Some of the major benefits of the Spark system are shown predictive decision, we need to first train the model using
in Table 2: training datasets. In a malware detection system, the
training datasets could be statistical and dynamic behaviors
of know malware samples collected from individual users,

Table 2 Apache Spark Features and Benefits

Platform It can run on various platforms such as Kubernetes, EC2, Hadoop YARN, Mesos, and its stand-alone cluster
mode. Access data in HDFS, Apache Cassandra, Apache Hive, Apache HBase, Alluxio, and hundreds of other
data sources
Performance Apache Spark achieves high performance for both batch and streaming data. The state-of-the-art query
optimizer, DAG scheduler, and a physical execution engine boost its performance significantly. One of the
main reasons for high performance is that it does not read and write the intermediate data from disk but uses
RAM
Cost Open-Source Platform
Data processing Suitable for live-stream and iterative data analysis. It works with DAGs and RDDs to run operations
Fault tolerance The system can rebuild the dataset in case of partition fails. The system tracks the RDD block creation process
and it can use DAG to rebuild the data across nodes
Scalability Highly scalable system. Nodes can be added based on the requirements to achieve desired performance
Security It relies on integration with Hadoop to achieve the necessary security level. By default, the security is turned off
Machine learning support It has in-memory processing and uses MLlib for computations. It makes the system faster and suitable for
machine learning
Scheduling and resource Has built-in tools for resource allocation, scheduling, and monitoring
management
Ease of use and language Spark powers a stack of libraries including GraphX, Spark Streaming, DataFrames, MLlib for machine learning,
support and SQL. You can combine these libraries seamlessly in the same application. APIs can be written in Java,
Scala, R, Python, Spark SQL

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3994 M. Kumar

computer networks, cloud-based infrastructure, etc. These Many times the machine learning-based systems gen-
data can be collected in two different ways: erate a false alarm. There are two types of false alarms.
• Pre-Execution Phase:—In this phase, static informa- • False Positive:—System mistakenly identifies a benign
tion about the files is collected without executing it. file as a malicious file.
This may be the file header information, file format, • False Negative:—System mistakenly identifies a mali-
binary data statistics, code descriptions, etc. Signature- cious file as a benign file.
based static analysis can be used to train the detection
The objective of the machine learning model is to
model.
reduce the false alarm as low as possible or zero. It is
• Post-Execution Phase:—It focuses on behavior anal-
complex by the fact that everyday novel malware keeps
ysis of the code and series of events caused by its
coming, which the learning model has not seen during the
execution. Behavior analysis of the file can be mapped
training phase and may detect as a benign file. Occasion-
with the known malware and can be used to train the
ally the system may detect a benign file as malware
detection model.
because the latest software and files developed by software
Research work discussed in this paper is based on the vendors may be of new pattern and behavior. The system
static features of malware, i.e., the pre-execution phase. may raise the false alarm because of a weak training
dataset or fault in the model itself. In such a scenario, we
6.1 Application of machine learning in malware need to thoroughly examine the system and correct either
analysis the dataset or model.
As mentioned, many new benign files, as well as mal-
Our malware detection system is using both supervised and ware files, keeps coming up. Slowly with time, the number
unsupervised learning models. The overall objective to of false alarms increases. It decreases the overall perfor-
make the system learn and evolve by itself without human mance of the system. To maintain the system performance
intervention. Hence unsupervised learning is used. How- high, the system must be flexible enough for any changes in
ever, some attack is very novel and do not have any known the learning model and adding the new dataset to train the
pattern. In such a situation, supervised learning through model on the fly. The model should evolve by learning and
manual intervention is required (Hou et al. 2017; Ali et al. updating with the new datasets. The major steps of a
2020). machine learning-based static malware analysis system are
shown in Fig. 4.
6.2 Training dataset

It is important to understand that the success of the 7 Malware detection using locality-sensitive
machine learning model heavily depends on the training hashing
dataset. A large dataset with variant samples and proper
labeling can make the model accurate. The dataset should The main objective behind the implementation of the
contain a large number of conditions to match with a real- machine learning model for malware detection is to reduce
life scenario. Collecting such a dataset is a crucial job. human intervention. Malware detection systems use file
Anti-Malware development companies collect these sam- properties, file behavior, code fragment, hash value, or a
ples from their clients and users. Many of these organiza- combination of all the approaches to detect malware. The
tions and websites provide malware samples in the public current malware detection systems are facing performance
domain for research purposes. The details of data sources issues because of the following reasons.
used in this research work are mentioned in Sect. 8.1.
• Manual creation of detection rules couldn’t keep up
with the emerging flow of new malware.
6.3 Model
• The system couldn’t detect new malware until the
signature database is updated with the detection rule.
The machine learning model is like a black box. It takes
input X and produces output Y through the complex We wanted to make the system robust enough to detect
sequence of the process. Generally, the models are so even the small changes in the file. It would detect the new
complex that it is difficult for a human to interpret every malware even with minor modifications. Above all, with
input and output. However, though these models are the added features, we wanted to make the system scalable,
complex in nature, the developer should have a thorough fast, and efficient. To achieve our objectives and expected
understanding of these models. outcomes from the system, we have selectively used many

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 3995

Fig. 4 Major steps for machine learning-based static malware analysis

concepts, which are discussed in detail in the following minor changes. To accomplish this task, we used locality-
sections. sensitive hashing (LSH).
LSH is different from normal hashing techniques. As
7.1 Locality-sensitive hashing (LSH) shown in Fig. 5, in the normal hashing technique, the hash
value of two almost identical files is as different as the hash
One of the major challenges of the current malware value of two completely different files. The normal cryp-
detection system is to analyze and detect new malware. tographic hash algorithm does not reflect the near similarity
These new malware are improvised or minor changed between two files (Yuxin and Siyi 2019a; Gupta 2019;
versions of existing malware. Any changes in the existing Naderi et al. 2019). However, LSH is an algorithmic
malware code make it undetectable because of a mismatch technique that hashes similar inputs into the same ‘‘bucket’’
in pattern with known malware signatures. with high probability, while input data which are far dif-
So, the challenge is how effectively we match the ferent are likely to be in different ‘‘buckets.’’ This tech-
known signature patterns with new malware files which has nique helps for clustering and nearest neighbor search. The

Fig. 5 Normal Cryptographic Hash and Locality-Sensitive Hash

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3996 M. Kumar

Fig. 6 Major Steps for

Calculating Locality-Sensitive
Hash

overall process can be divided into three major steps as • If similarityðf 1 ,f 2 Þ is low then
shown in Fig. 6. probabilityðH ðf 1 Þ ¼¼ Hðf 2 Þ) is low.
Jaccard similarity calculation is feasible if the number of
7.2 Shingling
files and samples is less. However, in malware analysis, a
huge number of sample files are received every day. To
In this process, the file is converted into sets. Each file is
calculate Jaccard similarity, the system needs to load all the
converted into a set of characters of length k. The objective
sets and calculate intersection and union. The whole cal-
is to represent each document in the collection as a set of
culation process computationally resources intensive.
k-shingles.
Instead of dealing with large sets which require a lot of
computing time and memory, Min-Hash can provide the
7.2.1 Jaccard index
approximate measurements in a scalable manner.
The Jaccard Index is used for measuring the similarity and
7.2.2 Min-hash
diversity of sample sets. It is also known as the Jaccard
similarity coefficient and the intersection over the union.
Min-Hash is a technique to quickly estimate the similarity
Now to measure the similarity between two files (i.e., new
between two files. Jaccard similarity of two files Jðf 1 ,f 2 Þ
malware with small changes and the known malware sig-
are calculated by the ratio of their intersection and union
nature), Jaccard Index can be used. The Jaccard Index
using Eq. 1. The value of Jaccard similarity would be 0, if
between the two files A and B can be calculated as:
the two files are different, it would be 1, if the two files are
j A \ Bj completely similar, and it would be between 0 and 1,
J ðA; BÞ ¼ ð1Þ
j A [ Bj otherwise. The goal of Min-hash is to quickly calculate
Jðf 1 ,f 2 Þ without explicitly computing the intersection and
Though the Jaccard Index can measure the similarity
union. It greatly reduces the space complexity and time
between modified malware files and known malware
complexity.
samples, it raises some scalability issues. The documents
need to be stored in a sparse matrix to calculate similarity
7.2.3 Locality-sensitive hashing using min-hash
index and it’s required huge memory. The time complexity
of the algorithm is O(n2). As malware detection systems
Our malware analysis and detection system have a very
receive millions of files per day, it would require huge
large collection of signature database. Whenever the sys-
memory and time for similarity matching. Hashing is the
tem receives a suspected file for malware detection, it
way to solve time and space complexity issues.
generates a query to find Jaccard similarities. LSH algo-
Hashing is a function to convert the file into a small
rithm achieves the sub-linear time complexity by reducing
signature. Instead of storing the complete file in memory,
the number of comparisons needed to find the similarity
the hash value of the file is small enough to fit into memory
between two objects. LSH primarily differs from the con-
and greatly reduces the space complexity of the system. If
ventional hashing technique in the sense that the normal
H is a hash function, f 1 and f 2 are two sample file, the
hashing technique tries to avoid hash collision, whereas
similarity index using the hash function can be represented
LSH aims to maximize collision for similarity in the items.
as.
In-depth detail and working principles of Min-Hash algo-
• If similarityðf 1 ,f 2 Þ is high then rithm and locality-sensitive hashing for example can be
probabilityðH ðf 1 Þ ¼¼ Hðf 2 Þ) is high.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 3997

found in (Bryłkowski 2017; Pagani et al. 2018; Hashing 8.1.1 String features
2017).
Some of the well-known locality-sensitive hashing String feature extraction analysis is the most common
algorithms are SSDEEP, SDHASH, and TLSH. In this approach in malware static analysis. Extracting these fea-
experiment, we have used TLSH for malware detection. tures involved extracting and analyzing the exe-
TLSH can be used to inspect a large amount of malware in cutable files’ function names, comments, messages, import
very less time. TLSH uses threshold-based Hierarchical commands, etc. to determine the important features.
Agglomerative Clustering (HAC-T), which can cluster However, the string feature analysis is effective only for
hash digest in a scalable manner. The clustering techniques unencrypted files.
used in TLSH can cluster digest in O(nlogn) time on
average which significantly improves the malware detec- 8.1.2 PE header features
tion performance. The complete technical details for TLSH
can be found in (Dell’Amico 1910; Oliver et al. 2020; The set of features were extracted from the file’s
Technical Overview 2021). portable executable headers. Executable files have a com-
mon format called Common Object File Format (COFE).
Portable Executable (PE) format is one such COFE format
8 Experimental setup available for the Windows executables. PE format is
actually a data structure that tells the Windows OS loader
One of the main objectives of this research was to exper- about the information required to manage the wrapped
iment with the parallel machine learning approach for fast executable files. It contains valuable information such as
training. Hence multiple nodes were configured with references to DLL libraries needed to be imported and
Apache Spark. The system was installed with a 64-bit exported, code, data required to run an executable, and the
Ubuntu operating system. The configuration details of the resources needed by the executable, etc.
head node and worker nodes are mentioned in Table 3.

8.1 Datasets 9 Experimental analysis

We collected malware datasets from various sources. We Spark machine learning provides a suite of metrics
took 7825 Malware samples from VirsusShare and 8127 (Table 4) to evaluate the performance of machine learning
samples from VirusTotal and theZoo. These samples con- models.
sist of malware of different families. Malware analysis We analyzed the performance of the proposed spark-
datasets from IEEE data repository were also used for based scalable malware system using various classical
analysis. It includes approx. 1000 portable executable (PE) machine learning algorithms. We did three different types
header. Some malware samples were also collected from of analysis i.e., Training Time, Detection Time, and
Malwr, Lenny Zelter, and Contagio malware repositories. Accuracy for various classical machine learning
2128 benign files were collected from various application algorithms.
and utility software like Media Player, Adobe Reader,
Image Editing Software, Office Utility Software, Gaming
Applications, Network Tools, etc.

Table 3 System Setup and Configuration

Head node Worker node

Number of nodes 2 4
Processor Intel Core i7 (5.0 GHz) Intel Core i7 (5.0 GHz)
Memory (RAM) 16 GB 8 GB
Storage 1 TB 1 TB
Operating system Ubuntu 20.04 Ubuntu 20.04

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3998 M. Kumar

Table 4 Evaluation Metrics

Measure Description Formula

Accuracy Accuracy measures precision across all labels ðTPþTN Þ

AC ¼ TPþFPþTNþFN
Precision The proportion of correct labels that were classified over all the labels TP
P ¼ TPþFP
Recall The portion of correct labels that were classified correctly over all the positive labels TP
R ¼ TPþFN
F-measure Harmonic average of Precision and Recall PR
FM ¼ 2 PþR

where TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives

Table 5 Training time of machine learning algorithms on stand-alone system (in Hours)
Number of Algorithms
files
Decision tree Logistic model Random forest SVM Naı̈ve Bayes Gradient boosting KNN

2000 7.2 8.73 6.48 9.72 5.01 26.73 6.48

4000 10.86 9.69 11.94 28.68 6.93 29.01 8.61
6000 12.72 13.56 12.66 42.63 12.36 30.69 13.68
8000 14.91 20.34 13.59 64.29 21.96 32.91 17.01
10,000 15.96 21.69 14.79 87.93 33.93 34.71 25.29
12,000 17.73 23.67 16.92 93.96 49.23 36.09 27.63
14,000 18.69 26.01 21.33 111.81 80.13 38.01 34.02

machine learning model is a time-consuming process.

Table 5 and Fig. 7 show the learning time for various
classical machine learning model on a stand-alone system.
The stand-alone system is installed with Ubuntu 20.04
Operating System having Intel Core i7 (5.0 GHz) and
16 GB RAM. We can observe from Table 5 that with the
growing number of sample files, the training times of each
algorithm gradually increase.
We analyzed the performance of machine learning
algorithms on Apache Spark with 2 head nodes and 4
worker nodes. The hardware details of the head nodes and
worker nodes are given in Table 3. The learning time of the
classical machine learning model on a parallel system
configured with Apache Spark is shown in Table 6 and
Fig. 7 Comparison of machine learning algorithm’s training time on
stand-alone system Fig. 8. It can be observed from Table 6 that the parallel
system has significantly reduced the learning time of the
9.1 Training time analysis for machine learning algorithm compared to the stand-alone system.
models
9.2 Algorithm accuracy
Training time analysis for classic machine learning algo-
rithms was conducted with a dataset. The dataset contains There are many parameters, i.e., Accuracy, Precision,
14,000 samples with 70 attributes. The training process was Recall, and F1-Score, which are equally important to
initiated with 2000 sample files, and subsequently, the measure the performance of the machine learning algo-
number of files was increased up to 14,000 files. The rithm. Table 7 shows the Accuracy, Precision, Recall, and
F1-score for all the machine learning algorithms we

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 3999

Table 6 Training time of machine learning algorithms on distributed systems (in Hours)
Number of files Algorithms
Decision tree Logistic model Random forest SVM Naı̈ve Bayes Gradient boosting KNN

2000 0.3 0.5 0.5 0.6 0.4 1.5 0.6

4000 0.9 0.7 0.8 1.2 0.7 2.1 0.9
6000 1 1.2 0.9 1.8 1.1 2.9 1.1
8000 1.1 1.7 1.1 2.3 1.8 3.4 1.5
10,000 1.2 1.9 1.2 2.9 2.3 3.7 1.9
12,000 1.3 2 1.4 3.1 2.9 4.3 2.3
14,000 1.4 2.1 1.5 4.7 3.2 4.6 3.1

Fig. 9 Comparative performance analysis of machine learning

Fig. 8 Comparison of training time of machine learning algorithms on algorithms
distributed parallel systems

Table 7 Performance of machine learning algorithms

Algorithm Accuracy Precision Recall F1-Score

Decision Tree 0.935 0.981 0.946 0.964

Logistic Model 0.871 0.944 0.911 0.927
Random Forest 0.998 0.982 1.000 0.991
SVM 0.871 0.962 0.893 0.926
Naı̈ve Bayes 0.435 1.000 0.375 0.545
Gradient Boosting 0.935 1.000 0.929 0.963
KNN 0.790 0.939 0.821 0.876

Fig. 10 Comparative AUC results of Logistic, SVM and DT Model

analyzed in our experiment. Figure 9 shows that the overall Machine (SVM), and Decision Tree (DT). Figure 11 shows
accuracy of the Random Forest algorithm is better than all the AUC for Random Forest, Naı̈ve Bayes, Gradient
the other algorithms. Boosting and KNN.
Area under the curve (AUC) better visualizes the per-
formance of the algorithms. The higher the AUC, the better 9.3 Malware detection
the performance of the model. To make the graph clearly
visible, two AUC graphs have been drawn. Figure 10 After analyzing the training performance and accuracy of
shows the AUC for Logistic Model, Support Vector multiple machine learning algorithms, we found that the

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

4000 M. Kumar

Forest as well as along with TLSH. The malware detection

time of the proposed system is shown in Table 8 and
Fig. 12. It is important to mention that the use of TLSH
along with Random Forest has significantly improved the
performance of the system. The TLSH reduces the detec-
tion time on average by half.

10 Comparison of proposed system

with existing malware detection
techniques
Fig. 11 Comparative AUC results of Random Forest, Naı̈ve Bayes,
Gradient Boosting, and KNN Model We have compared our results as shown in Table 9, with
the most relevant existing works mentioned in the literature
survey. The main objective of our work was to build a
scalable and time-efficient system with high accuracy.
Table 8 Malware detection time of proposed system (in sec.) Most of the recent works have focused more on higher
Number Algorithm
accuracy than scalability or time efficiency. However, the
of files work carried out in this paper has collectively focused on
Random Forest TLSH With Random Forest scalability, time efficiency, and accuracy which is com-
2000 0.042 0.021 paratively new research work. The experimental results
4000 0.045 0.023 clearly show that the proposed system has achieved high
6000 0.046 0.027 accuracy, scalability, and efficiency.
8000 0.048 0.029
10,000 0.049 0.031
12,000 0.051 0.035 11 Conclusion
14,000 0.053 0.037
Cybersecurity is a global concern. The malware landscape
has grown in parallel with software and emerging tech-
nologies. The inherent cat and rat race between anti-mal-
ware and malware developers has existed for a long and
seems to be continued. Malware developers are as
aggressive as anti-malware developers. In such a scenario,
when we are aware that malware developers are putting
their best effort to destroy, we need to continuously evolve
our defense strategies. The objective of the big data and
machine learning-based malware detection system is to be
able to combat next-generation malware attacks. Currently,
major anti-malware organizations receive millions of
samples in a day for scan and detection. It’s quite chal-
lenging to process such a large number of samples with a
high response time. Manually creating and updating the
signature database is not a feasible approach to combat a
Fig. 12 Malware detection time of the proposed malware detection flood of novel and complex malware attacks. The objective
system of using a machine learning model in a malware detection
system is to make the system learn and evolve by itself,
performance of the Random Forest Model is best. Hence over time. The malware detection architecture discussed in
the Random Forest algorithm was used in the final proto- this paper has addressed various challenges, i.e., scalabil-
type of the proposed malware analysis and detection ity, time efficiency, and accuracy. The proposed scalable
system. malware detection system using a big-data, machine
Since we have used TLSH in our proposed model for learning, and locality-sensitive hashing approach could be
malware detection (Fig. 3), we wanted to evaluate the a potential solution for combating the growing malware
performance of the proposed model only with Random attacks in almost real time. The experimental results shown

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 4001

Table 9 Comparative Analysis of Proposed System with Existing Systems

Authors and year Techniques used Dataset Result Strength Weakness
accuracy

In Cho et al. Sequence VX Heaven windows virus 87% High accuracy and good The research analysis is
(2016) alignment collection time efficiency mainly focused on
Method window-based malware
variants. Scalability and
time efficiency are not
considered in the research
objectives
Paranthaman and Random Forest, CAIDA Dataset, CTU-13 96.31% The proposed framework has Scalability is not
Thuraisingham SVM Dataset, ISOT BotNet, good time efficiency considered as a research
(2017) Android Malware Genome objective
Project Dataset, ADFA
IDS dataset, CS Mining
Malicious Software
Dataset,
Cui et al. (2018) Deep Learning, Dataset from the Vision 94.5% Novel approach. This The proposed system needs
CNN, Bat Research Lab (https:// method transformed the a unique approach to
Algorithm vision.ece.ucsb.edu/ malicious code into transforming malicious
research/signal-processing- grayscale images. Next, the code into color images.
malware-analysis) images were identified and Scalability and time
classified by a CNN that efficiency are not
could extract the features considered in the research
of the malware images objectives
automatically
Burnap et al. The Self VirusTotal 93.76% Good for detection of APTs The system needs to wait
(2018) Organizing and polymorphic malware for the full execution of
Feature Map the malicious payload.
(SOFM) Scalability and time
efficiency are not
considered in the research
objectives
Azmoodeh et al. Deep Learning, VirusTotal and Self- 99.68% Sustainable against Junk Work is mainly focused on
(2019) Collected Customized Code Insertion Attacks IoT and IoBT malware
Dataset detection. Scalability and
time efficiency are not
considered in the research
objectives
Vinayakumar Deep Neural Publicly available dataset 96.3% The framework is capable of The author has claimed that
et al. (2019) Network, Ember analyzing a large number the system is scalable, but
Classical of malware in real-time. It the same has not been
Machine could be scaled out to shown in experimental
Learning analyze an even larger results
Algorithms number of malware
Yuxin and Siyi Deep Belief VirusTotal and Self- 98.37% The classification result is Scalability and time
(2019b) Network (DBN) Collected Customized better and accuracy is efficiency are not
Dataset higher than the baseline considered in the research
models objectives
Choi (2020) Combined kNN HAURI, Antivirus Company 99.8% The system has significantly The author in the paper has
Classification reduced the malware implemented vantage-
and Hierarchical detection time using Trend point (VP) tree using a
Similarity Hash Micro locality-sensitive similarity hash. The
for Fast Malware hashing (TLSH) search time of the VP tree
Detection needs improvement
Chen et al. (2021) A Learning-based VirusTotal 99.12% The accuracy of the The malware detection time
Static Malware proposed system is high of the proposed system is
Detection high
System with
Integrated
Feature

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

4002 M. Kumar

Table 9 (continued)
Authors and year Techniques used Dataset Result Strength Weakness
accuracy

Serpanos Random Forest EMBER: An Open Dataset 99.21% Sisyfos is a modular and The research work is
et al.,(Serpanos for Training Static PE extensible platform for focused on the
et al. 2021) Malware Machine malware analysis. It development of a
2021 Learning Models addresses multiple modular and extensible
operating systems platform for supporting
multiple operating
systems. Detection time
and scalability are not the
research objective
Proposed Random Forest VirsusShar, VirusTotal, 99.8% Highly Scalable and Time The malware detection time
architecture with TLSH theZoo, IEEE, Malwr, Efficient of the proposed system is
Lenny Zelter and Contagio almost half compared to
malware repositories the existing model
proposed by other
researchers. However, the
proposed system is based
on Apache Spark which
requires higher RAM
capacity

in the paper are quite encouraging and satisfactory. We are References

further working on a similar approach for scalable malware
detection systems using dynamic malware analysis and Agarkar S, & Ghosh S (2020) Malware detection & classification
deep learning. using machine learning. 2020 IEEE International Symposium on
Sustainable Energy, Signal Processing and Cyber Security
(ISSSC). https://doi.org/10.1109/isssc50941.2020.9358835
Al Ahmadi BA and Martinovic I (2018) MalClassifier: Malware
Author contributions The paper is authored by a single author, and all family classification using network flow sequence behav-
the works in the paper are carried out by him. ior. 2018 APWG Symposium on Electronic Crime Research
(eCrime), San Diego, CA, pp 1-13, https://doi.org/10.1109/
Funding Not applicable. ECRIME.2018.8376209
Ali M, Hagen J, Oliver J (2020) scalable malware clustering using
Availability of data and material Data sharing is not applicable to this multi-stage tree parallelization. IEEE Int Conf Intell Secur
article as no new data were created or analyzed in this study. Informatics (ISI) 2020:1–6. https://doi.org/10.1109/ISI49825.
2020.9280546
Code availability Not applicable. Anderson HS, Kharkar A, Filar B, and Roth P (2017) Evading
machine learning malware detection. Black Hat
Azmoodeh A, Dehghantanha A, Choo KKR (2018) Robust malware
Declarations detection for internet of (Battlefield) things devices using deep
eigenspace learning. IEEE Trans Sustain Comput 4(1):88–95.
Conflict of interest The author hereby declares that they have no https://doi.org/10.1109/TSUSC.2018.2809665
conflict of interest. No research grant or fund has been received from Bermejo Higuera J, Abad Aramburu C, Bermejo Higuera JR, Sicilia
any agency to carry out the research work discussed in the Urban MA, Sicilia Montalvo JA (2020) Systematic approach to
manuscript. malware analysis (SAMA). Appl Sci 10(4):1360. https://doi.org/
10.3390/app10041360
Ethical approval Not applicable. Bryłkowski H (2017) Locality sensitive hashing - LSH explained.
Medium. Brainly Engineering, https://medium.com/engineering-
Consent to participate Not applicable. brainly/locality-sensitive-hashing-explained-304eb39291e4.
Burnap P, French R, Turner F, Jones K (2018) Malware classification
Consent for publication Not applicable. using self organising feature maps and machine activity data.
Comput Secur 73:399–410. https://doi.org/10.1016/j.cose.2017.
Human animal and rights This article does not contain any studies 11.016
with human participants or animals performed by any of the authors. Catak FO (2019) Malware API call dataset. IEEE Dataport, https://
doi.org/10.21227/crfp-kd68.
Chen Z, Zhang X, Kim S (2021) A learning-based static malware
detection system with integrated feature. Intell Autom Soft
Comput 27(3):891–908

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Scalable malware detection system using big data and distributed machine learning approach 4003

Cho IK, Kim TG, Shim YJ, Ryu M, Im EG (2016) Malware analysis ’18). Association for Computing Machinery, New York, NY,
and classification using sequence alignments. Intell Autom Soft USA, 354–365. https://doi.org/10.1145/3176258.3176306.
Comput 22(3):371–377. https://doi.org/10.1080/10798587.2015. Paola A De, and Lo Re G (2020) A hybrid system for malware
1118916 detection on big data - IEEE Conference Publication. Accessed
Choi S (2020) Combined kNN classification and hierarchical March 23. https://ieeexplore.ieee.org/document/8406963/.
similarity hash for fast malware detection. Appl Sci Paranthaman R and Thuraisingham B (2017) Malware collection and
10(15):5173. https://doi.org/10.3390/app10155173 analysis. 2017 IEEE International Conference on Information
Cui Z, Xue F, Cai X, Cao Y, Wang G, Chen J (2018) Detection of Reuse and Integration (IRI), San Diego, CA, pp 26–31 https://
malicious code variants based on deep learning. IEEE Trans doi.org/10.1109/IRI.2017.92.
Industr Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018. Poudyal S, Akhtar Z, Dasgupta D and Gupta KD (2019) Malware
2822680 analytics: review of data mining, machine learning and big data
Dell’Amico M (2019) Fishdbc: Flexible, incremental, scalable, perspectives. 2019 IEEE Symposium Series on Computational
hierarchical density-based clustering for arbitrary data and Intelligence (SSCI), Xiamen, China, pp 649-656, https://doi.org/
distance. arXiv preprint 1910.07283 10.1109/SSCI44817.2019.9002996
Gupta S (2019) Locality sensitive hashing. Medium. Towards Data Rathore H, Agarwal S, Sahay SK, Sewak M (2019) Malware
Science, https://towardsdatascience.com/understanding-locality- detection using machine learning and deep learning. arXiv.org
sensitive-hashing-49f6d1f6134 https://arxiv.org/abs/1904.02441v1.
Gupta D, Rani R (2018) Big data framework for zero-day malware Serpanos D, Michalopoulos P, Xenos G, Ieronymakis V (2021)
detection. Cybern Syst 49(2):103–121. https://doi.org/10.1080/ Sisyfos: A modular and extendable open malware analysis
01969722.2018.1429835 platform. Appl Sci 11(7):2980. https://doi.org/10.3390/
Hordri NF, Ahmad NA, Yuhaniz SS, Sahibuddin S, Ariffin AF, Saupi app11072980
NA, Zamani NA, Jeffry Y, Senan MF (2018) Classification of Smart Whitelisting Using Locality Sensitive Hashing (2017) Trend
malware analytics techniques: a systematic literature review. Int micro. https://www.trendmicro.com/en_us/research/17/c/smart-
J Secur Appl 12(2):9–18 whitelisting-using-locality-sensitive-hashing.html
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) HinDroid: An TLSH - Technical Overview. (2021) TLSH Technical Overview.
intelligent android malware detection system based on structured https://tlsh.org/papers.html
heterogeneous information network. In Proceedings of the 23rd Ullah F, Babar MA (2019) Architectural tactics for big data
ACM SIGKDD International Conference on Knowledge Dis- cybersecurity analytics systems: a review. J Syst Softw
covery and Data Mining (KDD ’17). Association for Computing 151:81–118. https://doi.org/10.1016/j.jss.2019.01.051
Machinery, New York, NY, USA, 1507–1515. https://doi.org/10. Venkatraman S, Alazab M (2018) Use of data visualisation for zero-
1145/3097983.3098026 day malware detection. Secur Commun Netw 2018:1–13. https://
Kaspersky-Lab-Whitepaper-Machine-Learning. Accessed March 23, doi.org/10.1155/2018/1728303
2020. https://media.kaspersky.com/en/enterprise-security/Kas Vinayakumar R, Soman K (2018) Deepmalnet: evaluating shallow
persky-Lab-Whitepaper-Machine-Learning.pdf. and deep networks for static pe malware detection. ICT Express
Kolosnjaji B, Demontis A, Biggio B, Maiorca D, Giacinto G, Eckert 4(4):255–258
C and Roli F (2018) Adversarial malware binaries: evading deep Vinayakumar R, Alazab M, Soman KP, Poornachandran P, Venka-
learning for malware detection in executables. In 2018 26th traman S (2019) Robust intelligent malware detection using deep
European Signal Processing Conference (EUSIPCO), learning. IEEE Access 7(2019):46717–46738. https://doi.org/10.
pp 533–537. IEEE 1109/access.2019.2906934
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant Wassermann S and Casas P (2018) Bigmomal. Proceedings of the
permission identification for machine-learning-based android 2018 Workshop on Traffic Measurements for Cybersecurity -
malware detection. IEEE Trans Industr Inf 14(7):3216–3225. WTMC 18, https://doi.org/10.1145/3229598.3229600.
https://doi.org/10.1109/TII.2017.2789219 Wu Q, Zhu X, Liu B (2021) A survey of android malware static
Masabo E, Kaawaase KS, Sansa-Otim J (2018) Big data. Proceedings detection technology based on machine learning. Mob Inf Syst
of the 2018 International Conference on Software Engineering in 2021:1–18. https://doi.org/10.1155/2021/8896013
Africa - SEiA 18, https://doi.org/10.1145/3195528.3195533. Ye Y, Li T, Adjeroh D, Iyengar SS West Virginia University, West
Naderi H, Vinod P, Conti M, Parsa S, Alaeiyan MH (2019) Malware Virginia University, Tao Li Florida International University,
signature generation using locality sensitive hashing. Commun et al. A survey on malware detection using data mining
Comput Inf Sci Secur Privacy. https://doi.org/10.1007/978-981- techniques. ACM Computing Surveys (CSUR), 2017 https://
13-7561-3_9 doi.org/10.1145/3073559.
Oliveira A (2019) ‘‘Malware analysis datasets: Top-1000 PE imports. Yuxin D, Siyi Z (2019a) Malware detection based on deep learning
IEEE Dataport, https://doi.org/10.21227/004e-v304. algorithm. Neural Comput Appl 31(2):461–472
Oliver J, Ali M, & Hagen J (2020) HAC-T and Fast Search for Yuxin D, Siyi Z (2019b) Malware detection based on deep learning
Similarity in Security. 2020 International Conference on Omni- algorithm. Neural Comput Appl 31:461–472. https://doi.org/10.
Layer Intelligent Systems (COINS). https://doi.org/10.1109/ 1007/s00521-017-3077-6
coins49042.2020.9191381
Pagani F, Dell’Amico M, and Balzarotti D (2018) Beyond Precision Publisher’s Note Springer Nature remains neutral with regard to
and recall: Understanding uses (and misuses) of similarity hashes jurisdictional claims in published maps and institutional affiliations.
in binary analysis. In Proceedings of the Eighth ACM Confer-
ence on Data and Application Security and Privacy (CODASPY

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

The rise of machine learning for detection and classification of malware_ Research developments, trends and challenges - ScienceDirect
No ratings yet
The rise of machine learning for detection and classification of malware_ Research developments, trends and challenges - ScienceDirect
75 pages
applsci-12-08604-v2
No ratings yet
applsci-12-08604-v2
21 pages
Building A Malware Detection System Based On A Mac
No ratings yet
Building A Malware Detection System Based On A Mac
6 pages
Project JAISON
No ratings yet
Project JAISON
61 pages
Naal
No ratings yet
Naal
38 pages
A Comprehensive Review On Malware Detection Approaches
No ratings yet
A Comprehensive Review On Malware Detection Approaches
23 pages
A Review of Deep Learning Based Malware Detection Techniques
No ratings yet
A Review of Deep Learning Based Malware Detection Techniques
19 pages
Technical_Seminar_Report_565
No ratings yet
Technical_Seminar_Report_565
22 pages
From Code to Conundrum Machine Learnings Role in Modern Malware Detection
No ratings yet
From Code to Conundrum Machine Learnings Role in Modern Malware Detection
6 pages
Survey Paper of Group 7
No ratings yet
Survey Paper of Group 7
9 pages
A Comprehensive Survey on Identification of Malware Types and Malware Classification Using Machine Learning Techniques
No ratings yet
A Comprehensive Survey on Identification of Malware Types and Malware Classification Using Machine Learning Techniques
8 pages
Malware Detection
No ratings yet
Malware Detection
10 pages
Comp. Project Synopsis Reviwed
No ratings yet
Comp. Project Synopsis Reviwed
16 pages
Alz Arooni
No ratings yet
Alz Arooni
212 pages
A Malware Detection Method
No ratings yet
A Malware Detection Method
74 pages
Analyzing and comparing the effectiveness of malware detection_ A study of machine learning approaches - ScienceDirect
No ratings yet
Analyzing and comparing the effectiveness of malware detection_ A study of machine learning approaches - ScienceDirect
39 pages
MUSHKAN REPORT
No ratings yet
MUSHKAN REPORT
67 pages
Malware - Detection - Using - Machine - Learning (2) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (2) - Removed
31 pages
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
No ratings yet
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
11 pages
Research 4
No ratings yet
Research 4
17 pages
Final Research - Merged
No ratings yet
Final Research - Merged
10 pages
A novel ensemble-based approach for Windows malware detection
No ratings yet
A novel ensemble-based approach for Windows malware detection
10 pages
document_malware
No ratings yet
document_malware
9 pages
2303.01679v2
No ratings yet
2303.01679v2
17 pages
Catch Them Alive: Malware Detection
No ratings yet
Catch Them Alive: Malware Detection
19 pages
IJETT-V73I1P132
No ratings yet
IJETT-V73I1P132
15 pages
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
No ratings yet
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
18 pages
A Comprehensive Survey On Deep Learning Based Malware Detectiontechniques
No ratings yet
A Comprehensive Survey On Deep Learning Based Malware Detectiontechniques
36 pages
Chapter One 1.1 Background of The Study
No ratings yet
Chapter One 1.1 Background of The Study
40 pages
pdf3
No ratings yet
pdf3
9 pages
malware_detection_research_paper_updated Soheb6
No ratings yet
malware_detection_research_paper_updated Soheb6
8 pages
A Survey of The Recent Trends in Deep Le
No ratings yet
A Survey of The Recent Trends in Deep Le
30 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
Malicious Code Invariance Based On Deep Learning
No ratings yet
Malicious Code Invariance Based On Deep Learning
7 pages
A Malicious Code Detection Method Based on Stacked Depthwise Separable Convolutions and Attention Mechanism
No ratings yet
A Malicious Code Detection Method Based on Stacked Depthwise Separable Convolutions and Attention Mechanism
27 pages
Final Synposis
No ratings yet
Final Synposis
10 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
Malware Detection Using Machine Leaning
No ratings yet
Malware Detection Using Machine Leaning
9 pages
6thsemminiproject
No ratings yet
6thsemminiproject
12 pages
Malware_Detection_Using_Machine_Learning (1)
No ratings yet
Malware_Detection_Using_Machine_Learning (1)
4 pages
Research Paper 2 Malware Detection
No ratings yet
Research Paper 2 Malware Detection
24 pages
computers-13-00059
No ratings yet
computers-13-00059
18 pages
synopsis1
No ratings yet
synopsis1
7 pages
606 (2)
No ratings yet
606 (2)
16 pages
Malware Detection Research Paper Updated Soheb6
No ratings yet
Malware Detection Research Paper Updated Soheb6
6 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
8 pages
Electronics 11 03665 v2
No ratings yet
Electronics 11 03665 v2
20 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
FuzzyRNN NIT SUB 2Columns PDF
No ratings yet
FuzzyRNN NIT SUB 2Columns PDF
8 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
Analysis of Cyber Security Threats Using
No ratings yet
Analysis of Cyber Security Threats Using
5 pages
First Review B19
No ratings yet
First Review B19
24 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
8 pages
Mini Project
No ratings yet
Mini Project
11 pages
Top Solar Panel Manufacturers in India
No ratings yet
Top Solar Panel Manufacturers in India
10 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
9 pages
Classification of Malware Detection Using Machine Learning Algorithms A Survey
No ratings yet
Classification of Malware Detection Using Machine Learning Algorithms A Survey
7 pages
Malcode Detection
No ratings yet
Malcode Detection
5 pages
Development of Malware Detection and Analysis Mode
No ratings yet
Development of Malware Detection and Analysis Mode
50 pages
1-s2.0-S0950705125007725-main
No ratings yet
1-s2.0-S0950705125007725-main
15 pages
s13677-022-00306-5
No ratings yet
s13677-022-00306-5
14 pages
Searhing Techniques Data Structure
No ratings yet
Searhing Techniques Data Structure
11 pages
Malware - Detection - Using - Machine - Learning (3) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (3) - Removed
31 pages
s10462-021-10025-z
No ratings yet
s10462-021-10025-z
55 pages
tag-writer-user-manual
No ratings yet
tag-writer-user-manual
41 pages
Robust Classification of Smartphone Captured Handw
No ratings yet
Robust Classification of Smartphone Captured Handw
34 pages
Step by Step OBIEE 12C Installation on Windows – BI Publisher Installation, BI Analytic Installation – Obiee by Pavan
No ratings yet
Step by Step OBIEE 12C Installation on Windows – BI Publisher Installation, BI Analytic Installation – Obiee by Pavan
44 pages
Chicco Nextfit Ix Zip Air Plus Manual
No ratings yet
Chicco Nextfit Ix Zip Air Plus Manual
57 pages
s11042-024-18217-9 (3)
No ratings yet
s11042-024-18217-9 (3)
25 pages
s11042-023-18104-9 (2)
No ratings yet
s11042-023-18104-9 (2)
26 pages
aspm
No ratings yet
aspm
28 pages
s11071-024-10429-w
No ratings yet
s11071-024-10429-w
21 pages
s11277-024-11542-0
No ratings yet
s11277-024-11542-0
15 pages
Deep_learning_countermeasures_for_detecting_replay
No ratings yet
Deep_learning_countermeasures_for_detecting_replay
14 pages
s10772-024-10082-z
No ratings yet
s10772-024-10082-z
13 pages
IAT-III Question Paper with Solution of BCS303 Operating Systems March-2024-Attar Mahay Sheetal
No ratings yet
IAT-III Question Paper with Solution of BCS303 Operating Systems March-2024-Attar Mahay Sheetal
13 pages
Application_and_existing_problems_of_computer_network_technology_in_the_field_of_artificial_intelligence
No ratings yet
Application_and_existing_problems_of_computer_network_technology_in_the_field_of_artificial_intelligence
4 pages
Application Book: Heating Solutions For All Industries
No ratings yet
Application Book: Heating Solutions For All Industries
72 pages
Taylor Ims11 Tif Modc
No ratings yet
Taylor Ims11 Tif Modc
15 pages
A Stochastic Model For Demand Forecating in Python
No ratings yet
A Stochastic Model For Demand Forecating in Python
32 pages
A review on speaker recognition_ Technology and challenges
No ratings yet
A review on speaker recognition_ Technology and challenges
14 pages
Parts and Characteristics of A Parabola
No ratings yet
Parts and Characteristics of A Parabola
19 pages
Ethers - Js Beginner To Advanced Guides
No ratings yet
Ethers - Js Beginner To Advanced Guides
239 pages
VTU_B.E B.Tech_2019_3rd Semester_July_CBCS 17 Scheme_CSE_17CS35 UNIX and Shell Programming_FirstRanker.com
No ratings yet
VTU_B.E B.Tech_2019_3rd Semester_July_CBCS 17 Scheme_CSE_17CS35 UNIX and Shell Programming_FirstRanker.com
2 pages
RN41/RN41N: Class 1 Bluetooth Module With EDR Support
No ratings yet
RN41/RN41N: Class 1 Bluetooth Module With EDR Support
28 pages
05 TUITION FEE AND OTHER DUES
No ratings yet
05 TUITION FEE AND OTHER DUES
4 pages
Plumbing Work Program
0% (1)
Plumbing Work Program
1 page
VU2ABS Broadband Hexbeam Assembly and Installation Guide
No ratings yet
VU2ABS Broadband Hexbeam Assembly and Installation Guide
6 pages
Empowerment Technology
No ratings yet
Empowerment Technology
9 pages
CBS FCC Gurantee Opening in 11.8
No ratings yet
CBS FCC Gurantee Opening in 11.8
4 pages
Technical Seminar On: Face Recognition Based On Convolution Neural Network
No ratings yet
Technical Seminar On: Face Recognition Based On Convolution Neural Network
22 pages
Harvard Referencing System (UWL) PDF
No ratings yet
Harvard Referencing System (UWL) PDF
10 pages
v40 Tailgate Lock Assembly Replacing
No ratings yet
v40 Tailgate Lock Assembly Replacing
2 pages
Single-Sideband Transmission by Envelope: Elimination Restoration
No ratings yet
Single-Sideband Transmission by Envelope: Elimination Restoration
4 pages
6820-02-UPRVUNL-PM-Equipment Master Review Log - CHD-I - V1.0
No ratings yet
6820-02-UPRVUNL-PM-Equipment Master Review Log - CHD-I - V1.0
5 pages
Tabela de Conversao de Unidades de Pressao
No ratings yet
Tabela de Conversao de Unidades de Pressao
1 page
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Teste de Pressão Do Freio
No ratings yet
Teste de Pressão Do Freio
3 pages
ZVEI BR Technical Cleanliness Engl
No ratings yet
ZVEI BR Technical Cleanliness Engl
100 pages
Package List For Website
No ratings yet
Package List For Website
1 page
Lester Khiets Roa Bsce 2-A 10 Engineers Who Became President or General Manager of A Large Company
No ratings yet
Lester Khiets Roa Bsce 2-A 10 Engineers Who Became President or General Manager of A Large Company
8 pages
Ecs 2015
No ratings yet
Ecs 2015
48 pages
Tamil Typing Practice Book Free Download PDF
No ratings yet
Tamil Typing Practice Book Free Download PDF
2 pages
Guidelines For Sizing of Restriction Orifice For Single-Phase Fluids With PDF
No ratings yet
Guidelines For Sizing of Restriction Orifice For Single-Phase Fluids With PDF
8 pages
2010 Chevrolet Captiva Sport X2 (LE5 o LE9)
100% (1)
2010 Chevrolet Captiva Sport X2 (LE5 o LE9)
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Scalable_malware_detection_system_using_big_data_a

Uploaded by

Scalable_malware_detection_system_using_big_data_a

Uploaded by

Soft Computing (2022) 26:3987–4003

APPLICATION OF SOFT COMPUTING

Scalable malware detection system using big data and distributed

Accepted: 24 October 2021 / Published online: 5 November 2021

1 Introduction objectives. Some malware targets individual users, whereas

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Fig. 1 Malware analysis

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 1 Recent works on malware analysis using Machine Learning

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

from the malware database(Ali et al. 2020). If the TLSH

5.1 Scalable malware processing architecture

One of the core objectives of the research is to design a

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Fig. 3 Machine Learning-Based Malware Detection System’s Architecture

5.1.1 Apache spark 6 Machine learning

Table 2 Apache Spark Features and Benefits

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Fig. 4 Major steps for machine learning-based static malware analysis

Fig. 5 Normal Cryptographic Hash and Locality-Sensitive Hash

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Fig. 6 Major Steps for

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

8.1 Datasets 9 Experimental analysis

Table 3 System Setup and Configuration

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 4 Evaluation Metrics

Accuracy Accuracy measures precision across all labels ðTPþTN Þ

2000 7.2 8.73 6.48 9.72 5.01 26.73 6.48

machine learning model is a time-consuming process.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

2000 0.3 0.5 0.5 0.6 0.4 1.5 0.6

Fig. 9 Comparative performance analysis of machine learning

Table 7 Performance of machine learning algorithms

Decision Tree 0.935 0.981 0.946 0.964

Fig. 10 Comparative AUC results of Logistic, SVM and DT Model

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Forest as well as along with TLSH. The malware detection

10 Comparison of proposed system

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 9 Comparative Analysis of Proposed System with Existing Systems

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

in the paper are quite encouraging and satisfactory. We are References

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.