A Comprehensive Review On Malware Detection Approaches
A Comprehensive Review On Malware Detection Approaches
ABSTRACT According to the recent studies, malicious software (malware) is increasing at an alarming
rate, and some malware can hide in the system by using different obfuscation techniques. In order to protect
computer systems and the Internet from the malware, the malware needs to be detected before it affects a
large number of systems. Recently, there have been made several studies on malware detection approaches.
However, the detection of malware still remains problematic. Signature-based and heuristic-based detection
approaches are fast and efficient to detect known malware, but especially signature-based detection approach
has failed to detect unknown malware. On the other hand, behavior-based, model checking-based, and
cloud-based approaches perform well for unknown and complicated malware; and deep learning-based,
mobile devices-based, and IoT-based approaches also emerge to detect some portion of known and unknown
malware. However, no approach can detect all malware in the wild. This shows that to build an effective
method to detect malware is a very challenging task, and there is a huge gap for new studies and methods.
This paper presents a detailed review on malware detection approaches and recent detection methods which
use these approaches. Paper goal is to help researchers to have a general idea of the malware detection
approaches, pros and cons of each detection approach, and methods that are used in these approaches.
INDEX TERMS Cyber security, malware classification, malware detection approaches, malware features.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 6249
Ö. Aslan, R. Samet: Comprehensive Review on Malware Detection Approaches
TABLE 1. Traditional versus new generation malware. no method could detect all new generation and sophisticated
malware. This shows that building an effective method to
detect malware is a very challenging task, and there is a huge
demand for new studies and methods.
This paper presents the literature review in order to inves-
tigate the current situation of malware detection approaches.
The paper makes the following contributions:
• Explains new technological trends for malware creation
and new approaches to detect malware.
• Investigates the probability of detecting malware.
• Presents a summary of the current studies on malware
detection.
• Explains important approaches and methods for mal-
ware detection.
• Discusses current challenges and proposes new assump-
tions for malware detection approaches.
• Provides a systematic overview of malware detection
approaches and methods for further studies.
incrementally. According to scientific and business reports, The rest of the paper is organized as follows: Section II
approximately 1 million malware files are created every day, demonstrates problem definition. Malware detection tech-
and cybercrime will damage the world economy by approxi- niques and algorithms are explained in section III, and
mately $6 trillion annually by 2021 [1]. Recent studies show malware detection approaches are explained in section IV.
that mobile malware is on the rise. According to the McAfee Evaluation on malware detection approaches are presented in
mobile threat report, there is a huge increase in backdoors, section V. Finally, the conclusion and future works are given
fake applications and banking Trojans for mobile devices [2]. in section VI.
Besides, the malware attacks related to the social media,
healthcare industry, cloud computing, internet of things (IoT), II. PROBLEM DEFINITION
and cryptocurrencies are also on the rise. According to cyber- This section investigates the problem of malware and possi-
security ventures, ransomware malware will cost around bility of detection. It can be said that it is impossible to design
$11.5 billion globally at the end of 2019 [1]. an algorithm which can detect all malware. This is because
To protect legitimate users and companies from mal- the problem of detecting the malware has shown NP-complete
ware, malware need to be detected. Malware detection is in many studies. This is important because before starting to
the process of determining whether a given program has build an effective detection system, it is a good practice and
malicious intent or not. In early days, signature-based detec- experience for researcher to understand the scope, limitation,
tion approach was used widely to detect malware. However, and possibility of malware detector. The possibility of detec-
this approach has some limitations such as it cannot detect tion malware is remaining problematic because theoretically
unknown and new generation malware. In process of time, it is a hard problem, and practically malware creators using
researchers proposed new approaches including behavioral-, complicated techniques such as obfuscation to make detect-
heuristic-, and model checking-based detection. With these ing process very challenging.
approaches, datamining and machine learning (ML) algo-
rithms are also started to be used widely in malware detec- A. DIFFICULTY OF PROBLEM IN THEORY
tion. Recently, new approaches have been proposed such Since the first malware that appeared in the wild was a virus,
as deep learning-, cloud-, mobile devices-, and IoT-based most of the studies had been done theoretically were based on
detection. For known and some of unknown malware, heuris- the detection of virus. According to early studies, the detec-
tic detection approach performs well. On the other hand, tion of virus is impossible [3]–[5] and NP-complete [6]–[9].
for unknown and complicated malware; behavior-, model According to F. Cohen, the detection of computer virus is
checking-, and cloud-based approaches perform better. Deep an undecidable because detection process itself contains a
learning-, mobile devices-, and IoT-based approaches also contradiction [3], [5], [6]. If the detection problem is seen as
emerge to detect some portion of known and unknown a decision-making problem, D (decision-maker) will decide
malware. It has not been proved exactly that one detec- whether P is a virus or not. According to Cohen, it cannot
tion approach is more effective than the others. This is be decided whether P is a virus because if P is a virus,
because each method has its own advantages and disadvan- it will be marked by D as a virus and will not be able to
tages, and in different situation one method can detect better make changes to other programs, as it will not act as a virus.
than another. Even though several new methods have been If D decision maker did not identify P as a virus, P will
proposed by using different malware detection approaches, interact with other programs to spread and become infected.
This decision process involves contradiction, and therefore • Packaging: Packaging is an obfuscation technique to
it is not possible to identify P as a virus. According to compress malware to prevent detection, or hiding the
M. Chess and R. White, there is no program that detects actual code by using encryption [15], [16]. Due to this
all viruses without false positives (FPs) because viruses are technique, malware can easily bypass firewall and anti-
polymorphic and can be exist in different forms [5]. Accord- virus software. Packaged malware need to be unpacked
ing to M. Adleman detecting a virus is quite intractable and before being analyzed. The packers can be divided into
almost impossible [7]. This is because according to Gödel 4 different groups include compressors, crypters, protec-
numberings of the partial recursive functions, it is not pos- tors, and bundlers.
sible to create detecting mechanism. To reliably identifying In this section, the limitations of malware detecting sys-
a bounded-length mutating virus is NP-complete explained tems have been summarized. Current studies demonstrate that
in [8]. According to the author, virus detector for certain virus it is almost impossible to write an algorithm to detect all
strain can be used to solve the satisfiability problem. Since malware. This is because the computational complexity of
satisfiability problem is known to be NP-complete, so the malware is not clear, and the detection of malware problem
detection of the malware is NP-complete. Zuo et al. claim that is proved to be NP-complete. Besides, the use of new tech-
there exist computer viruses whose detecting procedures have niques (obfuscation and packing) during malware creation
sufficiently large time complexity, and there are undecidable also makes detection process more challenging.
viruses which have no minimal detecting procedure [9].
III. MALWARE DETECTION TECHNIQUES AND
B. DIFFICULTY OF PROBLEM IN PRACTICE ALGORITHMS
The new generation malware uses the common obfusca- In recent years, datamining and ML algorithms have been
tion techniques such as encryption, oligomorphic, polymor- used extensively for malware detection. Malware detection
phic, metamorphic, stealth, and packing methods to make is the process of investigating the content of the program and
detection process more difficult. This kind of malware can deciding whether the analyzed program malware or benign.
easily bypass protection software that is running in kernel The malware detection process includes 3 stages: Malware
mode such as firewalls, antivirus software, etc. and some analysis, feature extraction, and classification.
malware instances can also present the characteristics of
multiple classes at the same time. This makes practically A. MALWARE ANALYSIS
almost impossible to detect all malware with single detection In order to understand the content and behaviors of malware,
approach. The definition of common obfuscation techniques it needs to be analyzed. Malware analysis is the process
explain as follows: of determining the functionality of malware and answers to
following questions [17], [18]. How malware works, which
• Encryption: In encryption, malware uses encryption to machines and programs are affected, which data is being
hide malicious code block in its entire code [10]. Hence, damaged and stolen, etc. There are mainly two techniques
malware becomes invisible in the host. to analyze malware: static and dynamic [17]. Static analysis
• Oligomorphic: In oligomorphic method, a different key examines the malware without running the actual code [19].
is used when encrypting and decrypting malware pay- On the other hand, dynamic analysis examines the malware
load [11]. Thus, it is more difficult to detect malware behaviors while running its code. Malware analysis starts
which uses oligomorphic method than encryption. with basic static analysis and finishes with advanced dynamic
• Polymorphic: In polymorphic method, malware uses a analysis. The malware is analyzed by using reverse engineer-
different key to encrypt and decrypt [12] likewise the key ing [20] and some other malware analysis tools to represent
used in oligomorphic method. However, the encrypted the malware in different format. Reverse engineering process
payload portion contains several copies of the decoder can be seen in Figure 1.
and can be encrypted in layered [13]. Thus, it is more
difficult to detect polymorphic malware when compared
to oligomorphic malware.
• Metamorphic: Metamorphic method does not use
encryption. Instead, it uses dynamic code hiding which
the opcode changes on each iteration when the malicious
process is executed [14]. It is very difficult to detect
such malware because each new copy has a completely
different signature.
• Stealth: Stealth method also called code protection,
implements a number of counter techniques to prevent it
from being analyzed correctly [11]. For instance, it can
make changes on the system and keep it hidden from
detection systems. FIGURE 1. A flow chart of reverse engineering process.
B. MALWARE FEATURE EXTRACTION network related attacks which are used for intrusion
Malware features are extracted by using data mining tech- detection system.
niques. Data mining is the process of extracting new mean- • Drebin dataset (2014): This dataset is created for smart
ingful information from large datasets or databases which has phones to examine the effectiveness of the existing anti-
been unknown before this process. In recent years, by using virus software [23]. It consists of 5560 malware across
datamining new models and datasets have been created [21]. 20 families and 123,453 benign samples.
There are different models such as n-gram, and graph model • Microsoft malware classification challenge dataset
to create malware dataset and features. (2015): It has been published by Microsoft and consists
of 20,000 malware [24]. Malware has been analyzed
using the IDA packet disassembler and the output should
1) THE n-gram MODEL
be processed using data mining prior to ML.
The n-gram is a feature extraction technique which has been
• ClaMP (Classification of Malware with PE headers)
used widely in many areas as well as malware detection.
dataset (2016): It consists of 5184 records and has
The n-gram can use both static and dynamic attributes to cre-
55 properties [25]. The dataset uses API arrays, contains
ate features. To create features from behaviors, n-gram group
examples of malicious and benign software with their
the system calls or application programing interfaces (APIs)
features.
in a consecutive order by specified n (n = 2, n = 3, n =
• AAGM dataset (2017): It is a network-based dataset for
4, n = 6, etc.) values. Although the n-gram model has been
android malware [26]. It consists of 400 malware and
used widely in malware detection, it has some drawbacks
1500 benign samples from 12 families [26].
when determining features. This is because every sequential
• EMBER dataset (2018): It consists of 1 million records
static and dynamic attributes are not related to one another.
and holds malware and benign features [27].
This makes classification and clustering more challenging for
later processes. Besides, n-gram generates enormous feature These datasets can be used for researches who want to get
space which increases the analysis time and decreases the some experience before proposing a new malware detection
model performance. For these reasons, there is a huge demand approach.
to find out new models to achieve better performance than
n-gram. C. MALWARE CLASSIFICATION
Machine learning (ML) is a set of algorithm that correctly
2) GRAPH-BASED MODEL estimates the outcomes of the applications without being
The graph-based model is one of the commonly used tech- explicitly programmed. The purpose of the ML is to convert
niques to generate features as well. System calls made in the input data into acceptable value intervals by using statisti-
this method are converted into graph G (V, E) such that cal analysis. By using ML, many operations can be performed
V represents nodes which identify system calls and the E on related data such as classification, regression and cluster-
represents edges which identify the relationship among the ing. ML algorithms have been used in malware detection for
system calls. Since the size of the graph increases over time, many years [28]. Well-known ML algorithms are Bayesian
sub-diagrams can be used to describe the graph. The sub- network (BN), naive Bayes (NB), C4.5 decision tree variant
diagram is defined in many studies as NP-Complete. This (J48), logistic model trees (LMT), random forest tree (RF),
means that it requires a lot of time to define each sub-diagram. k-nearest neighbor (KNN), multilayer perceptron (MLP),
After the whole diagram is expressed with fewer nodes and simple logistic regression (SLR), support vector machine
edges, the programs are identified as malicious or benign. (SVM), and sequential minimal optimization (SMO). These
algorithms are used especially in behavior-based detection
and some of other detection approaches. Although each algo-
3) MALWARE DATASET rithm has its own advantages and disadvantages, it cannot be
As in other research areas, there are not many datasets pub- concluded that one algorithm is more efficient than another.
lished previously which are accepted and widely used for However, an algorithm can perform better than other algo-
malware detection. In addition, most of the existing datasets rithms in terms of the distribution of the data, number of
are not accessible for research, and in most cases the datasets features, and dependencies between properties.
accessed are not in the appropriate formats for data mining
processes and ML algorithms. The datasets used in mal-
IV. MALWARE DETECTION APPROACHES
ware analysis can be listed as follows: NSL-KDD, Drebin,
In recent years, there has been a rapid increase in the num-
Microsoft malware classification challenge, ClaMP (classifi-
ber of academic studies on malware detection. In the early
cation of Malware with PE headers), AAGM, and EMBER
days, signature-based detection method was widely used.
dataset.
This method works fast and efficiently against the known
• NSL-KDD dataset (2009): It is an updated version of malware, but does not perform well against the zero-day
the KDD’99 dataset which consists of approximately malware [21], [29]. In the process of time, researchers have
125,000 records and 41 features [22]. It shows the started to use techniques such as behavior-, heuristic-, and
model checking-based detection; and new techniques such as A. SIGNATURE-BASED MALWARE DETECTION
deep learning-, cloud-, mobile devices-, and IoT-based detec- Signature is a malware feature which encapsulates the
tion. Overview of malware detection approaches, features, program structure and identifies each malware uniquely.
and used techniques can be seen in Figure 2. Signature- based detection approach is widely used within
In each approach, feature extracting method is different one commercial antivirus. This approach is fast and efficient to
from another. It could not have been proven one detection detect known malware, but insufficient to detect unknown
method works better than another because each method has malware. In addition, malware belonging to the same fam-
its own advantages and disadvantages. By using behavior-, ily can easily escape the signature-based detection by using
heuristic-, and model checking-based detection approaches; obfuscation techniques. General view of signature-based
huge number of malware can be detected with a few behaviors detection schema can be seen in Figure 3.
and specifications. In addition, new malware can be detected
by using these approaches as well. However, they cannot 1) SIGNATURE GENERATION PROCESS
detect all malware. There is great necessity to find the method During the signature generation, first features are extracted
which effectively detects more complex and unknown mal- from executables (Figure 3). Then, signature generation
ware. Before explaining each detection approach in details, engine generates a signatures and stores them into signature
some well-known methods in each detection approach and database. When sample program needs to be marked as mal-
their related works are summarized in Table 2. Then, detailed ware or benign, signature of the related sample is extracted
literature review is presented, and the pros and cons of each as the same way before and compared with signatures on
study are explained. the database. Based on the comparison, sample program is
marked as malware or benign. There are many different tech- • String Scanning: Compares the byte sequence in
niques to create a signature such as string scanning, top-and- the analyzed file with the byte sequences previously
tail scanning, entry point scanning, and integrity checking. saved in the database. Byte signatures have been
achieved detection rate of 90.83% with a FPR of 0.80%. not been compared with other studies in the literature, and
The authors claim that proposed schema is resistant to the evaluation metrics are not very high and are not explained
obfuscation techniques, and it can be used for the generic in detail.
detection of all types of polymorphic malware rather than
being limited to a specific malware type. The authors also 3) EVALUATION OF SIGNATURE-BASED DETECTION
claim that the suggested system outperformed state-of-the- In the literature review, signature-based detection methods
art signature generation methods including Tang et al. [64], have been summarized. Signature-based detection schema
Newsome et al. [66], and Perdisci et al. [67] previously has been used for antivirus vendors for many years and it
reported in the literature. The proposed method is limited to is quite fast and effective to detect known malware. This
polymorphic malware and it has been tested on only hundreds approach is generally used to detect malware which belongs
of malware which is not enough to determine the performance to the same family. However, it fails to detect new gen-
of proposed method. eration malware which uses obfuscation and polymorphic
Automatic string signatures generation (Hancock) is techniques. Besides, it is prone to many FPs and extracting
explained in [41]. According to the paper, proposed schema signature takes a lot of man-power.
can automatically generate high-quality string signatures Although previous signature-based methods have achieved
with minimal FPs and maximal malware coverage. The pro- some success, they are not enough to detect new generation
posed method uses a set of library code identification tech- malware. To build an effective signature, the following key
niques, and diversity-based heuristics techniques to ensure points are taken into consideration:
the contexts in which a signature is embedded in contain- • Signature should be as short as possible and can repre-
ing malware files similar to one another [41]. Although the sent many malware with single signature,
authors claim that Hancock can automatically generate string • Effective automatic signature generation mechanism
signatures with a FPR below 0.1%, this FPR will be changed must be built,
based on benign samples that are analyzed. This is because • During the signature generation, datamining and ML
benign set is constantly growing, and getting some satisfy- techniques need to be used more,
ing result on some part of benign cannot be generalized to • Signature should be resistant to packing and obfuscation
whole set. Thus, these problems need to be solved for further techniques.
studies. Santos et al. proposed n-grams-based file signatures
to detect malware [68]. First, for known files n-grams are
extracted for every file and used as a file signature. Then, for B. BEHAVIOR-BASED MALWARE DETECTION
any unknown instance, n-grams are generated, and by using Behavior-based malware detection approach observes the
measuring function and k-nearest neighbor algorithm [69], program behaviors with monitoring tools and determines
file is marked malware or benign. Paper demonstrated that whether the program is malware or benign. Although the pro-
n-grams-based signatures can detect unknown malware to a gram codes are being changed, the behavior of the program
certain degree. will be similar; thus, majority of new malware can be detected
Efficient signature based malware detection on mobile with this method [29]. On the other hand, some malware bina-
devices is proposed in [70]. First, signature has been cre- ries do not run properly under protected environment (virtual
ated. Second, hash table has been used to store the hash machine, sandbox environment). Hence, malware samples
values of signatures to increase scanning speed. Finally, sig- are may be incorrectly marked as benign.
nature matching algorithm is used to compare the signatures.
To eliminate the mismatches, the probability of occurrence 1) BEHAVIOR DETECTION PROCESS
of signature bytes in non-malicious content has been used.
When establishing a behavior-based detection system,
According to the authors, the results have shown that sug-
behaviors are obtained by using one the following
gested schema performs well when compared to the Clam-AV
procedure:
scanner, and provides huge memory savings while main-
taining fast scanning speed. The proposed system was only • Automatic analysis by using sandbox [18];
compared with Clam-AV scanner, which is not enough for • Monitoring of system calls [36], [38];
overall evaluation. Zheng et al. presented the DroidAnalytics, • Monitoring of file changes [18];
an Android malware analytic system which can automatically • Comparison of registry snapshots [29];
collect malware, generate signatures for applications, identify • Monitoring network activities [17];
malicious code segment, and associate the malware under • Process monitoring [18].
study with various malware in the database [71]. In proposed In behavior-based detection, first, behaviors are deter-
system three-level signature generation schema has been used mined by using one of the technique used above and the
to identify each application. The authors assert that proposed dataset is created by subtracting the features using datamin-
signature methodology provides significant advantages over ing. Then, specific features from the dataset are obtained and
traditional cryptographic hash like MD5-based signature, and classification done by using ML algorithms. General view of
resistant to packing and mutations. The proposed system has behavior-based schema can be seen in Figure 4.
Liu et al. used MapReduce to group malware behaviors Graph-based malware detection using dynamic analysis
and detect malware [76]. According to the authors, most is proposed in [42]. The proposed schema works on graphs
of the studies done so far were process-oriented, and deter- which are constructed from dynamically collected instruction
mined a process as a malware only by its invoking system traces of the target executable. Markov chains have been used
calls. However, now most of the malware, which is defined in which the vertices are the instructions and the transition
as complex malware, consists of several processes and is probabilities are estimated by the data contained in the trace.
transmitted to the system by driver or by DLL [77]. In such They constructed similarity matrix which is combination of
cases, malware performs actions on victim machine by using graph kernels between the instruction trace graphs. They per-
more than one process instead of its own processes. When formed classification by using SVM on similarity matrix. The
only one process is analyzed, malware can be marked as results showed that there is a significant improvement over
benign. The paper emphasized persistent behaviors by using signature-based and other machine learning-based detection
Auto-Start Extensibility Points (ASEP), and based on these techniques. For the test case, modified version of Ether frame-
behaviors it differentiated malware from benign. The exper- work has been used. There are some limitations of Ether
imental results showed that the DR improved on previous system including:
research by 28%. However, there are some limitations of pro- • Ether is not completely invisible which means that some
posed method. The limitations of this method can be address intelligent malware can detect it and does not show their
as follows: real behaviors,
• Some malware binaries do not require persistent behav- • Ethernet card can be emulated by the underlying Xen
ior ASEP, system and string settings can be changed by malware,
• Persistent malware behaviors can be completed without • Ether is quite slow for malware analysis.
using system calls, Using different framework can increase the performance.
• The cost of data transmission has not been measured. Mojtaba and Hashemi proposed a graph mining method
Besides, the proposed method results have not been compared for detecting unknown malware binaries [80]. First, the paper
with other studies in the literature. Eliminating above limita- extracted control flow graph (CFG) from programs and com-
tions can improve the method performance. bined it with extracted API calls to have more information
A supervised ML model is proposed in [78]. The model about executable files. This new representation model was
applied a kernel base SVM that used weighting measures, called API-CFG. Then, the CFGs were converted to a set
which calculates the frequency of each library call to detect of feature vectors. Finally, the classification was performed
Mac OS X malware. The DR was calculated as 91% with by ML algorithms. According to the authors, the proposed
3.9% FP rate. Test results indicated that increasing sample method classified unseen benign and malicious code with
size increased the detection accuracy, but decreased the FPR. high accuracy, and outperformed n-grams based detection
Combining static and dynamic features, using other tech- method. However, the paper did not evaluate the performance
niques such as fuzzy classification and deep learning can for obfuscated malware, and also did not compare the results
increase the performance. with known methods. To compare performance with other
A graph-based detection schema was defined in [79], graph mining approaches may generate more trustworthy
[37]. Kolbitsch et al. [79] proposed a graph-based detection, results.
in which the system calls are converted into a behavior graph,
3) EVALUATION OF BEHAVIOR-BASED DETECTION
where nodes represented system calls and edges indicated
In literature review, behavior-based detection approach and
transitions among system calls that showed the data depen-
related methods have been summarized. Detection schema
dency. The program graph to be marked is extracted and
based on behaviors consists of 3 steps:
compared with the existing graph to determine whether the
given program is malware. Even though the proposed model • Determine behaviors (datamining can be used),
performed well for the known malware, it has difficulties in • Extract features from behaviors (datamining is used),
detecting unknown malware. A graph-based method which • Apply classification (machine learning is used).
specifies the common behaviors of malware and benign sam- Data mining techniques such as n-gram, n-tuple, bag, graph
ples is represented in [37]. In proposed system, kernel objects model, etc. have been used to determine the features from
were determined by system calls and behaviors were deter- behaviors; Hellinger distance, cosine coefficient, chi-square,
mined according to these objects. According to the authors, etc. (probability and statistical method) distance algorithms
the proposed method is scalable and can detect unknown are used to specify similarities among features. The diffi-
malware with high DR, and with low FP rates. In addition, culties in defining a behavior, the large number of extracted
the proposed model is highly scalable regardless of new features (when using n-grams, etc.), and the difficulties
instances added and robust against system call attacks. How- in identifying the similarities and differences among the
ever, the proposed method can observe only partial behavior extracted properties have prevented the creation of an effec-
of an executable. To explore more possible execution paths tive detection system. Besides, some malware does not
would improve the accuracy of this method. run properly within the virtual machines/sandboxes, and
advanced code obfuscating techniques prevent malware from This way, the technique did not need to deal with a large
being analyzed correctly. The use of new methods and tech- database of rules, which also accelerates the detection time
niques along with the use of ML and data mining algorithms and accuracy rate. According to the paper, the proposed
in malware detection has begun to play a major role when system outperformed popular antivirus software tools such as
generating features meaningfully. There is huge demand for McAfee, VirusScan and Norton AntiVirus; and outperformed
more scientific studies to cover shortcomings of existing data-mining-based detection systems including naive Bayes,
methods. This study has summarized the existing researches support vector machine (SVM), and decision tree techniques.
and makes suggestions to fill the gap. To collect more API calls which can provide more informa-
tion about malware and identify complex relationships among
C. HEURISTIC-BASED MALWARE DETECTION the API calls may improve the performance.
In recent years, heuristic based detection approach has been Since traditional signature-based anti-virus systems fail to
used frequently [81]. It is a complex detection method which detect polymorphic, metamorphic, and previously unknown
uses experiences and different techniques such as rules and malicious executables; heuristic-based malware detection is
ML techniques [10]. Although it has a high accuracy rate to explained in [84], [85]. Yanfang et al. proposed intelli-
detect zero-day malware to a certain degree, it cannot detect gent malware detection system (IMDS) [84]. The IMDS
complicated malware. Heuristic-based detection schema can used objective-oriented association (OOA) mining that works
be seen in Figure 5. based on windows API calls. The method consists of 3
parts: PE (portable executables) parser, OOA rule genera-
tor, and rule based classifier. PE parser extracted Windows
API execution calls from PE. OOA Fast_FP-Growth algo-
rithm used API calls and generated association rules. Finally,
based on the association rules, OOA mining algorithms per-
formed and executables marked malicious or benign. The
paper claims that the proposed system performed better
than other techniques including anti-virus software such as
Norton AntiVirus, McAfee VirusScan and KAV, as well as
the systems using data mining techniques such as naive
Bayes, SVM and decision tree. To overcome the disadvan-
tages of signature-, and behavioral-based malware detection
approaches, B. Zahra, et al. proposed heuristic type of method
which can detect malware that cannot be detected by previous
two approaches [85]. Authors applied learning algorithm to
generate a pattern which was similar to signature. Based on
the signature, new suspicious programs were marked mal-
ware or benign. The paper mentioned API system calls, oper-
ational code (Opcode), n-grams, control flow graph (CFG),
and hybrid features that are used extensively in heuristic
approach [85].
FIGURE 5. Heuristic-based malware detection schema. A statistical analysis of opcode frequency distributions to
identify and differentiate modern (polymorphic and meta-
morphic) malware is explained in [86]. A total of 67 malware
1) RELATED WORKS FOR HEURISTIC-BASED DETECTION executables were sampled statically disassembled and their
Arnold and Tesauro proposed an automatically generated statistical opcode frequency distributions were compared
Win32 heuristic virus detection in [82]. They automatically with the aggregate statistics of 20 non-malicious samples.
construct multiple neural network classifiers which can detect Test results showed that there is a statistically significant dif-
unknown Win32 viruses. Generally, heuristic schema has ference in opcode distribution between malware and benign.
high FP rate, but the authors claim that by combining the To get more reliable results, more samples need to be ana-
individual classifier outputs using a voting procedure, the risk lyzed and suggested method results’ need to be compared
of FP is reduced to an arbitrarily low level. The study is with other well-known heuristic methods. A detection system
limited to Win32 virus, and can be extended to other mal- that combines static and dynamic features has been suggested
ware. More malware needs to be examined for this method. in [43]. According to the paper, combining static and dynamic
Expert-designed heuristic features can improve the perfor- features improve the method performance. By combining
mance. Yanfang et al. proposed post-processing techniques these features, the feature vector was constructed and classi-
of associative classification for malware detection [83]. The fied using ML classifiers. The paper claims that the detection
proposed system greatly reduced the number of generated rate of the proposed system is satisfactory and increased when
rules by using rule pruning, rule ranking, and rule selection. compared to their first study. However, the probability of
• Malicious behaviors are split across several procedures approach which can provide only a limited view of the mal-
and cannot be identified unless procedures are inlined, ware. To identify behavioral dependencies more accurately;
which decreases the method performance, extract more accurate specifications; and using effective LTL,
• The macro does not cover all instruction sets of the CTL, CTPL formulas can improve the performance. Model
x86 architecture. checking-based detection approach can be evaluated at the
Eliminating or decreasing these deficiencies will surely early stage, so, to see the effectiveness of the approach, more
improve the performance. studies need to be done.
Beaucamps et al. represented rewriting and model check-
ing which capture high-level malware behaviors when detect- E. DEEP LEARNING-BASED MALWARE DETECTION
ing malware [89]. Proposed method uses a rewriting-based Deep Learning is a subfield of ML that inherited from artifi-
abstraction mechanism which produces abstracted forms of cial neural networks (ANN) which learn from examples. It is
program traces, independent of the program implementation. a new approach and widely used for image processing, driver-
It can handle similar behaviors in a generic way and thus to be less cars, and voice control; but it is not used sufficiently in
robust with respect to its variants. The authors claim that this malware detection. Although it is quite effective and reduces
method can be useful for both static and dynamic analysis. feature space drastically, it is not resistant to evasion attacks.
This approach is at an early stage and in the study only Deep learning-based schema can be seen in Figure 7.
theoretical results are presented. To see the method efficiency,
the proposed method needs to be tested.
Song and Touili proposed a pushdown model-checking
method for malware detection [90]. Proposed schema works
as follows:
• Binary code translates to pushdown systems (PDS),
• The paper introduced a stack computation tree predicate
logic (SCTPL) to represent the malicious behaviors,
• It provides an algorithm to model-check pushdown sys-
tems against SCTPL specifications.
Proposed method reduced the model-checking problem to
checking the emptiness of Symbolic Alternating Büchi Push-
down Systems. The authors claim that they obtained encour-
aging experimental results. However, suggested method
works if the data in the stack cannot be changed by direct
memory access. Identification of android malware families
with model checking is represented in [91]. To show the effec-
tiveness of suggested system most common malware family
in Android environment the DroidKungFu and the Opfake FIGURE 7. Deep learning-based malware detection schema.
families have been analyzed. The suggested algorithm can
analyze and verify the java bytecode that is produced when
the source code is compiled. A preliminary investigation has 1) RELATED WORKS FOR DEEP LEARNING-BASED
been also conducted to assess the validity of the proposed DETECTION
method. The authors mentioned that test results are promis- Large-scale malware classification using random projections
ing, and they can identify malicious payloads with a very high and neural networks is presented in [92]. In the suggested sys-
accuracy in a reasonable time. The paper has analyzed only tem, dimensionality of the original input space had reduced
a few malware families, to extend the analysis and evaluate by a factor of 45 (179K/4K). Using suggested architecture,
more malware families will produce more reliable results. several very large-scale neural network systems with over
Also, investigating the payload family tree can give clues 2.6 million labeled samples were trained and achieved clas-
about phylogenies of malware which will result in better sification results with a two-class error rate of 0.49% for a
classification. single neural network and 0.42% for an ensemble of neural
networks. Authors emphasized that using more hidden layer
2) EVALUATION OF MODEL CHECKING-BASED DETECTION could not improve the accuracy. For example, using one-layer
The literature review of model checking-based detection neural network performed better than two and three-layer
schema has been summarized. This approach is generally neural network. Droid-Sec which uses deep learning- based
used for program verification and not used sufficiently for detection is proposed in [93]. It used both static and dynamic
malware detection. Although it is effective to detect some new analysis and extracted more than 200 features. They used
malware variants, it is still insufficient to detect all complex unsupervised pre-training phase and the supervised back-
malware. Besides, it is a complex and resource-intensive propagation phase. In the pre-training phase, they adopted
the deep belief network (DBN) [94] that utilizes the built fooled by evasion attacks. Grosse et al. investigated the
restricted Boltzmann machines (RBM) which is beneficial for viability of adversarial crafting against deep neural net-
better characterizing Android apps. In the back-propagation works [95]. The authors mentioned that crafted inputs lead
phase, the pre-trained neural network fine-tuned with labeled to deceive ML models which results misclassifications. For
value in a supervised manner. This way, the whole deep evaluation, DREBIN dataset has been used. They achieved
learning model is built completely. According to test results, misclassification rates of up to 80% against neural network,
96% accuracy has been measured which outperformed SVM, which shows that adversarial crafting is indeed a real threat
C4.5, LR, and naïve Bayes. To analyze more apps and to in security critical domains. Kolosnjaji et al. investigated the
automate the analysis processes can be useful to build more vulnerabilities of malware detection methods that use deep
reliable detector. networks to learn from raw bytes [96]. They proposed a
Deep neural network based malware detection using two gradient-based attack that is capable of evading a recently-
dimensional binary program features explained is in [50]. proposed deep network by only changing few specific bytes
Proposed framework consists of 3 main parts: at the end of each malware sample, while preserving its
• In the first part, 4 different types of complementary intrusive functionality. According to their test results, adver-
features from the benign and malicious binaries are sarial malware binaries evade the targeted network with
extracted, high probability, even though less than 1% of their bytes
• In the second part, deep neural network which consists are modified.
of an input layer, two hidden layers and an output layer The literature review of deep learning-based malware
has been used, detection has been summarized. Even though it is power-
• In the third part, score calibrator, which translates the ful, effective and reduces feature space drastically, it is not
outputs of the neural network, is used and the probability resistant to evasion attacks. Besides, building a hidden layer
of the file being malware is measured. takes time and adding extra hidden layers rarely increases the
model performance. Deep learning-based malware detection
According to the authors, suggested system achieves a 95% approach is quite in the early stages, so more studies need to
DR at 0.1% FPR over an experimental dataset of over be done to identify this approach more correctly.
400,000 software binaries. Even though proposed approach
achieved high accuracy rate on the standard cross-validation,
F. CLOUD-BASED MALWARE DETECTION
the performance decreased sharply when split validation was
used. This can be eliminated by using deobfuscation the Cloud computing has been rapidly developing because it
binary before feature extraction. Besides, the number of provides a lot of advantages including easy accessibility, on-
benign samples is too small when compared with the number request storage, and decreasing costs. Since cloud has been so
of malware analyzed. To get accurate estimation more benign popular, it has also been used to detect malware. Cloud-based
samples need to be analyzed. malware detection enhances the detection performance for
Huang and W. Stokes proposed a new multi-task deep Pcs and mobile devices with much bigger malware databases
learning (multi-task neural network- MtNet) architecture for and intensive computational resources. Cloud-based detec-
malware classification [51]. The proposed model is trained tion uses different types of detection agents over the cloud
with data extracted from dynamic analysis of malicious and servers and offers security as a service. A user can upload
benign files. The system is trained on 4.5 million files and any type of file and receive a report whether uploaded file is
tested on a holdout test set of 2 million files. The paper malware or not. Cloud-based detection schema can be seen
claims that MtNet has made a big improvement compared to in Figure 8.
a shallow neural architecture. Multi-task learning encourages
the hidden layers to learn a more generalized representation at
lower levels in the neural architecture. Besides, MtNet archi-
tecture also employs rectified linear unit (ReLU) activation
functions and dropout for the hidden layers. ReLU activation
functions cut the number of epochs needed for training a
binary malware classifier in half while dropout leads to sig-
nificant reductions in the test error rate. The main challenge
of this study is that it is almost impossible to increase the
model performance by adding extra layers. Besides, MtNet is
susceptible to attacks and can be evaded. Overcoming these
challenges may improve the model performance.
Even though cloud-based detection approach has many sample in multiple end-users’ environments can improve the
advantages, there are some issues with this detecting schema. results of the analysis with very small overhead. On the other
Some of disadvantages can be the following: hand, suggested framework raises the privacy and security
(1) User needs to upload file contents to the cloud which issues, and is prone to various forms of detection and eva-
can disclose some sensitive data such as location, password, sion attacks. Solving security related issues and implement
and credit card information, resistant framework against evasion attacks will increase the
(2) The cloud detection mechanism has some over-head framework performance.
over other detection mechanism, so communication between A cloud-based anti-malware system called CloudEyes,
the client and server must be optimized, especially for the IoT which provides efficient and trusted security services for
and mobile devices. resource-constrained IoT devices presented is in [98]. For
(3) The lack of real time monitoring for all files within all the client side, CloudEyes implemented a lightweight scan-
locations. ning agent that utilizes the digest of signature fragments to
dramatically reduce the range of accurate matching. For the
1) RELATED WORKS FOR CLOUD-BASED DETECTION cloud server side, CloudEyes presented suspicious bucket
Sang Kil et al. proposed a design and implementation of a cross filtering, a novel signature detection mechanism based
novel anti-malware system called SplitScreen [32]. It is a dis- on the reversible sketch structure, which provides retrospec-
tributed malware detection schema which uses an additional tive and accurate orientations of malicious signature frag-
screening step prior to the signature matching phase found ments. Furthermore, by transmitting sketch coordinates and
in existing approach. The SplitScreen’s two-phase scanning the modular hashing, CloudEyes guarantees both the data
enables fast and memory efficient malware detection that privacy and low-cost communications by transmitting sketch
can be decomposed into a client/server process that reduces coordinates and the modular hashing. Authors claim that the
the amount of storage. Proposed method implemented as an mechanisms in CloudEyes are effective and practical which
extension of ClamAV which improves scanning throughput can outperform other existing systems with less time and
using today’s signature sets by over 2x by using half the communication consumption. On the other hand, the detec-
memory. According to the authors, the speedup and memory tion rate and accuracy can be further improved. Also, some
savings of SplitScreen improve further as the number of methods can be used such as Winnowing Block Shingling
signatures increases. The proposed method is scalable on a and Winnowing Multi-Hashing to reduce the size of the data
wide range of low-end consumer and handheld devices. Since in order to optimize the storage and matching performances
single server is used in the cloud, it will be better to optimize during signature initialization.
the server performance, and load some works on client side. Xiao, Liang, et al. investigated the cloud-based malware
Yanfang et al. presented cloud-based schema which detection game, in which mobile devices offload their appli-
combines file content and file relations to improve malware cation traces to security servers via base stations or access
detection results and develops a file verdict system [97]. points in dynamic networks [56]. They designed a malware
The system incorporated into the Comodo’s Anti-malware detection scheme with Q-learning for a mobile device to
products, and empirical studies were conducted on large derive the optimal offloading rate without knowing the trace
daily datasets collected by Comodo cloud security center. generation and the radio bandwidth model of other mobile
The authors claim that their experimental results demon- devices. The Dyna architecture is used to improve perfor-
strated that the accuracy and efficiency of Valkyrie system mance, and post-decision state learning-based scheme is used
outperform other popular anti-malware software tools such to accelerate the reinforcement learning process.
as Kaspersky AntiVirus and McAfee VirusScan, as well According to the authors, test results showed that the pro-
as other alternative data mining based detection systems. posed schemes improve the detection accuracy, reduce the
However, since file relations and file content have different detection delay, and increase the utility of a mobile device in
properties, combining these 2 features directly can decrease the dynamic malware detection game when compared with
the quality of information including correlation and consis- the benchmark strategy. Since many different parties com-
tency issues. Using different approaches as well as Joint- municate with each other during the detection process, some
Embedding approach can help to solve the correlation and overhead can mitigate the performance including the net-
consistency problem. work transmission delay, detection delay for mobile device,
Martignoni et al. presented a framework that enhances the cloud processing time, and the local detection delay.
the capabilities of existing dynamic behavior-based detec- Reducing these delays will improve the performance.
tors. The proposed framework enables sophisticated behavior Yadav R. Mahesh presented malware detection system for
based analysis of suspicious programs in multiple realis- cloud environment [99]. The proposed work consists of 2
tic and heterogeneous environments in the cloud [54]. The modules, clustering and classification. In clustering module,
suggested schema forces sample programs to execute in a the input dataset is gathered into clusters with the utilization
distributed environment including security lab and potential of Weighted Fuzzy C-means clustering (MFCM) algorithm.
victim machines. The evaluation results demonstrated that the In classification module, the centroid from the clusters is
analysis of multiple execution traces of the same malware given to the intermittent Auto Associative Neural Network
G. MOBILE DEVICES-BASED MALWARE DETECTION purposes, and those from real malware found in the wild.
In mobile devices world, Android platform has become the Simple 2-means clustering algorithm is chosen to distinguish
market leader. According to recent studies, new malicious benign applications and their correspondent malware version.
app for Android is introduced every 10s. Because of that The authors specified that API call analysis, information
researchers have focused on Android platform rather than flow tracking, and network monitoring technique contribute
other platforms for malware detection. Numerous malware to a deeper analysis of the malware, and provide malware
detection methods have been proposed for smartphones espe- behaviors and more accurate results. The authors identified
cially for Android platform. Generally, these methods use that open(), read(), access(), chmod(), and chown() are the
datamining and ML algorithms to detect malware. A number most used system calls by malware. The authors mentioned
of different features such as system calls, security-sensitive that the proposed method has shown to be an effective means
APIs, information flows, and control flow structures are used. of isolating the malware and alerting the users to downloaded
Even if current studies have made improvement in detecting malware. However, test cases have been done generally on
traditional and new generation malware for mobile devices; self-written malware and a few real malware which is not
detecting of complex malware, and scaling the detection tech- enough for real evaluation. Thus, more real malware needs to
niques for a large bundle of apps still remain a challenging be analyzed. Moreover, there is no enough information about
task. Mobile devices-based detection schema can be seen metrics which represent the framework performance such as
in Figure 9. DR, accuracy, and FP. In addition, the authors did not mention
how they handle zero-day malware.
1) RELATED WORKS FOR MOBILE DEVICES-BASED Host-based malware detection system for Android is pre-
MALWARE DETECTION sented in [58], [101]. Andromaly—a behavioral malware
Isohara et al. proposed a kernel-base behavior analysis for detection framework for Android devices is represented in
Android malware inspection [57]. The system consists of a [58]. The proposed framework used a host-based malware
log collector and a log analysis application. The log collector detection system that continuously monitors various features
records all system calls and filters events with the target appli- and events obtained from the mobile device and then applies
cation, and the log analyzer matches activities with signatures ML anomaly detectors to classify the collected data as nor-
described by regular expressions to detect a malicious activ- mal or malicious. They evaluated several combinations of
ity. They evaluated 230 applications in total. According to anomaly detection algorithms, feature selection techniques
the authors, system can effectively detect malicious behaviors and the number of top features to find the combination
of the unknown applications. 230 apps are not enough to that yields the best performance when detecting new mal-
measure the efficiency of the suggested system, so more apps ware on Android. The authors claim that proposed frame-
need to be analyzed. Besides, there is no enough information work is effective for both mobile devices in general and
about DR, accuracy, and FP. on Android in particular. However, experiments have been
A new framework to obtain and analyze smartphone appli- done on artificially-created malware rather than real malware.
cation activity is presented in [100]. The 2 types of datasets Saracino et al. proposed MADAM, a novel multi-level host-
have been used: those from artificial malware created for test based malware detection system for Android devices that
simultaneously analyzes and correlates features at 4 levels: achieving 99.23% f-measure. Furthermore, when evaluated
kernel, application, user, and package to detect malicious with more than 87.000 apps collected in-the-wild, CASAN-
behaviors [101]. In this study, the actions of each malware are DRA achieves 89.92% accuracy, which has outperformed
examined and misbehavior classes are generated from mal- existing methods by more than 25% in their typical batch
ware behaviors, which encompass most of the known mal- learning setting and more than 7% when they are continu-
ware behaviors. According to the authors, MADAM detects ously retained. The authors, did not mention how they handle
and effectively blocks more than 96% of malicious apps malware which uses obfuscation techniques and unknown
among the 2800 apps. MADAM is subject to mimicry attacks malware. To improve the model performance different graph
which inserting malicious code into benign apps to mislead- kernel, and API dependencies such as information flows and
ing the detection system. Besides, the paper did not mention permission dependencies can be used.
how they handled unknown malware. Narayanan et al. proposed a MKLDROID, a unified frame-
Li et al. introduced significant permission identification work for Android that systematically integrates multiple
(SigPID) method to detect android malware [102]. Instead views of apps for performing comprehensive malware detec-
of extracting and analyzing all Android permissions, three tion and malicious code localization [104]. The MKLDROID
levels of pruning by mining the permission data have been uses a graph kernel to capture structural and contextual
developed which identifies the most significant permissions information from apps’ dependency graphs when identifies
to distinguishing malware and benign. SigPID then utilizes malicious code patterns. Then, it employs multiple kernel
ML classification algorithms to classify different families of learning (MKL) to find a weighted combination of the views
malware and benign apps. According to the authors’ findings, which yields the best detection accuracy. Through large-scale
only 22 permissions are significant out of 135 when over experiments on several datasets wild apps, authors claim
2000 malware analyzed. The test results indicated that when that MKLDROID outperforms three state-of the-art methods
a SVM is used as the classifier, they could achieve over consistently, in terms of accuracy. In addition, malicious code
90% of precision, recall, accuracy, and f-measure; which are localization experiments on a dataset of repackaged malware,
about the same as those produced by the baseline technique. MKLDROID was able to identify all the malware classes with
When proposed schema is compared with other state-of-the- 94% average recall. On the other hand, MKLDROID, cannot
art methods, SigPID is more effective by detecting 93.62% detect all sorts of malicious behaviors and cannot be resistant
of malware in the dataset and 91.4% new malware samples. to obfuscating techniques. Furthermore, MKLDROID can
To use SigPID features with static features can improve the be fooled by adversarial attacks. MKLDROID used only
performance. A review on feature selection in mobile mal- user-awareness contextual information to separate malware
ware detection is presented in [103]. In the paper, 100 stud- from benign. However, other types of contextual informa-
ies were examined based on features selection techniques. tion such as probing and device-specific privileges could be
They categorized features into 4 groups including: static, used.
dynamic, hybrid features and applications metadata. The
authors identified that the most common and distinctive static
2) EVALUATION OF MOBILE DEVICES-BASED MALWARE
features are Android permission, network address, strings,
DETECTION
and hardware components; dynamic features are system calls,
The literature review of the mobile devices-based detec-
network traffic, system components, and user interaction;
tion approach has been summarized. It can use both static
hybrid features are permissions and Java code, system calls,
and dynamic features. Although the proposed methods seem
and AndroidManifest.xml; metadata features are category,
effective when detecting traditional malware, it needs to be
description, permissions, contact email, number of screen-
improved to detect up-to-date malware. Besides, it is not
shots, and version. The authors emphasized that some of
scalable for large bundle of apps. In mobile area, the malware
examined papers introduced novel methods, however due to
detection is still in the earlier stages, and there need to be more
lack of malware sample, authors could not test their systems
studies on this area to fill the gaps.
thoroughly.
Malware detection using graph kernel for Android is pre-
sented in [59], [104]. Narayanan et al. proposed CASAN- H. IoT-BASED MALWARE DETECTION
DRA context-aware, adaptive and scalable android malware Internet of Things (IoT) architecture generally consists of
detector through online learning [59]. The authors proposed a a wide range of Internet-connected smart devices such as
novel graph kernel, which facilitates capturing apps security- home appliances, network cameras, and sensors. The IoT and
sensitive behaviors along with their context information from mobile devices have started to dominate the Internet more
dependence. The authors mentioned that CASANDRA has than PCs. Since mobile and IoT devices are becoming more
specific advantages: it is adaptive to the evolution in mal- popular among users day by day, they are also becoming more
ware features over time, and explains the significant fea- favorite targets for attackers. Because of that the malware
tures that led to an apps classification as being malware detection schema landscape is changing from computers to
or benign. According to the authors, CASANDRA outper- IoT and mobile devices. IoT-based detection schema can be
forms two state-of-the-art methods on a benchmark dataset seen in Figure 10.
detection is still in the earlier stages for IoT, and there need
to be more studies on this area to fill the gaps.
FIGURE 10. IoT-based malware detection schema. V. EVALUATION ON MALWARE DETECTION APPROACHES
In previous section, malware detection approaches were ana-
lyzed based on the main idea, algorithm types, and feature
extraction methods, etc. This section summarizes detection
1) RELATED WORKS FOR IoT-BASED MALWARE DETECTION approaches and their methods, provides advantages and dis-
Malware detection approach for IoT devices is represented in advantages of each detection approach, and provides some
[105], [60]. Novel light-weight technique for detecting DDos suggestions to build a more effective detection schema. The
malware in IoT environments is explained in [105]. They comparison of malware detection approaches, and advan-
extracted the malware images such as one-channel gray-scale tages, disadvantages of each malware detection approach can
image from a malware binary, then utilized a light-weight be seen in Table 6 and Table 7, appropriately.
convolutional neural network for classifying their families. Signature-, behavior-, heuristic-, and model checking-
According to the paper, experimental results showed that the based approaches are well-known and have been used for
proposed system can achieve 94.0% accuracy for the classi- malware detection more than a decade. These approaches are
fication of benign and DDoS malware, and 81.8% accuracy using reverse engineering, datamining, and ML techniques to
for the classification of benign and two main malware fam- detect malware.
ilies. Even though proposed method is fast and lightweight, Signature-based detection approach is fast and effective
it is vulnerable to complex code obfuscation techniques. The to detect known malware. During the signature generation;
author mentioned that this problem can be partially reduced static features such as byte sequences, assembly instructions,
by using more complex static features, such as Opcode strings, Opcode, and list of DLLs are used. Signature detec-
sequences and API calls to a certain degree. Detecting crypto- tion schema has been used for many years and decreases over-
ransomware in IoT networks based on energy consumption head and execution time. However, it cannot detect new gen-
footprint for Android devices is represented in [60]. The pro- eration of malware (Table 6), it is vulnerable to obfuscation
posed system use ML algorithms and specifically monitors and polymorphic techniques, and omitting feature selection.
the energy consumption patterns of different processes to To build an effective signature-based detection schema: some
classify ransomware from malware applications. According dynamic features can be used to avoid obfuscation; feature
to the authors, proposed technique outperformed KNN, neu- selection phase can be added; and new technologies such as
ral networks, SVM and RF, in terms of accuracy rate, recall deep learning, active learning, and ML can be used to increase
rate, precision rate and f-measure. The proposed method the detection rate.
description is not clear. Besides, there is no information about Behavior-based detection approach is used to determine
which ransomware family was analyzed and how they han- the functionality of malware. Thus, even if malware instruc-
dled unknown ransomware. Also, the paper did not mention tion sequence and signature may change, the functionality of
any limitations and challenging tasks. malware will be more or less the same. So, it can detect new
malware, and different variants of the same malware [106].
It is also effective against obfuscation and polymorphic tech-
2) EVALUATION OF IoT-BASED MALWARE DETECTION niques (Table 7). However, it produces high FPs. Besides,
The literature review of the IoT-based detection approach some behaviors are similar in malware and benign sam-
has been summarized. Although the proposed methods seem ples, so grouping these behaviors is difficult, and some mal-
effective when detecting traditional malware, it needs to be ware does not run in protected environment and mistakenly
improved to detect up-to-date malware. Besides, the malware marked as benign. To specify all behaviors correctly, multiple
execution paths can be gathered using different machines on difficulties to detect complex malware, and is not scalable
clouds. This can decrease the number of malware mistakenly for large bundle of apps. To integrate the mobile and IoT-
marked as benign. based approach with cloud-based can improve theDR and
Heuristic-based detection approach can use both static and scale better for large bundle of apps.
dynamic features such as API calls, Opcode, CFG, n-gram, Even though each detection method has its own advan-
list of DLLs, and hybrid features. It can detect some previ- tages and works better for different datasets, no detection
ously unknown malware, but it is vulnerable to metamorphic method could detect all malware. Malware detection rate
techniques, and numerous rules and training phases [107] versus complexity of malware can be seen in Figure 11.
make this detection approach complicated (Table 7). Decreas- When complexity of malware (unknown malware, new gen-
ing the number of rules, and building a more efficient learning eration of malware, obfuscated malware) increases, the detec-
phase can improve the method performance. tion rate decreases for all detection approaches. It can be
Model checking-based approach is powerful, can detect seen that signature types of detection approaches such as
unknown malware, and is resistant against obfuscation and signature-, heuristic-, and most of the time mobile devices-
polymorphic techniques (Table 7). However, it can obtain a and IoT-based schemas show lower performance than other
limited view of the malware, not resistant to evasion attacks, approaches such as behavior-, model checking-, cloud-, and
and cannot detect all new generation of malware. To identify deep learning-based approaches (Figure 11).
more accurate formulas, and using effective model checker This is because the later approaches are more effective
may improve performance. to detect unknown and obfuscated malware. Behavior-
Recently; deep learning-, cloud-, mobile devices-, and based detection approach performs pretty well, while sig-
IoT-based approaches have started to be used in malware nature based detection approach shows lowest performance
detection (VI). Deep learning-based detection approaches are (Figure 11). Model checking- and cloud-based detection
effective to detect new malware and reduce features space approaches perform slightly better than deep-learning-,
sharply [108], [109], but it is not resistant to some evasion heuristic-, mobile devices-, and IoT-based detection
attacks. On the other hand, cloud-based detection approaches approaches. Combining malware detection approaches can
increase DR, decrease FPs, and provide bigger malware provide better detection mechanism. For example, combin-
databases and powerful computational resources [110]. The ing behavior-based with model checking-based approaches,
overhead between client and server, and lack of real monitor- and using deep learning and cloud at the same time
ing are still a challenging tasks in cloud environment. Mobile will surely provide better detection mechanism. Besides,
devices- and IoT-based detection approaches can use both using new technologies such as block chain and big data
static and dynamic features, and improve detection rates on may give more opportunity to build a more effective
traditional and new generation of malware [111]. But, it has detector.
FIGURE 11. Malware detection rate versus complexity of malware based on previous studies.
Although malware detectors are being improved every day, hand, for an unknown and complicated malware behavior-,
the following research challenges still remain an open issue model checking-, and cloud-based approaches perform bet-
in malware detection approaches: ter. Deep learning-, mobile-, and IoT-based approaches have
• New generation malware uses some obfuscation and also emerged to detect some portion of known and unknown
packing techniques to hide itself. By using these tech- malware. However, some portion of malware could not be
niques malware can prevent itself from being correctly detected by using these approaches. This shows that to build
analyzed and avoid detection. Signature-based detection an effective method to detect malware is a very challenging
approach is not resistant to malware obfuscation. Even task, and there is a huge gap to fill in new studies and
if behavior-, and model checking-based approaches are methods. Even though the trends in malware creation and
effective to most of obfuscation techniques, they cannot detection approaches are changing rapidly, this study still can
be resistant to all obfuscation techniques. be considered as a key reference for the computer scientist
• Real-time monitoring and detection are a challenging and developers who work in this field. As a future work,
tasks. Most of the studies have been done so far to detect new approach and method need to be proposed. To do that
malware by using dataset and are not appropriate for combining malware detection approaches can be one of the
real-time monitoring. solutions among many. For instance, combining behavior-
• Most of the malware detection approaches are prone to based with model checking-based approaches, and using deep
FPs and FNs. Some features and signatures can be very learning and cloud at the same time will surely provide better
close in malware and benign samples which raises FPs detection mechanism.
and FNs. Recently, the number, severity, sophistication of malware
• No detection method can affectively detect all unknown attacks, and cost of malware inflicts on the world econ-
malware. omy have been increasing exponentially. Attacks with these
• Generally, learning algorithms are prone to bias, and kinds of software have a disastrous effect and cause con-
overfitting. This leads to decreases DRs and increases siderable material damage to individuals, private compa-
FPs. nies, and governments’ assets. Thus, malware should be
• There is no well-known and accepted dataset which can detected before damaging the important assets in the com-
be used to evaluate the malware detection approaches pany. However, there are large gaps in the research area of
performance. This is because each malware detection malware detection and prevention. The aim of this study is
method uses different malware and dataset. to contribute to the research of malware. In this context,
the paper has presented a detailed review of the state-of-
the-art studies for malware detection approaches, and tech-
VI. CONCLUSION niques and algorithms that are used for malware detection.
Even though several new methods have been proposed The advantages and disadvantages of each malware detection
by using these different malware detection approaches, approach have been explained. As well as datamining and
no method could detect all new generation and sophisticated ML, new technologies such as deep learning-, cloud-, mobile
malware. For the known malware signature- and heuristic- devices-, and IoT-based detection schemas have also become
based detection approaches perform well. On the other popular.
6268 VOLUME 8, 2020
Ö. Aslan, R. Samet: Comprehensive Review on Malware Detection Approaches
[53] Y. Ye, L. Chen, S. Hou, W. Hardy, and X. Li, ‘‘DeepAM: A heterogeneous [76] S.-T. Liu, H.-C. Huang, and Y.-M. Chen, ‘‘A system call analysis method
deep learning framework for intelligent malware detection,’’ Knowl. Inf. with mapreduce for malware detection,’’ in Proc. IEEE 17th Int. Conf.
Syst., vol. 54, no. 2, pp. 265–285, Feb. 2018. Parallel Distrib. Syst., Dec. 2011, pp. 631–637,
[54] L. Martignoni, R. Paleari, and D. Bruschi, ‘‘A framework for behavior- [77] U. Bayer, I. Habibi, D. Balzarotti, E. Kirda, and C. Kruegel, ‘‘A view on
based malware analysis in the cloud,’’ in Proc. Int. Conf. Inf. Syst. Secur. current malware behaviors,’’ in Proc. USENIX Workshop, 2009.
Berlin, Germany: Springer, 2009. [78] H. H. Pajouh, A. Dehghantanha, R. Khayami, and K.-K.-R. Choo, ‘‘Intel-
[55] H. Sun, X. Wang, J. Su, and P. Chen, ‘‘RScam: Cloud-based anti-malware ligent OS X malware threat detection with code inspection,’’ J. Comput.
via reversible sketch,’’ in Proc. Int. Conf. Secur. Privacy Commun. Syst. Virol. Hack Tech., vol. 14, no. 3, pp. 213–223, Aug. 2018.
Cham, Switzerland: Springer, 2015. [79] C. Kolbitsch, P. M. Comparetti, C. Kruegel, E. Kirda, X.-Y. Zhou, and
[56] L. Xiao, Y. Li, X. Huang, and X. Du, ‘‘Cloud–based malware detection X. Wang, ‘‘Effective and efficient malware detection at the end host,’’ in
game for mobile devices with offloading,’’ IEEE Trans. Mobile Comput., Proc. USENIX Secur. Symp., Aug. 2009, vol. 4, no. 1, pp. 351–366.
vol. 16, no. 10, pp. 2742–2750, Oct. 2017. [80] M. Eskandari and S. Hashemi, ‘‘A graph mining approach for detecting
[57] T. Isohara, K. Takemori, and A. Kubota, ‘‘Kernel-based behavior analysis unknown malwares,’’ J. Vis. Lang. Comput., vol. 23, no. 3, pp. 154–162,
for Android malware detection,’’ in Proc. 7th Int. Conf. Comput. Intell. Jun. 2012.
Secur., Dec. 2011. [81] F. Adkins, L. Jones, M. Carlisle, and J. Upchurch, ‘‘Heuristic malware
[58] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss, ‘‘Andromaly: detection via basic block comparison,’’ in Proc. 8th Int. Conf. Malicious
A behavioral malware detection framework for Android devices,’’ J. Unwanted Softw., Amer. (MALWARE), Oct. 2013.
Intell. Inf. Syst., vol. 38, no. 1, pp. 161–190, Feb. 2012. [82] W. Arnold and G. Tesauro, ‘‘Automatically generated Win32 heuristic
[59] A. Narayanan, M. Chandramohan, L. Chen, and Y. Liu, ‘‘Context–aware, virus detection,’’ in Proc. Int. Virus Bull. Conf., vol. 200, 2000.
adaptive, and scalable Android malware detection through online learn- [83] Y. Ye, T. Li, Q. Jiang, and Y. Wang, ‘‘CIMDS: Adapting postprocessing
ing,’’ IEEE Trans. Emerg. Topics Comput., vol. 1, no. 3, pp. 157–175, techniques of associative classification for malware detection,’’ IEEE
Jun. 2017. Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 3, pp. 298–307,
[60] A. Azmoodeh, A. Dehghantanha, M. Conti, and K.-K.-R. Choo, ‘‘Detect- May 2010.
ing crypto-ransomware in IoT networks based on energy consumption [84] Y. Ye, D. Wang, T. Li, D. Ye, and Q. Jiang, ‘‘An intelligent PE-malware
footprint,’’ J. Ambient Intell. Hum. Comput., vol. 9, no. 4, pp. 1141–1152, detection system based on association mining,’’ J. Comput. Virol., vol. 4,
Aug. 2018. no. 4, pp. 323–334, Nov. 2008.
[61] K. Hahn, ‘‘Robust static analysis of portable executable malware,’’ in
[85] Z. Bazrafshan, H. Hashemi, S. M. H. Fard, and A. Hamzeh, ‘‘A survey on
Proc. HTWK Leipzig, 2014.
heuristic malware detection techniques,’’ in Proc. 5th Conf. Inf. Knowl.
[62] Hooked on Mnemonics Worked for Me. Accessed: Nov. 15, 2019. Technol., May 2013.
[Online]. Available: http://hooked-on-mnemonics.blogspot.com/
[86] D. Bilar, ‘‘Opcodes as predictor for malware,’’ Int. J. Electron. Secur.
2011/01/intro-to-creating-anti-virus-signatures.html
Digit. Forensics, vol. 1, no. 2, p. 156, 2007.
[63] M. F. Zolkipli and A. Jantan, ‘‘A framework for malware detection using
[87] A. Holzer, J. Kinder, and H. Veith, ‘‘Using verification technology to
combination technique and signature generation,’’ in Proc. 2nd Int. Conf.
specify and detect malware,’’ in Proc. Int. Conf. Comput. Aided Syst.
Comput. Res. Develop., May 2010.
Theory. Berlin, Germany: Springer, 2007.
[64] Y. Tang, B. Xiao, and X. Lu, ‘‘Using a bioinformatics approach to gener-
[88] J. Kinder, S. Katzenbeisser, C. Schallhart, and H. Veith, ‘‘Proactive detec-
ate accurate exploit-based signatures for polymorphic worms,’’ Comput.
tion of computer worms using model checking,’’ IEEE Trans. Depend.
Secur., vol. 28, no. 8, pp. 827–842, Nov. 2009.
Sec. Comput., vol. 7, no. 4, pp. 424–438, Oct. 2010.
[65] H. Razeghi Borojerdi and M. Abadi, ‘‘MalHunter: Automatic generation
of multiple behavioral signatures for polymorphic malware detection,’’ in [89] P. Beaucamps, I. Gnaedig, and J. Marion, ‘‘Abstraction-based malware
Proc. ICCKE. Mashhad, Iran: Ferdowsi Univ. Mashhad, vol. 1, Oct. 2013. analysis using rewriting and model checking,’’ in Proc. Eur. Symp. Res.
Comput. Secur. Berlin, Germany: Springer, 2012.
[66] J. Newsome, B. Karp, and D. Song, ‘‘Polygraph: Automatically gener-
ating signatures for polymorphic worms,’’ in Proc. IEEE Symp. Secur. [90] F. Song and T. Touili, ‘‘Pushdown model checking for malware detec-
Privacy (Samp;P), Oakland, CA, USA, May 2005, pp. 226–241. tion,’’ Int. J. Softw. Tools Technol. Transf., vol. 16, no. 2, pp. 147–173,
Apr. 2014.
[67] R. Perdisci, W. Lee, and N. Feamster, ‘‘Behavioral clustering of HTTP-
based malware and signature generation using malicious network traces,’’ [91] P. Battista, F. Mercaldo, V. Nardone, A. Santone, and C. A. Visaggio,
in Proc. 7th USENIX Conf. Netw. Syst. Design Implement., San Jose, CA, ‘‘Identification of Android malware families with model checking,’’ in
USA, 2010, pp. 391–404. Proc. 2nd Int. Conf. Inf. Syst. Secur. Privacy, 2016.
[68] I. Santos, Y. K. Penya, J. Devesa, and P. G. Bringas, ‘‘N-grams-based file [92] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, ‘‘Large-scale malware
signatures for malware detection,’’ in Proc. 11th Int. Conf. Enterprise Inf., classification using random projections and neural networks,’’ in Proc.
vol. 9, 2009, pp. 317–320. IEEE Int. Conf. Acoust., Speech Signal Process., May 2013.
[69] E. Fix and J. L. Hodges, ‘‘Discriminatory analysis: Nonparametric dis- [93] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, ‘‘Droid–Sec: Deep learning in
crimination: Small sample performance,’’ Univ. California, Berkeley, Android malware detection,’’ ACM SIGCOMM Comput. Commun. Rev.,
Berkeley, CA, USA, Tech. Rep. 11, 1952. vol. 44, no. 4, pp. 371–372, 2014.
[70] D. Venugopal and G. Hu, ‘‘Efficient signature based malware detection [94] Y. Bengio, ‘‘Learning deep architectures for AL,’’ Found. Trends Mach.
on mobile devices,’’ Mobile Inf. Syst., vol. 4, no. 1, pp. 33–49, 2008. Learn., vol. 2, no. 1, pp. 1–127, 2009.
[71] M. Zheng, M. Sun, and J. C. Lui, ‘‘Droid analytics: A signature based ana- [95] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel,
lytic system to collect, extract, analyze and associate android malware,’’ ‘‘Adversarial perturbations against deep neural networks for mal-
in Proc. 12th IEEE Int. Conf. Trust, Secur. Privacy Comput. Commun., ware classification,’’ 2016, arXiv:1606.04435. [Online]. Available:
Jul. 2013. https://arxiv.org/abs/1606.04435
[72] Y. Fukushima, A. Sakai, Y. Hori, and K. Sakurai, ‘‘A behavior based [96] B. Kolosnjaji, A. Demontis, B. Biggio, D. Maiorca, G. Giacinto,
malware detection scheme for avoiding false positive,’’ in Proc. 6th IEEE C. Eckert, and F. Roli, ‘‘Adversarial malware binaries: Evading deep
Workshop Secure Netw. Protocols, Oct. 2010, pp. 79–84. learning for malware detection in executables,’’ in Proc. 26th Eur. Signal
[73] M. Christodorescu, S. Jha, S. Seshia, D. Song, and R. Bryant, Process. Conf. (EUSIPCO), Sep. 2018.
‘‘Semantics–aware malware detection,’’ in Proc. IEEE Symp. Secur. Pri- [97] Y. Ye, T. Li, S. Zhu, W. Zhuang, E. Tas, U. Gupta, and M. Abdulhayoglu,
vacy (S&P), May 2005. ‘‘Combining file content and file relations for cloud based malware
[74] A. Lanzi, D. Balzarotti, C. Kruegel, M. Christodorescu, and E. Kirda, detection,’’ in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery
‘‘AccessMiner: Using system-centric models for malware protection,’’ Data Mining (KDD), 2011.
in Proc. 17th ACM Conf. Comput. Commun. Secur. (CCS), 2010, [98] H. Sun, X. Wang, R. Buyya, and J. Su, ‘‘CloudEyes: Cloud-based mal-
pp. 399–412. ware detection with reversible sketch for resource-constrained Internet of
[75] M. Chandramohan, H. B. K. Tan, L. C. Briand, L. K. Shar, and Things (IoT) devices,’’ Softw. Pract. Exper., vol. 47, no. 3, pp. 421–441,
B. M. Padmanabhuni, ‘‘A scalable approach for malware detection Mar. 2017.
through bounded feature space behavior modeling,’’ in Proc. 28th [99] R. M. Yadav, ‘‘Effective analysis of malware detection in cloud comput-
IEEE/ACM Int. Conf. Autom. Softw. Eng. (ASE), Nov. 2013, pp. 312–322. ing,’’ Comput. Secur., vol. 83, pp. 14–21, Jun. 2019.
[100] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, ‘‘Crowdroid: Behavior- ÖMER ASLAN received the B.Sc. degree in com-
based malware detection system for Android,’’ in Proc. 1st ACM Work- puter engineering from the University of Trakya,
shop Secur. Privacy Smartphones Mobile Devices (SPSM), 2011. Turkey, in 2009, and the M.Sc. degree in infor-
[101] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, ‘‘MADAM: Effec- mation security from the University of Texas at
tive and efficient behavior-based Android malware detection and pre- San Antonio, USA, in 2014. He is currently pur-
vention,’’ IEEE Trans. Depend. Sec. Comput., vol. 15, no. 1, pp. 83–97, suing the Ph.D. degree in computer engineering
Jan. 2018. with Ankara University, Turkey. He is a Research
[102] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-an, and H. Ye, ‘‘Significant
Assistant in cyber security with the University of
permission identification for machine-learning-based Android malware
Siirt, Turkey. He has published seven conference
detection,’’ IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3216–3225,
Jul. 2018. papers and one book chapter.
[103] A. Feizollah, N. B. Anuar, R. Salleh, and A. W. A. Wahab, ‘‘A review
on feature selection in mobile malware detection,’’ Digit. Invest., vol. 13,
pp. 22–37, Jun. 2015.
[104] A. Narayanan, M. Chandramohan, L. Chen, and Y. Liu, ‘‘A multi-view
context-aware approach to Android malware detection and malicious
code localization,’’ Empirical Softw. Eng, vol. 23, no. 3, pp. 1222–1274,
Jun. 2018.
[105] J. Su, V. Danilo Vasconcellos, S. Prasad, S. Daniele, Y. Feng, and
K. Sakurai, ‘‘Lightweight classification of IoT malware based on image
recognition,’’ in Proc. IEEE 42nd Annu. Comput. Softw. Appl. Conf.
(COMPSAC), vol. 2, Jul. 2018.
[106] E. B. Karbab and M. Debbabi, ‘‘MalDy: Portable, data-driven mal-
ware detection using natural language processing and machine learn-
ing techniques on behavioral analysis reports,’’ Digit. Invest., vol. 28,
pp. S77–S87, Apr. 2019.
[107] C. Choi, C. Esposito, M. Lee, and J. Choi, ‘‘Metamorphic malicious code REFIK SAMET (Member, IEEE) is currently a
behavior detection using probabilistic inference methods,’’ Cognit. Syst. Professor with the Computer Engineering Depart-
Res., vol. 56, pp. 142–150, Aug. 2019. ment, Ankara University, Turkey. He has worked
[108] W. Wang, M. Zhao, and J. Wang, ‘‘Effective Android malware detection and managed projects at TUBITAK, NATO, Euro-
with a hybrid model based on deep autoencoder and convolutional neural pean Union, and University Scientific Research
network,’’ J. Ambient Intell. Hum. Comput., vol. 10, no. 8, pp. 3035–3043, Units. He is working on reliability and fault-
Aug. 2019. tolerance of real-time computer systems, informa-
[109] H. Zhang, W. Zhang, Z. Lv, A. K. Sangaiah, T. Huang, and
tion security, cyber security, malware analysis, and
N. Chilamkurti, ‘‘MALDC: A depth detection method for malware based
computer forensics. He has six patents, four books,
on behavior chains,’’ in Proc. World Wide Web, 2019, pp. 1–20.
[110] Q. K. Ali Mirza, I. Awan, and M. Younas, ‘‘CloudIntell: An intelli- and four book chapters. He has over 60 articles
gent malware detection system,’’ Future Gener. Comput. Syst., vol. 86, published in National and International journals and more than 60 conference
pp. 1042–1053, Sep. 2018. papers. He is a member of scientific committee at many National and
[111] Z. Ma, H. Ge, Y. Liu, M. Zhao, and J. Ma, ‘‘A combination method for International science conferences and journals. He is a member of the IEEE
Android malware detection based on control flow graphs and machine Computer Society and IEEE Turkey Section.
learning algorithms,’’ IEEE Access, vol. 7, pp. 21235–21245, 2019.