Malware Detection and Analysis Challenges and Rese
Malware Detection and Analysis Challenges and Rese
Malwares are continuously growing in sophistication and hobbyists and cyber-offenders trying to show their ability
arXiv:2101.08429v1 [cs.CR] 21 Jan 2021
numbers. Over the last decade, remarkable progress has been by causing havoc and to steal information potentially for
achieved in anti-malware mechanisms. However, several pressing monetary gains, respectively. They are popularly known as
issues (e.g., unknown malware samples detection) still need to
be addressed adequately. This article first presents a concise hackers, black hats and crackers, and could be external/internal
overview of malware along with anti-malware and then sum- menace, industrial spies or foreign governments. Malwares
marizes various research challenges. This is a theoretical and can be used to change or erase data from victim computers,
perspective article that is hoped to complement earlier articles to collect confidential information, or to hijack systems in
and works. order to attack other devices, send spams, host and share illicit
contents, bring down servers, penetrate networks, and cripple
critical infrastructures.
I. I NTRODUCTION
Consequently, a broad range of tools and schemes have
Use of personal computers and mobile devices coupled been devised to detect and mitigate malware attacks [1]. Anti-
with internet has now become integral part of everyday life. malware systems thwart malwares by determining whether
This ubiquity of high interconnectivity has prompted many given program has malign intent or not [4]. Despite great
serious privacy and security menaces as well as different advancement of malware defense techniques and their inces-
other malicious activities. For instance, 117 million LinkedIn sant evolution, badwares still can bypass the anti-malware
user’s email and password were made publicly available by solutions owing to mainly sophisticated packers and weakest
hackers in 2016. In 2017, Uber revealed that its server was link, i.e., humans. Namely, most anti-malware methods do not
attacked and 57 million drivers and riders data were stolen. exhibit low enough error rates. Additionally, their performance
While, in 2018 almost 50 million Facebook accounts were particularly drops when they face unknown malwares. While,
compromised due to security breach. Similarly, cyberattacks daily 360,000 novel malware samples hit the scene [4]. As
on Norway’s ‘Health South East RHF’ healthcare organization anti-malware becomes more avant-garde so as malwares in
in 2018 exposed health record of more than half of country’s the wild, thereby escalating the arms race between malware
population. Moreover, it is estimated that on an average guardians and writers. The quests for scalable and robust
every 10 second a new malicious code specimen is released automated malware detection frameworks still have to go
to attack mobile devices [1]. A surge of cyberattacks with a long way. This article presents an overview of malwares
increasing number and sophistication can be seen with each and their defenses formulated in recent years, and highlights
passing year, which is impacting governments, enterprises and challenges, open issues and research opportunities for re-
individual alike and causing severe reputation, financial and searchers and engineers. It is a perspective and academic
social damages. For example, malicious cyber activities in article, which is aimed at complementing existing studies and
2016 cost U.S. economy alone up to 109 billion USD [2]. prompt interdisciplinary research.
Different types of cyberattacks are presently being per-
formed by cybercriminals, e.g., man-in-the-middle, malware II. M ALWARE C ATEGORIES
or birthday attack. In particular, malware attacks have ad- Malwares, as depicted in Fig. 1, can be divided into various
vanced as one of the main formidable issues in cybersecurity types depending on how they infect, propagate or exploit
domain as well as primary tool utilized by cybercriminals [3]. the target system as follows [3]. Please note that some of
Malware is a short form of malicious software. In French the malware types/tools/names fall in gray area of features
language, ‘mal’ means ‘bad’. Malware is a catch-all term intended for begin purposes as well, e.g., cookie, Wireshark,
widely employed to denote various kinds of unwanted harmful etc.
software programs with malicious motives [4]. When malware
is executed on a system or computing device it attempts
to breach the system/device’s security policies regarding in- A. Virus
tegrity, confidentiality and availability of data. Other names for A piece of code that duplicates, reproduces or propagates
malware are badware, malicious code, malicious executable itself across files, programs and machines if they are network-
and malicious program. Malwares are developed or used by connected. Viruses cannot execute independently, therefore
2
Malware
Browser
Virus Worm Trojan Spyware Ransomware Scareware Diallerware Bot Rootkit Backdoor Downloader
Hijackers
Adware Keylogger Trackware Cookie Riskware Sniffer Spamware Reverse Shell Bootkit
a command and control (C&C) channel from a system called K. Browser Hijackers
C&C server. A cluster of bots controlled by a sole server It is an undesired software that alters settings of web
is known as botnet. Botnets can be employed to organize browser without user’s consent either to inject ads in the
DDoS attacks, phishing fraud, sending spams, etc. Well-known browser or replace home/error page and search engine. Some
examples are Sdbot and Agobot. of them may access sensitive data with spyware. Examples are
1) Spamware CoolWebSearch and RocketTab.
Spamware (aka spam sending malware or spambot) is ma-
licious software designed to search and compile list of email L. Downloader
addresses as well as sending large number of spam emails. It
is an element of a botnet functioning as a distributed spam- It is a malicious program that downloads and installs/runs
sending network. Spamware can use infected user’s email new versions of malwares from internet on compromised
ID or IP address to send emails, which may consume great computers. Downloader is usually embedded in websites and
amount of bandwidth and slow down the system. Examples software. Examples are Trojan-Downloader:W32/JQCN and
are Trik Spam and Necurs botnet. Trojan-Downloader:OSX/Jahlev.A.
2) Reverse Shell
A reverse shell is an unauthorized program (malware) III. M ALWARE C ONCEALMENT T ECHNIQUES
that provides access of undermined computer to the attacker. To evade anti-malwares, malware writers have applied fol-
Reverse shell enables attacker to run and type command on lowing different malware camouflage approaches [3]:
host as the attacker is local. Examples are Netcat and JSP web
shell. A. Encryption
Encrypted malware by this method consists of encryption
I. Rootkit
and decryption algorithms, keys and malicious codes. Each
A rootkit is a stealthy software that is devised to conceal time attacker employs new encryption algorithm and key to
specific programs/processes and enabling privileged access generate novel malware version. Since decryption algorithm
to computer/data. Rootkit allows the attacker accessing and remains same, there is a higher probability to be detected.
controlling the system remotely without being detected, as it The main target of this procedure is to avoid static analysis
normally runs with root privileges and subverts system logs and delaying investigation process. CASCADE was reported
and security software. Examples are NTRootkit and Stuxnet. as the first encrypted malware in 1987.
1) Bootkit
Bootkit is an advanced form of rootkits that infects master B. Packing
boot record or volume boot record. Since it resides in boot
sector, it is difficult to be detected by security software, and Packing mechanism is utilized to compress/encrypt malware
also stays active after system reboot. Well-known examples executable file. To detect malwares with packing technique,
are BOOTRASH and FinFisher. reverse engineering methods or correct unpacking algorithm
is needed, which sometime is hard as it requires knowledge
J. Backdoor of true packing/compression algorithm. UPX and Upack are
Backdoor is a malware that installs by itself and creates examples of packing.
secret entrance for attackers to bypass system’s authentication
procedures and to access and perform illegitimate activities. C. Obfuscation
Backdoors are never utilized alone but as foregoing malware This technique obscures program’s principal logic to stop
attacks of other kinds, as they do not harm but furnish wider others gaining associated knowledge of the code. Malwares
attack surfaces. A notable backdoor tool is Remote Access with obfuscation and their deleterious functionality stay un-
Terminal/Trojan (RAT). Other examples are Basebridge and intelligible till activated. Quintessential obfuscation strategies
Olyx. are inessential jumps and including garbage commands.
4
Malware
Decision
Input Sample Feature Extraction Feature Selection Classifier/Clustering
(Explanation: Malware Analysis)
Benign
(Cleanware)
x1 y1 y2 Predicted Class
Probability Density
P N
… …
xk yk
Actual Class
P TP FN
xk+1 yk+1
--- --- N FP TN
xd yk (k<d) Y Malware Families
y1
Fig. 2: A generic malware detection and analysis system. First, input sample is provided to feature extraction module that yields
feature representation vector. A feature reduction/selection process is carried out on feature representation vector to obtain
fixed dimensionality regardless of length of input sample for enhanced performance. A classification/clustering technique
is trained on available set of malware and benign samples. During detection/analysis, unseen sample is reported by the
classification/clustering techniques as malware or not. Further analysis is also sometimes performed, e.g., describing suspicious
(or benign) characteristics present in the sample.
D. Evaluation Metrics more complex and residual deep learning, dictionary learning
Performance of malware detection methods is generally and data mining should be explored for feature segmenta-
evaluated by False Positive Rate = FP/(FP + TN), True Positive tion/representation learning/selection/classification and deter-
Rate = TP/(TP + FN), specificity = TN/(TN + FP), precision mining temporal relationships within and between malware
= TP/(TP + FP), accuracy = (TP + TN)/(TP + TN + FP sections.
+ FN), where TP, FP, TN and FN are true positives, false
positives, true negatives and false negatives, respectively. Mal- C. Mobile device malwares
ware samples are commonly considered as positive instances. Smart-devices connected to internet is growing exponen-
Moreover, Matthews correlation coefficient, F-score, Kappa tially, so as malwares (especially via third party apps) against
statistic, confusion matrix, receiver operating characteristic them. Insubstantial studies have been conducted on mobile
and under the curve measures have been used. While, for device malwares. Moreover, most existing anti-malware tech-
clustering-based algorithms Macro-F1 and Micro-F1 metrics, niques are not real-time and unsuited for mobile devices
respectively, for accentuating the performance on rare and because of high computational cost and/or features complexity
common categories [3], [4]. used for analysis. Thus, real-time lightweight mobile anti-
malwares via Bayesian classification is an interesting re-
V. R ESEARCH CHALLENGES AND OPPORTUNITIES search direction to be explored. Multiple information from
The ever-growing demand of minimized failure rates of anti- in-built sensors (e.g., accelerometer) may enhance mobile
malware solutions have opened up exigent research opportu- anti-malware performance. Mobile hardware malware detec-
nities and challenges to be resolved yet. tion and removal is another issue that needs serious explo-
ration. Sooner mobile anti-malware-inspired techniques will
substantially impact smart-devices design. Anyway, smart-
A. Issues in existing anti-malware methods
device malwares should be tackled both by preventive and
Malwares are still exponentially evolving in sophistication, effective countermeasures. App developers should assure that
and more difficult plights lie ahead. Most prior static and dy- their apps are abiding security and privacy policies. App stores
namic or hybrid methods do not work for novel/unknown/zero- administrators should vet and remove dubious apps. Users
day signatures and require virtual environment plus are time should use superior anti-malwares and install trusted apps.
consuming, respectively. Nonetheless, virtual environments On the whole, wearable and mobile devices malware and
are becoming less effective as malware writers are usually anti-malware are a new research field in cybersecurity with
one step ahead by implementing high-level new techniques pressing problems worth researching like malware affecting
to conceal malicious features. Though efforts are afoot to device’s master boot record or stealthily exploiting device to
design multi-level and parallel processing system, existing mine cryptocurrency, and how a malware performing well on
anti-malware methods/tools all in all are not adequate or benchmark data will be better under real-world environments.
potent for higher levels of concealments. Current anti-malware
systems also face challenges like scalability, lack of truly real-
world representative datasets, irreproducibility of published D. Large-scale benchmark databases
results, low generalization and detection disagreement among Advancement in malware research deeply depends on the
them for the same samples. There is a need of improved public availability of comprehensive benchmark datasets incor-
and comprehensive malware countermeasures, which could be porating accurate labels and contexts. Most existing databases
developed by utilizing recent advanced-machine/deep learn- suffers from limitations like small size, missing informa-
ing, -data mining and -adaptive schemes. Also, approaches tion/features, imbalanced-classes, and not publicly available.
embodying anomaly analysis with behavioral data should be Lack of adequate large-scale public datasets has stymied
designed to investigate what the malware is doing rather than research on malware. Benchmark public datasets will assist
how it is doing. This may result in minimized error and false to compare independent anti-malware schemes, determine
alarm rates. inter and intra relationships between security infringement
phenomena and unify malware findings to draw determined
B. Advanced machine learning (AML) techniques for anti- conclusions with reference to statistical significance. Neverthe-
malware less, collecting large-scale heterogenous annotated databases is
challenging and time- and resource-consuming due to malware
Quintessential anti-malwares often depend on non-linear attributes, forms and behaviors diversity. Crowdsourcing may
adversary explicit models and expert domain knowledge, help accumulating different annotated large-scale databsets.
thereby making them prone to overfitting and lower overall
reliability. Conversely, AML techniques attempt to imitate at-
tackers with various content, contexts, etc. rather than explicit E. Graph-based malware analysis
models/systems/attacks. Few preliminary studies on shallow Malwares with concealments are dominant nowadays and
AML usage for anti-malware has been conducted, but still a effectual in evading conventional anti-malwares that largely
lot of efforts to be done regarding AML anti-malware. For im- disregard learning and identifying the underlying relationships
proved accuracy, flexibility and scalability on wide range and between samples and variants, and contextual information.
unknown samples, AML paradigms like open set recognition, Graph-based relationship representations and features (e.g.,
7
data- and control-flow graphs, call graphs, data-, program-, network. IoT cyber-security is relatively new research realm
and control-dependency graphs) offer interesting possibility and quite challenging owing to heterogeneous networks with
even when malware code is altered as it helps in tracking multisource data and several categories of nodes. To this
malware genealogy in different settings. Devising graph-based end, different routes (e.g., predictive and blockchain) could
anti-malwares yet have issues from data heterogeneity, noisy be effective. Predictive security is attaining cyber resiliency
and incomplete labels, and computational cost during real- by devising models that predict future attacks and prevent in
time detection. Up to some extend such challenges may advance. As there is a strong correlation between security
be addressed in decentralized fashion. Furthermore, use of infractions and human blunders, predictive models should
multiple directed and undirected graphs, multi-view spectral consider computer networking, social sciences, descriptive
clustering, heterogeneous networks, multiple graph kernel theory, uncertain behavior theory and psychology from at-
learning, dynamic graph mining and deep graph convolution tackers, users and administrators’ perspectives at different
kernels to capture contextual and structural information could granularity levels. Blockchain can be utilized for self-healing
be fruitful area of research. of compromised devices/systems. Models could be devised
that exploit e.g., redundancy to heal corrupted codes/software
F. Bio-inspired anti-malware by good codes replacements, since in blockchain one can trace
and roll back the firmware versions. However, such models
Several limitations of traditional anti-malwares could be should also be capable to handle resource, energy and com-
suppressed by bio-inspired (e.g., biological immune sys- munication constraints, which may be achieved by lightweight
tem, biological evolution, genetic algorithms and swarm in- machine/transfer/reinforcement learning based access control
telligence) techniques. Comparatively these techniques are protocols.
lightweight, highly scalable and less resource-constrained.
Adaptive bio-inspired techniques that is used both for intelli-
I. Deception and moving target anti-malware techniques
gent concealment-invariant feature extraction and classification
can dramatically enhance accuracy in the wild. Bio-inspired Deception techniques (e.g., honeypot) are being used to
methods that define particular objective functions to discrim- detect and prevent malwares, which lures adversaries to strike
inate a system under attack from a malfunctioning or failing in order to mislead them with false information. There are
may also help strengthening the security. Combination of bio- two kinds of honeypots, i.e., client and server. Honeypot helps
inspired algorithms with deep neural networks is one of the to reduce false positives and prevent DDoS attacks. Com-
most promising direction, however has been explored less in plex attacks/tools (e.g., polymorphic malware) is increasing
anti-malwares. to identify honeypots or to alter their behaviors to deceive
honeypots themselves. Also, honeypot can be exploited by
attackers to undermine other sensitive parts of frameworks.
G. Defense-in-depth anti-malware
More complicated honeypot and honey nets (i.e., bunch of
Anti-malware strategy that is composed of multiple defense honeypots) schemes (e.g., shadow honeypots) should be de-
levels/lines rather than single is called defense-in-depth. Such vised as compromised honeypot will put security of whole
strong defensive mechanism is expected to be more robust organization in danger.
as it doesn’t depend on one defense technique and if one Moving target techniques (aka dynamic platform methods-
is breached the others aren’t. Each machine/cyber-system DPMs) dynamically randomizes system components to sup-
architecture can be divided in various levels of depth, e.g., in press successful attacks’ likelihood and shorten attack lifetime.
a power grid system, the meters, communication frameworks, Though adversary must undermine all platforms not one to
and smaller components, respectively, could be envisaged as evade DPMs, DPMs require complicated application state
lowest, intermediate and highest level. Another solution is ac- synchronization among varying platforms, and expand the sys-
tive or adaptive malware defense. Active defense has received tem’s attack surface. Much less efforts have been dedicated to
little attention due to inherent complexity, where developer developing well-articulated attack models and how to upgrade
anticipates attack scenarios at different level and accordingly deception elements and strategy to confront dynamic changes
devises malware countermeasures. In adaptive defense, the in attack behaviors. Future research should concentrate on
system is persistently updated by retraining/appending novel devising unorthodox methodologies, performing real-world
features or dynamically adjusted corresponding to reshaping analyses to compute and compare effectiveness of deception
environments. Adaptive defenses would require fast, auto- and DPMs techniques, and studying if DPMs conflict or can
mated and computationally effective and could use unsuper- co-exist with other anti-malwares.
vised learning and concept drift.
J. Decentralized anti-malware
H. Internet of things (IoT) attacks Data sharing and trust management hinder current anti-
IoT are progressively being used in different domains rang- malwares advancement, which can be resolved by decentral-
ing from smart-cities to smart- and military-grids. Despite ized malware detectors using blockchain technology. But it
finest security endeavors, IoT devices/systems can also be has received little attention till now. For intersection of anti-
compromised by innovative cyber-attacks. Security of IoT malware and blockchain technology, future directions will
technology is more crucial as it is always connected to a include exploring overhead traffic handling, quality and sparse
8
malware signatures, building accurate dynamic normal nature of synchronous parallel processing (e.g., Spark) and to develop
of traffics, reducing massive false alerts, energy and cost, body of knowledge on pros and cons of big data anti-malware
blockchain latency, case-by-case scenario investigation, and tools to assist practitioners.
more proof-of-concept implementations.
N. Malware analysis visualization systems
K. Botnet countermeasures Existing methods to analyze malwares are time-consuming
Thwarting botnets has become key area. Several botnet de- for malware analysts. Highly interactive visual analysis would
tection and defense architectures have been proposed. Various aid researchers and practitioners to forensically investigate,
issues surround botnet countermeasure study, e.g., difficulties summarize, classify and compare malwares more easily. Most
in testing devised botnet defenses in real scenarios/data. Be- prior techniques are very limited with regard to interactivity,
sides, lack of widely acknowledged benchmark or standard mapping temporal dimensions, scalability and representation
methodology to quantitative evaluate or compare bot defenses space (e.g., they are superficially 2D rather than 3D). The
presumably due to privacy and data sharing concerns. Botnets, field of developing malware visualization systems covering
including IoT bot and socialbot, will continue to rise until consequential rang of malware types and environments is vital
effective means both technical and non-technical are taken. and emerging. Encyclopedic visualization systems will lead
Technical factors include passive internet service providers analysts/researchers to ascertain novel research domains in the
and unassertive software. Non-technical factors include estab- years to come.
lishing distributed global environment, local and multinational
legal issues and poor user awareness. O. Multimodal anti-malwares
Multimodal anti-malwares, which consolidate evidences
L. Privacy preservation from different kinds of features/sources (e.g., string, permis-
Malwares that steal sensitive information has received much sion, elements converted to image matrices) can overcome nu-
attention. However, preserving user privacy in malware analy- merous constraints in frameworks that consider only one/fewer
sis (especially at the cloud or third party server) and malware features. Multimodal frameworks are more flexible and can
data sharing is yet an open and seldom touched concern. significantly enhance the accuracy of unimodal ones in the
Establishing privacy and regaining trust in commercial anti- wild. Multimodal may include multiple sensors, algorithms
malwares would become difficult if user’s privacy/data is and instances, and information can be fused at feature, score
compromised once. Majority of prior anti-malwares overlook or decision level. There is ample room to develop novel fusion
the privacy and security of user, data and network. Thus, architectures. Moreover, multimodal frameworks are expected
reasonably little has been worked on privacy protection frame- to be intrinsically more robust to concealments, but no study
works to respect public and law opinions. Privacy preservation investigated how robust are they to concealments.
mechanisms that do not influence the detection performance is
practically worthy of contemplation. Formulating lightweight P. Clustering for malware analysis
detection and privacy protection systems usable on mobile Previous works have shown that clustering could be a useful
devices to balance security, efficacy, privacy and power con- tool to effectively classify unknown malwares for improved
sumption demands special considerations. More innovative generalization, to underline unseen family’s behaviors for
privacy preservation approaches (e.g., allowing user to sta- thorough analysis that may help more robust anti-malware
bilize privacy, convenience and security levels) in malware schemes, and to label huge volumes of malwares in fast and
analysis has been highlighted by many experts as an essential automatic fashion that has become major challenge. Future
future research to be carried out. goal should be further improving accuracy of clustering-based
malware analysis using cluster quality measurements, contex-
M. Big data malware analysis tual/metadata information, and boosted genetic algorithms, etc.
Attentions should also be given to rectify security issues, e.g.,
The demand for big data malware analysis frameworks
poisoning and obfuscation attacks against targeted clusters.
is steadily expanding. Practitioners are working to resolve
big data malware challenges such as volume (e.g., collect-
ing, cleaning and compressing data/labels), velocity (e.g., Q. Hardware-based Solutions
real-time online training, learning, processing or streaming Hardware-based detectors are recently getting momentum
big data), variety (e.g., heterogeneous multi-view data learn- against proliferation of malware. Such detection mechanisms
ing/embedding), veracity (e.g., leaning with contradicting and utilize low-level architectural features, which are obtained by
unreliable data), and value (explainable ML based malware redesigning the micro-architecture of computer processors,
analysis). Another promising future research direction is de- e.g., CPUs with special registers providing hardware and
vising large-scale feature selection techniques, which are software anomaly events. Nevertheless, research in this domain
less-dependent on feature engineering, via distributed feature and trustworthy systems (i.e., inherently secure and reliable
selection, low-rank matrix approximation, adaptive feature against human errors and hostile parties) is yet in its initial
scaling, spectral graph theory, and fuzzy and neuro-fuzzy genesis and has to go a long way. Furthermore, there is dearth
clustering. Rigorous efforts need to be made to investigate use of studies on efficacy of anti-malwares combining hardware-
9
and software-based techniques that have exceptional potential current. Some training camps/workshops are being held by
to uncover extra elaborate malwares. Likewise, smart devices’ companies/organizations also for general public, but they are
sensors (e.g., GPS and ambient light sensors) data could also exceptionally expensive. More on-line free-to-access training
be used as additional feature vector to profile malware. courses will surely diminish malware damages.