Proud Mal Static Analysis Based Malware Analysis For Exes
Proud Mal Static Analysis Based Malware Analysis For Exes
https://doi.org/10.1007/s40747-021-00560-1
ORIGINAL ARTICLE
Abstract
Enterprises are striving to remain protected against malware-based cyber-attacks on their infrastructure, facilit ies, networks
and systems. Static analysis is an effective approach to detect the malware, i.e., malicious Portable Executable (PE). It performs
an in-depth analysis of PE files without executing, which is highly useful to minimize the risk of malicious PE contaminating
the system. Yet, instant detection using static analysis has become very difficult due to the exponential rise in volume and
variety of malware. The compelling need of early stage detection of malware-based attacks significantly motivates research
inclination towards automated malware detection. The recent machine learning aided malware detection approaches using
static analysis are mostly supervised. Supervised malware detection using static analysis requires manual labelling and human
feedback; therefore, it is less effective in rapidly evolutionary and dynamic threat space. To this end, we propose a progressive
deep unsupervised framework with feature attention block for static analysis-based malware detection (PROUD-MAL). The
framework is based on cascading blocks of unsupervised clustering and features attention-based deep neural network. The
proposed deep neural network embedded with feature attention block is trained on the pseudo labels. To evaluate the proposed
unsupervised framework, we collected a real-time malware dataset by deploying low and high interaction honeypots on
an enterprise organizational network. Moreover, endpoint security solution is also deployed on an enterprise organizational
network to collect malware samples. After post processing and cleaning, the novel dataset consists of 15,457 PE samples
comprising 8775 malicious and 6681 benign ones. The proposed PROUD-MAL framework achieved an accuracy of more
than 98.09% with better quantitative performance in standard evaluation parameters on collected dataset and outperformed
other conventional machine learning algorithms. The implementation and dataset are available at https://bit.ly/35Sne3a.
Keywords Unsupervised classification · Progressive learning · Malware detection · Static analysis · Feature attention
123
Complex & Intelligent Systems
ing to a recent analysis by Juniper Research, the financial malware classification. This paper proposed a progressive
impact of data breaches will increase by 11% per year and deep unsupervised framework (PROUD-MAL) for classify-
will reach a level from $3 trillion to $5 trillion in 2024. ing Windows PE using static analysis of executables. The
Therefore, it is the utmost requirement of every business major contributions are descried as follows:
to protect its information-based assets, since even a sin-
gle attack can result in critical data loss. There are several
classes of malware including [11] Ransomware [11], Trojan (a) The purpose of research is to present a framework
[14], Key Logger [3], Backdoor [21], Launcher [13], Remote for unsupervised classification of Portable Executables
Access toolkits (RAT) [33], Spam-Sending malware [34] etc. (PEs) using static features. We term this framework as
The approaches for Malware detection are either signature- PROUD-MAL To this end, we propose a two-phase cas-
based [2] or behavior-based [31]; while first approach is caded formulation of progressive unsupervised cluster-
good for identification of known attacks without produc- ing followed by an attention-based deep neural network
ing an overwhelming false alarm [3] but requires frequent for static feature-based malware classification.
manual updates of the database with rules and signatures. (b) Moreover, it is worth mentioning that attention mod-
On the other hand, later approach can be used to generalize els have shown promising outputs in various domains
signatures related to host and network used to identify the such as image analysis and natural language process-
presence of an unwanted piece of code or activity on victim ing but to best of our knowledge, there is no research
computers or networks. The use of packers [46], encryp- on applying the attention-based mechanism for mal-
tion [5], polymorphism [31] and obfuscation [28] techniques ware classification using unsupervised clustering over
can easily bypass signature-based detection as they only per- static features. To this end, we propose a Feature
form pattern or string matching [11]. Behavior-based [36] Attention-based Neural Network (FANN) architecture
approaches that focus on pattern identification including file for malware classification. The Attention Block (AB)
activity, registry activity and API call [8]; are either based considers correlation of a feature to target or other
upon static [7] or dynamic analysis [6]. The latter form of features, besides considering feature weight. It puts rel-
analysis requires execution of the malicious code [35] in a atively more weights to feature that contributed more to
controlled setup, i.e., sandbox and is often slow, resource minimize the validation loss and maximize the classifi-
intensive and not suitable for the deployment in the pro- cation accuracy.
duction environment which are also discussed in by [22]. (c) We also collected a novel real-time malware dataset
Moreover, due to geometric rise in zero-day malware, exist- comprises 15,457 (25 GB) PE samples collected over
ing approaches have become less efficient for detection of a period of seven (07) months (200 days). The novel
zero-day attacks and there is a dire need of automated mal- dataset is collected by deploying low and high interac-
ware detection and classification system equipped with the tion honeypots as well as enterprise endpoint security
machine learning techniques [9].The machine learning can solution over a large organizational network. It is avail-
be either supervised or unsupervised, i.e., supervised learn- able publicly for research community.
ing or discriminative deep architectures conducts the training (d) The quantitative assessment reflects that the proposed
over labelled data, i.e., classification, regression or predictive model achieved superior performance and outperformed
analytics whereas unsupervised learning or so called gener- state-of-the-art supervised approaches as well unsuper-
ative architectures draws inferences from datasets consisting vised one. The high yield of classification accuracy
of input data without labels [43]. demonstrated the significance and utility of the proposed
Keeping in view the ongoing huge growth in number framework.
of malwares, time-based complexity for malware analysis,
acute number of domain experts and demand of earliest
detection, considerable research on machine-learning-based The rest of the paper organization is as follows: section
techniques is being conducted for automated malware analy- “Background and context” describes background and struc-
sis and classification [10, 19] but most of the static analysis- ture of windows-based PEs. Section “Related work” narrates
based approaches are supervised in nature. The availability the related work. Section “Methodology and architecture”
of updated malware dataset along with the labels is also a describes the dataset acquisition, data pre-processing, fea-
major hurdle for malware analysts. The aforementioned lim- ture extraction and proposed framework, i.e., PROUD-MAL
itations and gap motivated the development of automated followed by FANN architecture. Section “Experiments and
unsupervised malware analysis system for investigation of results ”narrates the implementation details including the
portable executable to make a classification decision based experimental setup, results obtained and discussion. Section
on static analysis. Moreover, it is essential to have a suitable “Conclusion and future direction” narrates the concluding
representation of feature vectors to make decision regarding remarks followed by the future direction.
123
Complex & Intelligent Systems
123
Complex & Intelligent Systems
researchers that the classification accuracy is based on the based mechanism over static features using unsupervised
model applied as well as disassembler chosen. In 2018, it has clustering for malware classification.
also been shown that machine learning model can learn from
sequence of raw bits without explicit feature extraction based
on conventional practices of malware classification [19]. Methodology and architecture
The use of machine learning-based classifiers for malware
intrusion detection is a well-known approach for network In this section, design of our proposed unsupervised frame-
analysis [25]. In addition to string extraction, researchers work, i.e., PROUD-MAL for classifying windows-based PE
[30] have also used statistical approach such as raw byte using clustering based on static analysis will be explained.
and byte entropy histogram. In [20], researchers presented The PROUD-MAL is a custom-built unsupervised frame-
an approach using static analysis of the features from the work composed of multiple modules including novel dataset
PE-Optional Header fields by employing Phi (φ) coefficient collection, dataset pre-processing and feature extraction and
and Chi-square (KHI2) score. In [23], features were extracted unsupervised clustering of the malicious & benign PE sam-
from system calls and submitted to neural network for classi- ples as illustrated in Figs. 1 and 2. Moreover, the designed
fication using 170 samples and obtained 0.96 for Area under Feature Attention-based Neural Network (FANN) is trained
Curve (AUC). In [24], experiments were performed to iden- over pseudo labels. The proposed classifier is evaluated over
tify the critical point to quarantine the activity of malicious the test dataset which was kept hidden during the testing as
code related to its communication with remote command and depicted in Fig. 3.
control server. Researchers [26] presented a framework that
ensures the protection of application programs against mal- Malware dataset acquisition
ware for mobile platform. In 2017, researchers used static
analysis to extract key information, i.e., headers strings and The first stage of the proposed framework is the indigenous
sequence from the metadata of PE files. The model was dataset collection. In this research work, a pilot attempt is
trained over a dataset of 4783 samples using Random Forest made to perform the dataset collection including the mal-
and achieved 96% accuracy. The researchers [42], designed a ware and benign samples which will be extended as future
malware classification method for several malware variants research work. A major obstacle in leveraging machine learn-
based on signature prediction. The proposed solution was ing techniques for malware analysis is the lack of sufficiently
based on the static analysis of features including strings, n- big, labelled datasets that shall contain the malicious as well
grams and hashes extracted from PE header. In [27], the as benign samples. Moreover, it is very important to keep
researchers proposed a malware detection system based on updating the malware dataset due to ever changing smart
supervised learning. They devised tool for feature extrac- evasion approaches adopted by malware authors. The collec-
tion from header of PE files. Later, system was trained using tion of malicious samples was difficult but the collection of
supervised machine learning classification algorithms such benign samples was also not easy. To this end, we used two
as Support Vector Machine and Decision Trees. In [47], (02) different approaches for collecting the malicious and
authors proposed Virtual Machine Introspection a machine benign samples as illustrated in Fig. 1. First, we deployed
learning-based approach for malware detection in virtual- low and high interaction honeypots as production unit and
ized environment. The researcher extracted opcodes using intentionally configured them in a vulnerable way to col-
static analysis and trained the classifier with selected fea- lect malicious files and log unauthorized behavior. The low
tures. Later, Term Frequency-Inverse Document Frequency interaction honeypots, i.e., Honeyd [34] as well as high inter-
(TF-IDF) and Information Gain (IG) were also applied as action honey pot, i.e., SMB Honey Pot [4] were deployed
classification algorithms. In 2019, researcher [29] proposed a over the enterprise organizational network to emulate the ser-
malware detection approach in the IoT environment based on vices frequently targeted by the attacker and the production
similarity hashing algorithm-based. In proposed technique, systems, respectively. Second, Kippo-Malware collector and
scores of binaries were calculated to identify the similar- Kaspersky endpoint security solution is also deployed over
ity between malicious PEs. Numerous hashing techniques the enterprise organizational network to collect malware as
[21] including PEHash, Imphash and Ssdeep were used. well as benign samples. The benign PE including.exe or.dll
Later, researchers integrated hash results using fuzzy logic. is also collected from machines with licensed and updated
Recently, attention models have shown promising output in version of Windows operating system including Windows
tasks such as image analysis, machine translation, computer XP, 7, 8 and 10. Special precautions have been taken into
vision and natural language processing [32]. The attention account for compliance of licensing and regulatory require-
mechanism supports the model to focus on the most relevant ments while collecting benign samples. Moreover, additional
features as required. Therefore, we employed the attention- precautionary measures such as establishment and config-
uration of sandbox environment for dataset collection and
123
Complex & Intelligent Systems
further processing were also taken into consideration. We of corrupt and duplicate samples. The verification of samples
collected 19,000 samples (31 GB) over the period of seven and removal of duplicates were done using the hash values.
months (200 days) but after performing dataset verification, The dataset is divided into 60:20:20 ratios for training, vali-
samples were reduced to 15,457 (25 GB) PE samples com- dation and testing, respectively, of the proposed model.
prising 8775 (17 GB) malicious and 6681 (8 GB) samples.
The reduction in number of samples resulted due to filtering
123
Complex & Intelligent Systems
Data pre-processing and feature extraction nized in a csv compatible format. More than 35 features were
extracted and below is brief description of selected features.
To prepare the dataset, a series of pre-processing steps were
performed, i.e., identification of file type, removal of corrupt
and duplicate samples, unpacking of the binaries and verifi-
• MD5 is a cryptographic signature. It is a 32-bit hexadeci-
cation of labels to transform the raw data into a meaningful
mal value and each file has its unique MD5 value.
format. It was ensured that the dataset shall not contain any
• Machine represents the target machine such as Intel 386,
duplicate binaries using MD5. It was also ensured that only
MIPS little endian Motorola, etc.
unpacked binaries shall be submitted for feature extraction
• Size of optional header is a mandatory feature irrespective
therefore section names were examined using a tool PEStu-
of the name and provides information related to PE. It is
dio [45] to see if any of them contains popular packers [46]
included only for executable files and not for object files.
such as UPX, ASPack, FSG.
• Characteristics represent attributes of the file such as base
Moreover, verification of labels is a significant activity,
relocation address, local symbol, user program or system
which was performed by deploying signatures-based anti-
file, little-endian or big-endian architecture or whether file
virus solutions in parallel and finally using cloud-based
is DLL or not etc.
service of Virus Total. We used VirusTotal API [44] as well as
• Major/Minor Linker Version tells the linker to place a ver-
VirusTotal web interface to submit the binaries for verifica-
sion number in the header of the.dll or.exe file.
tion. The VirusTotal API does not require to web interface for
• Code Size represents size of code (text) section.
file submission It is pertinent to mention here that labelling of
• Size of Uninitialized Data is the size of data section.
samples in the dataset like text, images or speech is relatively
• Address of Entry Point is the address where the PE loader
an easy task, but the labelling as well as the verification of
will begin execution; this address is relative to image base
labels that whether a sample is benign or malicious was very
of the executable. It is the address of the initialization func-
time intensive task. Handling the malicious files needs extra
tion for device drivers and is optional for DLL.
precautionary measures such as establishment and configu-
• Base of Code is the pointer to beginning of the code section,
ration of sandbox environment. During the process, findings
which is relative to image base.
were observed such as existence of overlapping segments,
• Image Base is the preferred address of the first byte of the
usage of non-standard version details, names for sections and
executable when it is loaded in memory.
zero size of raw data that also results into high virtual size of
• Section Alignment: The address assignment to PE requires
section in case of packed PE files. It was also observed that
section loading. The section alignment is set to 0 × 2000.
some packers make an attempt to reduce entropy by embed-
This means that the code section starts at 0 × 2000 and the
ding zero bytes in data to bypass screening. Moreover, in
section after that starts at 0 × 4000.
malicious files, the data section is missing or has relatively
• File Alignment: Just like the section alignment, the data
lower value (if present) and permissions assigned to the sec-
also needs to be loaded. It is set to 512 bytes or 0 × 200.
tion are found to be inconsistent in comparison with standard
• Major/Minor Operating System Version is the version sup-
practices. It was also observed that resource size is relatively
ported by PE.
small as malicious files are mostly non-GUI. The study of
• Major Image Version is the major version number of
compilation time revealed that malwares are mostly compiled
image.
during off working days and also do not have genuine cre-
• Size of Image is the size of executable after being loaded
ation time. After the dataset preparation and pre-processing,
into memory. It must be multiple of section alignment.
feature extraction was performed. Features were extracted
• Size of headers represents the size of all headers, i.e., PE
by parsing headers of Portable Executables (PEs). A custom
header, the optional header, DOS header.
parser was developed to read PE headers, tokenization of
• Checksum is used for file validation at load time and to
features and their respective values. Finally, tokens are orga-
confirm whether a file is undamaged or has been corrupted.
123
Complex & Intelligent Systems
• Sub System This field points to user interface type required given an empirical validation regarding the appropriate selec-
by operating system. tion of numbers of clusters in a dataset that is also depicted
• Size of Stack Reserve is number of bytes allocated for by Fig. 4b. Nevertheless, if such information is not known in
stack and determines the stack region utilized by threads. advance, the applicability of other clustering algorithms, e.g.,
• Size of Stack Commit is the amount of memory that stack mean-shift [20] or unsupervised deep embedding [21], etc.
is relegated at startup. can be considered more appropriate. Therefore, the extracted
• Size of Heap represents the space to reserve for loading. features F {f 1 , f 2 ,..., f N } are submitted to k-means algo-
• Loader Flags informs upon loading whether to break upon rithm that clusters the similar features (i.e., the corresponding
loading, debug on loading or to set to default. binaries). Using k-means allows us to obtain a set C {c1 ,
• Number of RVA is the number of relative virtual addresses c2 ,..., ck } of k ( M) cluster centroids by keeping the fol-
in rest of the optional header. Each entry describes a loca- lowing optimization function at minimum:
tion and size. The structures contain critical information
about specific regions of the PE file.
N
M
• Load Configuration size is usually used for exceptions. It C ← argc min || f i − c j ||2 , (1)
is only utilized in Windows NT, 2000 and XP. i1 j1
123
Complex & Intelligent Systems
ing subsection. The third and fourth layers comprise of 13 lized as prior knowledge to train the model to predict
neurons each followed by output layer using sigmoid activa- classification of PEs. The sequence can be represented
tion for binary classification. The model is further fine-tuned as {F1 , F2 , . . . , Fn}. The weighted vector containing Wi
by adjusting the hyperparameters to achieve the optimum of each data point S i in feature combination sequence is
results. represented as {(F1 , W1 ), (F2 , W2 ), . . . , (Fn , Wn )}. Next,
we
extract
subsequence
with k highest weights:
F1 , W1 , F2 , W2 , . . . , Fk , Wk . As discussed earlier,
Attention mechanism the AB connects two parallel attention network/layers of
opposite directions to same output. Each network/layers
As we introduced above, PE header has numerous fea- computes the attn(i,h) for features of a PE instance given as
tures where some features might have a higher impact on input, where i represents features and h represents number of
identifying malicious PEs. Therefore, we employ attention units. One network processes sequence from top to bottom
mechanism to prioritize significance of important features (forwards) and other processes the sequence from bottom to
while penalizing the “noise” fields. The main principle top (backwards). Let xt denote current step of input sequence,
behind proposed Attention Block (AB) is as follows: The h t−1 denote previous hidden state. The next hidden state ht
selection of significant feature rather than examining entire can be calculated as follows:
feature set improves classification. To this end, a fea-
ture vector sequence of length n is extracted from PE h t f (Axt + W h t−1 ), (2)
header. After processing feature vector at first iteration, sig-
nificant combination of length k is selected based upon where f is a non-linear activation function. A and W repre-
attention threshold. Subsequently, this subsequence is uti- sent weight matrices of current input vector xt and previous
123
Complex & Intelligent Systems
hidden state h t−1 . At each time step t, the forward pass cal- Table 2 Hyper-parameters and associated values
culates hidden state h t by considering previous hidden state Parameters Values
h t−1 and new input sequence xt . At the same time, back-
ward flow computes hidden state ht considering future hidden Batch size 32
state h t+1 and the current input xt . Afterward, the best out- Epochs 60
put among both forward h t and backward h t are selected to Dropout 0.5
obtain refined vector representation. The first network or set Learning rate 0.001
of layers in AB used sigmoid function while the other used Loss function Binary cross entropy
ReLU function. Finally, the best output is applied to feature Optimizer ADAM
importance map while taking the product of learned param- Momentum 0.9
eters with respective probabilities. As each layer computes
the attn(i,h) for features of a PE instance given as input, where
i represents features and h represents number of units. The Finally, the empirical validation of proposed PROUD-MAL
feature weights for first layer can be learned as Eq. (3). approach is also performed against standard metrics over test
dataset which is kept hidden during the training phase.
a(i,h) σ (x(i,h) , W ), (3)
where x i is the input to layer and W denotes weights of layer Experiments and results
and σ represent sigmoid activation function to feature map
w (i, h) for the first attention layer in Eq. (4). Implementation details
b(i, h) ∂(x(i,h) , W ). (4) This section narrates the configuration and performance
metric used for the experiment to classify Windows-based
Similarly,∂ represent ReLU activation function employed PE. The run time environment configured for experiments
by second attention layer to feature map w (i, h) followed by includes a workstation with Intel Core i5-9500 Processor
selecting maximum of Eqs. (3) and (4). @ 3.0 GHz with 6 cores and 6 logical processors, 32 GB
Ram, virtual memory of 20.0 GB with enabled virtualiza-
w(i, h) max(a(i,h) , b(i,h) ) (5) tion, graphic card NVIDIA GeForce GTX 1650 with 4 GB
Ram and Window 10 Pro 64 operating system. In terms of
The attention is computed by multiplication of w (i, h) software, both Keras and Tensorflow were employed at back-
with output of sigmoid function as: end for implementation of our proposed framework. The
training is performed for 60 epochs (i.e., approx. 23,185 iter-
exp(w(i,h) ) ations) and input was submitted to network in a batch of 32
attn(i,h) ⊗ w(i,h) . (6)
exp(w(i,h) ) + 1 with Adam as an optimizer and learning rate was initialized
with stepwise decay at 0.001 or 10–3 . Dropout regulariza-
The feature attention-based layer learns to put relatively tion of 0.5 is placed in after fully connected layers which
more to those features that have contributed more to minimize help to prevent overfitting. Generally, dropout removes neu-
the validation loss while learning the accurate classifica- rons and its connections randomly. Moreover, we adopted
tion by applying sigmoid function to the feature importance binary cross-entropy loss function, which is minimizing the
map and subsequently multiplying learned parameters with negative logarithmic likelihood between the prediction and
the respective probabilities. The dataset based on validated the ground-truth data. The momentum helps accelerating the
predicted clusters is splitted into 60:20:20 ratios for classi- ADAM in the relevant direction and mitigates oscillations by
fication training, validation, and testing, respectively. The adding a fraction of the update vector of the past time step to
model is trained over the predicted cluster dataset using the current update vector. The accuracy and loss parameters
classification algorithms including Random Forest, Support provided by Keras are visualized in better manner utilizing
Vector Machine (SVM), Gradient Boost, Ada Boost, Naive tensor board and console logs. The summary of hyper param-
Bayes and PROUD-MAL. The training is performed for 60 eters is provided in Table 2.
epochs (i.e., approx. 23,185 iterations) and the input was
submitted to the network in a batch of 32 with Adam as Results and discussion
an optimizer and learning rate was initialized with stepwise
decay at 0.001 or 10–3 . Binary cross entropy is utilized for We performed comparison of our proposed method with
loss calculation over the training data. After model training, state-of-the-art supervised approaches. Despite this challeng-
it is tested to make predictions against validation dataset. ing comparison, the utility of our proposed framework is
123
Complex & Intelligent Systems
123
Complex & Intelligent Systems
shows better predictive power and can also provide bet- acteristic (ROC) curve (Fig. 7) shows that our framework
ter sensitivity tuning. To the best of our knowledge, this is shows superior performance compared to other state-of-the-
due to unsupervised clustering cascaded by classifier with art supervised approaches. PROUD-MAL achieved ROC of
embedded attention layers. However, RF, SVM and NB also 0.99 with small discrepancy of 0.01. The visualization of
showed good performance by achieving AUC of 98.78%, cluster prediction is generated by applying t-SNE on the
97.40% and 95.37%. The GB and AB achieved relatively dataset and is depicted in Fig. 8. The blue dots represent the
lower AUC, i.e., 90.99%. The comparison with unsuper- malicious binaries and yellow mark represents the benign
vised approach [Hyrum S. Anderson et.al. 2018] also showed PEs. The visual exhibit reflects minor overlapping between
superior performance. Our approach demonstrated 5.2% high the malicious and benign samples. There were 38 features
classification accuracy. The detailed comparative assessment in vector space. However, by applying attention mechanism,
with supervised approaches as well as unsupervised one has it is revealed that features with the numerical values, e.g.,
shown utility and significance of the proposed architecture. section entropy, size of sections, image base were given
It is also pertinent to mention that for classification of an more weight by AB. On the other hand, the features that
unknown PE using an anti-virus software, the training time either represent unique numerical value or fixed length value
is not important because we can use pre-trained neural net- with a specific format, e.g., MD5, checksum are given rela-
work. As test time of FANN model is less than 21 ms per tively less weight than the normal numerical values such as
step, the model is appropriate for its subsequent utility in section entropy. But these attributes are given more consid-
real anti-virus software. eration in comparison with the features having string values,
Experiments show the True Positive (TP), False Posi- e.g., machine, characteristics, compiler etc. The proposed
tive (FP), True Negative (TN), False Negative (FN) rate for scheme of using feature subsequence combination by apply-
FANN is 0.98, 0.02, 0.98 and 0.02, respectively. The quan- ing attention mechanism resulted in more refined feature
titative assessment was conducted over 60 epochs with a representation. Subsequently, quantitative results of compar-
batch size of 32. The training-validation accuracy as well ative assessment have demonstrated the utility of attention
as training-validation loss is depicted in Fig. 6. The training mechanism for unsupervised classification of PEs using static
and validation graphs in Fig. 6 depict that PROUD-MAL features.
is trained quiet well enough around 60 epochs. We also
employed early stopping criteria to discontinue further train-
ing at an appropriate stage. It is worth mentioning that as we
Conclusion and future direction
increase the number of iterations, the loss or learning rate
descends gradually (not showing due to non-significance in
We have proposed and presented a progressive deep unsuper-
figure for later iterations.) Moreover, the graphs for training
vised malware classification framework, i.e., PROUD-MAL
and validation are also illustrated in Fig. 6. As our dataset
with a deep neural network architecture that uses dense layers
has 15,457 binaries comprising 8775 (17 GB) malicious and
and an attention block for binary classification of Windows-
6681 (8 GB) samples, therefore, we also calculated the area
based PEs based on features extracted from header in a static
under the ROC curve (AUC) as illustrated in Tables 3, 4,
fashion. Our proposed feature attention mechanism-based
which is a widely used performance metric for imbalanced
neural network for malware classification learns to put rela-
datasets. A visual inspection of the Receiver Operating Char-
tively more weights to those features that contributed more to
123
Complex & Intelligent Systems
minimize the validation loss while learning the accurate clas- 3. Gandotra E, Bansal D, Sofat S (2014) Malware analysis and clas-
sification. We also collected novel real-time malware dataset sification: a survey. J Inf Secur 5:56–64
4. Provos N (2004) A virtual honeypot framework. In: Proceedings
by deploying low and high interaction honeypots as well as of the 13th conference on USENIX Security Symposium
endpoint security solution on an enterprise organizational 5. Sung AH, Xu J, Chavez P, Mukkamala S (2004) Static analyzer
computer network for validation of proposed framework. of vicious executables (SAVE). In: Proceedings of the 20th annual
This indigenously collected dataset is novel and is made computer security applications conference
6. Wuchner T, Cisłak A, Ochoa M, Pretschner A (2020) Leverag-
public for the research community. We also look forward to ing compression-based graph mining for behavior-based malware
enhance existing volume of novel dataset. The quantitative detection. IEEE Trans Depend Secure Comput 16:1
assessment reflects that the proposed PROUD-MAL frame- 7. Ghafir I, Hammoudeh M, Prenosil V (2018) Defending against the
work achieved an accuracy of more than 98.09% with better advanced persistent threat: Detection of disguised executable files.
PeerJ Preprints 6:e2998. https://doi.org/10.7287/peerj.preprints.29
quantitative performance in standard evaluation metrics on
98v2
indigenously collected novel dataset and outperformed other 8. Alazab M, Layton R, Venkataraman S, Watters P (2010) Malware
conventional machine learning algorithms. As a way forward, detection based on structural and behavioral features of API calls,
our framework can be enhanced to explore the behavioral pg 1–10. In: International cyber resilience conference 2010—Perth
9. Devesa J, Santos I, Cantero X, Penya YK, Bringas PG (2010)
analysis based on API calls [49] using reinforcement learning
Automatic behavior-based analysis and classification system for
[50] for malware analysis. This includes the transformation malware detection. In ICEIS 2:395–399
of PEs into malware images and performs entropy based 10. Slam R, Tian R, Batten LM, Versteeg S (2013) Classification of
semantic segmentation of malware images. This will poten- malware based on integrated static and dynamic features. J Netw
Comput Appl 36(2):646–656
tially help malware authors to use malware visualization to 11. Egele M, Scholte T, Kirda E, Kruegel C (2012) A survey on
perform malware analysis more effectively for zero-day mal- automated dynamic malware-analysis techniques and tools. ACM
ware samples. The scope of future direction may also include Comput Surv 44:2
Non-Portable Executable (NPE) files. 12. Christodorescu M, Jha S (2003) Static analysis of executables to
detect malicious patterns. In: Proceedings of the 12th conference
on USENIX security symposium, vol 12, p 12
Acknowledgements The authors would like to express their gratitude
13. Ye Y, Wang D, Li T, Ye D (2007) IMDS: intelligent malware detec-
for research grant of Higher Education Commission (HEC) of Pakistan
tion system. In: Proceedings of ACM international conference on
under International Research Support Initiative Program (IRSIP) and
knowledge discovery and data mining (SIGKDD), pp 1043–1047
the institutional support from the Department of Computer Science,
14. Gandotra E, Bansal D, Sofat S (2014) Malware analysis and clas-
University of Warwick, Coventry, United Kingdom (UK).
sification: a survey. J Inf Secur 2:56–64
15. Schultz MG (2001) Data mining methods for detection of new mali-
cious executables. In: Proceedings of the IEEE symp. on security
Declarations and privacy, pp 38–49
16. Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of
Conflict of interest The authors declared that they have no conflict of malicious code by applying machine learning classifiers on static
interest. features: a state-of-the-art survey. Inf Secur Tech Rep 14:16–29
17. Eskandari M, Khorshidpour Z, Hashemi S (2013) Hdm-analyser:
Open Access This article is licensed under a Creative Commons a hybrid analysis approach based on data mining techniques for
Attribution 4.0 International License, which permits use, sharing, adap- malware detection. J Comput Virol Hack Tech 9(2):77–93
tation, distribution and reproduction in any medium or format, as 18. Khodamoradi P, Fazlali M, Mardukhi F, Nosrati M (2015) Heuris-
long as you give appropriate credit to the original author(s) and the tic metamorphic malware detection based on statistics of assembly
source, provide a link to the Creative Commons licence, and indi- instructions using classification algorithms. In: 18th CSI interna-
cate if changes were made. The images or other third party material tional symposium on computer architecture and digital systems
in this article are included in the article’s Creative Commons licence, (CADS), IEEE, pp 1–6
unless indicated otherwise in a credit line to the material. If material 19. Raff E, Sylvester J, Nicholas C (2017) Learning the PE header,
is not included in the article’s Creative Commons licence and your malware detection with minimal domain knowledge. In: Proc. 10th
intended use is not permitted by statutory regulation or exceeds the ACM workshop artificial intelligence secure, New York, ACM, pp
permitted use, you will need to obtain permission directly from the copy- 121–132
right holder. To view a copy of this licence, visit http://creativecomm 20. Belaoued M (2015) A real-time PE-malware detection system
ons.org/licenses/by/4.0/. based on CHI-Square test and pe-file features. In: IFIP interna-
tional conference on computer science and its applications CIIA
2015: computer science and its applications, pp 416–425
21. Pietrek M (1994) Peering inside the PE: a tour of the Win32 portable
executable file format
22. Rossow C et al (2012) Prudent practices for designing malware
References experiments: status quo and outlook. In: Proc. IEEE symp. secur.
privacy (SP), pp 65–79
1. Tang Y, Xiao B, Lu X (2011) Signature tree generation for poly- 23. Tobiyama S, Yamaguchi Y, Shimada H, Ikuse T, Yagi T (2016) Mal-
morphic worms. IEEE Trans Comput 60:565–579. https://doi.org/ ware detection with deep neural network using process behavior.
10.1109/TC.2010.130 In: Proc. IEEE 40th annu. comput. softw. appl. conf. (COMPSAC),
2. Internet security threat report (2019) https://www.symantec.com/ vol 2, pp 577–582
content/dam/symantec/docs/reports/istr-24-2019-en.pdf
123
Complex & Intelligent Systems
24. Shibahara T, Yagi T, Akiyama M, Chiba D, Yada T (2016) Efficient 42. Gandotra E, Bansal D, Sofat S (2016) Zero-day malware detection.
dynamic malware analysis based on network behavior using deep In: Sixth international symposium on embedded computing and
learning. In: Proc. IEEE Global Commun. Conf. (GLOBECOM), system design (IEEE, 2016), pp 171–175
pp 1–7 43. Ng CK, Jiang F, Zhang LY, Zhou W (2019) Static malware cluster-
25. Mutz D, Valeur F, Vigna G (2006) Anomalous system call detec- ing using enhanced deep embedding method. Concurr Computat
tion. ACM Trans Inf Syst Secur 9(1):61–93 Pract Exper 2019:e5234. https://doi.org/10.1002/cpe.5234
26. Zhauniarovich Y, Russello G, Conti M, Crispo B, Fernandes E 44. Algaith A, Gashi I, Sobesto B, Cukier M, Haxhijaha S, Bajrami
(2014) Moses: supporting and enforcing security profiles on smart- G (2016) comparing detection capabilities of antivirus products:
phones. Depend Secure Comput IEEE Trans 11(3):211–223 an empirical study with different versions of products from the
27. Raff E, Nicholas C (2017) An alternative to NCD for large same vendors. In:2016 46th Annual IEEE/IFIP international con-
sequences, Lempel-Ziv Jaccard distance. In: Proceedings of the ference on dependable systems and networks workshop (DSN-W),
23rd ACM SIGKDD international conference on knowledge dis- Toulouse, pp 48–53. https://doi.org/10.1109/DSN-W.2016.45
covery and data mining (ACM, 2017), pp 1007–1015 45. Kozachok AV, Kozachok VI (2018) Construction and evaluation
28. Rastogi V, Qu Z, McClurg J, Cao Y, Chen Y (2015) Uranine: real- of the new heuristic malware detection mechanism based on exe-
time privacy leakage monitoring without system modification for cutable files static analysis. J Comput Virol Hack Tech 14:225–231.
android. Springer Int. Pub, Cham, pp 256–276 https://doi.org/10.1007/s11416-017-0309-3
29. Namanya AP, Awan IU, Disso JP, Younas M (2019) Similarity hash- 46. Hassen M, Carvalho MM, Chan PK (2017) Malware classification
based scoring of portable executable files for efficient malware using static after validation analysis-based features. In: 2017 IEEE
detection in IoT. Fut Gen Comput Syst 110:824–832. https://doi. symposium series on computational intelligence (SSCI), Honolulu,
org/10.1016/j.future.2019.04.044 pp 1–7. https://doi.org/10.1109/SSCI.2017.8285426
30. Merkel R (2010) Statistical detection of malicious PE executables 47. Vadrevu P, Perdisci R (2016) MAXS: scaling malware execution
for fast offline analysis. In: Springer, Berlin, Heidelberg, ISBN with sequential multi-hypothesis testing. In: Proceedings of the
978-3-642-13241-4, pp 93–105. https://doi.org/10.1007/978-3-64 11th ACM on Asia conference on computer and communications
2-132414_10 security (ACM, 2016), pp 771–782
31. Catak FO, Yazı AF, Elezaj O, Ahmed J (2020) Deep learning based 48. Anderson HS, Kharkar A, Filar B, Evans D, Roth P (2018) Learning
Sequential model for malware analysis using Windows exe API to evade static PE machine learning malware models via reinforce-
Calls. PeerJ Comput Sci 6:e285. https://doi.org/10.7717/peerj-cs. ment learning. http://arxiv.org/abs/1801.08917
285 49. Wang Y, Stokes J, Marinescu M (2020) Actor critic deep reinforce-
32. Rush AM, Harvard SEAS, Chopra S, Weston J (2015) a neural ment learning for neural malware control. Proc AAAI Conf Artif
attention model for sentence summarization. In: Proceedings of the Intell 34(01):1005–1012
international conference on empirical methods in natural language 50. Wu C, Shi J, Yang Y, Li W (2018) Enhancing machine learning
processing, Lisbon, Protugal based malware detection model by reinforcement learning. In: Pro-
33. Collberg C, Thomborson C (2002) Watermarking, tamperproof- ceedings of the 8th international conference on communication and
ing, and obfuscation - tools for software protection. IEEE Trans network security (ICCNS 2018), pp 74–78. https://doi.org/10.114
Software Eng 28(8):735–746 5/3290480.3290494
34. Koniaris I, Papadimitriou G, Nicopolitidis P, Obaidat M (2014)
Honeypots deployment for the analysis and visualization of
malware activity and malicious connections. In: 2014 IEEE
Publisher’s Note Springer Nature remains neutral with regard to juris-
international conference on communications (ICC), Sydney, pp
dictional claims in published maps and institutional affiliations.
1819–1824. https://doi.org/10.1109/ICC.2014.6883587
35. Zhou Y, Jiang X (2012) Dissecting android malware: characteri-
zation and evolution. In IEEE symposium on security and privacy,
pp 95–109
36. Wan M, Shang W, Zeng P (2017) Double behavior characteristics
for one-class classification anomaly detection in networked control
systems. IEEE Trans Inf Forens Secur 12(12):3011–3023
37. Hearst M, Dumais S, Osman E, Platt J, Scholkopf B (1998) Support
vector machines. IEEE Intell Syst Appl 13(4):18–28. https://doi.
org/10.1109/5254.708428
38. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010)
RUSBoost: a hybrid approach to alleviating class imbalance. IEEE
Trans Syst Man Cybern A Syst Hum 40(1):185–197
39. Caruana A et al (2006) An empirical comparison of supervised
learning algorithms. In: ICML ’06 proceedings of the 23rd inter-
national conference on machine learning, pp 161–168
40. Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu
A (2002) An efficient K-means clustering algorithm analysis and
implementation. IEEE Trans Pattern Anal Mach Intell 24:881–892.
https://doi.org/10.1109/TPAMI.2002.1017616
41. Boutsidis C, Mahoney M, Drineas P (2009) Unsupervised fea-
ture selection for the k-means clustering problem. In: Advances
in neural information processing systems 22-proceedings of the
conference, pp 153–161
123