Data-Centric Machine Learning Approach For Early Ransomware Detection and Attribution
Data-Centric Machine Learning Approach For Early Ransomware Detection and Attribution
Abstract—Researchers have proposed a wide range of dated Windows 7/8 systems. As such, these methods
ransomware detection and analysis schemes. However, may not be applicable to the latest threats facing
most of these efforts have focused on older families Windows 10/11 users. Hence there is a pressing need
targeting Windows 7/8 systems. Hence there is a critical
need to develop efficient solutions to tackle the latest to detect new ransomware designs and classify them
threats, many of which may have relatively fewer samples for improved mitigation, i.e., attribution. Preferably,
to analyze. This paper presents a machine learning (ML) ransomware should be tackled early in the kill-chain
framework for early ransomware detection and attribu- to minimize damage [1]. Since new ransomware re-
tion. The solution pursues a data-centric approach which leases will likely have fewer available samples, solutions
uses a minimalist ransomware dataset and implements
static analysis using portable executable (PE) files. Results must also operate effectively with smaller “minimalist”
for several ML classifiers confirm strong performance in datasets. This requirement is very much in line with
terms of accuracy and zero-day threat detection. current trends in artificial intelligence (AI) to develop
Index Terms—Cybersecurity, malware analysis, ran- more focused “data-centric” solutions [5].
somware detection and attribution Accordingly, this paper presents a novel ML so-
lution for ransomware detection and attribution us-
I. I NTRODUCTION ing static analysis. First, a unique malware repository
Ransomware operates by encrypting files on a host is built by collecting samples of some of the lat-
computer and demanding some form of payment to est ransomware families, i.e., Babuk/Babyk, BlackCat,
release the keys. This malware has become the most Chaos, DJVu/STOP, Hive, LockBit, Netwalker, Sodi-
lucrative revenue source for cybercriminals, and many nokibi/REvil, and WannaCry (after 2017). Next, feature
ransomware “families” have impacted a wide range of extraction is done using Windows portable executable
users. Moreover, numerous cyber-criminal affiliates are (PE) format file information. Finally, several supervised
also offering ransomware-as-a-service (RaaS), further ML classifiers are trained and tested on these extracted
reducing the barrier to such extortion [1]. features, including support vector machines (SVM),
Ransomware follows a multi-stage “kill-chain” com- random forest (RF), extreme gradient boost (XGBoost),
prising of reconnaissance, distribution, installation, and feed-forward neural networks (FNN) [6]. Overall,
communication, encryption, and extortion [2], [3]. To this solution has very amenable run-times and can be
date, numerous designs have been evolved with increas- integrated into network/host-based defenses to target
ing levels of secrecy, speed, and complexity. For exam- ransomware early in the kill-chain (prevention).
ple, various methods have been used to breach systems This paper is organized as follows. Section II reviews
(e.g., remote access, drive-by, and privilege escalation) some key studies on ransomware analysis. Next, Section
and encrypt data in collaboration with command and III details the proposed ML-based framework, including
control (C&C) servers. Data exfiltration has also been dataset collection and feature extraction. Performance
used to extort users (double ransomware) [2]. As this results are then presented in Section IV, followed by
threat continues to grow, surveys indicate that almost future work directions in Section V.
half of large corporations have experienced such attacks
[4]. Windows ransomware is of particular concern as II. L ITERATURE R EVIEW
this operating system (OS) is still the most prevalent. A range of ransomware analysis schemes have been
In light of the above, researchers have proposed a proposed, and survey articles have detailed various
range of ransomware analysis solutions. Many of these (overlapping) taxonomies to classify these methods,
schemes extract information from network traces or e.g., static or dynamic analysis, network- or host-based,
host files/logs to train advanced machine learning (ML) etc [1]- [3]. These efforts are further reviewed here.
classifiers. However, most efforts have focused on a Static analysis examines executable files to detect
specific ransomware family or older families targeting artifacts of maliciousness, e.g., via author attribution,
1
code/segment identification (de-anonymization), etc [1]. fiers and yields over 95% detection rates. Meanwhile,
Some common methods used here include binary code [16] analyzes server message block (SMB) protocol
analysis (BCA), source code analysis via reverse en- patterns to detect older ransomware (2015-2017). The
gineering, and C&C server domain prediction [2]. For NetConverse scheme [17] also uses ML methods to
example, [7] specifies a multi-level framework to de- analyze host traffic for earlier threats and achieves high
tect ransomware from raw binaries, assembly code, detection rates (over 95%). Finally, [18] uses deep learn-
and libraries. ML classifiers are then trained with the ing to analyze network activity and classify abnormal
extracted data, yielding detection rates around 90%. operation in Windows 7. Results show high detection
Meanwhile, [8] transforms code sequences into N-grams rates for several families (over 97%).
and extracts frequency-based features for classification. Meanwhile, dynamic host-based schemes monitor lo-
Results show detection rates around 91% for several cal system activity to detect ransomware, e.g., memory
ML classifiers (decision tree, RF, etc). However, code- and file operations, application programmer interface
based analysis is very labor-intensive [9] and represents (API) function calls, dynamic link library (DLL) calls,
a more latent “post-infection” forensics approach. etc. For example, [19] uses a sandbox to track file
Recent efforts have also used other static features to encryption/deletion, persistent messages, etc. Results
analyze ransomware. For example, [10] leverages image show 96% detection rates for older ransomware types
processing techniques to convert ransomware binary (mid-2010s). Also, [20] presents a scheme to monitor
files into grayscale images and then performs texture and store encryption keys for ransomware detection and
analysis for feature extraction. Results for several ML file recovery. Results show successful mitigation of 12
classifiers show high accuracy (97%) for a small dataset out of 20 families. Similarly, [21] scans input/output
with a mix of old and new ransomwares (379 samples). requests for ransomware activity and flags affected
However, this scheme imposes added computational files. Studies have also proposed ransomware “paranoia”
burdens and does not consider benign applications. schemes that try to detect environments and avoid
Meanwhile, [11] details another static analysis scheme fingerprinting/detection, e.g., [22] tracks API calls.
which extracts entropy and image-based features to train Although the above works present some notable con-
a specialized Siamese NN classifier. Tests with a small tributions, key concerns still remain. Foremost, studies
dataset (about 1,000 samples and 10 families) show have largely focused on older ransomware targeting
accuracy values in the mid-90% range but notably lower Windows 7/8 systems (mid-2010s). Given the expanding
precision and recall rates (upper 70% range). Also, most nature of this threat, it is imperative to study newer
of the ransomware families used here are older (mid- families targeting Windows 10/11. However, there are
2010s) and benign applications are not considered. few datasets here, and new malwares may have smaller
Studies have also used static PE header file data for sample sizes to analyze (a challenge for ML schemes).
broader malware detection (not just ransomware). How- Hence effective “data-centric” [5] schemes are required
ever, these efforts focus on detection and not attribution. for minimalist datasets. Finally, ransomware detection
For example, [12] collects many samples (over 100,000) and attribution schemes must have amenable run-times
from a repository called VX Heaven (now inactive) and preferably target ransomware earlier the distri-
and trains ML classifiers using 7-10 extracted PE file bution/delivery stages to minimize damage [2]. It is
features. Results show detection rates in the upper 90% here that static analysis offers an expedient approach
range. Also, [13] extracts PE features from about 5,500 for tackling malicious payloads prior to infection. By
malware samples and 1,200 benign applications (early contrast, dynamic analysis requires more indepth exam-
2010s). Detection is done using a set of heuristics, ination of network or host activities over longer intervals
achieving 95% accuracy. Finally, [14] extracts 9 PE in virtual environments. As a result, a static analysis
file features (on sections, data directories, and entropy) solution is presented using PE format file analysis.
from a dataset with 1,200 malicious and benign samples
each. Results for several classifiers show 95% detection III. DATA -C ENTRIC S TATIC A NALYSIS USING ML
rates. However, these studies present no details on their The static analysis framework for ransomware detec-
malware datasets, most of which are over a decade old. tion and attribution (classification) is shown in Fig. 1
By contrast, dynamic analysis scans run-time ac- and comprises of several stages. The first stage (Empir-
tions and event sequences for ransomware activity. ical Data Collection) builds an up-to-date repository of
Specifically, dynamic network-based schemes examine some of the latest Windows 10/11 ransomware threats
packet traces for C&C communications, domain name (since 2017). Regular benign Windows-based applica-
service (DNS) queries, network storage access, etc. For tions are also added here to improve classifier perfor-
example, [15] presents a detection system for Locky mance. The second stage (Feature Selection/Extraction)
ransomware which uses traffic features to train classi- processes raw executables to extract key features. An
Fig. 1. Overview of static analysis ML framework for ransomware detection and attribution
efficient static analysis approach is proposed here using samples). However, limited dataset size/diversity can
Windows PE format files. Finally, the last stage (ML also have a negative impact on classifier performance.
Training/Testing) uses the feature datasets to train ML Now many active repositories host malware executa-
classifiers to detect and attribute ransomware. On a high bles, e.g., MalwareBazar, Triage, VirusShare,
level, this setup follows a well-defined ML approach, and VirusTotal, etc. These sites provide varying
similar to that used in other studies. However, the degrees of access and usability, e.g., VirusTotal and
novel contributions here include the collection of new VirusShare require registration to access uploads.
ransomware datasets and extraction of lightweight static Detailed cross-checking and analysis also shows no-
feature sets. Further details are now presented. table duplication across portals, e.g., many Sodinokibi
samples on MalwareBazar match those on Triage.
TABLE I There are also discrepancies between the number of
E MPIRICAL DATASET samples for each family, e.g., DJVu is abundant whereas
Family Samples Avg. Size Avg. PE File
Babuk/Babyk and BlackCat are more scarce. Finally,
Babuk (Babyk) 140 0.19 MB 32.68 KB some repositories (VirusShare and VirusTotal)
BlackCat 120 3.91 MB 1,147 KB do not organize or label their data, further complicating
Chaos 140 0.49 MB 35.2 KB collection. Hence unlabeled data dumps have to be
DJVu (STOP) 140 0.71 MB 66.2 KB tediously analyzed using hashing and cross-checked
Hive 140 3.51 MB 403.9 KB
LockBit 140 1.30 MB 171.5 KB with labelled samples. Hence there is potential for a
Netwalker 140 0.26 MB 35.72 KB lack of diversity, even scarcity, of new ransomware.
Sodinokibi 140 0.30 MB 50.89 KB In light of the above, a smaller “minimalist” data
WannaCry 140 7.62 MB 21.83 KB
repository is curated with 9 active ransomware families,
Benign 2,000 26.86 MB 155.88 KB
i.e., Babuk/Babyk, BlackCat, Chaos, DJVu/STOP, Hive,
LockBit, Netwalker, Sodinokibi/REvil, and WannaCry
(Table I). These families are amongst the most prevalent
A. Empirical Data Collection
ransomware threats in 2022, as per the IBM X-Force
As per Section II, existing studies on PE file analysis Threat Intelligence Index, i.e., LockBit (17%) followed
provide little/no details on their datasets, e.g., type of by WannaCry (11%) and BlackCat (9%). A total of 140
malwares, executable file sizes, collection time frames, unique executables are collected for each family, except
percentage of ransomware, etc. Many of these malwares for BlackCat which only yielded 120 samples due to
are old and related repositories are inactive [12]. Hence scarcity, i.e., total of 1,240 malicious samples. Many
a new repository is curated for the latest ransomware Windows 10/11 applications are also added to construct
families. Now given the rapidly changing nature of the a benign class (2,000 samples). These programs are
ransomware threat, it may be difficult to get sufficient collected from a range of websites and include system
samples of each. Hence realistic “data-centric” ML utility, entertainment, and productivity tools (Fig. 1).
frameworks must achieve good detection and attribution Overall, having a large set of non-malicious training
with minimalist datasets (perhaps only hundreds of data is very beneficial since regular applications down-
loads will exceed (unintended) ransomware downloads, to train/test supervised ML classifiers, i.e., SVM, RF,
This addition contrasts with work in [10], [11]. XGBoost, and FNN [6] (Fig. 1). All evaluation is done
using the Keras and TensorFlow toolkits, as well
B. Feature Selection/Extraction as Pandas and Sklearn. As per Section III-A, a
ML classifier performance is heavily dependent upon total of 9 malicious ransomware families are evaluated
input training data. Hence feature extraction (engineer- along with a set of benign applications, i.e., 10 classes.
ing) plays a vital role in transforming raw executables As noted earlier, there are a total of 1,240 malicious
to generate meaningful information for classifiers [6]. samples (140 samples for each family except BlackCat
As per Section II, static analysis is more expedient for which has 120 samples). The samples for each class
tracking ransomware early in its kill-chain. Hence this are further partitioned to generate separate training and
strategy is applied to Windows PE format files which testing pools. Namely, 20 random samples of each class
contain data structures to support program execution in are selected for testing and the remainder are used for
32-bit and 64-bit Windows OS environments. Namely, training, i.e., 120 training samples for all classes except
these files use the common object file format (COFF) BlackCat which only has 100 samples. Furthermore,
and contain information for the OS loader to setup/run 1,700 benign samples are selected for training and the
wrapped executable code (including memory mapping remaining 300 samples are used for testing. This par-
and permissions). For example, a PE format file has titioning reflects an approximate 85/15 training/testing
several initial lead-in headers along with multiple sec- split. All results are averaged over 100 trial runs, with
tions. Here each section specifies file content (i.e., code each using a different randomized 85/15 partitioning of
or data) and also contains its own section header. the datasets. Detailed findings are now presented.
As per Section II, studies on PE format files have con-
sidered a range of malwares for Windows 7/8 [12]- [14]
(mostly unspecified and not necessarily ransomware).
Hence there is a further need to extend such analysis
to Windows 10/11 ransomware threats. Now PE files
contain a wealth of information, and programs can have
unique non-overlapping parameters (depending upon
functionality). Hence when extracting PE format data, it
is important to select a subset of parameters which exist
across all sample files and also exhibit good variability.
In light of the above, PE files are generated for all
exectuables, with the resultant sizes shown in Table I.
A total of 4 datasets are built by extracting feature Fig. 2. Average multi-class accuracy (100 trials)
vectors with 5, 7, 10, and 15 parameters, labeled as
Datasets 1-4, respectively (Fig. 1). Each successive The average accuracy values (over all runs) are plot-
vector expands upon its predecessor by adding new ted for different feature vector sizes in Fig. 2, i.e., multi-
parameters. Now the exact parameters are chosen using class attribution. Results show improved performance
careful experimentation with the Image File Header, for all schemes with increasing feature vector sizes.
Image Optional Header, and Image Section Header In particular, the SVM and FNN classifiers give the
sections. Some key features include NumberOfSections, best improvement, with accuracy gains of 15-20%.
SizeOfCode, SizeOfHeaders, etc. Note that PE files Conversely, the RF and XGBoost classifiers have much
also contain information on dynamic-link library (DLL) lower gains as feature vector sizes increase from 5 to 15
calls which are indicative of functionality. For example, parameters, i.e., 0.5-1.5% range. These two classifiers
ransomware typically calls encryption, socket communi- also give the best accuracy (94-96% range). However,
cations, and registry-modification functions. Hence the the FNN scheme approaches these methods with 15
total number of DLL calls is also added to the 10 and 15 features, i.e., 91% accuracy. These findings are very
feature vectors (TotalDLLCalls, Fig. 1). Note that this encouraging given the relatively small-sized training
is a computed feature and not an extracted parameter. datasets and feature vectors used. The results also match
those for other schemes using much heavier feature
IV. P ERFORMANCE E VALUATION extraction and ML algorithms, e.g., image and entropy-
The static analysis framework is now evaluated using based features, deep NN algorithms, etc [10], [11].
the data repository from Section III-A. Namely, feature Next, consider attribution errors in more detail. In-
vectors extracted from the PE files are labelled to deed, mis-classifying ransomware as benign is much
generate input datasets. These datasets are then used more harmful than mis-classifying it as the wrong type
of ransomware, i.e., since such errors can allow malware feature vector sizes also give smaller improvements with
to bypass network or host defenses and infect host these classifiers, i.e., 2% range. By contrast, the SVM
machines. Hence to quantify this behavior, a modified and FNN schemes give very poor results for small fea-
ransomware detection rate (RDR) is defined as: ture vectors, with ransomware mis-classification rates
Trs around 50% (1-RDR). These classifiers are also very
RDR = (1) sensitive to feature vector size. Nevertheless, the FNN
Trs + Frs
scheme still approaches the performance of the RF
where Trs is the total number of ransomware samples
and XGBoost schemes with larger feature vectors, i.e.,
classified as (any class of) ransomware, and Frs is the
92% RDR. The BDR results are also plotted in Fig.
total number of ransomware samples mis-classified as
4. As expected, these values are higher than the RDR
benign, i.e., total number of ransomware test samples is
values since a larger amount of benign data is used
(Trs + Frs ). This metric essentially captures the binary
for training. Again, the RF and XGBoost schemes give
detection capability of a multi-class classifier and is
the lowest benign program mis-classification rates, close
similar to the recall formula, i.e., tracks false negatives.
to 99%. Although the other methods (SVM, RF) give
A benign detection rate (BDR) is also defined as:
slightly lower BDR rates, they are still over 92% (less
Tbn than 1 error in 12). Note that these binary detection
BDR = (2)
Tbn + Fbn rates closely match those from other malware detection
where Tbn is the total number of benign samples clas- studies which make use of much larger datasets and
sified as benign, and Fbn is the total number of be- more elaborate feature extraction schemes (Section II).
nign samples mis-classified as ransomware. In general, Meanwhile, Fig. 5 shows an average confusion matrix
though, false negative attribution of benign executables for the XGBoost classifier (classes 0-8 represent the 9
is less of a security concern. ransomware families and class 9 represents the benign
class). Here, the numbers in row 9 are larger as there
are more benign test samples. These results confirm
that most samples are classified correctly, i.e., diagonal
numbers dominate. Moreover, even when ransomware
samples are mis-classified, they are mostly flagged as
another ransomware (mirroring RDR results in Fig. 3).