Malware Classification Using Graph Neural Networks
Malware Classification Using Graph Neural Networks
SJSU ScholarWorks
Spring 2023
Recommended Citation
Mananjaya, Manasa, "Malware Classification using Graph Neural Networks" (2023). Master's Projects.
1268.
DOI: https://doi.org/10.31979/etd.68ya-hj74
https://scholarworks.sjsu.edu/etd_projects/1268
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at
SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU
ScholarWorks. For more information, please contact scholarworks@sjsu.edu.
Malware Classification using Graph Neural Networks
A Project
Presented to
In Partial Fulfillment
Master of Science
by
Manasa Mananjaya
May 2023
© 2023
Manasa Mananjaya
by
Manasa Mananjaya
May 2023
by Manasa Mananjaya
cessing for capturing semantic relationships between words. In this study, we conduct
applied to knowledge graphs constructed from opcode sequences of malware files. In the
graphs built with different word embedding techniques such as Bag-of-words, TF-IDF,
and Word2Vec. Our results indicate that Word2Vec produces the most effective word
embeddings, serving as a baseline for comparison with three GNN models- Graph
various lengths using Word2Vec and construct knowledge graphs with these embed-
show that larger vector embeddings improve the models’ performance in classifying
the malware files into their respective families. Our experiments demonstrate that
ment and mentorship during my research. His vast knowledge and valuable suggestions
have been essential in directing my research toward the right path and ensuring its
Dr. Thomas Austin, for their valuable time and constructive comments.
v
TABLE OF CONTENTS
CHAPTER
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 GCN-Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 GCN-TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 GCN-Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . 19
vi
3.3.5 Results for Word Embedding Experiments . . . . . . . . . 20
3.4.1 Word2Vec-GCN . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Word2Vec-GAT . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Word2Vec-GraphSAGE . . . . . . . . . . . . . . . . . . . 23
3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
APPENDIX
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
LIST OF TABLES
1 Malware Families . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
viii
LIST OF FIGURES
1 Opcode Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 15
ix
CHAPTER 1
Introduction
The threat of malware to the security of computer systems and networks is growing
rapidly. Malicious software can be used to steal sensitive data, gain unauthorized
access, disrupt services, and cause damage to systems. Due to their increasing
Traditional signature-based methods are often insufficient to detect new and unknown
malware, while behavioral-based methods may produce high false positives. Machine
offering the potential for accurate and efficient detection of both known and unknown
malware.
One such machine learning technique called Graph Neural Network (GNN) has
recently gained attention in the field of malware analysis as a powerful tool for
capturing the structural relationships between features of malware samples [1]. GNNs
can process data represented as graphs, where individual elements are depicted as
nodes and edges depict relationships between the elements [2]. They have proven
Computer Vision, and Social Network Analysis, where the data is represented as a
graph.
words in a high-dimensional space as vectors. These vectors capture the meaning and
structural relationships between words and can be used for various NLP tasks. By
semantic relationships between different parts of the code and use them for classifying
malware [3].
In this research, we explore the use of three GNN models for malware classification
1
using word embedding techniques. We focus on the application of GNNs to knowledge
graphs constructed from opcode sequences of malware files. Opcode sequences are a
the program. Knowledge graphs are graphs that represent structured knowledge in a
form that can be processed by machines. In our case, we construct knowledge graphs
where nodes represent opcodes, and edges represent the co-occurrence of opcodes in
malware samples.
Word2Vec [4], TF-IDF [5], and Bag-of-Words [6] in classifying malware using GNNs.
First, we investigate the performance of GCN applied to knowledge graphs built using
various word embedding techniques. The best word embedding technique is then
deduced and used to build knowledge graphs. Finally, the performances of GCN,
GAT, and GraphSAGE in classifying the word-embedded knowledge graphs into their
The remaining sections of this paper are organized in the following manner. In
for malware analysis, including word embedding techniques and GNNs. Chapter 3
describes the data set used in our experiments, the methodology for constructing
conclude our report and present potential directions for future work in Chapter 4.
2
CHAPTER 2
Background
access to the network. They are difficult to detect and analyze due to their increasing
systems and networks [7]. According to the Cybersecurity Ventures Report 2021,
cybercrime will cause damages worth $10.5 trillion annually by 2025 [8]. Malware is a
The sheer volume and diversity of malware pose a significant challenge to cybersecurity
experts. For instance, in 2019, the number of malware families increased by 13.7%,
In addition, the detection speed of new malware variants is crucial, as delays can lead
to serious security breaches. Therefore, the need for effective and efficient malware
malware attacks.
These methods are based on the idea that malware has distinguishing characteristics
that can be utilized to identify it [10]. They compare the code or behavior of a file to
On the other hand, malware detection methods based on behavior focus on the
actions performed by the malware rather than just its signature [11]. These methods
3
monitor the system’s activities and detect unusual behavior that may indicate malware
new and unknown malware that do not have signatures in the database. However,
behavioral-based methods can produce high false positives since legitimate software
are insufficient to detect new and unknown malware, researchers have turned to
detect malware using machine learning techniques. Among these techniques, graph
neural networks (GNNs) and word embedding have gained considerable attention for
their effectiveness in identifying and classifying malware. In [12], the authors conducted
various graph-based methods for detecting malware and discussed the advantages and
limitations of these methods. The survey concluded that GNNs show potential as
a viable method for malware detection due to their ability to capture the complex
in [13] to detect malware. The malware files are represented as graphs where the
API calls are represented as nodes and the order in which they appear is captured
through the edges between the nodes. The DEGCN model dynamically adjusts the
node weights based on the significance of the API call and updates the edge weights
according to their temporal sequence. The proposed model achieved a 98.3% detection
4
The proposed malware detection method in [14] implements GCNs for identi-
fying malware. The authors used a graph representation of malware, where nodes
represent API calls and edges represent their dependencies. This method achieved
[15] implemented GATs for intelligent transportation systems. The authors used a
graph-based representation of network traffic, where nodes represented the source and
for detecting malware in smart healthcare systems. They used multiple views of
malware, including API calls, system calls, and static features, and applied attention
mechanisms to capture the most relevant features. The attention mechanism focuses
on the most significant segments of the input data, thereby reducing the feature space’s
Over 15,000 malware samples are represented as graphs and evaluated using GCN.
learning methods such as SVM, and other graph-based methods such as GAT and
ChebNet. This method shows promise in improving the detection of malware through
the use of GCNs. Comparably, the authors of [18] proposed a GNN model that
into the same category, regardless of their behavior. One of the strengths of the
proposed method is that it does not need any feature engineering or prior knowledge
5
about malware. The approach extracts the structural information of malware samples
automatically and captures the underlying similarities between them. This makes the
network that uses multi-feature alignment and fusion to detect malware. The proposed
model combines the strength of GCNs and attention mechanisms to capture both
local and global features of malware effectively. To evaluate the proposed model, the
malware samples and achieved an accuracy of 99.2% and an AUC of 0.998. The multi-
feature alignment and fusion technique improves the alignment of multiple features
and enhances the model’s performance by providing high accuracy and robustness.
Various word embedding techniques are explored in [20]. The authors propose a
novel method for representing malware samples as sequences of opcodes, which are
then converted into Word2Vec embeddings or HMM states. The experiments were
assessed how effectively 8 distinct machine learning techniques classified malware. The
results showed that the machine learning methods based on Word2Vec embeddings
achieved the highest accuracy of 96.2%, while the HMM-RF model achieved an
techniques, Word2Vec, PCA2Vec, and HMM2Vec, for classifying malware. The results
show that Word2Vec-based techniques perform the best and are more computationally
efficient than HMM2Vec and PCA2Vec. The paper provides a useful comparison of
different machine learning techniques with word embedding for malware classification,
6
which can help researchers and practitioners in the field make informed decisions
During our review of previous research, we understood that utilizing word em-
integrating this with GNNs, we can create a robust model capable of detecting malware.
dimensional vectors of numerical values. These techniques map words with com-
algorithms to process text data in a more efficient and effective manner, improving
the performance of tasks such as text classification. In this section, we explore three
malware using GNNs. These techniques are utilized to generate feature vectors that
2.2.1 Bag-of-Words
Bag-of-words (BoW) [6] is a simple and widely used word embedding technique
in NLP. This method counts the occurrences of each word in a text document to
of the other words in the text and disregards their order, syntax, or structure.
a BoW model. We build a matrix of word frequency counts for each document in the
corpus using this vocabulary, where each document is represented by a row and each
7
the different document lengths and phrase frequencies, we finally normalize the matrix.
Let W stand for the vocabulary set, D for the collection of documents, and n(d,
d is a vector x(d) of size |W|, where each element of the vector is given by:
𝑥(𝑑)[𝑤] = 𝑛(𝑑, 𝑤)
one-dimensional feature vectors. These vectors will serve as node features during the
classification of graphs. More information on BoW can be found in [22] and [23].
2.2.2 TF-IDF
The basic principle of the TF-IDF technique is that a word’s relevance in a document
inversely correlates with its frequency (TF) in that document and across all documents
(IDF). In other words, a word is likely to be more significant for that particular
d:
number of occurrence of w in d
𝑇 𝐹 (𝑤, 𝑑) =
total number of words in d
The logarithm of the ratio of the total number of documents N to the number of
8
documents containing the word w gives the IDF value:
(︂ )︂
𝑁
𝐼𝐷𝐹 (𝑤) = log
number of documents that include 𝑤
The logarithmic function is used to reduce the impact of rare words on the IDF
score. The IDF score of words that frequently appear in documents will be lower, and
as a result, will have a lower impact on the TF-IDF score. The TF-IDF scores for
each word in a document may then be computed and used as features in machine
2.2.3 Word2Vec
[25]. Word embeddings are word representations that are distributed across a high-
dimensional vector space, with each dimension representing a feature of the word.
These embeddings can be used in a variety of NLP tasks such as text categorization,
Word2Vec creates word embeddings using a neural network that has been trained
on a large text corpus. This neural network learns to either predict a word given
its context or to anticipate a word given its nearby words. The word embeddings
are subsequently created using the weights from the neural network’s hidden layer.
advantage. For example, words with similar meanings, such as "car" and "automobile"
Two architectures used to train Word2Vec are the Continuous Bag of Words
(CBOW) and the Skip-Gram model [27]. The CBOW technique utilizes a group of
words surrounding the target word to make a prediction, whereas the Skip-Gram
approach takes the target word as input and tries to anticipate the surrounding
context words [28]. We experiment with the Skip-Gram model in this research.
9
2.3 Graph Neural Networks
structured data represented as graphs. Unlike traditional neural networks that take
fixed-length data as input, GNNs take graphs as input where individual elements are
represented as nodes and edges represent the relationship between the elements [29].
(𝑘)
where ℎ𝑣 is the representation of node 𝑣 at the 𝑘-th iteration, 𝒩 (𝑣) is the set
of neighboring nodes of 𝑣, 𝑊 (𝑘) and 𝑏(𝑘) are the learnable weight matrix and bias
vector at the 𝑘-th iteration, and 𝜎 is a non-linear activation function such as ReLU or
sigmoid [30].
In our case of malware classification, the nodes in the input graphs represent
the opcodes whereas the edges connect the opcodes that appear together frequently.
The GNN performs message passing between the nodes to capture information about
the relationships between them. This involves computing node embeddings based
on the embeddings of its neighbors, and then using these embeddings to update the
central node’s representation. This process can be repeated multiple times to capture
layers, which allow for shared weights and translation invariance, as well as pooling
layers, which allow for hierarchical learning. GCNs learn a set of filters that can
operate on the graph structure to extract features from the data. The filters are
10
typically defined as functions that operate on the node’s local neighborhood and
During training, the weights of the filters are learned through backpropagation,
which enables the GCN to learn to extract meaningful features from the graph
structure. By applying these filters repeatedly, the GCN is able to learn hierarchical
representations of the graph. In the classification phase, the feature vectors of each
node are taken as input by the GCN and a label for each graph is produced. Labeling
the graph is accomplished by applying a pooling operation to the output of the last
layer of the GCN. The pooling layer aggregates the feature vectors of all nodes into
a single vector. This vector is then fed into a fully connected layer, which produces
a final output vector that represents the predicted class probabilities for the input
graph. Chapter 3 provides the architecture of GCN and presents several experiments
to learn the importance of each node’s neighbors for a given task. This is achieved
through a series of weighted linear combinations of the neighbors’ hidden states, with
the weights learned through a self-attention mechanism. In other words, GATs use
the graph structure to determine which nodes are most relevant for a given task,
GAT optimizes the loss function with respect to model parameters during the
training phase. This typically involves computing the model’s predictions for a set
of labeled examples and comparing them to the true labels using a loss function.
11
examples involves computing the hidden states of all nodes in the graph and obtaining a
probability distribution over the possible labels. The label with the highest probability
is then assigned to the graph. More information on the application of GATs in text
GraphSAGE (Graph Sample and Aggregate) networks [35] are a class of GNNs
that learn representations for nodes in a graph by aggregating information from their
graph-based learning methods by leveraging graph convolutions, which can learn from
The GraphSAGE algorithm converts every node in the input graph to a low-
dimension vector. A multi-layer neural network operates on each node and its neighbors
in the graph to achieve this. At each layer, the model aggregates information from
the local neighborhood of each node by sampling a fixed number of neighbors and
performing a mean or max pooling operation. The resulting representations are then
passed through a non-linear activation function and fed into the next layer. This
process is carried out repeatedly for a specified number of layers until the final node
embeddings are obtained. This ability of the GraphSAGE network makes it highly
In our study, GraphSAGE learns embeddings for each opcode by taking into
for the entire graph based on the embeddings of its constituent nodes. This vector is
then passed through a fully-connected neural network to obtain the final graph-level
classification [36]. [37] provides more details on how GraphSAGE networks can be
12
CHAPTER 3
ing. We provide a brief overview of feature engineering and highlight the experiments
3.1 Dataset
The dataset experimented with in this study is taken from the VirusShare
website that hosts malware files belonging to various families. It consists of 13,597
malware families with at least one malware file belonging to each family. However,
due to the significant number of families and the large number of opcodes in each
infeasible. Therefore, we limited our experiments to only five families listed in Table 1.
To ensure balance in the dataset, 1,000 samples are selected randomly from these five
this dataset can be found in [20] and [21]. We will briefly discuss the characteristics
BHO - This is a type of Trojan malware that is used by attackers for malicious
activities like tracking user activities or installing other malware in user systems [38].
13
through fake games. Once installed, it can steal sensitive information such as login
performs various malicious activities such as displaying fake alerts or redirecting web
traffic [40].
VBInject - Malware of this family injects malicious code into legitimate processes
running on the operating systems. Attackers use this malware to steal sensitive
and tricks users into paying for unnecessary antivirus licenses [42].
For our research, we classify the executable malware files into their respective
families using a .csv file containing the file names and families as reference. Specifically,
we sort 1,000 original malware files into each of the five families, resulting in a total
of 5,000 files. The malware files are disassembled into .asm binary files for opcode
extraction. This was achieved on a Linux system using the Objdump command, which
is part of the GNU Binutils package. Opcodes extracted from each binary file are
stored in a text file with the same file name. Including all of the distinct opcodes
present in each file would have resulted in additional overhead during the machine
learning model training process, as there are a considerable number of such opcodes.
Moreover, the majority of opcodes contributed to less than 1% of the total number of
opcodes. The top 50 opcodes and their frequencies are shown in Figure 1.
14
Figure 1: Opcode Frequency
It can be observed that a large portion of the opcodes in the top 50 opcodes
conducted binary classification using BHO and OnLineGames malware families. 1,000
malware files from each family are considered for classification. Our methodology
15
embedding technique. The experiments consisted of training the CNN model with the
top 10, 20, 30, 40, and 50 opcodes, with embedded vector lengths of 2, 10, 50, and 100.
model for varying numbers of opcodes and vector lengths. It can be observed that
Therefore, it is more practical to use fewer opcodes to reduce computation time and
It can be observed that the highest average accuracy is achieved when utilizing
the top 20 opcodes across all vector lengths. Consequently, we extracted the top
20 opcodes and processed the opcode files to contain only these opcodes, which are
subsequently utilized for further experimentation. For each file, a pre-processing step
was performed to remove any punctuations and tokenize the text into individual
words.
In this section, we describe different word embedding methods that are imple-
mented to create feature vectors for the sequence of opcodes obtained from the malware
16
files. We also examine the impact of each embedding technique on the performance
of the malware file classification process. All of the embedding techniques are tested
using the same GCN model described in Section 3.4.1. Additionally, a separate model
was developed that did not implement any of the word embedding techniques, which
served as a base model for comparison with other word embedding implemented mod-
els. The resulting feature vectors are subsequently incorporated into the knowledge
3.3.1 GCN-Bag-of-Words
a simple yet effective way to extract features and create a feature vector without
determine how the frequency of certain opcodes influences the type of malware family
length 20, where each vector value represents the frequency of a particular opcode in
vector if an opcode is missing. This step is necessary because the BoW technique
used to generate feature vectors requires a fixed-length vector for each malware file.
By appending zeros to the end of the vector, the missing opcodes are effectively
be maintained across all malware files. This feature vector is incorporated in the
knowledge graphs as described in Section 3.3.4. Figure 3 depicts how the feature
17
Figure 3: Bag-of-Words Feature Generation
3.3.2 GCN-TF-IDF
This section presents the TF-IDF technique for generating feature vectors for each
malware file containing the opcode sequences. In our implementation, we first create a
document-term matrix, where each row represents a document (i.e., malware file) and
each column represents an opcode. The entries in the matrix correspond to the term
frequency of each opcode in each document. Next, we compute the inverse document
frequency for each opcode as described in Section 2.2.2. Finally, we calculate the
TF-IDF score for each opcode by multiplying its TF score with the IDF value. The
resulting TF-IDF matrix is then used to generate a feature vector for each file, where
each vector represents a document and contains the TF-IDF scores for every opcode
in the malware file. Figure 4 illustrates the process of generating feature vectors using
18
3.3.3 GCN-Word2Vec
vectors in this research [7]. In our experiment, we implement the gensim Word2Vec
model and train it using the opcode sequences. The vector length is kept as 100
and the window size is set to the default value of 5. The Word2Vec library offers
conduct our experiments, we opted to use the CBOW algorithm to train Word2Vec.
The trained Word2Vec model is then employed to generate feature vectors for each
document by averaging the opcode vectors of all the opcodes present in the document.
Figure 5 illustrates the usage of the Word2Vec model in generating feature vectors for
We utilize Python’s NetworkX library to create knowledge graphs from the opcode
files. The nodes in the graph represent the opcodes, and the edges between them
indicate the relationship between the opcodes. Specifically, we create an edge between
two opcodes if they occur consecutively in the file. To compute the edge weights, we
the opcode files without any word embedding. This baseline model is used to compare
19
with the word-embedded graphs. After creating the graph, we save it in a .pkl file
format, along with its corresponding label. During classification, this data is retrieved
from the .pkl file and used for further analysis. An example of a regular graph is
the word embedding techniques in the .pkl files along with the graph and label data.
During the training phase, these feature vectors are embedded as node features in the
graph that is loaded. This additional information enhances the performance of the
GNN models. Figures A.16, A.17, and A.18 illustrate the knowledge graphs created
using BoW, TF-IDF, and Word2Vec techniques respectively. Each of these techniques
The classification results obtained for various word embedding techniques are
summarized in Table 2. The same GCN model described in Section 3.4.1 is used
to classify the graphs generated using these word embedding techniques. These
results help us in deciding which word embedding technique to continue with for
our implementation using GNN models. We can see that only Word2Vec is giving
hand, TF-IDF and BoW worsen the classification accuracy with BoW giving 59.80%
and TF-IDF techniques are 20x1, whereas the feature vectors generated by Word2Vec
are of size 20x100. This indicates the dimensionality of the vector space that each
technique uses to represent the text data. A larger vector size allows for more complex
and nuanced relationships between opcodes to be captured, which can improve the
20
performance in downstream tasks such as classification. Therefore, we select Word2Vec
as the primary word embedding technique for our GNN models. Section 3.4 describes
this in detail.
Model Accuracy
Baseline Model 71.60%
GCN-BoW 59.80%
GCN-TFIDF 22.60%
GCN-Word2Vec 91.91%
This section presents the architecture of the three GNNs used in this research for
malware classification. All models are designed to handle graph data with Word2Vec
generated feature vectors that are embedded as node features in the graphs. In our
In our implementation, the first step involves loading all the graph data from the
stored .pkl files. The feature vectors are then read from the files and embedded in
the nodes after the graph is loaded. Before training the GNN models, we use graph
generators to create graph data generators, which are used to feed the graph data to
the GNN models during the training phase. This enables the models to learn from
3.4.1 Word2Vec-GCN
structured data [43], and it performs message passing over the graph to compute node
21
The GCN model architecture consists of two graph convolutional layers with
64 units each and ReLU activation function. A dropout rate is set to 0.4 and is
applied to the convolutional layers to avoid the model from overfitting. The global
average pooling layer aggregates the node features of the graph into a single vector
representation. The output of the global average pooling layer is then fed into two
fully connected dense layers, the first with 32 units and ReLU activation, and the second
with 5 units and softmax activation, which generates a probability distribution over
the five possible classes. The Adam optimizer is used to minimize sparse categorical
cross-entropy, which computes the difference between the predicted and true class
labels. The accuracy metric is used to assess the model’s performance on the test data.
These values are summarized in Table 3. Experiment results obtained with GCN are
Hyperparameter Value
Number of GCN layers 2
Number of units per GCN layer 64
Dense layer sizes [32, 5]
Activation [relu, softmax]
Dropout rate 0.4
Learning rate 0.001
Optimization algorithm Adam
Loss function Sparse categorical cross-entropy
3.4.2 Word2Vec-GAT
the node embeddings in the graph by aggregating information from the neighboring
nodes. The GATConv layer, available in the spektral library of Python, is used to
implement the GAT model. In our implementation, we use two GAT layers that
22
consist of 64 hidden units with a dropout rate of 0.5 and elu activation function.
The attn_heads parameter defines the number of attention heads used by the
GAT model. Each head computes a separate attention coefficient for each neighbor
of a node and then concatenates the results. The value of this hyperparameter is set
to 8 and the dropout rate is set to 0.4 in the GAT layers. The global sum pooling
layer with 64 units and relu activation function and an output layer with 5 units and
softmax activation function are added. The Adam optimizer is used with a learning
variance between the predicted and actual labels, and the accuracy metric is used
to assess the performance of the model. These values are summarized in Table 4.
Hyperparameter Value
Number of GAT layers 2
Number of units per GAT layer 64
Dense layer sizes [64,5]
Activation [elu, relu, softmax]
Attention heads 8
Dropout rate 0.4
Learning rate 0.005
Optimization algorithm Adam
Loss function Categorical cross-entropy
3.4.3 Word2Vec-GraphSAGE
batch size and the number of samples specified in the generator. The model consists
23
dimensions of 32 and ReLU activation function.
The dropout rate is set to 0.5 in each layer. Global max pooling operation
is applied to obtain a single feature vector representing the entire graph, which is
then fed into a dense output layer with a softmax activation function to generate
the final classification output. The model is compiled using the Adam optimizer with
Hyperparameter Value
Number of GraphSAGE layers 2
Number of units per GraphSAGE layer 32
Dense layer sizes [32, 5]
Activation [relu, softmax]
Dropout rate 0.5
Learning rate 0.005
Optimization algorithm Adam
Loss function Categorical cross-entropy
We train GNN models with graph samples using the best hyperparameter values
for graph classification without any word embeddings to observe the effect of word
embeddings, varying the vector length from 1 to 100. 5,000 graph samples are generated
for each vector length category and the classification performance is evaluated using
accuracy and classification matrices for each GNN model. Comparing the classification
results helps in gaining insights into how the quality of feature vectors affects the
24
3.5.1 GCN Results
Figure 6 gives the confusion matrices for GCN. This model achieves an accuracy
of 79.60% for the baseline model, 60.20% for Word2Vec with vector length of 1, 84.70%
for Word2Vec with vector length of 20, 85.3% for Word2Vec with vector length of 50
25
3.5.2 GAT Results
Figure 7 gives the confusion matrices for GAT. This model achieves an accuracy
of 73.80% for the baseline model, 42.90% for Word2Vec with vector length of 1, 80.80%
for Word2Vec with vector length of 20, 83.80% for Word2Vec with vector length of
26
3.5.3 GraphSAGE Results
Figure 8 gives the confusion matrices for GraphSAGE. This graph model achieves
an accuracy of 75.90% for the baseline model, 47.50% for Word2Vec with vector
length of 1, 76.80% for Word2Vec with vector length of 20, 82.70% for Word2Vec
with vector length of 50 and 84.70% for Word2Vec with vector length of 100.
27
3.5.4 Discussion
Figure 9 shows the accuracy achieved by each GNN architecture for the baseline
model and Word2Vec embeddings with vector lengths of 1, 20, 50, and 100. The
results indicate that the classification accuracy improves significantly as the length of
the embedded vector increases. The experiments are concluded at vector length 100
even when it is increased up to 200. The results also show that GCN outperforms
GAT and GraphSAGE, achieving an accuracy of 91.10% for a vector length of 100.
GAT and GraphSAGE produced very similar results, with GAT performing slightly
better.
Based on the findings, it can be inferred that as the length of the Word2Vec
vectors increases, the models become capable of capturing more fine-grained details
of the opcode sequence. This, in turn, leads to the creation of higher-quality feature
vectors, which are more effectively utilized by the GNNs to capture the underlying
Figure 9: Accuracy for GNN models with varying Word2Vec vector lengths
28
CHAPTER 4
opcode sequences. The results of our experiments provide strong evidence that using
malware files using knowledge graphs constructed from opcode sequences. Our results
The first set of experiments investigated the impact of different word embedding
performance of GNN models. The results showed that Word2Vec produces the
and constructed knowledge graphs with these embeddings as node features. Through
graphs with larger feature vectors improve the models’ performance in classifying the
Our results indicate that the length of the Word2Vec vectors has a significant
vector increases, the models become capable of capturing more fine-grained details of
the opcode sequence, leading to the creation of higher-quality feature vectors that are
more effectively utilized by the GNNs to capture the underlying graph structure and
classify nodes accurately. The experiments showed that the classification accuracy
29
improves significantly as the length of the embedded vector increases. However, there
Our study has several implications for the field of malware analysis. First,
the use of GNNs for malware classification has shown promising results, indicating
the potential of using graph-based approaches for malware analysis. Second, the
our study highlights the importance of selecting appropriate GNN architectures and
One possible direction for future work in this field is to investigate the effectiveness
of other word embedding techniques, such as GloVe [44] and FastText [45]. Moreover,
such as subgraph sampling [46] and random walks [47], on the classification performance
of GNN models. As we are dealing with only five malware families with 1,000 malware
files in each, it would be valuable to assess the performance of GNN models on larger
Another potential future research can focus on exploring other GNN architectures,
such as the recently proposed Transformer-based GNNs [48], and comparing their
performance with traditional GNN models. Other than word embedding generated
feature vectors, additional features, such as file size and entropy, can be added to the
temporal information into the graph, such as the order in which opcodes were executed,
30
LIST OF REFERENCES
[1] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun,
‘‘Graph neural networks: A review of methods and applications,’’ 2021.
[11] W. Liu, P. Ren, K. Liu, and H. Duan, ‘‘Behavior-based malware analysis and
detection,’’ in 2010 International Conference on Intelligent Computing and Inte-
grated Systems. IEEE, 2010, pp. 455--458.
31
[12] T. Bilot, N. E. Madhoun, K. A. Agha, and A. Zouaoui, ‘‘A survey on malware
detection with graph representation learning,’’ 2023.
[13] Z. Zhang, Y. Li, W. Wang, H. Song, and H. Dong, ‘‘Malware detection
with dynamic evolving graph convolutional networks,’’ Int. J. Intell.
Syst., vol. 37, no. 10, p. 7261–7280, mar 2022. [Online]. Available:
https://doi.org/10.1002/int.22880
[14] S. Li, Q. Zhou, R. Zhou, J. Li, and H. Chen, ‘‘Intelligent malware detection based
on graph convolutional network,’’ Journal of Supercomputing, vol. 78, no. 5, pp.
4182--4198, 2022.
[15] C. Catal, H. Gündüz, and A. Ozcan, ‘‘Malware detection based on graph attention
networks for intelligent transportation systems,’’ Electronics, vol. 10, p. 2534, 10
2021.
[16] V. Ravi, M. Alazab, S. Selvaganapathy, and R. Chaganti, ‘‘A multi-view
attention-based deep learning framework for malware detection in smart
healthcare systems,’’ Computer Communications, vol. 195, pp. 73--81,
2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0140366422003231
[17] S. Li, Q. Zhou, R. Zhou, and Q. Lv, ‘‘Intelligent malware detection based on
graph convolutional network,’’ The Journal of Supercomputing, vol. 78, 02 2022.
[18] Y.-H. Chen, J.-L. Chen, and R.-F. Deng, ‘‘Similarity-based malware classification
using graph neural networks,’’ Applied Sciences, vol. 12, no. 21, 2022. [Online].
Available: https://www.mdpi.com/2076-3417/12/21/10837
[19] X. Yang, D. Yang, and Y. Li, ‘‘A hybrid attention network for malware detection
based on multi-feature aligned and fusion,’’ Electronics, vol. 12, no. 3, 2023.
[Online]. Available: https://www.mdpi.com/2079-9292/12/3/713
[20] A. Kale, F. Di Troia, and M. Stamp, ‘‘Malware classification with word embedding
features,’’ 03 2021.
[21] A. Chandak, W. Lee, and M. Stamp, ‘‘A comparison of word2vec, hmm2vec, and
pca2vec for malware classification,’’ 03 2021.
[22] W. Qader, M. M. Ameen, and B. Ahmed, ‘‘An overview of bag of
words;importance, implementation, applications, and challenges,’’ 06 2019, pp.
200--204.
[23] K. Juluru, H.-H. Shih, K. N. Keshava Murthy, and P. Elnajjar, ‘‘Bag-of-words
technique in natural language processing: A primer for radiologists,’’
RadioGraphics, vol. 41, no. 5, pp. 1420--1426, 2021, pMID: 34388050. [Online].
Available: https://doi.org/10.1148/rg.2021210025
32
[24] S. Qaiser and R. Ali, ‘‘Text mining: Use of tf-idf to examine the relevance of
words to documents,’’ International Journal of Computer Applications, vol. 181,
07 2018.
[31] S. Zhang, H. Tong, J. Xu, and J. Ye, ‘‘Graph convolutional networks: a compre-
hensive review,’’ Computational Social Networks, vol. 6, no. 1, p. 11, 2019.
[34] Y. Liu and X. Gou, ‘‘A text classification method based on graph attention
networks,’’ in 2021 International Conference on Information Technology and
Biomedical Engineering (ICITBE), 2021, pp. 35--39.
33
[36] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec,
‘‘Graph convolutional neural networks for web-scale recommender systems,’’
in Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. ACM, jul 2018. [Online]. Available:
https://doi.org/10.1145%2F3219819.3219890
[37] L. Yao, C. Mao, and Y. Luo, ‘‘Graph convolutional networks for text classifica-
tion,’’ 2018.
[38] Microsoft Security Intelligence, ‘‘Trojan:win32/bho,’’ https://www.microsoft.
com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Trojan:
Win32/BHO&threatId=-2147364778, 2010.
[39] Microsoft Security Intelligence, ‘‘Pws:win32/onlinegames,’’ https:
//www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?
Name=PWS%3AWin32%2FOnLineGames, 2010.
[40] Microsoft Security Intelligence, ‘‘TrojanDownloader:Win32/Renos,’’
https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-
description?Name=TrojanDownloader:Win32/Renos&threatId=16054, 2010.
[41] Microsoft Security Intelligence, ‘‘VBInject,’’ https://www.microsoft.com/en-
us/wdsi/threats/malware-encyclopedia-description?Name=VirTool:
Win32/VBInject%26ThreatID=-2147367171, 2010.
[42] Microsoft Security Intelligence, ‘‘Winwebsec,’’ https://www.microsoft.com/
security/portal/threat/encyclopedia/entry.aspx?Name=Win32%2fWinwebsec,
2010.
[43] A. Voytetskiy, A. Herbert, and M. Poptsova, ‘‘Graph neural networks
for z-dna prediction in genomes,’’ bioRxiv, 2022. [Online]. Available:
https://www.biorxiv.org/content/early/2022/08/25/2022.08.23.504929
[44] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for word
representation,’’ vol. 14, 01 2014, pp. 1532--1543.
[45] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, ‘‘Enriching word vectors
with subword information,’’ 2017.
[46] J. Wang, P. Chen, B. Ma, J. Zhou, Z. Ruan, G. Chen, and Q. Xuan, ‘‘Sampling
subgraph network with application to graph classification,’’ 2021.
[47] D. Jin, R. Wang, M. Ge, D. He, X. Li, W. Lin, and W. Zhang, ‘‘Raw-gnn:
Random walk aggregation based graph neural network,’’ 2022.
[48] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, ‘‘Graph transformer networks,’’
2020.
34
APPENDIX
Appendix
A.1 Additional Experiments
Encoder Representations from Transformers (BERT) has the ability to capture word
context and produce more significant representations. The following sections present
the implementation details and the outcomes achieved by training GNN models with
BERT embeddings.
To implement BERT, the BERT model and tokenizer are imported from the
process text of length 512, the opcode sequence from the malware file is split into
chunks of length 512 and each chunk is processed sequentially. The tokenizer is then
used to tokenize the opcodes and generate token IDs. For each opcode in the sequence,
the model takes these token IDs as input and generates embeddings of length 768. To
generate feature vectors for unique opcodes in the file, the feature vector values for
the same opcodes are averaged. These features are then used as node embeddings in
A.1.2 Results
The performance results of GCN, GAT, and GraphSAGE using BERT embedding
The obtained results are quite comparable to those achieved by GNN models
trained with Word2Vec for a vector length of 100. It is evident from the results that
using a pre-trained model like BERT in our case is not providing any significant ad-
35
vantage in terms of improved classification performance when compared to Word2Vec,
Model Accuracy
GCN 89.90%
GAT 87.80%
GraphSAGE 87.00%
(c) GraphSAGE-BERT
The classification results obtained using BERT are compared to the results
obtained using Word2Vec with vector length of 100. As shown in Figure A.11, it can
36
be observed that the accuracies achieved by BERT and Word2Vec are quite similar.
37
A.2 Additional Results
A.2.1 Accuracy-Loss Graphs
Below are the accuracy-loss graphs of all three models, taken during their training
38
(a) Accuracy (b) Loss
39
A.2.2 Knowledge Graphs
40
Figure A.16: Bag-of-Words Graph
41
Figure A.17: TF-IDF Graph
42
Figure A.18: Word2Vec Graph
43