0% found this document useful (0 votes)

52 views61 pages

Bachelor Thesis

Uploaded by

Kanute

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views61 pages

Bachelor Thesis

Uploaded by

Kanute

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Deep Learning-Based Code Vulnerability Detection:

A New Perspective

Bachelor Thesis

Bachelor of Science
Department of Business Information Systems
Major Data Science
Baden-Wuerttemberg Cooperative State University

Amos Dinh

Prof. Dr. Maximilian Scherer, Academic Supervisor

Dr. rer. nat. Martin Härterich, SAP SE, Company Supervisor
12th of February - 6th of May 2024

I
Declaration of Originality
I herewith declare that I have composed the thesis
“Deep Learning-Based Code Vulnerability Detection: A New Perspective”
myself and without the use of any other than the cited sources and aids. Furthermore,
the submitted electronic version of the thesis matches the printed version.

Speyer, 6th of May 2024

Amos Dinh

II
Abstract
Automatic code vulnerability detection is an ongoing research field. The employed algo-
rithms detect whether a piece of source code contains a vulnerability that could render
the whole application open to malicious attacks. Among recent methods, Deep Learn-
ing-based approaches have been proposed which leverage token- or graph-based source
code representations to discover vulnerabilities.
In the current work, the performance of Deep Learning-based methods is investigated by
employing Graph Neural Networks on a large vulnerability detection dataset. In detail,
we examine the dimensions data, architecture, training and evaluation and show how
a simple baseline which measures only the code complexity outperforms both Graph
Neural Networks and Large Language Model-based approaches.
Further, performance is not improved with different architectures such as Graph Struc-
ture Learning-based and heterogeneous models, nor with specifically devised multitask-
and multistage pretraining on the code graphs.
We demonstrate, how the dataset composition skews the performance and communi-
cates overoptimistic results. Consequently, only rigorous evaluation, including careful
train-test separation on code-project level, stratifying the the predictions by code com-
plexity and the comparison against an appropriate baseline, depicts the models’ detec-
tion capability more truthfully. The findings are not specific to the dataset but affect
multiple other datasets in the field.
Now being able to measure the detection capability of models more precisely, we con-
clude with the findings that for increasing vulnerability detection performance, more
data is needed, and simple model architectures suffice in the current setting.

III
Table of Contents
List of Figures .......................................................................................................... VI
List of Tables .......................................................................................................... VII
1 Introduction ............................................................................................................. 1
2 Preliminaries ............................................................................................................ 4
2.1 Graph Concepts ................................................................................................. 4
2.2 Automatic Vulnerability Detection .................................................................... 5
2.3 Graph Neural Networks ..................................................................................... 7
2.3.1 Graph Classiﬁcation ................................................................................ 10
2.3.2 Metrics .................................................................................................... 11
2.4 Graph Structure Learning ................................................................................ 12
2.5 Pretraining Methods ........................................................................................ 13
2.5.1 Multitask Pretraining .............................................................................. 14
2.5.2 Pretraining on Graphs ............................................................................. 16
3 Related Work ......................................................................................................... 18
3.1 Deep Learning-based approaches to Automatic Vulnerability Detection ......... 18
3.2 Respecting the Graph Structure ...................................................................... 20
4 Experiments ........................................................................................................... 23
4.1 Experiment 1: The Importance of Project-Based Train-Test Separation ......... 23
4.2 The DiverseVul Dataset and Splitting Approach ............................................. 24
4.3 Experiment 2: num_nodes Baseline ................................................................ 27
4.4 Experiment 3: Graph Representation .............................................................. 28
4.5 Experiment 4: Architecture .............................................................................. 30
4.6 Experiment 5: Pretraining ............................................................................... 32
4.7 Experiment 6: Stratiﬁcation ............................................................................. 36
4.8 Experiment 7: Performance per CWE ............................................................. 41
4.9 Technical Details .............................................................................................. 42
5 Discussion .............................................................................................................. 43
Bibliography ............................................................................................................... a
Index of Appendices .................................................................................................. A

IV
List of Abbreviations
AST: Abstract Syntax Tree
CFG: Control Flow Graph
CG: Call Graph
CPG: Code Property Graph
CVE: Common Vulnerabilities and Exposures
CWE: Common Weakness Enumeration
DFG: Data Flow Graph
DL: Deep Learning
GCN: Graph Convolutional Network
GGNN: Gated Graph Neural Network
GIN: Graph Isomorphism Network
GNN: Graph Neural Network
GSL: Graph Structure Learning
GraphGLOW: Graph Structure Learning Model for
Open-World Generalization
HGP-SL: Hierarchical Graph Pooling with Structure Learning
LLM: Large Language Model
MLP: Multilayer Perceptron
MSE: Mean Squared Error
NLL: Negative Log Likelihood
NN: Neural Network
OSS: Open Source Software
RGCN: Relational Graph Convolutional Network
WL test: Weisfeiler-Lehman Graph Isomorphism Test

V
List of Figures
Figure 1: GNN aggregators fail to compute distinguishable node representations. .... 9
Figure 2: Function samples per project and median size of the extracted graphs. ... 25
Figure 3: Number of nodes in the CPG. ................................................................... 26
Figure 4: The 𝜎 parameter during pretraining and link prediction performance. ..... 34
Figure 5: Training and validation loss of diﬀerent pretraining methods. .................. 35
Figure 6: Prediction performance stratiﬁed by graph size. ....................................... 37
Figure 7: Optimal logit threshold per graph size and code length of samples. ......... 38
Figure 8: Larger dataset size increases model performance. ..................................... 40
Figure 9: Predictions per CWE. ............................................................................... 41

VI
List of Tables
Table 1: Test results on Previous & DiverseVul. ....................................................... 23
Table 2: Test performance of num_nodes. ............................................................... 27
Table 3: Average performance of models trained on 20% of the training data. ........ 29
Table 4: Validation performance of models. .............................................................. 31
Table 5: Test performance of models. ....................................................................... 32
Table 6: Performance of overﬁtting the training set. ................................................ 33

VII
Graph Structure Learning Introduction

1 Introduction
Security flaws in program code leave software and applications vulnerable to attacks
with malicious intent. The applications are increasingly based on a variety of under-
lying Open Source Software (OSS) libraries. Thereby, the attack surface is drasti-
cally expanded, since software protocols differ and sufficient security checks might
exist within a library but interfaces between libraries, as well as the modularity and
complexity of today’s applications, creates room for weaknesses. Simultaneously, the
manual labor of security experts is costly, time consuming [1] and grows with project
complexity. A a result it might be infeasible to verify the safety of all code manually.
Automatic methods have been developed to aid the manual vulnerability discovery
process. Static methods such as “FlawFinder” [2], [3] analyze source code by matching
it against a known list of tokens or patters which indicate vulnerabilities. Dynamic
methods analyze code at runtime. Techniques such as fuzzing inject pseudo-random
input into code and examine the output to discover bugs or vulnerabilties [4].
The availability of open source code as a data source permits the application of
Deep Learning (DL)-based methods. However, how to effectivly gather and curate
datasets for vulnerability discovery is still an unsolved problem. The promise of DL-
based methods is that they may detect more vulnerabilities while reducing the num-
ber of false alerts. When applied effectively, they could alleviate the shortcomings
of traditional static and dynamic detection such as limitations of manually defined
rule sets or randomized compute-limited fuzzing that requires program code to be
executable. More effective automatic means would both increase security detection
coverage across the OSS landscape and correspond to many saved security expert-
hours. Ultimately, they would help defend against cyber-security attacks, both of
economical and political value.
For this purpose, the present work examines Graph Neural Networks (GNNs) to carry
out the task of binary vulnerability classification on function level Code Property
Graphs (CPGs). The graphs are extracted from function files of the relatively new
DiverseVul dataset [5]. We employ GNNs because they are small in parameter size,
allowing for fast experimentation. Additionally, authors of the Diversevul Dataset [5]
have already examined Large Language Model (LLM) models’ performance.
We focus on exploring the relevant factors data, architecture, training and evaluation
within the domain of GNNs to gain a holistic understanding of the current state

1
Graph Structure Learning Introduction

of DL-based automatic vulnerability detection. Overall, it is discovered that careful

evaluation of the models reveals their inability to learn significantly more than a
naive baseline.
Followingly, the key contributions of this work include:
Data: The most important pillar of the work is the splitting of the DiverseVul dataset
[5] on project level, preventing test-data leak. Project level splits were previously
not practical, as former datasets did not exhibit sufficient size giving rise to inflated
performance results. Earlier work such as [5] and [6] have examined the setting of
out-of-training distribution testing. However, their work still measures performance
and draws conclusions from the settings in which training and test samples share the
same projects. We reason that the test set contaminations the authors notice in their
own work make it difficult to draw any conclusions in the former setting. Therefore,
we abandon it entirely in the present work.
The minor detail of train-test separation allows for a more truthful depiction and ex-
amination of DL-based performance in vulnerability detection measured in a realistic
and arguably more useful scenario where methods are applied to unseen codebases.
Further, we empirically investigate the impact on performance of adding features
such as node degree and triangle counts to the graph samples. Additionally, it is ex-
amined whether directed, undirected and heterogeneous variations of the underlying
graphs yield best performance, also by leveraging the Relational Graph Convolutional
Network (RGCN) [7] architecture.
Architecture: We investigate the performance of GNNs on the DiverseVul dataset
[5]. In particular, different architectures including the state-of-the-art method ReVeal
[6] as well as base architecures such as the Graph Convolutional Network (GCN)
[8], Graph Isomorphism Network (GIN) [9] and Graph Structure Learning based
methods.
Training: We derive and examine the performance of various pretraining methods,
including multi-task and multi-stage pretraining.
Evaluation: An initial evaluation reveals suprisingly similar performance between
the models. Delving deeper, we find different simple tools to be helpful to compare
performances more accurately, controlling for the unwanted factors of variation in
the datatset. These tools include a simple baseline, as well as stratification per vul-
nerability type.

2
Graph Structure Learning Introduction

In the following, in Chapter 2 we will ﬁrst establish important concepts related to the
pillars data, architecture, training and performance, such as how the CPG of program
code is constructed, an introduction to GNNs and graph classiﬁcation, as well as
pretraining and the balanced accuracy metric. Chapter 3 introduces related work
and in Chapter 4 we present the experiments. Finally, in Chapter 5 we re-evaluate
and highlight the conclusions drawn.
The code for all experiments is available at https://github.com/AmosDinh/security-
research-graph-learning.

3
Graph Structure Learning Preliminaries

2 Preliminaries
This chapter introduces the knowledge utilized throughout the work. First, general
graph concepts are deﬁned, followed by an overview over automatic vulnerability
detection. Finally, GNNs in conjunction with related concepts such as graph classiﬁ-
cation, expressiveness of GNNs, Graph Structure Learning (GSL) and pretraining on
graphs are reviewed, upon which attempts at increasing the vulnerability detection
performance are based.

2.1 Graph Concepts

Graphs have become an important tool in areas such as geography, chemistry, soci-
ology, linguistics or computer science to model and reason about concepts [10] or
solve problems.
A simple graph 𝐺 can be deﬁned in terms of a set vertices 𝒱 and a set of edges ℰ
which connect vertices in a directed or undirected manner, (𝑣1 , 𝑣2 ) or {𝑣1 , 𝑣2 } [10].
For a graph in general, multiple edges between the same vertices might exist and the
graph may contain self-loops connecting one node to itself.

1 if (𝑣𝑖 , 𝑣𝑗 ) ∈ ℰ
𝐀𝑖𝑗 = { (1)
0 else

For a simple directed graph the set of edges can be expressed in terms of the adja-
cency matrix Equation 1. The data representation as adjacency matrix allows for
simple formulation of message passing in GNNs explained in a subsequent section.
A heterogeneous graph additionally includes the mapping functions 𝜏 (𝑣) : 𝒱 → 𝒜
and 𝜑(𝑒) : ℰ → ℛ which map vertices, also called nodes, and edges to their respec-
tive node and edge types [11]. This translates to the graph representation containing
multiple adjacency matrices, one for each edge type. In a social network, where nodes
might represent people, categories such as “child”, “adult” could be formulated as
node types and edge types might represent relationships between people such as “is
friend”, “is spouse”.

4
Graph Structure Learning Preliminaries

In the present paper, node degree and triangle count are examined for usefulness
as features for DL-based vulnerability detection. The degree-count is the number of
neighbors a node is connect to through edges. In the directed manner, there exist
two degree statistics, an in-degree and out-degree for each node. The triangle count
is the number of 3-cliques a node 𝑣 participates in [12]. Node 𝑣 is part of a triangle
if there are distinct 𝑢 and 𝑤 such that 𝑣, 𝑢, 𝑤 areconnected by an edge. The number
of degree and triangle counts can directly be employed to determine the local clus-
tering coeﬃcient of a node [12]. The clustering coeﬃcient can for instanc help detect
spamming activity in web graphs [12].
In the following sections we use “1” to denote the identity matrix of appropriate size
which has ones on the diagonal and zeros elsewhere.

2.2 Automatic Vulnerability Detection

This section introduces common vulnerability detection concepts.

Deﬁnition:

“[A code vulnerability can be deﬁned as a] weakness in the computational logic

[…] found in software and hardware components that, when exploited, results in
a negative impact to conﬁdentiality, integrity, or availability. […]”
— CVE Deﬁnition, National Institute of Standards and Technology [13]

The most common categorization of vulnerabilities is comprised by the Common

Vulnerabilities and Exposures (CVE) database [14]. Companies and individuals
can submit discovered instances of vulnerabilities in programm code to the CVE
database, which are then mapped to one or multiple underlying weaknesses, catego-
rized by the Common Weakness Enumeration (CWE) [15] classes in a hierarchical
manner.
To give an example, common weaknesses include CWE-787 “Out-of-bounds Write”
or CWE-79 “Improper Neutralization of Input During Web Page Generation (‘Cross-
site Scripting’)” [16]. In CWE-787 the program writes data into computer memory
past the intended bounds. An attacker can thereby overwrite parts of computer mem-
ory which were not intended to be writable. Subsequent access to the overwritten
locations may now result in operations speciﬁed by the attacker.

5
Graph Structure Learning Preliminaries

The ﬁeld of automatic vulnerability detection is concerned with discovering se-

curity relevant weaknesses in software by automatic means. Especially C++ and C
code is in the focus of security researchers, because of their widespread use, and
thereby data availability, and larger attack surface in comparision to higher-level
programming languages like Python.
Code vulnerability detection is inherently a difficult task. Grounded in the halting
problem, detecting all vulnerabilities in a program with the help of another program
which terminates in finite time is theoretically intractable [17].
The project OSS-Fuzz [18] illustrates this claim in a practical manner. It tests the
code of multiple OSS projects continuously against trillions of generated test cases
every week to discover new vulnerabilities. The endeavor emphasizes that efficient
and effective vulnerability detection is still a largely unsolved challenge. [19] group
efforts of automatic vulnerability detection into three categories:
Static analysis examines a program without execution. Graph-based static analyz-
ers model the program as a heterogeneous graph with vertices representing program
statements. The implementation employed in this work [20], [21] converts C files
to their graph representation, a Code Property Graph (CPG). This graph contains
multiple directed edge types:
• The Abstract Syntax Tree (AST) contains the main edge type in the graph.
It splits the program on syntactical symbol level. For example, the assign-
ment *prelink = '0'; is split into nodes * , prelink , = and '0' which are
connected in an hierarchical structure with = as the root node.
• The Data Flow Graph (DFG) represents data dependencies between opera-
tions. It tracks access and modification of variables [22].
• The Control Flow Graph (CFG) marks a subset of the AST nodes as control
nodes and connects them to model the program control flow. The flow de-
scribes all possible paths which could be taken over the course of execution
and is determined by statements such as if, for or switch [22].
• The Call Graph (CG) represents method caller and receiver relationships.
Parsers such as [20] are “forgiving” in that they do not require programs to be ex-
ecutable to create the CPG. Graph-based static analyzers then model and identify
vulnerabilities based on the CPG. They can examine high volumes of code but suffer
from the lack of run-time information, making them subsceptible to a high false-
positive rate [23].

6
Graph Structure Learning Preliminaries

Dynamic analysis checks programs for vulnerabilities during run-time. Fuzzers like
the aforementioned OSS-Fuzz [18] inject random data into program code to induce
unexpected behavior. The process is augmented with heuristic tools. Dynamic taint
analysis tracks the dataflow of user-controlled data [19]. Any operation which uses
the data is also marked as tainted. Pontential vulnerabilities are then discovered
by identifying security critical operations such as control flow operations or system
calls which come in contact with the tainted data. Drawbacks of dynamic analyzers
include their low code coverage. Therefore, these methods suffer from a high false-
negative rate, as they can not create different enough input data to show that the
code sample is vulnerable [23].
Mixed analysis combines both static and dynamic methods. Concolic execution
[19] executes a program with a random input while collecting symbolic constraints.
A solver then creates a new input based on the symbolic information which steers
the execution towards a different path. Therefore, mixed analysis can alleviate the
drawbacks of static and dynamic analysis, combining high code coverage with access
to run-time information.

2.3 Graph Neural Networks

In this section the concept of Graph Neural Networks GNNs as well as GNN archi-
tectures are introduced which we use for the vulnerability classification.
GNNs are utilized in various fields which use graphs to model concepts (Section 2.1).
They have been employed to recommend content to users [24], model football dy-
namics [25] and accelerate fluid simulation [26].
They are amongst the most general classes of Artifical Neural Networks (ANNs) [27].
For example, architectures such as the Convolutional Neural Network (CNN) [28] as
well as the Transformer [29] can be formulated as GNNs [30].
Most common GNNs use the adjacency matrix 𝐀 of a graph to operate on a node
feature matrix 𝐗 which specifies an informative vector 𝐱 per node. They apply per-
mutation equivariant functions 𝐅(𝐗) by the use of permutation invariant functions
such as SUM, MEAN and MAX to aggregate local neighborhoods [30]. A Recurrent
Neural Network [31] may also be used as aggregator [32].

7
Graph Structure Learning Preliminaries

1
̂ ̂ − 12 𝐻 𝑙 𝑊 𝑙 )
𝐻 𝑙+1 = 𝜎(𝐷̂ − 2 𝐴𝐷

⎛ 𝐴𝑖𝑗 ̂ ⎞ (2)
𝑙+1
𝐻𝑖,. = 𝜎⎜
⎜
⎜ ∑ 𝐻 𝑙
𝑗,. 𝑊 𝑙⎟
⎟
⎟
𝑗∈𝒩(𝑖)∪{𝑖} √𝐷̂ 𝑖𝑖 𝐷̂ 𝑗𝑗
⎝ ⎠

The GCN [8] introduces the concept of graph convolution (Equation 2) where 𝜎 is
a non-linear activation function, 𝐻 0 = 𝐗 and subsequent 𝐻 are learned node repre-
sentations at each layer 𝑙, 𝐴 ̂ = 𝐀 + 𝐼, 𝐷̂ 𝑖𝑖 = ∑ 𝐴𝑖𝑗
̂ and the weight 𝑊 𝑘 is a learnable
𝑗
weight matrix. Similar to the convolution operation in CNNs, the graph convolution
is an aggregation of the local neighborhood around each node 𝑖.
The weight matrices 𝑊 𝑘 can be seen as “message” weights which transform the
neighbors’ 𝑗 node representations 𝑋𝑗,. , as shown in the lower part of Equation 2. The
neighbors’ messages are aggregated by the SUM aggregator and ﬁnally transformed
𝑙+1
by a non-linearity 𝜎, “sending” a single message to create node 𝑖’s represenation 𝐻𝑖,.
at layer 𝑙 + 1. Before aggregation, each neighbor 𝑗’s message is normalized by the
geometric mean of 𝑖’s and 𝑗’s (in-)degree. In combination with the SUM aggregator
this operation does not amount to a mere averaging of neighbors of 𝑖 but models
more complex relationships.
̂ can represent continuously-
Besides a simple binary connection indicator, 𝐴𝑖𝑗
weighted connections as well. Furthermore, the addition of the identity matrix 𝐼 to
𝐴 allows nodes’ representations of the layer 𝑙 + 1 to include information of their own
representation in the previous layer 𝑙.
The authors of the work introducing the GIN [9] examine the expressivity of GNNs
and derive the GIN architecture as a result of their theoretical examination. Basis
for their examination is the Weisfeiler-Lehman graph isomorphism [33] test. The test
is used to determine if two graphs 𝐺 and 𝐺′ have the same structure.
1. Initially the same label is assigned to all 𝑛 nodes in a graph
2. In each following iteration, the node labels of each neighbor are passed to the
node, creating a multiset of node labels. This multiset, besides the node labels,
also contains the information about the count of each label.
3. Then each node is assigned a new label based on a hash of the multiset of their
neighbors’ labels.

8
Graph Structure Learning Preliminaries

4. Steps 2 and 3 are repeated. If the algorithm converges before reaching 𝑛 repeti-
tions, meaning no new label-hashes are created and both graphs have identical
label structure after sorting, the graphs can be declared as isomorphic.
The test is known to fail in some cases [34], but is overall able to distinguish graph
structures.

a) Mean and Max fail b) Max fails c) Mean and Max fail

Figure 1: Common GNN aggregators fail to compute distinguishable node represen-

tations for 𝑣1 and 𝑣2 given their neighborhood structure and neighbor features rep-
resented by the color.

Examining the message passing mechanism commonly shared between GNN-archi-

tecture, [9] argue that GNNs can be maximally as powerful as the Weisfeiler-Lehman
Graph Isomorphism Test (WL test) in distinguishing graph structures. Especially
MEAN and MAX aggregators fail in computing distinguishable node representations
based on the multiset of neighbors, which is illustrated in Figure 1.

̂ 𝑙)
𝐻 𝑙+1 = MLP(𝐴𝐻
𝑙+1 ̂ 𝑋 ) (3)
𝐻𝑖,. = MLP( ∑ 𝐴𝑖𝑗 𝑗,.
𝑗∈𝒩(𝑖)∪{𝑖}

Followingly, they propose the GIN architecture (Equation 3), which utilizes the SUM
aggregator and unlike conventional GNNs architectures employs a Multilayer Per-
ceptron (MLP) [35], not a single-layer one, to pass the neighbors’ messages. The MLP
in combination with the SUM aggregator allows the model to maintain injectivity,
by the universal approximation theorem [36], similar to the hash-function in the WL
test. The authors prove empirically that the model is as expressive as WL test and

9
Graph Structure Learning Preliminaries

more expressive than models such as the GNN by measuring the models’ overﬁtting
performances.
Expressivity is beneﬁcial in the case of model pretraining as observed by the authors
in [37] following [38]. The paper employs the GIN attempting to improve model per-
formance.

2.3.1 Graph Classiﬁcation

GNNs can be utilized for a range of tasks on graph data. Widely used tasks include:
• link prediction to predict missing links between nodes such as to fill in missing
knowledge in relational data [39],
• node classification to detect bots in social networks [40],
and graph classification, which is employed in the present paper as binary classifica-
tion to classify if program code includes a vulnerability. Because the number of nodes
varies between graphs, a fixed classification head such as a MLP is not sufficient.

𝑌 ̂ = MLP(READOUT(𝐻)) (4)

A “readout” layer is added before the MLP to aggregate the node representations 𝐻
computed by the GNN along the node dimension (Equation 4).

𝑌 ̂ = MLP(top-rank(𝐻, 𝑍, 𝑛)) (5)

𝑌 ̂ = MLP(READOUT(𝐻 ⊙ 𝑍)) (6)

Readout operations include the conventional aggregators SUM, MEAN and MAX,
but can also be more complex like attention based pooling [41], which pools neighbors
selectively using separately computed attention scores 𝑍 = GNN(A, X). In Equa-
tion 5 the top 𝑛 node representations are selected to be passed to the MLP. In
Equation 6 nodes are soft-selected based on the attention scores.

10
Graph Structure Learning Preliminaries

2.3.2 Metrics

TP
Precision =
TP + FP
TP
Recall = (7)
TP + FN
2 ⋅ Precision ⋅ Recall
F1-Score =
Precision + Recall

1 TP TN
Balanced Accuracy = ⋅( + ) (8)
2 TP + FN TN + FP

The F1-Score (Equation 7) is a commonly employed metric for the binary classifica-
tion scenario with imbalanced classes. However in cases where the number of correctly
classified negatives (TN) is as relevant as the number of correctly classified positives
(TP), balanced accuracy can be a more appropriate performance measure. Balanced
accuracy reduces to the standard accuracy in case of a balanced dataset. A perfect
classifier would attain 100% balanced accuracy, a random classifier 50%.
For illustration, we examine the task of vulnerability detection on the DiverseVul
dataset utilized in this paper [5]. The dataset is imbalanced since only approximately
6% of the samples belong to the positive class. A classifier classifying all samples as
positive reaches an F1-Score of 11% outperforming 9 out of 11 initial models on the
task of detecting vulnerabilities in “unseen projects” ([5] Table 5). The outperformed
models include CodeT5 [42] and GPT-2 [43] variations. The equivalent balanced
accuracy amounts to exactly 50%. It can be concluded that the naive classifiers F1-
Score depends on the class-ratio of the dataset, opposed to the balanced accuracy,
which is independent of the class-ratio. As illustrated, the F1-Score can communicate
a false sense of performance.

11
Graph Structure Learning Preliminaries

2.4 Graph Structure Learning

In the task of GSL the premise is that existing graph structures in a dataset are not
optimal for learning-based optimization of downstream tasks using GNNs. Often,
real-world datasets are noisy, incomplete or simply do not model structures beneﬁcial
for certain tasks.

min 𝔼𝐴~𝐺 ̂
ℒ(Y, 𝐹𝜃 (X, A, 𝐴))
̂ (9)
𝐴̂

Concretely, GSL aims at learning the optimal graph structure 𝐴 ̂ in conjunction or

separately to the downstream task, for which the negative performance is measured
by ℒ. Different approaches for finding the optimal 𝐴 ̂ exist [44]:
Metric-based GSL employs learned node embeddings and derives the adjacency
matrix 𝐴 ̂ by computing the pairwise similarity between learned node representations
̃𝑖,. with the help of a metric function. Conceptually, such methods are closely re-
𝐻
lated to attention-based mechanisms [45] such as the Graph Attention Network [46]
or non-graph related architectures such as the Transformer [29]. However, different
to the Graph Attention Network, they can consider all node-pair combinations in a
graph and different to the Transformer, they enforce a graph-specific prior through
̃𝑖,. are learned on the
regularization terms and the fact that the representations 𝐻
original graph-structure.
Direct GSL treats the adjacency matrix 𝐴 ̂ itself as a learnable parameter and opti-
mizes it directly.

𝐶
min − ∑ 𝟙[𝑦=𝑐] log 𝑃𝜃 (𝑌 = 𝑦 | 𝐴 = A ⊙ 𝜎(𝑀 ), 𝑋 = X) (10)
𝑀
𝑐=1

In GNNExplainer [47] the end goal is not a downstream task but ﬁnding a suitable
graph structure itself, which explains a separate GNN’s classiﬁcation predictions
𝐹𝜃 (A, X) on a node, edge or graph level (Equation 10). Here, the mask 𝑀 is directly
𝑛×𝑛
optimized and mapped to [0, 1] by 𝜎, ⊙ denotes the Hadamard product.

12
Graph Structure Learning Preliminaries

ℒ𝑠𝑝 = 𝛼‖𝐴‖̂ (11)

1 𝑁 ̂ (ℎ − ℎ )2
ℒℎ (𝐻, 𝐴)̂ = 𝛽 ∑ 𝐴𝑖𝑗 𝑖 𝑗 (12)
2 𝑖,𝑗=1

Regularization terms include sparsity constraints (Equation 11) to reduce the num-
ber of edges in the learned adjacency matrix, the 𝐿0 norm is often replaced by the
𝐿1 norm such that optimization becomes tractable. ℒℎ (Equation 12) regularizes
the graph structure by modeling a homophily assumption, meaning that neighboring
nodes should have similar representations ℎ [48]. 𝛼 and 𝛽 are manually deﬁned hy-
perparameters.

𝐻 𝑙 = 𝜆 ⋅ 𝐹𝜃𝑙 (𝐻 𝑙−1 , A) + (1 − 𝜆) ⋅ 𝐹𝜃𝑙 (𝐻 𝑙−1 , 𝐴)̂ (13)

The initial graph structure A can be included as a prior in the computation of node
representations 𝐻 𝑙 for the downstream task with hyperparameter 𝜆 (Equation 13)
determining the inﬂuence of A [48].
As open challenges to GSL, [44] list the learning of heterogeneous, heterophilous and
task-agnostic structures.

2.5 Pretraining Methods

In DL, pretraining describes the procedure of training a model ﬁrst on an auxiliary

task followed by finetuning the model on the actual task. Pretraining often increase
the model’s performance on the actual task [37], [49]. The combination of pretraining
and finetuning may also be referred to as transfer learning [50].
Whereas in supervised machine learning, we are interested in learning 𝑃𝜃 (𝑌 |𝑋), pre-
training learns some 𝑃𝜃 (𝑋) or 𝑃𝜃 (𝑌𝑘 |𝑋) for some different task 𝑘. The extend to
which pretraining is useful is therefore determined by how relevant the pretraining
tasks are for the final supervised task [38]. Concretely, in pretraining, the model must
learn representations of input data 𝑋 which are useful for the later supervision task.
[38] posit that pretraining can both be seen as a form of regularization and as an
improvement to the optimization procedure itself. Pretraining solely changes the

13
Graph Structure Learning Preliminaries

random initialization point of model weights to a “prelearned” initialization point at

the start of the supervised stage.
From the regularization perspective, pretraining finds an initialization point in the
parameter space which reduces the model’s dependence on the first training samples
seen. The dependence on the first data points stems from the fact that with more
training steps, the magnitude of weights in the model increases, rendering the op-
timization surface increasingly non-convex and making it difficult for the learning
procedure to escape the path taken [38]. Essentially, pretraining already narrows
down the parameter location, such that during supervision, the dependence on the
first samples is reduced. The regularization perspective implies that pretraining finds
the “hills” in optimization space, which after descent increase the model’s general-
ization ability but decrease its’ training performance.
From the optimization perspective, pretraining finds specific “basins of attrac-
tion” [38] which increase the models training performance and thereby its’ general-
ization ability as well.
[38] find evidence supporting both views. On the one hand pretraining decreases
performance for small models while increasing larger models performance. As small
models are not expressive enough in the first place, much like 𝐿1 and 𝐿2 regulariza-
tion, pretraining diminishes performance. On the other hand, they find that even
when the training distribution converges against the real distribution, by becom-
ing increasingly large, the pretrained model’s performance is higher than without
pretraining. If pretraining could be described as a regularizer alone, it would hurt
performance as the introduced bias would become counterproductive on increasingly
large datasets. Contrary, this finding supports the optimization theory that pretrain-
ing finds regions in the parameter space which yield better performance overall.

2.5.1 Multitask Pretraining

Instead of pretraining a model on one task, one can employ multiple pretraining tasks
with the idea that the model can ﬁnd initial supervision data representations which
incorporate information required for all pretraining tasks. These robust representa-
tions could lead to better performance, as the model can “choose” the learned fea-
tures important for the following supervision task. The main concern centers around
how one can jointly learn these pretraining tasks.
In this paper we experiment with four ideas:

14
Graph Structure Learning Preliminaries

1. Given three tasks 𝐴, 𝐵 and 𝐶, the straightforward way to learn the tasks is to
learn the tasks sequentially in a blocked fashion. The drawback of this idea is
the loss of information from the earlier pretraining tasks [51].

2. Another way would be to formulate a new loss function, summing all partici-
pating losses: ℒ = ℒ𝐴 + ℒ𝐵 + ℒ𝐶 .

However, if the pretraining tasks have different loss formulations, such as the
Negative Log Likelihood (NLL) for classification and the Mean Squared Error
(MSE) of the regression task, the simple addition is not intuitively justifiable
[52]. The reason is that both losses operate on different scales of magnitude
and therefore would contribute unevenly to the gradient descent procedure.

3. One possibility to alleviate this problem is to learn tasks in an interleaved fash-

ion, repeatedly changing the tasks, and resetting any momentum terms in the
optimizer after 𝑛 minibatch samples [51].

4.
1 1
ℒ(𝜃, 𝜎𝑟 , 𝜎𝑐 ) = 2
ℒ𝑟 (𝜃) + 2 ℒ𝑐 (𝜃) + log 𝜎𝑟 + log 𝜎𝑐 (14)
2𝜎𝑟 𝜎𝑐

Along with the mentioned methods, we experiment with the approach of [52]
(Equation 14). The equation combines the loss of a regression and classification
task. The loss terms are derived from probabilistic formulations of the MSE
as a Gaussian NLL and the classification likelihood as NLL with a tempera-
tured softmax function. ℒ𝓇 represents the MSE for the regression task and ℒ𝒸
the negative logarithm of the softmax for the classification task. In the loss
formulation, 𝜎𝑟 and 𝜎𝑐 are learned alongside 𝜃. They model the variance in the
respective pretraining tasks, which is determined by both the scale of the task
loss and the difficulty of the task. When the difficulty of task ℒ𝑟 is high, 𝜎𝑟 will
increase, to decrease ℒ𝑟 ’s influence on the total loss. At the same time log 𝜎𝑟
will act as a regularizer, penalizing large 𝜎𝑟 . We employ the modified version
of [53] where the regularization terms become log 1 + 𝜎2 to ensure the loss for-
mulation can not exploit simple tasks, whereby log 𝜎 could become negative.

15
Graph Structure Learning Preliminaries

The loss formulation (Equation 14) allows the addition of arbitrarily many re-
gression and classiﬁcation tasks.

2.5.2 Pretraining on Graphs

GNNs can benefit from pretraining when training data is scarce [49]. The graph
structure information naturally permits the formulation of un- or semi-supervised
pretraining tasks. In general, transfer learning for GNNs can be modeled as follows,
similar to [37]:
1. First, we select a GNN architecture, a set of pretraining tasks 𝑇 = {𝐴, 𝐵, 𝐶, …}
as well as some pretraining schedule 𝑆 which specifies the ordering and fre-
quency of each task during pretraining, as well as the loss strategy used (In case
of multitask pretraining). Further, for each task a task head, often a shallow
MLP with correct dimensionality and depending on the task, a READOUT-
layer, is specified: 𝐻 = {MLP𝐴 , MLP𝐵 , MLP𝐶 , …}. Also, for each task, a loss
function must be selected: 𝐿 = {ℒ𝐴 , ℒ𝐵 , ℒ𝐶 , …}.
2. During pretraining for each task 𝑡, we compute 𝐻 ∗ = GNN(A, X) and follow-
ingly 𝑌𝑡̂ = MLP𝑡 (𝐻 ∗ ) as well as 𝓁𝑡 = ℒ𝑡 (𝑌 ̂ ) in accordance with 𝑆. Then the
GNN and task heads can be optimized with a stochastic gradient descent pro-
cedure [54]. Some synergies may arise in case of multitask pretraining as task
head computations of different task can depend on the same 𝐻 ∗ . The pretrain-
ing is terminated when a number of pre-defined epochs or another stopping-
criterion is satisfied.
3. Finally, the task heads are discarded and the GNN is finetuned for the super-
vision task with a new task head.
Because Neural Networks (NNs) can be viewed as computing increasingly abstract
representations of input X [38], this transfer learning approach encodes task-specific
information in the MLP layers while the GNN learns to output more general features
𝐻 ∗.
GNN-specific pretraining tasks include link-prediction, feature masking [37] or graph
contrastive learning [49]. In link-prediction the task is to classify if two nodes are
connected in the real graph. When the size of the adjacency matrix A𝑛×𝑛 is suffi-
ciently small, for example approximately n < 3000 on a NVIDIA T4 GPU, graph
convolution and pretraining methods can be conducted in a full-graph fashion.

16
Graph Structure Learning Preliminaries

𝑌 ̂ = 𝜎(𝐻 ⋅ 𝐻 𝑇 ) (15)

𝐻 = GNN(X, A ⊙ (1 − M))
(16)
𝓁𝑝 = ℒ(Y ⊙ M, 𝑌 ̂ ⊙ M)

In a full-graph fashion the link-prediction probabilities can be computed with Equa-

tion 15 where 𝜎 squashes the dot products into the range (0,1). As the GNN would
otherwise have no reference for computing the representations H, only a certain
amount of edges in A𝑛×𝑛 is selected for supervision with the binary mask M𝑛×𝑛
(Equation 16). Link prediction employs a contrastive loss with negative examples
[24] to ensure representations of disconnected nodes are different. Here it is ensured
that one node per edge is part of an existing edge in the positive examples to create
a more difficult supervision objective.
For feature masking, we similarily apply a mask (1 − M𝑛×𝑑 ) to the feature matrix
X𝑛×𝑑 masking nodes 𝑛 with probability 𝑝 and feature dimensions 𝑑 with probability
𝑞 to try to predict missing features. The features could be categorical or numerical
features, requiring a classification or regression objective respectively.
Graph contrastive learning [49] leverages augmentations of the graph such as edge
masking or feature masking and similar to link-prediction tries to maximize the
agreement of graph level representations 𝑦 ⃗ and 𝑦𝑝⃗ while minimizing the agreement
of 𝑦 ⃗ and 𝑦𝑛⃗ , where negative examples are augmentations of a different graph. Graph
contrastive learning can only be effective in the case when small augmentations do
not alter the semantic meaning of the graph drastically, and is therefore unsuitable
in certain tasks. The addition or deletion of connections in molecule graphs might
drastically alter their chemical properties [37] and in code vulnerability detection
single edits might render a program vulnerable. For these tasks, the former pretrain-
ing methods can find application.

17
Graph Structure Learning Related Work

3 Related Work
This chapter ﬁrst introduces relevant work from the domain of DL-based automatic
vulnerability detection. Later, also two GSL architectures are discussed which are
experimented with in comparison to the conventional architectures.

3.1 Deep Learning-based approaches to Automatic Vulnerability De-

tection

Data for training training is collected through variations of the following procedure:
The CVE [14] database deems as label source, since it collects vulnerabilities in
source code reported by developers and companies. For OSS the vulnerability-fixing
commits linked to the CVE are available. The authors can now declare all functions
part of the vulnerability-fixing commit as benign and their counterparts in the pre-
vious commit as vulnerable. The label quality is diminished by factors such as that
not every fixing-commit is able to fix the vulnerability and not every file or function
changed in such a commit is relevant to the vulnerability.
In general DL-based approaches to Automatic Vulnerability Detection can be divided
into token- and graph-based approaches [23]:
Graph-based approaches such as Devign [22] leverage the CPG of program code.
The Devign architecture consists of a Gated Graph Neural Network (GGNN) [55]
which considers a representation of the CPG with one directed adjacency matrix A𝑟
for each edge type 𝑟.

𝑎𝑡−1
𝑟 = A𝑇𝑟 (𝑊𝑟 𝐻 𝑡−1 + 𝑏)
(17)
𝐻 𝑡 = GRU(𝐻 𝑡−1 , ∑ 𝑎𝑡−1
𝑟 )
𝑟

Concretely, the compute the state 𝑎𝑟 separately for each edge type, and aggregate
the states before computing the next timestep’s nodes hidden states 𝐻 with a Gated
Recurrent Unit [45]. 𝑦 ̂ is computed with a a custom architecture incorporating a
CNN with 1-d convolution to select nodes and features important for the task.

18
Graph Structure Learning Related Work

The authors of REVEAL [23] apply a GGNN in a similar fashion. The node features
X are composed of a categorical vector for the vertex type in the CPG, for example
“ArithmeticExpression” or “CallStatement” and, a as in Devign, a Word2Vec [56]
embedding of the respective node, learned on the same dataset. The authors separate
the task of learning code representations from the vulnerability classification task:
In a first stage, node representations are learned with the GGNN, classifying the
node labels. In the second stage, a MLP head classifies the graphs into vulnerable or
non-vulnerable based on the already learned node embeddings of each graph. They
leverage a cross entropy loss to learn the true label and furthermore add a contrastive
loss with weighting 𝛼, to encourage representations of the same class to be similar
and of different classes to be dissimilar. Most importantly the authors make some
key observations regarding the challenges for DL-based vulnerability detection.
They criticize previous approaches, because of the lack of out-of-training-distribution
evaluation and curate a new dataset based on the two open source projects Linux
Debian Kernel and Chromium. Evaluating models on the Reaveal dataset which were
trained on the previous datasets, they find significant performance degradation by
on average 73% for the F1-Score, compared to the models’ performance on their own
datasets. For example, a drop from 73.26% to 16.68% F1 for the Devign model.
Manually inspecting predictions of models, they state that many predictions are
grounded in irrelevant features. Additionally, the note that GNNs learn more rele-
vant features than LLMs, because they are able to use the CPG. To address the
data imbalance, they propose to oversample the minority vulnerability class. Further
they observe that the dataset curation method induces duplicate sample issues which
degrades the dataset quality.
Opposed to the present work, the authors fail to establish a naive baseline and corre-
spondingly conduct no examinations to relativize their approach. Further, the limited
dataset size restrains the authors from evaluating real-world performance quantita-
tively in greater detail.
Token-based approaches leverage the text representations of source code to detect
vulnerabilities. [57] employ Recurrent Neural Networks [31], CNNs and Decision Tree
approaches. To increase the retrieval ability, to be employed on larger program code,
[58] use a bidirectional Long Short Term Memory Network [59] only on a subset of
the code, which they compose as “code gadgets”. Focusing on code vulnerabilities

19
Graph Structure Learning Related Work

related to function calls, they extract the relevant information using data flow and
control flow analysis tools.
[5] conduct large scale vulnerability detection experiments testing LLMs of the model
families RoBERTa [60], GPT-2 [43] and T5 [42]. They create the DiverseVul vul-
nerability detection dataset which is 60% larger than the previously largest C++
and C dataset. To enlarge the training set they add multiple previous datasets and
conclude that pretraining strategies can improve the LLMs’ performance when the
pretraining tasks are code specific and not only natural language ones. Comparing
with REVEAL, they suggest that LLMs might be superior for vulnerability detec-
tion, especially when trained on large corpora. They further investigate the models’
performance, when tested on samples from an isolated set of projects, and find similar
performance drops as the REVEAL authors. Finally, they find that class weighting,
as an alternative to minority-class oversampling can help the models’ performance.
We find several points for discussion in their approach and consequently the conclu-
sions drawn, which we elaborate on in Chapter 4.

3.2 Respecting the Graph Structure

Previous CPG-based models for vulnerability detection only rely on the CPG of the
given program. In the current work we investigate whether loosening this constraint
provides benefits. The hypothesis would be that the provided CPG structure is not
optimal for the task of vulnerability detection. Therefore, inferring different connec-
tions between code-nodes could create different neighborhood structures and thereby
better inform the node representations learned by the GNNs for the task of vulner-
ability detection.
The the metric-based GSL approach of Graph Structure Learning Model for
Open-World Generalization (GraphGLOW) [61] applies the “Iterative Deep Graph
Learning” framework [48] in an inductive setting of node-classification in social net-
works.
In general, the architecture consists of two components, a GCN [8] and MLP head 𝑓𝑤
and a structure learner 𝑔𝜃 , which learns to find an optimal adjacency matrix 𝐴∗ =
𝑔𝜃∗ (A, X). The training of GraphGLOW can be summarized as a nested optimization
problem:

20
Graph Structure Learning Related Work

𝑀
∗
𝜃 = arg min min ∑ ℒ(𝑓𝑤𝑚 (𝑔𝜃 (Am , Xm ), Xm ), Ym ) (18)
𝜃 𝑤1 ,..𝑤𝑀
𝑚=1

The approach is applied in an inductive setting, meaning training and testing are
conducted on different graphs. During training, we aim to find the 𝜃∗ which minimizes
the loss ℒ(𝑓𝑤∗𝑚 (𝐴∗𝑚 , X𝑚 ), Y𝑚 ) on each training graph 𝑚. In detail, GraphGLOW
trains the structure learner 𝑔𝜃 on a number of social graphs with ℒ a node classi-
fication loss. Between the 𝑀 graphs during training, one 𝑔𝜃 is learned while 𝑓𝑤𝑚
is relearned for every 𝑚. During testing or inference on a new social graph, 𝐴∗ =
𝑔𝜃∗ (A, X) is computed with the found 𝜃∗ and a final 𝑓𝑤 is learned based on 𝐴∗ .

1 𝐾
𝛼𝑢𝑣 = 𝛿( ∑ SIM(𝑤𝑘1 ⊙ ℎ𝑢 , 𝑤𝑘2 ⊙ ℎ𝑣 )) (19)
𝐾 𝑘=1

Equation 19 depicts the node centric view of 𝑔𝜃 , determining the connection strength,
the entry of the adjacency matrix 𝐴∗ between nodes 𝑢 and 𝑣 using their learned
representations ℎ. It learns 𝐾 heads with parameters 𝑤𝑘 , SIM denotes the cosine
similarity and 𝛿 converts the input into values within [0, 1]. During one forward pass
𝑔𝜃 and the GCN are applied in an iterative fashion: 𝐴𝑡 = 𝑔𝜃 (𝐴𝑡−1 , 𝐻 𝑡 ) and 𝐻 𝑡 =
GCN(𝐴𝑡−1 , 𝐻 𝑡−1 ) until 𝐴𝑡 converges measured by the frobenius norm of 𝐴𝑡 − 𝐴𝑡−1 ,
followed by computing 𝑌 ̂ = MLP(𝐻 𝑡 ). The method further leverages graph specific
regularization such as in Equation 12, the prior Equation 13 as well as a regulariza-
tion term which penalizes certainty in 𝑔𝜃 .
Since GraphGLOW is applied on node classification in social networks, the regular-
ization terms reflect an assumption of homophily as people often converse with like-
minded individuals. Hierarchical Graph Pooling with Structure Learning (HGP-SL)
[62] is applied on graph level and models the prior belief that redundant information
in form of similar node representations in the graph can be pooled away. Therefore,
it maximizes heterophily in the graph. In each layer HGP-SL pools the node repre-
sentations 𝐻 𝑙 of the layer by some ratio 𝜌, only keeping the top 𝑛 ⋅ 𝜌 nodes which
are most different to their neighbors.

21
Graph Structure Learning Related Work

−1
𝑝⃗𝑙 = ‖(1 − (𝐷𝑙 ) − 𝐴𝑙 )𝐻 𝑙 ‖1 (20)

After pooling, some of the nodes kept might not have any edges to other nodes,
therefore a new graph structure is learned using a metric-based approach. The node-
information score 𝑝⃗𝑙 , determines how different nodes are to their neighbors and is
computed with Equation 20. ‖…‖1 denotes the row-wise 𝑙1 norm. 𝑝⃗𝑙 is simply the 𝑙1
norm of the difference between ℎ𝑖 and the average of the neighbors’ representation.
HGP-SL is applied to molecule datasets and achieves state of the art results on the
PROTEINS [63] dataset.
The RGCN [7] improves the GCN [8] to work on heterogeneous graphs which contain
one adjacency matrix for each edge type 𝑟, Ar1 , …, Arn . Exemplary, the idea to learn
one weight 𝑊𝑟 for each edge type and aggregate the neighbors’ aggregated represen-
tations specified by each A𝑟 which are computed in a similar fashion to the GCN.
In heterogeneous graphs, some edge types might not have enough edges to learn the
corresponding 𝑊𝑟 well.

𝐵
𝑊𝑟 = ∑ 𝑎𝑟𝑏 𝑉𝑏 (21)
𝑏=1

Therefore, the authors propose to decompose all 𝑊𝑟 into a series of shared basis
matrices 𝑉𝑏 (Equation 21), learning both 𝑉𝑏 as well as the coeﬃcients 𝑎𝑟𝑏 which
depend on 𝑟. By the decomposition, information could be shared between edge types
to learn better 𝑊𝑟 .

22
Graph Structure Learning Experiments

4 Experiments
In the following, the experiments are presented which are conducted with the focus
on the four dimensions data, architecture, training and evaluation. First, we revisit
the motivation from Chapter 1, then examine the DiverseVul dataset [5] and proceed
with each experiment, ﬁrst stating the motivation, followed by the empirical results
and interpretation thereof.

4.1 Experiment 1: The Importance of Project-Based Train-Test Sepa-

ration

The promise of DL-based automatic vulnerability detection is to increase eﬃciency

and effectiveness compared to conventional static and dynamic approaches. To sup-
port progress in the field, we aim to evaluate the models accurately. As described
in the Chapter 1, we observe that previous work [5], [22], [6] evaluate DL-based
approaches to automatic vulnerability detection mostly in the setting where both
the training and test data stem from the same distribution, meaning from the same
projects (the “same” setting). Upon evaluation on out-of-distribution samples, such
as other datasets or unseen projects [5], [6] find unexpectedly large performance de-
terioration. Nevertheless, they proceed to draw conclusions based on the empirical
results from the “same” setting.

Model Training Set F1 Precision Recall (Balanced Accuracy)

NatGen CVEFixes 0.1183 0.3617 0.0707 (0.5300)
Previous 0.4694 0.5181 0.4292 (0.6974)
Prev. & DiverseVul 0.4715 0.5181 0.4325 (0.6989)
Table 1: Test results on Previous & DiverseVul from the Diversevul paper [5]. CVE-
Fixes [64] is a dataset and “Previous” is a combination of datasets.

For example, [5] claim that increasing the dataset size increases model performance.
Table 1 shows their results for the best LLM. It can be observed that in the ﬁrst row
where the model is not evaluated in the “same” setting, since training and test sets
are distinct, the performance is near random: We calculated a balanced accuracy of
53% with the false-positive rate provided by the authors. It can not be ruled out that

23
Graph Structure Learning Experiments

the increase of performance by 14% in balanced accuracy in the second row is not
simply due to a larger training set, but due to the fact that training and testing sets
are not disjunct. Likewise, in the third row, if a larger training set were to increase
performance, addition of roughly 50% more data only corresponds to an increase
in performance by 0.015% balanced accuracy. Notably, about half of the samples
of “Previous” and DiverseVul overlap, which would be in line with our hypothesis
that the increase in performance is due to insuﬃcient train-test separation (Since
the model has seen many similar samples in the second and third row, explaining
the almost-equal performance).
A small experiment assures our hypothesis, comparing the results of the GIN [9]
network, in a project level split on DiverseVul [5] the balanced accuracy decreases by
15% from 80% to 65% (Table 4).
Determining the factors responsible for the diminished performance is left for further
investigation. However, they correspond to some form of train-test leakage, as the
model learns non-generalizable factors which are only relevant within one project.
Consequently, in the following we only examine results on data which is split on
project level, unless stated otherwise.

4.2 The DiverseVul Dataset and Splitting Approach

We utilize DiverseVul dataset [5] as the data source to train the DL-based models
because of its size which allows for meaningful project level splits. The authors also
manually check the vulnerable label quality: Only 25% of samples labeled as vulner-
able which were manually checked are found to be vulnerable in the next largest
dataset BigVul [65] with a size of 260,000 functions. DiverseVul has the highest rate of
all compared datasets with 60%. The dataset contains 330,000 C and C++ functions
of which 19,000 are classified as vulnerable. 85% of all functions are mapped to one
or multiple of 150 CWE classes, the types of vulnerabilities. The data corresponds
to 7500 commits from 800 different projects. To obtain the label for each function,
the authors leverage websites which list vulnerability fixing commits similar to the
approach mentioned in Chapter 3.1. Functions which are part of files changed in
vulnerability fixing commits are labeled as benign, the same functions from the pre-
vious commit are labeled as vulnerable. To increase the dataset size, the authors

24
Graph Structure Learning Experiments

furthermore label the functions in all C and C++ ﬁles as benign, which are part of
neither of both commits.

Figure 2: Left: Function samples per project on logarithmic scale, the largest project
is Linux, Right: Median size of the extracted graphs per project, purple samples
mostly represent functions for which the CPG-parsing failed.

Figure 2 on the left side depicts the number of functions extracted per project on
a logarithmic scale. The project with the most samples is Linux with about 70,000.
More than half of the projects have less than 100 samples. The CPGs for each func-
tion are extracted with the help of an extraction tool of Fraunhofer-AISEC [20], [21].
The tool contains a “forgiving” parser, which makes it possible to build the CPG
even for incomplete or semantically incorrect source code. Nevertheless, the tool fails
in some instances: We ﬁlter out all code graphs which contain less than 10 nodes.
Figure 2 shows how the purple data points fall “out of distribution”. The tool returns
a category for each node, as described in Chapter 3.1. Further, each node’s 100-
dimensional Word2Vec [56] embedding is added, similar to [22], which was trained
on data overlapping with the “Previous” datasets in [5].
We split the dataset into 6 folds on project level and try to ensure similar project sizes,
benign and vulnerable ratios as well as CWE ratios. In detail, we randomly apply
scikit-learns GroupKFold [66] 1000 times, selecting the variation with the smallest
distance between the largest and smallest folds’ sizes. Because of its size, the Linux
project constitutes one whole fold. One of the other folds is randomly selected as
test fold. Additionally to the functions on which the CPG parser failed, we remove

25
Graph Structure Learning Experiments

about 3000 graphs with 𝑛 > 1000 which allows for faster experimentation with the
GNN architecture. In total, the cross validation set contains now 205,000 graphs
with 11,300 vulnerable samples and the separate test set 43,500 samples with 2900
vulnerable samples. This amounts to a vulnerable sample ratio of 5.6% and 6.4%
respectively.

Figure 3: Left: Smaller projects contain relatively more vulnerable samples, Right:
The number of nodes in the CPG is indicative of the vulnerability label.

The dataset is inspected further. Figure 3 on the left shows that smaller projects
generally have larger vulnerability ratios. This could simply be related to the fact,
that larger projects contain a larger codebase, thus the number of additionally added
benign samples could be higher.
More importantly, Figure 3 on the right depicts that benign and vulnerable samples
already have disparate distribution of number of nodes in the graph. The number
of nodes in a graph should be proportionally related to file length. We believe it is
also related to the fact that additionally to the benign and vulnerable pairs, all other
unmodified functions are added to the dataset and declared as benign. Generally,
larger functions are both simply more likely and because of increased complexity
more prone to contain vulnerabilities, which leaves the simpler functions unmodified.
These functions such as simple “setters” and “getters” are then added as benign
samples. Notably, the original authors [5] do not examine the different models’ per-
formance respecting such an observation.
Further, it is examined whether graph sizes differ between small, medium and large
projects, however the distributions appear visually similar.

26
Graph Structure Learning Experiments

The ﬁndings related to the dataset prompt multiple courses of action:

• The even visually distinct node count distributions of benign and vulnerable sam-
ples suggest the use of the “num_nodes” baseline: The naive classiﬁer classiﬁes
samples as vulnerable if the number of nodes exceeds a threshold 𝑡. 𝑡 is chosen to
maximize the balanced accuracy on the validation folds. Plotting the balanced ac-
curacy against the number of nodes, the optimal threshold is found to be a smooth
curve with the peak at 𝑡 = 102 which is evident from Figure 3. The baseline serves
as point of comparison in the subsequent experiments. The performance of the
baseline is reported in the next experiment.
• The imbalanced nature of the dataset prompts the use of balanced accuracy as
primary metric. As elaborated in Section 2.3.2, balanced accuracy weighs positive
and negative predictions equally. The F1-score is biased towards weighing correct
positive predictions higher. In the case of the current dataset, a naive one-class
predictor achieves an F1 of 11% conveys a false sense of performance, whereas the
balanced accuracy of 50% communicates the random performance. To preserve
comparability to previous works, the F1 score is reported as well.

4.3 Experiment 2: num_nodes Baseline

Here the performance of the num_nodes baseline is reported. The threshold 𝑡 is

determined on the cross-validation folds of the training set. The ﬁnal performance
on the test set, consisting of more than 100 projects, is only determined at the end
together with all other models, to not inﬂuence the other experiments. But we report
them earlier to establish it as reference.

Model Balanced Accuracy F1 Precision Recall

num_nodes 0.6439 0.1950 0.1154 0.6284
Table 2: Test performance of num_nodes.

Surprisingly the baseline outperforms the best LLM, Code T5 Small [42] in the set-
ting of evaluation on unseen projects. The DiverseVul authors report an F1-score
of 17.21% [5] and we calculate a balanced accuracy of 57.12%, evaluated on 95 ran-
domly chosen projects. Notably, in their setting the train and test set, correspond to

27
Graph Structure Learning Experiments

DiverseVul as well as previous datasets, such that performance can only be compared
with lower confidence. However, as we infer from the context the Code T5 Small
model is pretrained on C and C++ code as well as code specific tasks and finally
finetuned on vulnerability detection. The experimental outcome raises the question,
how well the models generalize at all in this setting.
Followingly, it can be confirmed, that LLMs are not necessarily the ideal architec-
ture. For our experiments, we focus on GNNs with small parameter count in the
single-digit millions as architecture, which allows for fast experimentation compared
to LLM-based methods with over 100 million parameters [5]. GNNs-based methods
have shown better generalization and robustness to variability in code-style and for-
matting compared to LLM-based approaches [67], [22], since they can access the
CPG. Thus, by employing these architectures we can diminish the susceptibility to
spurious features as unwanted factor of variation in the dataset to increase general-
ization.

4.4 Experiment 3: Graph Representation

Under the new setting of training data which is split on project level into 5 folds, we
aim to evaluate if the information of edge-direction provided by the directed edges
of the CPG is useful for vulnerability classification. In general CPG edge directions
contain semantic meaning. In the control flow edges they indicate the order of oper-
ations and the data flow they indicate which node is a call node and which one a
receiving node. The Devign model [22] considers the edges in the forward direction
with the adjacency matrix A, however it is not mentioned whether the backward
direction AT is considered as well. From a node-level perspective the forward direc-
tion would inform a code node’s representation, which other computational nodes
it depends on in a backward perspective it would inform the node’s representation
which operations it gives rise to. [67] consider the adjacency matrix in an undirected
fashion, meaning 𝐴 ̂ = min(A + AT , 1), which would help the model consider both
views, however the information of direction is lost in this setting.
In a later experiment the effect of considering both directions separately and training
on the CPG as heterogeneous graph with multiple adjacency matrices, one for each
edge type, is examined. In the current setting, only a single matrix A summarizes all
matrices of the heterogeneous CPG: 𝐴 ̂ = min(A + A + … + A , 1).
r1 r2 rn

28
Graph Structure Learning Experiments

In addition we test the eﬀect of including node degree and node triangle counts to X.
The counts are added per adjacency direction and per edge type separately. These
features could distinguish nodes with large involvement in the computational graph
or might contain relevant patterns for the vulnerability detection.

Model & Dataset Balanced Acc. F1 Precision Recall

GCN & dir. deg. 0.6407 0.1673 0.0976 0.6350 ⯅
GCN & dir. 0.6266 0.1599 0.0982 0.5612
GCN & undir. deg. 0.6434 ◆ 0.1723 0.1009 0.6127 ⯅
GCN & undir. 0.6365 0.1862 0.1148 0.5051

GIN & dir. deg. 0.6173 ● 0.1643 0.0980 0.5324 ⯅

GIN & dir. 0.6278 0.1733 ● 0.1045 0.5245
GIN & undir. deg. 0.6427 ● ◆ 0.1737 0.1021 0.5998 ⯅
GIN & undir. 0.6393 0.1817 ● 0.1099 0.5378
Table 3: Average performance of models trained on 20% of the training data on the
five validation folds per dataset variation. Metrics are calculated as average over
different model configurations (hidden dimension: [128, 256], dropout: [0, 0.3, 0.5])
(deg.: with degree features, dir.: directed graphs).

The GNN models GCN [8] and GIN [9] are trained of 20% of the data per training
split and evaluated on the the full validation split in a 5-fold cross validation. They
use a MLP task head with SUM-aggregation and are trained until the validation loss
does not decrease for 10 epochs. For this and the following experiments Different
model configurations are trained, and the metrics in Table 3 represent the average
across these configurations as well as all cross-validation variations. The indicators
“dir”, “undir” and “deg” denote whether a directed adjacency matrix is used and
whether degree and triangle counts are added to X.
We make several observations (Table 3):
⯅ In all four cases the addition of degree and triangle features improves recall.
● In the case of the GIN model, the addition of the features increases balanced
accuracy but decreases F1, when considering the recall increase the model clas-
sifies more vulnerable samples correctly in favor for less correctly classified be-
nign samples.

29
Graph Structure Learning Experiments

◆ For both architectures, the use of the additional features with an undirected
adjacency matrix achieves the highest balanced accuracy.
Many GNN architectures such as the GCN and the GIN do not consider the inclu-
sion of both A and AT . One theory explaining the increase in balanced accuracy in
Table 3 is that the undirected graphs allow the GNNs to pass neighbor-messages
more eﬃciently than if an arbitrary direction of A was chosen. As we ﬁnd later, a
more likely reason is that the setup allows the models to learn naive features which
describe the complexity of the graph and correlate with the number of nodes, similar
to what the baseline num_nodes computes.

4.5 Experiment 4: Architecture

Several GNN architectures are implemented and evaluated in a cross-validation fash-

ion, before finally being tested on the test set. The project-level split folds of the
DiverseVul [5] are used. Before the final cross-validation, for each model, hyperpara-
meter search attempts to find the hyperparameters achieving highest balanced accu-
racy. For this purpose, we cross-validate on all validation folds, training only on 20%
of each fold to enable a broader hyperparameter search. Searched hyperparameters
are provided in Appendix 1. The final cross-validation mirrors the variance of the
performance between different projects, whereas the test set depicts the vulnerability
detection performance on unseen data.
Some details of training include the class weighting according to [5] and the calibrat-
ing of the bias of the task head such that it initially reflects the class imbalance of
5.4% vulnerable samples in the predicted probability, to speed up the training. The
Adam optimizer [68] with learning rate 0.001 and without weight decay is employed.
For measuring the training performance, all models are trained on the same random
split of the training data. Training is halted after 15 epochs of stagnating loss.
The performance of the baseline num_nodes, GCN, GIN, RGCN variants, REVEAL
[6], and the GSL approaches GraphGLOW and HGP-SL are compared.
• The GCN and GIN represent baselines, because of their simple architecture.
• For the RGCN, we employ the basis decomposition (Equation 21) in the following
way: In the heterogeneous CPG, each directed edge type 𝑟 has a “forward” and
“backward” representation Arf and Arb = AT
rf . All r1f , …, rnf , r1b , …, rnb share a set
of basis matrices 𝒱all , rxf and rxb share a set of basis matrices 𝒱𝑥 . Thus, the model

30
Graph Structure Learning Experiments

utilizes both the edge type information as well as edge direction information which
was not accounted for in Experiment 3, Section 4.4. Two options with different
aggregation methods are explored, RGCN-SUM and RGCN-MEAN.
• REVEAL [6] is trained without the contrastive loss component from the original
paper in an end-to-end fashion.
• The application of GraphGLOW and HGP-SL investigates whether the meth-
ods can learn more optimal graph structures as elaborated on in Chapter 3.2.
To restate, GraphGLOW could model homogeneous relationships while HGP-SL
would summarize the graph, only keeping most distinct node representations. For
GraphGLOW, the correct transition from node-classification in the original paper
to graph-classification in the vulnerability detection task is not clear immediately.
We experiment with keeping the structure learner 𝑔𝜃 constant while resetting the
task head 𝑓𝑤 every n minibatches without added benefit. Followingly, we resort
to not resetting 𝑓𝑤 .

Model Balanced Acc. F1 Precision Recall

GCN 0.6475±0.017 0.1680±0.021 0.0963±0.013 0.6647±0.064
GIN 0.6467±0.015 0.1719±0.025 0.0996±0.015 0.6315±0.077
RGCN-mean 0.6478±0.019 0.1722±0.019 0.1001±0.013 0.6303±0.078
RGCN-sum 0.6411±0.020 0.1901±0.028 0.1178±0.021 0.5077±0.064
REVEAL 0.6365±0.023 0.1901±0.025 0.1186±0.017 0.4850±0.069
HGP-SL 0.6456±0.020 0.1722±0.019 0.0998±0.011 0.6322±0.092
GraphGLOW 0.6477±0.016 0.1681±0.018 0.0963±0.011 0.6649±0.079
num_nodes 0.6494±0.018 0.1699±0.027 0.0980±0.018 0.6547±0.056
GIN mixed data 0.7964±0.024 0.2869±0.040 0.1738±0.030 0.8409±0.034
Table 4: Validation performance of models trained on 100% of the respective splits’
training data. The mixed-data model is trained and evaluated on non-projectwise
split data.

Table 4 depicts the models’ validation performance. No model is able to outperform

the num_nodes baseline. Inspecting the individual validation splits, the performance
among the models’ is similar for each validation split, therefore the standard devia-
tion shown is rather an eﬀect of the speciﬁc split division. RGCN-sum and REVEAL

31
Graph Structure Learning Experiments

seem to have similar performance proﬁles. Both trade oﬀ a higher balanced accuracy
in favor for a higher F1 score.

Model Balanced Acc. F1 Precision Recall

GCN 0.6467 0.2011 0.1205 0.6061
GIN 0.6456 0.1992 0.1190 0.6106
RGCN-mean 0.6429 0.1910 0.1119 0.6507
RGCN-sum 0.6345 0.2104 0.1334 0.4976
REVEAL 0.6269 0.2177 0.1452 0.4350
HGP-SL 0.6410 0.1943 0.1153 0.6165
GraphGLOW 0.6441 0.1958 0.1161 0.6239
num_nodes 0.6439 0.1950 0.1154 0.6284
Table 5: Test performance of models.

Table 5 shows the testing performance of models. Again, no model is able to outper-
form the num_nodes baseline. Neither the information about edge type, nor edge
direction allows the RGCN variants to outperform the baseline on the DiverseVul
dataset. In parallel, the structure learning of the GSL variants provides no beneﬁt.
These results, and the next experiment regarding model pretraining initiate the more
thorough investigation of the factors aﬀecting the outcome that all models achieve
around 65% balanced accuracy in the last experiments.

4.6 Experiment 5: Pretraining

Pretraining can improve a DL model’s downstream performance as elaborated on in

Chapter 2.5 [38]. In the setting of graph classification, specifically molecular property
prediction it has been shown to increase the model’s performance [37]. In an abstract
way, this task is similar to vulnerability detection, since both are graph classifica-
tion tasks and even small changes in the adjacency matrix can lead to drastically
different properties. Crucial to an effective pretraining setup in [37] is the multistage
pretraining. The authors combine a first stage of node-level pretraining (attribute
masking) with a second stage of graph-level pretraining (predicting domain-specific
molecule attributes). In their setting, solely applying the graph-level pretraining has
a negative impact on the downstream task. The reasoning is similar to REVEAL [6],

32
Graph Structure Learning Experiments

where node-level code features are learned independently from the downstream task,
and only then the model is optimized for the graph-classiﬁcation task. [37] believe,
the node-level pretraining assists the GNN in learning relevant node representations,
which increases the generalization ability. When the model is only pretrained on
graph level, it could overﬁt on the node level, learning features which only maximize
the graph-level pretraining performance but which are not relevant node-level fea-
tures, which in turn would decrease the generalization ability.

Model & Dataset Balanced Acc. F1 Precision Recall

GCN & deg. 0.8860 0.4022 0.2572 0.9327
GIN & deg. 0.8985 0.4266 0.2759 0.9428
Table 6: Performance of overﬁtting the training set for three-layer GCN and GIN
models.

[37] find more expressive models to benefit the most from pretraining, which is in line
with the observations of [38]. In particular, they suggest the GIN architecture. We
test and confirm the model’s expressivity compared to the GCN (Table 6). Because
it has shown equal performance to other models in Experiment 4, and the computa-
tional efficiency of the architecture, the GIN is employed to test different pretraining
strategies:
First, multiple pretraining tasks are devised. On the node-level, one can formulate
link-prediction and feature masking as straightforward tasks. For the features-mask-
ing in X, the triangle counts, node degrees and Word2Vec [56] embedding are learned
minimizing the MSE, the node category minimizing the NLL.
On the graph-level, more than 85% of the samples in DiverseVul [5] are mapped to
one or more CWE. For benign samples the mapping is determined by which CWEs
the corresponding vulnerable samples belong to. The information is utilized for multi-
label classification. Predicting if a sample is assigned to CWE is treated as single
binary classification task, for each CWE. Also, each sample is weighted equally, re-
gardless of how many CWEs it is categorized by. This pretraining task could help
the model establish a focus on code patterns which are prevalent in each CWE, pre-
sumably beneficial for the final vulnerability detection task.

33
Graph Structure Learning Experiments

Since the code graphs used for training are rather small, 𝑛 < 1000, all pretraining is
conducted in a full graph fashion, and across a minibatch of 8 graphs. We pretrain
and train only on one cross-validation split.

Figure 4: Left: Evolution of the 𝜎 parameter during pretraining when learning the
"variance" of pretraining tasks, Right: Link prediction validation performance of dif-
ferent pretraining methods in conjunction with the GIN architecture.

To combine the pretraining tasks, three methods are experimented with, as presented
in Chapter 2.5: Loss addition, alternating between pretraining tasks and learning
a weight 𝜎 for each task (Equation 14), which can be interpreted as both learning
the scale of the loss and the difficulty of its corresponding task. Figure 4 on the left
shows the learned 𝜎 for each task during training, when all tasks are jointly learned
with Equation 14. The tasks can be divided into two groups: regression tasks, feature
masking of the embedding, degree and triangle counts, and classification pretraining
tasks, link prediction, node category masking and CWE classification. The figure
shows the different learned magnitudes for the classification-based vs regression-based
loss. Within each group, the task difficulties can be compared: Predicting embed-
ding features seems to be more difficult than predicting triangle counts, which in
turn is more difficult than predicting degree counts. Predicting links and the graphs
CWEs is more challenging than the node category prediction. Figure 4 on the right
depicts link prediction performance for different pretraining routines. As expected,
the pure link prediction pretraining task achieves the highest F1 (blue), followed by
the routine which alternates between link prediction and feature masking followed
by CWE classification, not shown in this figure. Comparing the “all” pretraining
routines which pretrain with link prediction, feature masking and CWE prediction,
alternating the loss is most efficient and learning the task variance is slightly better

34
Graph Structure Learning Experiments

than simple loss addition. For the “all” pretraining routines, the same patterns are
observed when inspecting CWE classiﬁcation and feature masking performance.

Figure 5: Training and validation loss of diﬀerent pretraining methods with the GIN
architecture. Exponential smoothing is applied for clearer comparability. Pretraining
shows no clear beneﬁt compared to no pretraining.

Figure 5 depicts both the training and validation loss in the setting when the pre-
trained GIN is trained on the final vulnerability detection task. Training is stopped
after 10 epochs without validation loss improvement. The left plot illustrates that
when pretrained with “link prediction”, “all tasks in an alternating fashion” or “link
prediction and feature masking in an alternating fashion, followed by CWE classi-
fication”, the training loss achieved is the lowest (red, brown, grey). In the right
plot, only “pretraining with link prediction”, “all tasks with learned variance” and
“alternating link prediction and feature masking followed by CWE classification” (red,
pink, grey) achieves minimally lower validation loss. When evaluated on the test set,
models with pretraining achieve no higher balanced accuracy than the original GIN.
In summary, evaluating a broad range of pretraining tasks and pretraining schedules,
pretraining on DiverseVul shows no clear benefit. However, in future experiments, link
prediction or combining losses while learning 𝜎 seems and task alternation followed
by graph level pretraining seem to be the most promising direction, compared to

35
Graph Structure Learning Experiments

the other pretraining tasks or simple loss addition, because those methods achieved
lowest validation loss.
The findings hint at the hypothesis, that pretraining mostly only assists the models
in learning how to “count” graph complexity faster, to in fact learn solely, what the
num_nodes baseline counts: When inspecting the training loss in Figure 5, almost
all pretraining schedules start at a higher loss value than the GIN baseline at around
0.75. But they quickly proceed to pass the GIN baseline in terms of training loss
at around epoch 10. This phenomenon seems to be related to the link prediction
task, because the training loss of the CWE classifcation and feature masking models
progresses at a similar rate to the GIN baseline. Although their training loss is lower,
no visibly lower validation loss is achieved, and therefore it is likely that pretraining
in the current setting leads to the models overfitting faster. These findings prompt
the following experiments.

4.7 Experiment 6: Stratiﬁcation

Observing that all model architectures as well as the pretrained GIN models, without
exception, achieve a balanced accuracy no higher than 65%, leads us to investigate
further and establish the num_nodes baseline. Seeing how the num_nodes baseline
also achieves a balanced accuracy of 65% (Table 5) , the question arises if the models
learn anything besides a representation of the complexity in the graph, which corre-
lates with the number of nodes the num_nodes baseline uses to detect vulnerable
samples. For this purpose, the predictions of the models of the architecture and pre-
training experiments are re-evaluated under stratiﬁcation.

36
Graph Structure Learning Experiments

Figure 6: Predictions on the test set are stratiﬁed by the number of nodes in the
graph with equal-count bins. "num_nodes" represents the baseline performance.

First, results are stratified by the number of nodes in the graph: For each model,
all predictions are sorted by the number of nodes the code graph of the underlying
sample has. Then they are assigned into equal-count bins. With a larger number
of bins, the performance converges against the theoretical setting, in which models
have no direct access to the information about the sample’s graph size (Figure 6).
This is because in each bin, only the performance on graphs of the same size is com-
pared, measuring how well models can distinguish vulnerable samples from benign
ones, when both have the same graph size. Subsequently the total performance is
determined by averaging across all bins.
Figure 6 depicts both the balanced accuracy as well as the F1 score in this setting.
The success of the stratification can be verified, since the balanced accuracy of the
num_nodes classificator drops to a random guesser’s performance of 50% already
at around 3 bins (black). Similarly, it’s F1 score drops to a performance worse than
the 12% F1 score of a one-class classificator. Considering both plots, the GIN archi-
tecture performs best both in terms of balanced accuracy and F1 score (orange).
It’s performance, although modest, is above random in the balanced accuracy case
and better than a one-class classificator. Also a GraphGLOW [61] model was trained
(grey), exchanging the GCN [8] with the GIN architecture. However, it’s performance

37
Graph Structure Learning Experiments

is worse than GIN. Notably the REVEAL [6] architecture, originally designed for
vulnerability detection is unable to base predictions on more than on the graph com-
plexity (purple).
Examining the balanced accuracy, GIN’s performance drops by 12% points versus in
the unstratified setting. Without access to the node count information, the model’s
performance is only 2.5% points higher than random, while it can simultaneously
not be ruled out that the predictions are based on other naive features.
These findings are concerning, since the DL methods synthesize almost no complex
information, which a simple baseline would not be able to detect. When comparing
the num_nodes performance on unseen projects, 64% balanced accuracy and 20%
F1 score, to the best LLMs performance in a similar setting, 57% balanced accuracy
and 17% F1 [5], it seems unlikely that LLMs learn significantly more than the GNN-
based models under stratified evaluation.

Figure 7: Left: Predictions on the test set are stratiﬁed by the number of nodes in
the graph with equal-count bins. The optimal threshold increases in a linear pattern
with increasing graph size, indicating that the original predictions are biased towards
classifying smaller graphs as non-vulnerable. n Right: Code length of positive, nega-
tive and additional samples.

Figure 7 on the left confirms the following conclusion. The figure lists the optimal
logit thresholds per graph size in a stratified setting with 7 bins for the GINs pre-
dictions (darker is better). An almost linear patter shows that with increasing graph
size, higher thresholds would be more optimal , instead of the default threshold at 0.
This indicates that the model primarily looks at features correlated with graph size,

38
Graph Structure Learning Experiments

since the optimal threshold changes drastically between graph sizes. The model has
to compromise and predicts such that the optimal threshold is at 0 excatly for the
median bin, which also resembles the average graph size.
In conclusion, the models primarily learn to approximate the the code length. The
reason, num_nodes performs well, is the data collection strategy in DiverseVul. We
confirm our hypothesis from Section 4.2 that the benign and vulnerable pairs in
general have larger code length than the 88% functions which are modified in nei-
ther commit and are additionally collected in DiverseVul [5]: Based on the function
names we are able to approximately identify benign functions which correspond to
vulnerable ones, and retrieve 16541 of the 18945 corresponding benign functions.
Consequently, we compare file lengths between the pairs and the other additional
functions (Figure 7, the right plot). Indeed, the vulnerable and benign pairs have a
similar median at 112 and 115 but the additional files a median of 35. A similar data
collection of additional functions is present in a number of other works introducing
new datasets [6], [65] and becomes a systematic issue in tainting performance mea-
sures when not accounted for by an appropriate baseline.
However, this dataset composition is not inherently an issue. Additional negative
examples should presumably help the model. Multiple questions can be raised in
response to the results, which we investigate:
Do the additional files mislead the model to only learn features based on the code
length?
A GIN is trained under the setting where the graph size is explicitly added to the
node features X. By initially providing the code length as feature, the model is not
forced to focus on deriving it from other features such as degree and triangle counts.
The GIN model is unable to achieve better performance and therefore we disregard
the question.

39
Graph Structure Learning Experiments

Figure 8: Larger dataset size increases model performance. Predictions on the test
set are stratiﬁed by the number of nodes in the graph with equal-count bins.

Is it an issue of insuﬃcient training data?

A GIN is trained with 25% and 50% of the training data. Figure 8 compares the
performance with the 100% setting. Indeed, a larger training set size increases per-
formance in the stratified setting. When disregarding the additional samples, Diver-
seVul only contains about 40,000 vulnerability pairs. It is plausible that significantly
larger datasets will lead to better performance. The scalability under the stratified
setting should be investigated in the future, adding more data through the use of the
previous datasets.
Is it an issue of architecture, meaning the GNNs are unable to find relevant patterns?
This question is left for future research. The references from the literature imply that
token-based models such as LLMs perform no better [5], even when pretrained on
code specific tasks. The experiments in the stratified setting suggest that at least
within the GNN domain, of the evaluated models, the simple GIN architecture per-
forms the best, outperforming more complex architectures such as the RGCN and
GSL approaches. Perhaps because the expressivity of the GIN plays a crucial role,
which expressive as the WL test, which is demonstrated both theoretically and em-
pirically [9]. The other architectures are derived from the GCN.
Summarizing th experiment, it is found that when stratified by the number of nodes
in the sample graphs, the prediction performance of the models decreases drastically.
Although the GIN architecture performs the best, it is only 2.5% points better than

40
Graph Structure Learning Experiments

a random classiﬁer. Examining the threshold heatmap and the diﬀerent median code
lengths of the vulnerable-benign pairs and the additional samples, it is concluded
that the models mostly only learn the code graph length. Neither GNNs nor LLMs
achieve good performance on unseen projects. But larger datasets might boost their
performance.

4.8 Experiment 7: Performance per CWE

In the ﬁnal experiment, the performance per CWE is studied. It is attempted to

stratify both per CWE and graph size. However, the resulting bins of some CWEs
are to small to derive interpretable results. Also, for some bin balanced accuracy and
F1 score are undeﬁned, because only a single class is present in the bin.

Figure 9: Predictions per CWE are shown. Not for every CWE samples are present in
the test set. num_nodes is calibrated for each CWE separately to maximize balanced
accuracy.

Instead, Figure 9 depicts the unstratified balanced accuracy of the models. For the
most conservative comparison, the num_nodes baseline is tuned for the best thresh-
old for each CWE separately. Under the presumption that the other models are only
variations of the num_nodes baseline with a threshold which is less optimal than the
globally optimal num_nodes baseline, they trade off balanced accuracy differently
between CWEs. Therefore, the union of the models might “visually” outperform the

41
Graph Structure Learning Experiments

num_nodes baseline, which is avoided by setting the baseline to the most optimal
thresholds for each CWE.
In Figure 9, the CWEs are ordered by their sample size in the whole DiverseVul
[5] dataset. In many instances, the optimally calibrated num_nodes baseline outper-
forms other models. On the CWEs with the largest occurrence (on the left), some
models learn additional facts about the data, indicating, that larger datasets might
improve the detection capability. Supporting this claim is the fact, that compared to
bigger CWEs for the smaller CWEs, the performance between models varies drasti-
cally, indicating that models have not trained on enough data.

4.9 Technical Details

Technical details regarding the experiments, such as pretraining and hyperparame-

ters are provided in Appendix 1 and Appendix 2. We leave GraphGLOW-GIN as an
interesting architecture for future research (Appendix 3).

42
Graph Structure Learning Discussion

5 Discussion
Followingly, we summarize the previous experiments and contextualize them to de-
rive primary directives for future work in DL-based vulnerability detection.
In a theoretical analysis it is shown, how the balanced accuracy presents a better
candidate to measure performance than the F1 score, because a one-class classifier
always achieves a balanced accuracy of 50% but the associated F1 score changes with
the data imbalance ratio. The plots under stratification in Experiment 6 illustrate
how balanced accuracy is more interpretable in this setting.
Further, it is demonstrated, that to effectively evaluate DL-based methods, a project-
level split to separate train and test data it is crucial to omit factors of variation
specific to individual projects. A GIN network trained and evaluated on the same
project achieves 15% points higher balanced accuracy compared to one trained in
the project-split setting (Experiment 1).
In previous work, authors base their experiments on DiverseVul or similar datasets
[5], [6] and fail to abandon the simple data setting and only for some experiments
adopt the project-level split. They proceed to draw conclusions based on the un-sep-
arated data although attributing performance gains is difficult when models partially
overfit to the testing data.
Supporting the argument from Experiment 1, we show that the num_nodes base-
line which bases decisions only on the graph size outperforms even large, pretrained
LLMs in the project-split setting (Experiment 2).
Different GNN-based and GSL-based approaches fail similar to the LLMs to outcom-
pete the baseline. Including edge direction or edge type to train the RGCN [7] can
not improve the outcome either (Experiment 4). Similarly, it is the case for pretrain-
ing (Experiment). At least, Graph structural features, including triangle counts and
node degrees allow the models to perform on par with the baseline (Experiment 3).
When stratifying for the samples’ graph sizes we find that models learn only margin-
ally more than the baseline. The best performing model, the GIN only measures 2.5%
points of balanced accuracy above num_nodes. The graph size is added as feature
and subsequent training reveals that it is unlikely, that the models are “mislead” by
the large portion of additional samples which are collected in DiverseVul to increase
the dataset set. Rather, we attribute the insufficient performance to a lack of training
data (Experiment 6).

43
Graph Structure Learning Discussion

Our findings demonstrate how DL-based vulnerability detection is still not effective
enough to be applied in a real word setting on “unseen” data. However, the following
observations might benefit its future progress.
The findings are interpreted in order of decreasing importance:
• Evaluation: We illustrate how in the context of the DiverseVul dataset the right
train-test separation and stratification is necessary in order to evaluate vulnerabil-
ity detection methods and to validate one’s approach. Future work would benefit
from comparing against the right baselines and metrics.
• Regarding the data dimension, a shared topic in Chapter 4 is how improving the
datasets both in quantity and quality is the most promising direction forward.
In Experiment 7, models could not be evaluated in a setting stratified by CWE
and graph size because of insufficient data. They showed drastically different per-
formance for CWEs with the least data compared to CWEs with more data, ini-
dicating that performance would increase with more data. Accordingly, also the
GIN model achieved better balanced accuracy when trained with more data.
However, publicly available vulnerability data is limited to OSS, of which the two
largest projects, Linux and Chromium are already covered by vulnerability detec-
tion datasets [5], [6], [65]. The limited availability is best demonstrated by the
overlapping percentage of over 50% of the DiverseVul dataset when joined with
previous data sources [5].
Thus, improving data quality above the 60% true-positive rate of DiverseVul [5]
is equally as important.
Related to data quality is the question whether the function level is the right
representation to detect code vulnerabilities. [5] find that for many samples it is
impossible to determine whether they are vulnerable without additional context.
Other work explores higher level representations, such as an interprocedural one
[1], [58]. Such representations provide detection methods with more context, for
instance about which variables are user controlled. However, they come with the
drawback of increased complexity and are selective about the data which can be
used.
• Architecture: In context of the lack of training data and consequently also the
difficulty of devising effective evaluation metrics, authors should be flexible when
selecting and judging the performance of different architectures. In our experi-

44
Graph Structure Learning Acknowledgments

ments, although simple, the GIN [9] architecture achieved the best results. Be-
cause of the small model size, experimentation regarding pretraining or the effect
of increasing the dataset could be carried out quickly. Fast iteration would not
have been possible with larger architectures such as GraphGLOW or LLM-based
methods. After the data challenges have been solved, the RGCN represents an
interesting choice of architecture, since it incorporates both the information about
edge type and edge direction. LLM-based methods could show their effectiveness
with larger abundance of vulnerability detection data. The 40,000 samples in the
DiverseVul dataset [5] samples, disregarding the additionally collected data, did
not suffice for outperforming the num_nodes baseline in the “unseen projects”
setting.
• Training: While the models pretrained in the current work achieve lower train-
ing loss, only in three instances their validation loss is slightly lower than the
baseline’s loss. Perhaps more sophisticated pretraining methods, for example ones
which mimic code execution could be developed to aid performance. Other direc-
tions such as contrastive learning with vulnerable samples and their respective
patches [6], [69], or [70] could show promising results as well.
In conclusion, the recipe of more careful evaluation in combination with simple model
architectures and a special focus on data quality and quantity would greatly benefit
progress in DL-based automatic vulnerability detection.

Acknowledgments
The author would like to thank Erik Imgrund for sharing his insights along the way
and Dr. rer. nat. Martin Härterich for his guidance on the topic and feedback on the
paper.

45
Bibliography

[1] T. Ganz, E. Imgrund, M. Härterich, and K. Rieck, “PAVUDI: Patch-based Vul-

nerability Discovery using Machine Learning,” in Proceedings of the 39th Annual
Computer Security Applications Conference, 2023, pp. 704–717.
[2] D. A. Wheeler, “FlawFinder.” [Online]. Available: https://github.com/david-a-
wheeler/flawfinder
[3] T. Ganz, P. Rall, M. Härterich, and K. Rieck, “Hunting for Truth: Analyzing
Explanation Methods in Learning-based Vulnerability Discovery,” in 2023 IEEE
8th European Symposium on Security and Privacy (EuroSP), 2023, pp. 524–
541.
[4] C. Beaman, M. Redbourne, J. D. Mummery, and S. Hakak, “Fuzzing vulnerabil-
ity discovery techniques: Survey, challenges and future directions,” Computers
& Security, vol. 120, p. 102813–102814, 2022.
[5] Y. Chen, Z. Ding, L. Alowain, X. Chen, and D. Wagner, “Diversevul: A new
vulnerable source code dataset for deep learning based vulnerability detection,”
in Proceedings of the 26th International Symposium on Research in Attacks,
Intrusions and Defenses, 2023, pp. 654–668.
[6] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vul-
nerability detection: Are we there yet?,” IEEE Transactions on Software Engi-
neering, vol. 48, no. 9, pp. 3280–3296, 2021.
[7] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M.
Welling, “Modeling relational data with graph convolutional networks,” in The
semantic web: 15th international conference, ESWC 2018, Heraklion, Crete,
Greece, June 3–7, 2018, proceedings 15, 2018, pp. 593–607.
[8] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolu-
tional networks,” arXiv preprint arXiv:1609.02907, 2016.
[9] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural
networks?,” arXiv preprint arXiv:1810.00826, 2018.
[10] R. J. Wilson, Introduction to graph theory. Pearson Education India, 1979.
[11] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph transformer,” in
Proceedings of the web conference 2020, 2020, pp. 2704–2710.

a
[12] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficient semi-streaming al-
gorithms for local triangle counting in massive graphs,” in Proceedings of the
14th ACM SIGKDD international conference on Knowledge discovery and data
mining, 2008, pp. 16–24.
[13] N. I. of Standards and Technology, “Vulnerability Definition.” [Online]. Avail-
able: https://nvd.nist.gov/vuln
[14] M. Corporation, “Common Weakness Enumeration.” [Online]. Available:
https://cwe.mitre.org/
[15] M. Corporation, “Common Vulnerabilities and Exposures.” [Online]. Available:
https://cve.mitre.org/
[16] M. Corporation, “Stubborn Weaknesses in the CWE Top 25.” [Online]. Avail-
able: https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weaknesses.
html
[17] J. M. Spring, “An analysis of how many undiscovered vulnerabilities remain in
information systems,” Computers & Security, vol. 131, p. 103191–103192, 2023.
[18] O. C. A. A. M. W. Mike Aizatsky Kostya Serebryany, “OSS-Fuzz.” [On-
line]. Available: https://security.googleblog.com/2016/12/announcing-oss-fuzz-
continuous-fuzzing.html
[19] T. Ji, Y. Wu, C. Wang, X. Zhang, and Z. Wang, “The coming era of alpha-
hacking?: A survey of automatic software vulnerability detection, exploitation
and patching techniques,” in 2018 IEEE third international conference on data
science in cyberspace (DSC), 2018, pp. 53–60.
[20] Fraunhofer-AISEC, “CPG Extractor.” [Online]. Available: https://github.com/
Fraunhofer-AISEC/cpg
[21] K. Weiss and C. Banse, “A Language-Independent Analysis Platform for Source
Code,” arXiv preprint arXiv:2203.08424, 2022.
[22] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability
identification by learning comprehensive program semantics via graph neural
networks,” Advances in neural information processing systems, vol. 32, 2019.
[23] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vul-
nerability detection: Are we there yet?,” IEEE Transactions on Software Engi-
neering, vol. 48, no. 9, pp. 3280–3296, 2021.

b
[24] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on
large graphs,” Advances in neural information processing systems, vol. 30, 2017.
[25] Z. Wang et al., “TacticAI: an AI assistant for football tactics,” Nature Commu-
nications, vol. 15, no. 1, pp. 1–13, 2024.
[26] Z. Li and A. B. Farimani, “Graph neural network-accelerated Lagrangian ﬂuid
simulation,” Computers & Graphics, vol. 103, pp. 201–211, 2022.
[27] F. Rosenblatt, “The perceptron: a probabilistic model for information storage
and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386–
387, 1958.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based learning ap-
plied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp.
2278–2324, 1998.
[29] A. Vaswani et al., “Attention is all you need,” Advances in neural information
processing systems, vol. 30, 2017.
[30] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric
deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv preprint
arXiv:2104.13478, 2021.
[31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Internal Repre-
sentations by Error Propagation, Parallel Distributed Processing, Explorations
in the Microstructure of Cognition, ed. DE Rumelhart and J. McClelland. Vol.
1. 1986,” Biometrika, vol. 71, pp. 599–607, 1986.
[32] R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro, “Janossy pooling: Learning
deep permutation-invariant functions for variable-size inputs,” arXiv preprint
arXiv:1811.01900, 2018.
[33] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M.
Borgwardt, “Weisfeiler-lehman graph kernels.,” Journal of Machine Learning
Research, vol. 12, no. 9, 2011.
[34] N. T. Huang and S. Villar, “A short tutorial on the weisfeiler-lehman test and its
variants,” in ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2021, pp. 8533–8537.
[35] S. Linnainmaa, “Taylor expansion of the accumulated rounding error,” BIT
Numerical Mathematics, vol. 16, no. 2, pp. 146–160, 1976.

c
[36] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an
unknown mapping and its derivatives using multilayer feedforward networks,”
Neural networks, vol. 3, no. 5, pp. 551–560, 1990.
[37] W. Hu et al., “Strategies for pre-training graph neural networks,” arXiv preprint
arXiv:1905.12265, 2019.
[38] D. Erhan, A. Courville, Y. Bengio, and P. Vincent, “Why does unsupervised
pre-training help deep learning?,” in Proceedings of the thirteenth international
conference on artificial intelligence and statistics, 2010, pp. 201–208.
[39] H. Ren, W. Hu, and J. Leskovec, “Query2box: Reasoning over knowledge graphs
in vector space using box embeddings,” arXiv preprint arXiv:2002.05969, 2020.
[40] S. Feng et al., “Twibot-22: Towards graph-based twitter bot detection,” Ad-
vances in Neural Information Processing Systems, vol. 35, pp. 35254–35269,
2022.
[41] J. Lee, I. Lee, and J. Kang, “Self-attention graph pooling,” in International
conference on machine learning, 2019, pp. 3734–3743.
[42] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified
pre-trained encoder-decoder models for code understanding and generation,”
arXiv preprint arXiv:2109.00859, 2021.
[43] A. Radford et al., “Language models are unsupervised multitask learners,” Ope-
nAI blog, vol. 1, no. 8, p. 9–10, 2019.
[44] Y. Zhu et al., “A survey on graph structure learning: Progress and opportuni-
ties,” arXiv preprint arXiv:2103.03036, 2021.
[45] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[46] P. Velickovic et al., “Graph attention networks,” stat, vol. 1050, no. 20, pp. 10–
48550, 2017.
[47] Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec, “Gnnexplainer: Gen-
erating explanations for graph neural networks,” Advances in neural informa-
tion processing systems, vol. 32, 2019.
[48] Y. Chen, L. Wu, and M. Zaki, “Iterative deep graph learning for graph neural
networks: Better and robust node embeddings,” Advances in neural information
processing systems, vol. 33, pp. 19314–19326, 2020.

d
[49] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive
learning with augmentations,” Advances in neural information processing sys-
tems, vol. 33, pp. 5812–5823, 2020.
[50] R. Entezari, M. Wortsman, O. Saukh, M. M. Shariatnia, H. Sedghi, and L.
Schmidt, “The role of pre-training data in transfer learning,” arXiv preprint
arXiv:2302.13602, 2023.
[51] D. Mayo et al., “Multitask learning via interleaving: A neural network investi-
gation,” in Proceedings of the Annual Meeting of the Cognitive Science Society,
2023.
[52] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to
weigh losses for scene geometry and semantics,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2018, pp. 7482–7491.
[53] L. Liebel and M. Körner, “Auxiliary tasks in multi-task learning,” arXiv preprint
arXiv:1805.06334, 2018.
[54] H. Robbins and S. Monro, “A stochastic approximation method,” The annals
of mathematical statistics, pp. 400–407, 1951.
[55] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural
networks,” arXiv preprint arXiv:1511.05493, 2015.
[56] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Eﬃcient estimation of word
representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[57] R. Russell et al., “Automated vulnerability detection in source code using deep
representation learning,” in 2018 17th IEEE international conference on ma-
chine learning and applications (ICMLA), 2018, pp. 757–762.
[58] Z. Li et al., “Vuldeepecker: A deep learning-based system for vulnerability de-
tection,” arXiv preprint arXiv:1801.01681, 2018.
[59] A. Graves and J. Schmidhuber, “Framewise phoneme classiﬁcation with bidi-
rectional LSTM networks,” in Proceedings. 2005 IEEE International Joint Con-
ference on Neural Networks, 2005., 2005, pp. 2047–2052.
[60] Y. Liu et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv
preprint arXiv:1907.11692, 2019.
[61] W. Zhao, Q. Wu, C. Yang, and J. Yan, “Graphglow: Universal and generaliz-
able structure learning for graph neural networks,” in Proceedings of the 29th

e
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023,
pp. 3525–3536.
[62] Z. Zhang et al., “Hierarchical graph pooling with structure learning,” arXiv
preprint arXiv:1911.05954, 2019.
[63] C. Gallicchio and A. Micheli, “Fast and deep graph neural networks,” in Pro-
ceedings of the AAAI conference on artificial intelligence, 2020, pp. 3898–3905.
[64] G. Bhandari, A. Naseer, and L. Moonen, “CVEfixes: automated collection of
vulnerabilities and their fixes from open-source software,” in Proceedings of the
17th International Conference on Predictive Models and Data Analytics in Soft-
ware Engineering, 2021, pp. 30–39.
[65] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “AC/C++ code vulnerability dataset
with code changes and CVE summaries,” in Proceedings of the 17th Interna-
tional Conference on Mining Software Repositories, 2020, pp. 508–512.
[66] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of
Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[67] E. Imgrund, T. Ganz, M. Härterich, L. Pirch, N. Risse, and K. Rieck, “Broken
Promises: Measuring Confounding Effects in Learning-based Vulnerability Dis-
covery,” in Proceedings of the 16th ACM Workshop on Artificial Intelligence and
Security, 2023, pp. 149–160.
[68] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[69] N. Risse and M. Böhme, “Limits of machine learning for automatic vulnerability
detection,” arXiv preprint arXiv:2306.17193, 2023.
[70] Y. Mirsky et al., “{VulChecker}: Graph-based Vulnerability Localization in
Source Code”, in 32nd USENIX Security Symposium (USENIX Security 23),
2023, pp. 6557–6574.

f
Index of Appendices
1 Appendix: Hyperparamter Search ...................................................................... B
2 Appendix: Model Training ................................................................................. B
3 Appendix: GIN-GraphGLOW ............................................................................ C

A
1 Appendix: Hyperparamter Search
All hyperparameter conﬁgurations can also be found in the repository linked in the
Introduction.
For the GCN [8] and GIN [9] network which, because of their simplicity, represent
baseline GNN performance, hyperparameters are searched in batch_size: [64, 128],
hidden_channels: [64, 128], dropout: [0, 0.3, 0.7], depth: [2, 3].
For GraphGLOW [61] the number of iterations is searched in [1, 5], batch size is 32,
GNN dropout is 0.3, the layers are 2, hidden channels are 129, the sparsity ratio is
searched in [0, 0.2, 0.4] and the skip ratio searched in [0, 0.3, 0.7]
For HGP-SL [62] batch size is searched in [32, 128], the pooling ratio in [0.3, .05,
0.8], and the number of layers in [2, 3, 4].
For the RGCN [7] the optimal parameters from the GCN hyperparameter search are
taken and additionally the number of globally shared basis matrices searched in [3,
8], the number of matrices shared between the forward and backward direction is
searched in [1, 2], and for each direction separately the number of matrices is searched
in [1, 2].
For REVEAL [23] we search the batch size in [32, 128], the feature dropout in [0,
0.5] and the global hidden channelsin [200, 256].

2 Appendix: Model Training

Here we briefly describe the training setup. Since all graphs with more than 1000
nodes are ignored (which removes about 3500 graphs), training is speed up signifi-
cantly by implementing the models in a dense fashion. Further, this reduces memory
footprint, such that GraphGLOW [61] can be trained faster, since setup allows us to
batch graphs. For creating minibatches, adjacency matrices and feature vectors are
padded to size 1000.
This further allows us to leverage the static-optimization options of pytorch, since
the input has fixed dimensions, opposed to the sparse-matrix setting. Therefore,
we can employ the structure learner and message-passing scheme of GraphGLOW
in a non-approximate fashion, achieving similar speeds to the original approximate
implementation. Furthermore, The setup enables full graph pretraining, instead of
stochastic sampling for the node-level pretraining tasks.

B
3 Appendix: GIN-GraphGLOW
Another interesting research direction for the GraphGLOW architecture would be
to study it’s performance when the GIN architecture is used instead of the GCN as
base model on the datasets from the original work [61]. Because the GIN applies no
normalization terms to the aggregated neighborhood embeddings or more specifically
the adjacency matrix, when the gradient of the loss is backpropagated through the
learned adjacency matrix 𝐴,̂ it is independent of 𝐴.̂
In the GCN’s case, the adjacency matrix is normalized by the degree matrix:
1
̂ ̂ − 12 , thus the gradient changes depending on the number of neighbors a node
𝐷̂ − 2 𝐴𝐷
has. For the GCNs case this means that the gradient becomes smaller when connec-
tions to neighbors have higher probability. This means that for nodes for which many
connections have been learned, their representations change more slowly. Therefore,
the model could be more inflexible. Similarly, the gradient is really high for nodes
without connection, which could lead the model to learn too many connections. Es-
sentially, the rate of learning for edges becomes dependent on the other edges in the
graph, which is not well justified.
For the GIN’s case, the gradient does not change in response to more or less neigh-
bors, thus it might be more flexible droping and learning new connections in the
graph.

Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
SrushtiKulkarni 23551008 Blackbook
No ratings yet
SrushtiKulkarni 23551008 Blackbook
85 pages
Prova-Real-time Anomaly Detection On Financial Data
No ratings yet
Prova-Real-time Anomaly Detection On Financial Data
115 pages
Fundamentals of Machine Learning: a Simplified Approach
From Everand
Fundamentals of Machine Learning: a Simplified Approach
Er. Sudhir Goswami
No ratings yet
Mastering Generic Programming in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Generic Programming in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Mishra Thesis AI Augmented Vulnerability
No ratings yet
Mishra Thesis AI Augmented Vulnerability
96 pages
Devign Effective Vulnerability Identification by Learning Comprehensive Program Semantics Via Graph Neural Networks
No ratings yet
Devign Effective Vulnerability Identification by Learning Comprehensive Program Semantics Via Graph Neural Networks
11 pages
Deep Learning Solutions For Source Code Vulnerability Detection
No ratings yet
Deep Learning Solutions For Source Code Vulnerability Detection
12 pages
AmazonInterview Preparation
100% (1)
AmazonInterview Preparation
6 pages
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Vul-RAG Enhancing LLM-based Vulnerability Detectio
No ratings yet
Vul-RAG Enhancing LLM-based Vulnerability Detectio
12 pages
Diverse Vu L
No ratings yet
Diverse Vu L
15 pages
Vulnerability Detection in Popular Programming Languages With Language Models
No ratings yet
Vulnerability Detection in Popular Programming Languages With Language Models
21 pages
Deep Learning Solutions For Source Code Vulnerability Detection
No ratings yet
Deep Learning Solutions For Source Code Vulnerability Detection
12 pages
AI and Deep Learning for Networks
From Everand
AI and Deep Learning for Networks
Gopee Mukhopadhyay
No ratings yet
Shirley Yang Masc Thesis
No ratings yet
Shirley Yang Masc Thesis
65 pages
Finetuning LLM For Vulnerability Detection
No ratings yet
Finetuning LLM For Vulnerability Detection
12 pages
Meta-Path Based Attentional Graph Learning Model F
No ratings yet
Meta-Path Based Attentional Graph Learning Model F
13 pages
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
From Everand
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
Avishek Nag
No ratings yet
E-GVD Efficient Software Vulnerability Detection T-1
No ratings yet
E-GVD Efficient Software Vulnerability Detection T-1
9 pages
Final Research Paper
No ratings yet
Final Research Paper
9 pages
Thesis 2021 Optimizaiton GNN Stathas Nistath Meng Eecs 2021 Thesis
No ratings yet
Thesis 2021 Optimizaiton GNN Stathas Nistath Meng Eecs 2021 Thesis
79 pages
Thesis Master 2022 Application of GNN For Graph Classification
No ratings yet
Thesis Master 2022 Application of GNN For Graph Classification
81 pages
1297 2018 Pitfalls of Graph Neural Network Evaluation
No ratings yet
1297 2018 Pitfalls of Graph Neural Network Evaluation
11 pages
Security Vulnerability Detection With Multitask Se
No ratings yet
Security Vulnerability Detection With Multitask Se
11 pages
Nguyen Duy
No ratings yet
Nguyen Duy
66 pages
ECE Cafeteria Management System Proposal
100% (1)
ECE Cafeteria Management System Proposal
13 pages
Deepak Upadhyay BI Resume Updated
No ratings yet
Deepak Upadhyay BI Resume Updated
4 pages
Mastering the Craft of C++ Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C++ Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Solid Edge
No ratings yet
Solid Edge
2 pages
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
Li 2021
No ratings yet
Li 2021
16 pages
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Windows XP All Editions Universal Product Keys Collection
No ratings yet
Windows XP All Editions Universal Product Keys Collection
8 pages
Software Vulnerability Analysis and Discovery Using Deep Learning Techniques A Survey
No ratings yet
Software Vulnerability Analysis and Discovery Using Deep Learning Techniques A Survey
15 pages
AIT Company Profile
No ratings yet
AIT Company Profile
28 pages
Performance Output in Computer System Servicing: Type of Computers
No ratings yet
Performance Output in Computer System Servicing: Type of Computers
2 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Mastering the Craft: Unleashing the Art of Software Engineering
From Everand
Mastering the Craft: Unleashing the Art of Software Engineering
Kiran Nagesh
No ratings yet
Crossproject Transfer Representation Learning For Vulnerable Fun 2018
No ratings yet
Crossproject Transfer Representation Learning For Vulnerable Fun 2018
9 pages
En TS 8.1.1 TSCSTA Book
No ratings yet
En TS 8.1.1 TSCSTA Book
260 pages
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
System Software
No ratings yet
System Software
21 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Amdp 1
No ratings yet
Amdp 1
11 pages
p4 d2 2017 p4 16 Tutorial
No ratings yet
p4 d2 2017 p4 16 Tutorial
94 pages
Practical No.:-6: Name of The Experiment: To Study KNIME Tool
No ratings yet
Practical No.:-6: Name of The Experiment: To Study KNIME Tool
6 pages
Complete Bug Bounty Cheat Sheet
No ratings yet
Complete Bug Bounty Cheat Sheet
5 pages
PentestTools WebsiteScanner Report
No ratings yet
PentestTools WebsiteScanner Report
5 pages
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
From Everand
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
Florian Fittkau
No ratings yet
Rate Card New
No ratings yet
Rate Card New
3 pages
DCPlusPlus Guide
No ratings yet
DCPlusPlus Guide
7 pages
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
AMD AHCI Controller Driver Distribution List Version 1.2.001.0301, 06/16/2011
No ratings yet
AMD AHCI Controller Driver Distribution List Version 1.2.001.0301, 06/16/2011
3 pages
EASA Part 66 Module 5 Software Management Control
No ratings yet
EASA Part 66 Module 5 Software Management Control
13 pages
4az s4cld2408 BPD en de
No ratings yet
4az s4cld2408 BPD en de
13 pages
Analysis of Technology Architectures and Cybersecurity Assessment
No ratings yet
Analysis of Technology Architectures and Cybersecurity Assessment
9 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Rane Sixty-Two Mixer Manual
No ratings yet
Rane Sixty-Two Mixer Manual
44 pages
C++ Mastery: Advanced Techniques and Strategies
From Everand
C++ Mastery: Advanced Techniques and Strategies
Adam Jones
No ratings yet
Description of Approach
No ratings yet
Description of Approach
5 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Spirit Breaker
No ratings yet
Spirit Breaker
9 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Chapter2 - M-Review of Octave
No ratings yet
Chapter2 - M-Review of Octave
12 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Mastering C: Advanced Techniques and Best Practices
From Everand
Mastering C: Advanced Techniques and Best Practices
Adam Jones
No ratings yet
A4 Worksheet - As Seasons Roll On by
No ratings yet
A4 Worksheet - As Seasons Roll On by
5 pages
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
Generating BSDL Files in Quartus II
No ratings yet
Generating BSDL Files in Quartus II
5 pages
Bike Price Prediction: Matthias Fast, Amos Dinh Semester Project
No ratings yet
Bike Price Prediction: Matthias Fast, Amos Dinh Semester Project
37 pages
Fluent Simulation and Modeling Techniques: Definitive Reference for Developers and Engineers
From Everand
Fluent Simulation and Modeling Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MS Xiip
No ratings yet
MS Xiip
7 pages
University Presentation
No ratings yet
University Presentation
23 pages
University Presentation
No ratings yet
University Presentation
23 pages
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Fall 2023 - MTH643 - 2
No ratings yet
Fall 2023 - MTH643 - 2
4 pages
RPG Game List
No ratings yet
RPG Game List
1 page
Boddu Saiteja - 198105
No ratings yet
Boddu Saiteja - 198105
1 page
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Valgrind Essentials: Definitive Reference for Developers and Engineers
From Everand
Valgrind Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Metasploit Techniques and Workflows: Definitive Reference for Developers and Engineers
From Everand
Metasploit Techniques and Workflows: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Excel Fundamentals Manual 41
No ratings yet
Excel Fundamentals Manual 41
1 page
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CompTIA CySA+ Study Guide: Exam CS0-002
From Everand
CompTIA CySA+ Study Guide: Exam CS0-002
Mike Chapple
No ratings yet
Edge Computing 101: Expert Techniques And Practical Applications
From Everand
Edge Computing 101: Expert Techniques And Practical Applications
Rob Botwright
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bachelor Thesis

Uploaded by

Bachelor Thesis

Uploaded by

Deep Learning-Based Code Vulnerability Detection:

Prof. Dr. Maximilian Scherer, Academic Supervisor

Speyer, 6th of May 2024

of DL-based automatic vulnerability detection. Overall, it is discovered that careful

2.1 Graph Concepts

2.2 Automatic Vulnerability Detection

This section introduces common vulnerability detection concepts.

“[A code vulnerability can be deﬁned as a] weakness in the computational logic

The most common categorization of vulnerabilities is comprised by the Common

The ﬁeld of automatic vulnerability detection is concerned with discovering se-

2.3 Graph Neural Networks

Figure 1: Common GNN aggregators fail to compute distinguishable node represen-

Examining the message passing mechanism commonly shared between GNN-archi-

2.3.1 Graph Classiﬁcation

𝑌 ̂ = MLP(top-rank(𝐻, 𝑍, 𝑛)) (5)

𝑌 ̂ = MLP(READOUT(𝐻 ⊙ 𝑍)) (6)

2.4 Graph Structure Learning

Concretely, GSL aims at learning the optimal graph structure 𝐴 ̂ in conjunction or

ℒ𝑠𝑝 = 𝛼‖𝐴‖̂ (11)

𝐻 𝑙 = 𝜆 ⋅ 𝐹𝜃𝑙 (𝐻 𝑙−1 , A) + (1 − 𝜆) ⋅ 𝐹𝜃𝑙 (𝐻 𝑙−1 , 𝐴)̂ (13)

2.5 Pretraining Methods

In DL, pretraining describes the procedure of training a model ﬁrst on an auxiliary

random initialization point of model weights to a “prelearned” initialization point at

2.5.1 Multitask Pretraining

3. One possibility to alleviate this problem is to learn tasks in an interleaved fash-

2.5.2 Pretraining on Graphs

In a full-graph fashion the link-prediction probabilities can be computed with Equa-

3.1 Deep Learning-based approaches to Automatic Vulnerability De-

3.2 Respecting the Graph Structure

4.1 Experiment 1: The Importance of Project-Based Train-Test Sepa-

The promise of DL-based automatic vulnerability detection is to increase eﬃciency

Model Training Set F1 Precision Recall (Balanced Accuracy)

4.2 The DiverseVul Dataset and Splitting Approach

The ﬁndings related to the dataset prompt multiple courses of action:

4.3 Experiment 2: num_nodes Baseline

Here the performance of the num_nodes baseline is reported. The threshold 𝑡 is

Model Balanced Accuracy F1 Precision Recall

4.4 Experiment 3: Graph Representation

Model & Dataset Balanced Acc. F1 Precision Recall

GIN & dir. deg. 0.6173 ● 0.1643 0.0980 0.5324 ⯅

4.5 Experiment 4: Architecture

Several GNN architectures are implemented and evaluated in a cross-validation fash-

Model Balanced Acc. F1 Precision Recall

Table 4 depicts the models’ validation performance. No model is able to outperform

Model Balanced Acc. F1 Precision Recall

4.6 Experiment 5: Pretraining

Pretraining can improve a DL model’s downstream performance as elaborated on in

Model & Dataset Balanced Acc. F1 Precision Recall

4.7 Experiment 6: Stratiﬁcation

Is it an issue of insuﬃcient training data?

4.8 Experiment 7: Performance per CWE

In the ﬁnal experiment, the performance per CWE is studied. It is attempted to

4.9 Technical Details

Technical details regarding the experiments, such as pretraining and hyperparame-

[1] T. Ganz, E. Imgrund, M. Härterich, and K. Rieck, “PAVUDI: Patch-based Vul-

2 Appendix: Model Training

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.