0% found this document useful (0 votes)
5 views11 pages

Vulnerability Classification on Source Code QRS24 STV

This conference paper discusses a novel approach to classify vulnerabilities in source code using text mining and deep learning techniques, addressing the limitations of traditional expert-based categorization. The authors propose a method that automates the labeling of vulnerabilities during the detection phase of security testing, enhancing the efficiency and accuracy of vulnerability localization. The study evaluates various natural language processing techniques and machine learning models to improve the classification of vulnerabilities directly from the source code.

Uploaded by

say.mansabdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

Vulnerability Classification on Source Code QRS24 STV

This conference paper discusses a novel approach to classify vulnerabilities in source code using text mining and deep learning techniques, addressing the limitations of traditional expert-based categorization. The authors propose a method that automates the labeling of vulnerabilities during the detection phase of security testing, enhancing the efficiency and accuracy of vulnerability localization. The study evaluates various natural language processing techniques and machine learning models to improve the classification of vulnerabilities directly from the source code.

Uploaded by

say.mansabdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/380365637

Vulnerability Classification on Source Code Using Text Mining and Deep


Learning Techniques

Conference Paper · July 2024


DOI: 10.1109/QRS-C63300.2024.00017

CITATIONS READS
2 601

5 authors, including:

Ilias Kalouptsoglou Miltiadis Siavvas


Centre for Research and Technology Hellas Imperial College London
19 PUBLICATIONS 102 CITATIONS 55 PUBLICATIONS 628 CITATIONS

SEE PROFILE SEE PROFILE

Apostolos Ampatzoglou Alexander Chatzigeorgiou


University of Macedonia University of Macedonia
174 PUBLICATIONS 2,917 CITATIONS 303 PUBLICATIONS 5,739 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Ilias Kalouptsoglou on 08 May 2024.

The user has requested enhancement of the downloaded file.


Vulnerability Classification on Source Code using Text
Mining and Deep Learning Techniques

Ilias Kalouptsoglou1,2,∗ , Miltiadis Siavvas1 , Apostolos Ampatzoglou2 ,


Dionysios Kehagias1 , and Alexander Chatzigeorgiou2
1
Centre for Research and Technology Hellas, Thessaloniki, Greece
2
University of Macedonia, Thessaloniki, Greece
iliaskaloup@iti.gr, siavvasm@iti.gr, a.ampatzoglou@uom.edu.gr,
diok@iti.gr, achat@uom.edu.gr
*corresponding author

Abstract—Nowadays, security testing is an integral part of the their resources and efforts to strengthen potentially vulnerable
testing activities during the software development life-cycle. Over components.
the years, various techniques have been proposed to identify As modern software systems become more complex and
security issues in the source code, especially vulnerabilities,
which can be exploited and cause severe damages. Recently, interconnected, there is a constantly increasing number of
Machine Learning (ML) techniques capable of predicting vul- new identified vulnerabilities [4]. Thus, there is a need for
nerable software components and indicating high-risk areas a classification scheme to group related and similar vul-
have appeared, among others, accelerating the effort demanding nerabilities. Therefore, the Common Weakness Enumeration
and time consuming process of vulnerability localization. For (CWE) system [5] was introduced by MITRE to categorize
effective subsequent vulnerability elimination, there is a need
for automating the process of labeling detected vulnerabilities weaknesses found in software systems. Each individual CWE
in vulnerability categories i.e., identifying the type of the vul- represents a single vulnerability type. However, CWE is an
nerability. Several techniques have been proposed over the years expert-based system, where those who found a vulnerability
for automating the labeling process of vulnerabilities. However, are often different from the categorizers of it [6]. Moreover,
the vast majority of the proposed methods attempt to identify the aforementioned manual categorization usually is a time
the type of vulnerabilities based on their textual description that
is provided by experts, such as the description provided by the consuming procedure. According to Spanos et al. [7], there
vulnerability report in the National Vulnerability Database, and are delays between when a vulnerability is reported, when the
not on their actual source code, hindering their full automation technical description is conducted, and when the vulnerability
and the vulnerability categorization from the software testing is characterized with a CWE and a severity score.
phase. This work examines the vulnerability classification directly Recently, approaches of employing Machine Learning (ML)
from the source code during the vulnerability detection step.
Moreover, this way, a vulnerability detection method will be algorithms for automating the procedure of categorizing vul-
able to provide complete information and interpretation of nerabilities have appeared. To enhance the manual classifi-
its findings. Leveraging the advances in the field of Artificial cation carried out by the security experts, Aivatoglou et al.
Intelligence and Natural Language Processing, we construct and used ML to categorize reported vulnerabilities into CWEs
compare several multi-class classification models for categorizing utilizing the textual descriptions provided from NVD for each
vulnerable code snippets. The results highlight the importance of
the context-aware embeddings of the pre-trained Transformer- vulnerability [8]. Additionally, Liu et al. classified vulnera-
based models, as well as the significance of transfer learning from bilities gathered from cybersecurity articles and websites into
a programming language-related domain. the groups of code injection, access issues, buff errors, and
Index Terms—security testing; vulnerability classification; nat-
SQL injection [9]. However, although these approaches may
ural language processing; contextual word embedding; large lan- automate the categorization of the already reported vulnera-
guage models; transfer learning bilities, they still need a formal and accurate vulnerability-
specific technical description to be written in textual form by
1. I NTRODUCTION an expert first. In other words, none of these methods are
Software security holds a crucial role in the software devel- part of software testing, since they utilize vulnerability-related
opment life-cycle (SDLC), as described in the International meta-data instead of information retrieved from the analyzed
Standard on Software Quality ISO/IEC 25010 [1]. A primary source code itself.
concern of the software community is the identification, char- In contrast with vulnerability prediction techniques, which
acterisation and mitigation of vulnerabilities present in the are commonly built based on ML models that utilize software
source code. These vulnerabilities represent weaknesses in attributes to predict vulnerable hot-spots (e.g., files, classes,
software systems that external threats could exploit [2][3]. methods, etc.) in a software product [10][11][12], the pro-
Prior to the release of any software product, it is essential cedure of labeling the predicted vulnerabilities is currently
to conduct thorough security testing, with particular emphasis disconnected from the software testing phase, raising questions
on the use of vulnerability detection tools. By employing such about the interpretability, and hence, the reliability of the
mechanisms, software organisations can efficiently allocate findings of the testing activities. ML-enabled vulnerability
prediction has grown significantly, but it still presents impor- the machine learning algorithms, and the overall methodology
tant drawbacks, which stand as obstacles for its adoption in of this study. The findings of our evaluation approach are
practice. For instance, although it overcomes the limitations presented and discussed in Section 5, and Section 6 provides
of static analysis by producing much fewer false positives and the conclusions of the study along with directions for future
being able to identify not only coding violations of pre-defined research.
rules but also complex vulnerability patterns that cannot be
expressed in the form of static analysis rules [13][14], it lacks 2. R ELATED W ORK
the ability of static code analyzers to present specific lines and Research studies about the characterization of security vul-
categories of the security alerts [15]. nerabilities focused on the technical description provided for
In simple words, vulnerability prediction models are able to the reported software flaws in textual form. Initially, Neuhaus
highlight software components that are likely to be vulnerable, and Zimmermann analyzed vulnerability reports in the CVE
but they typically do not provide information about the type database, represented as BoW, utilizing an unsupervised ML
and the exact location of the potential vulnerability, leaving algorithm (i.e., Latent Dirichlet Allocation) to find the most
a lot of manual work to the engineers. Knowing at least the usual types of vulnerability [22].
type of the vulnerability would greatly help the development Then, Yamamoto et al. developed a method able to calculate
team, as it would narrow down their focus to specific security vulnerability scores based on the natural language descrip-
weaknesses, allowing them to employ dedicated testing tech- tion provided by CVE [23], while Wen et al. proposed an
niques and test cases that are more effective for this type of automatic vulnerability categorization framework using text
issue. mining on the vulnerability descriptions in NVD [6]. Subse-
Recent research endeavors in the field of software vul- quently, Aghaei et al. presented ThreatZoom, a tool capable
nerability prediction have attempted to reduce the level of of classifying CVEs into CWEs from CVE descriptions using
granularity of their predictions, focusing on localizing as Artificial Neural Networks (ANNs) [24].
much as possible the vulnerable lines of code per component Furthermore, Yosifova et al. examined the performance of
[16][17]. For this purpose, they often employ explainable Ar- several baseline ML models on predicting the vulnerability
tificial Intelligence (AI) methods (e.g., Attention mechanism) type using as features the CVE descriptions [25]. In addition,
in order to recognize the parts of the code that were the Aivatoglou et al. proposed a text analysis and ML-based
most important in the model’s prediction of a vulnerability. method to automate the process of vulnerability classification
However, these techniques do not provide further information using NVD descriptions, showcasing the high prediction ac-
about the category of the vulnerability, since they try to explain curacy of tree-based models [8].
the model’s decision of classifying a file or a function as In the vulnerability prediction-related literature, where stud-
vulnerable in a binary classification scheme. ies utilize software attributes to classify software components
The purpose of this study is to present a mechanism of clas- as vulnerable or not, text mining techniques have demonstrated
sifying detected vulnerabilities, providing this way developers encouraging results [11][26][27][28]. However, a very limited
with valuable insights into the kind of the security issues that number of studies has dealt with the objective of classifying
exist in their software (e.g., Command Injection, Deadlock, vulnerabilities to vulnerability categories as highlighted in a
SQL Injection, etc.) during the testing phase of the SDLC systematic mapping study on software vulnerability prediction
when the vulnerabilities are actually identified. To this end, we [29].
conduct an extensive comparative analysis of several Natural Siavvas et al. observed that software metrics are not suf-
Language Processing (NLP) techniques applied in the textual ficient indicators of specific vulnerability types [30]. Then
format of vulnerable source code snippets. Wartschinski et al. proposed VUDENC, which is a Deep
In particular, we examine two different text representations Learning (DL) and text mining-based vulnerability detection
methods (i) Bag-of-Words (BoW) and (ii) sequences of to- model [31] that considers several different vulnerability cat-
kens as well as four different word embedding algorithms egories. Although in their study they presented also results
(i) Word2vec [18], (ii) fastText [19], (iii) Bidirectional En- per vulnerability category, they focused on identifying which
coder Representations from Transformers (BERT) [20], and components are vulnerable, by performing binary classifi-
(iv) CodeBERT [21]. We also train different kinds of ML cation. Moreover, Kong et al. dealt with the problem of
models (e.g., Random Forest, Support Vector Machines, and identifying multi-type vulnerabilities, using graph embeddings
Transformer). Furthermore, special attention is paid on the and graph neural networks, but, although they included several
adoption of transfer learning. Last but not least, we provide CWEs in their dataset, they also focused on the discrimination
insightful observations for the strengths and the weaknesses between vulnerable and non-vulnerable components [32]. In
of each examined algorithm based on the results. fact, they were more interested in exploring how best to
The rest of the paper comprise five main sections. Sec- capture the diverse code representations of different types of
tion 2 provides an overview of the state-of-the-art methods vulnerabilities.
for categorizing vulnerabilities, while Section 3 presents the One of the initial attempts to classify software vulnerabili-
relevant theoretical framework for the text mining techniques ties during the security testing phase is [33], where the authors
under examination. Section 4 describes the utilized dataset, noticed the need for pinpointing types of vulnerabilities during
their detection phase. They proposed the use of DL, Bidirec- one of the fastest growing and most popular programming
tional Long-Short Time Memory (BiLSTM) neural network languages [36], but not so much explored in this field. Finally,
specifically, to implement a multi-class vulnerability prediction a discussion is conducted regarding the relationship between
model. They also noticed that training a separate model for the results obtained and the differences, as well as the tech-
each type of vulnerabilities and apply all of them to every nological developments in the NLP techniques considered.
single sample of the testing data is neither a scalable nor an
effective enough solution. 3. T HEORETICAL BACKGROUND
Next, Mamede et al. explored the capabilities of In this section, a description of the main text mining tech-
Transformer-based models on the classification of software niques utilized in the current study is provided. In particular,
vulnerabilities [34]. Particularly, they trained several BERT details about the main concepts that we adopted from the NLP
variants for multi-label vulnerability classification using a syn- domain to perform vulnerability classification are presented.
thetic dataset of Java source code. Moreover, Mazuera–Rozo To train ML models to classify vulnerabilities based on code
et al. examined several different source code representations snippets identified as vulnerable, we first needed to represent
and ML classifiers in both binary and multi-class vulnerability the source code in a numerical way. For this purpose, we
prediction in function-level of granularity, showing a very high used common textual representation methods. In particular,
accuracy drop in both cases when using real-world data instead we used (i) Bag-of-Words (BoW) and (ii) sequences of tokens
of synthetic ones [35]. representations.
From the above analysis, we can argue that there is a lack of On the one hand, BoW is the simplest NLP method that
fine-grained vulnerability classification mechanisms, operating allows training ML models on text data. In the BoW approach,
in the testing phase of the SDLC. Moreover, existing methods code snippets are collected in a ”bag” that contains the words
have focused primarily on C/C++ and secondary on Java of each snippet, without considering their sequential order.
vulnerabilities. The most promising work seems to be the study This method constructs a vocabulary from the words of the
of Zou et al. but their study [33] has several limitations, since entire training dataset, with each code snippet represented
it: (i) identifies solely vulnerabilities related to API/library as a vector aligned with the vocabulary. The values of this
function calls, (ii) cannot localize the identified vulnerabilities, vector are the number of occurrences of each word in the code
(iii) is based on a dataset that is largely synthetic, and (iv) deals snippet. This way, the textual data of the code snippets are
only with C/C++ code. transformed to tabular data, allowing the code to be analyzed
In the present study, we propose a method that enhances through a structured numerical format.
the automation of the vulnerabilities’ categorization as op- On the other hand, in the sequences of tokens approach,
posed to the traditional expert-based labeling, by leveraging each code snippet is represented as a sequence of the words
AI and NLP algorithms. Concisely, the contributions of the included in the snippet. Therefore, the sequential order of the
mechanism presented in this study to the relevant literature can tokens within the code snippets is preserved, giving a benefit
be summarised as follows: Firstly, this mechanism achieves to ML algorithms that are capable of capturing the structural
classification of vulnerabilities from the early phase of security dependencies hidden among the tokens of the source code.
testing as opposed to other ML-based methods that classify After constructing token sequences, a word embedding method
reported vulnerabilities based on their description provided has to be applied in order to feed ML models with numerical
by NVD [8][9]. Secondly, the proposed scheme provides vectors, as well as to represent better the meaning of the words.
increased confidence in the security test findings by comple- In order to vectorize the sequences of tokens, we employed
menting vulnerability predictions with the categories of the various sophisticated word embedding algorithms, such as
identified vulnerabilities. Word2vec and fastText, as well as the word embeddings
In addition, this study categorizes vulnerable lines of code, provided by Transformers. These algorithms are DL models
and hence, achieves a lower level of granularity of the vulner- capable of learning semantic and syntactic relationships among
ability classification process, as opposed to studies that use the the tokens and place them at the vector space based on their
source code to implement multi-class vulnerability prediction similarity. More specifically, words that are close to each other
by predicting which files or methods contain vulnerabilities in the text (i.e. in the source code) are also placed close to
and of what kind [33][34][35]. Actually, to the best of our each other in the vector space.
knowledge, the present study is the first to delve into the Word2vec, which was initially proposed by Mikolov et al.
classification of specific vulnerable lines of code to vulnerabil- and Google [18], is one of the most popular and widely
ity categories, by examining and comparing several different used techniques for vectorizing source code [37][38][39].
embedding algorithms and NLP models. This method employs two primary architectures, namely the
Moreover, we classify real-world vulnerabilities (instead of Continuous Bag-of-Words (CBOW) and Skip-gram [18]. The
synthetic ones), which belong to categories of vulnerabilities first aims at predicting a target word based on the words of
(e.g., Command Injection, Path Disclosure, Open Redirect, its context in a sentence, whereas the latter aims at predicting
etc.) that are considered major security issues in software context words based on a target word. Although Word2vec
engineering. Furthermore, this study performs vulnerability has proved to be efficient in a variety of NLP tasks [40],
classification on source code written in Python, which is it has several drawbacks, as well. Specifically, it neither
handles the out-of-vocabulary (OOV) problem nor is able to TABLE I
capture contextual relationships of the words. In other words, DATASET CLASS DISTRIBUTION
Word2vec assigns unique and global embeddings to the words Vulnerability category No. of vulnerabilities
regardless of their context. SQL Injection 1431
To the contrary, fastText, which is another efficient word XSRF 976
embedding algorithm proposed by Bojanowski et al. and Command Injection 721
Facebook AI [19], manages to handle the OOV problem by Path Disclosure 481
Open Redirect 442
considering sub-word information. More specifically, fastText Remote Code Execution 334
breaks words into smaller units, such as character n-grams, XSS 145
and therefore, is able to handle OOV words. Similarly to
Word2vec, it has both CBOW and Skip-gram architectures.
However, it still has difficulty in understanding the complex 4. M ETHODOLOGY
semantic relationships and multiple meanings of words. There-
fore, both fastText and Word2vec embeddings are called static This section describes the overall methodology that we fol-
or global embeddings as they are unique per word and do not lowed in order to predict the types of software vulnerabilities.
change based on the context. In particular, we present the dataset utilized, the strategy of
A more evolved architecture namely the Transformer, which constructing ML models, and the evaluation procedure. Figure
was originally proposed by Vaswani et al. and Google [41], 1 illustrates all the steps of our experimental setup: (i) data
managed to surpass the aforementioned issues of the tradi- collection and preparation, (ii) model selection and training
tional word embeddings techniques. Specifically, the Trans- setup, (iii) model training, parameterization and prediction,
former architecture, which revolutionised the NLP field, in- and (iv) model evaluation and comparison.
troduced the positional embeddings that can capture relative
positions of the tokens in the sequences. In other words, the 4.1 Dataset
Transformer, through its positional embeddings, can capture
contextual patterns across the whole input sequence with posi- For training and evaluating the examined vulnerability clas-
tional information. Therefore, each word’s embedding vector is sification approaches, we utilized a dataset that consists of
not unique but depends on the context. As opposed to the static Python source code. This dataset is provided by Wartschinski
vectors, Transformer-based word embeddings are considered et al. [31], who employed a version control system namely as
contextual embeddings. GitHub as a data source for gathering software components.
One popular model based on the Transformer architecture is To construct a dataset containing files labeled to indicate their
the Bidirectional Encoder Representations from Transformers vulnerability status (i.e., vulnerable or not), they analyzed nu-
(BERT) model that was proposed by Google [20]. It is an merous commit messages from Python programming language
encoder-only Transformer, which has been pre-trained on the projects hosted on GitHub.
task of masked language modeling (MLM) using a large They specifically went through commits with keywords in
corpus consisted of English Wikipedia and BookCorpus [42]. the commit messages that were indicative of a vulnerability
More specifically, it was trained to predict the original tokens fix. They accumulated a large number of Python source files
(i.e., words) in sentences where 15% of them were randomly related to these commits. The parent version of each file that
masked by a special mask token. In a replication study of existed before the vulnerability-fixing commit was flagged as
BERT, Liu et al. [43] observed that BERT was significantly un- vulnerable because it contained the vulnerability that needed
dertrained, and therefore, they proposed a robustly optimized to be fixed. They also collected the diff files, which list
pre-training approach, called RoBERTa. the source code-level changes made between two successive
Based on the RoBERTa model, Microsoft pre-trained a commits. Thus, they were able to extract the exact lines that
model called CodeBERT [21], which is a bimodal model that were repaired and therefore to determine which lines were
has been pre-trained not only on natural language but also vulnerable.
on six programming languages (i.e., Java, Python, Go, Ruby, All those collected vulnerable block of lines were also
PHP, and JavaScript). In particular, Feng et al. pre-trained it characterized by a unique vulnerability category. This specific
in pairs of function-level source code and the corresponding labeling was conducted based on the keywords included in
documentation in natural language. During pre-training, Code- the commit messages of the fixing commits. The authors of
BERT learnt general-purpose representations that proved to [31] chose to include keywords indicative of seven common
be useful in tasks such as generation of code documentation vulnerability types, taking into account the OWASP Top 10
and code search based on natural language queries [21]. list [47]. Specifically, the dataset contains 4530 code blocks
It is worth noting that BERT and its variants have been categorized as SQL Injection, Cross-Site Request Forgery
widely used in various software engineering tasks, raging from (XSRF), Command Injection, Path Disclosure, Open Redirect,
software requirements extraction to requirements classification Remote Code Execution, and Cross-Site Scripting (XSS) vul-
and vulnerability repair demonstrating remarkable capabilities nerabilities. Table I presents the distribution of the classes (i.e.,
[44][45][46]. vulnerability categories) in the dataset.
4.2 Study Design ular, the bert-base model, which has 110 millions parameters,
In the first step of our methodology, after retrieving the 12 layers of 768 size, and 12 attention heads [41] was utilized,
vulnerability-related data, we pre-processed the code snippets. whereas for CodeBERT we leveraged the codebert-base-mlm
Specifically, we substituted all numerical constants such as version, which, similarly with roberta-base, has the BERT
integers, floats, etc. and string literals with two distinct identi- architecture but 125M parameters.
fiers, ”numId$” and ”strId$” correspondingly. This substitution In step 3 of Figure 1, we trained the aforementioned models
made the code snippets more abstract and independent of using the training set of the dataset. We have to clarify that
application-specific constants, which might otherwise impact we conducted several experiments until ending up with the
the performance of the models. optimal hyperparameters. For the training of the static word
Subsequently, in the model selection phase, we represented embedding using a python code corpus, we utilized a vector
the source code in two formats namely as BoW and sequences dimension equal to 100 and a context window size of 20
of tokens. The former is in numerical form as it represents words. As embedding architectures, we used Skip-gram and
the source code in a table of words with their number of CBOW for Word2vec and fastText respectively. Regarding
occurrences as features. For the latter, we transformed each the ML classifiers, for RF we used 100 decision trees using
sequence to a numerical vector called embedding, comparing boostrap sampling and the square root of the total number of
several embedding methods (i.e., Word2vec, fastText, BERT, features at each split. For DT model, the tree was selected to
and CodeBERT). For the actual implementation of these grow until it reaches a maximum depth of 120 levels, while the
techniques, we employed the algorithms provided by Gensim1 SVM utilized a radial basis kernel and a gamma value equal
and Hugging Face2 libraries for the global and contextual to 100. Interestingly, KNN ended up with a single neighbor.
embeddings respectively. Furthermore, BERT and CodeBERT were fine-tuned with
For the cases of Word2vec and fastText (i.e., global embed- a learning rate equal with 0.00005, Adam optimizer, and
dings), since we deal with a programming language-related maximum length of the input sequences defined as 512,
task, we trained word embedding vectors leveraging a large which is the maximum length that they can support. To
code corpus that is provided by Wartschinski et al. [48]. This determine the number of epochs the early stopping technique
corpus consists of functions written in Python programming was utilized with 100 maximum epochs and a patience of
language and it contains 11.5 million lines of code in total. 5 consecutive epochs without improvement. Moreover, during
Then we computed the mean among the embedding vectors the tokenization of the textual data with BERT and CodeBERT,
that correspond to the different words in each input sequence, zero padding was applied to make all the sequence to have the
resulting in a single vector that represents the average of all same length and also the truncation technique was employed
the word vectors in the sequence. to cut-off sequences longer than the maximum length.
On the other hand, for the case of Transformer-based mod- Then we utilized the optimised models to predict vulnera-
els, there was no need for training embedding vectors as they bility categories for the code snippets that comprise the unseen
are models already pre-trained on large datasets. Therefore, testing set. Finally, in step 4, we evaluated the optimised mod-
we utilized the pre-trained models provided by Hugging Face. els of all the different approaches using common classification
In addition, in this case, we fed the sequences of tokens to metrics to measure their predictive performance on the test
the pre-trained models in inference mode and we extracted the data. Therefore, we managed to compare those approaches
sentence-level embedding vectors from the last hidden state of and to identify which method achieves the highest accuracy
the Transformer. This way we gained a contextual embedding scores.
vector for each sequence of tokens given as input. 4.3 Evaluation Scheme
Furthermore, either the BoW or the embedding vectors of
the sequences were fed to a ML model in order to perform To evaluate the models that we trained in the task of vulnera-
multi-class classification to vulnerability categories. For the bility classification, we applied the well established technique
selection of the classifier, we examined several ML models called k-fold cross-validation [29]. During this process, the
including Decision Trees (DT), K-Nearest Neighbors (KNN), dataset is separated in k different parts, specifically for the
Support Vector Machines (SVM), and Random Forest (RF) purpose of our experiment in ten folds. Then, the nine folds
classifiers. We also examined the approach of fine-tuning the are utilized as training set and the rest one as testing set. The
pre-trained BERT and CodeBERT models to the downstream training and testing is repeated ten times, each time selecting
task of vulnerability category prediction. a different fold as the test set. This way, we avoid putting data
As opposed to the embeddings extraction approach, during bias on the developed models, assuring that the models can
fine-tuning the whole model participates in the training to the perform well on various parts of the dataset and not on one
downstream task. Regarding the Transformer-based models random split.
utilized (i.e., BERT and CodeBERT), we leveraged the pre- As regards the measurement of the performance of the
trained models that are provided by Hugging Face. In partic- models, we utilized the accuracy, precision (P), recall (R), and
F1 -score classification metrics, which are frequently utilized in
1 https://radimrehurek.com/gensim/ the literature [8][29]. The value of these metrics is determined
2 https://huggingface.co/ from the number of True Positives (TP), True Negatives
Fig. 1. Overview of the overall approach.

(TN), False Positives (FP), and False Negatives (FN) that


are produced by the models. For facilitating the comparison N N
1 X 1 X 2 × P i × Ri
among the examined techniques, we pay more attention on F1 = F1i = = (3)
the F1 -score, which considers both precision and recall giving N i=1 N i=1 P i + Ri
equal weight to them, and therefore, is most suitable for , where i is the index of each class.
evaluating these models. On the contrary, we do not pay so
much attention on accuracy, since our dataset is imbalanced 5. R ESULTS AND D ISCUSSION
and therefore accuracy could be misleading. In this section we present the findings of our experimental
We have to clarify also that we present the macro average analysis. The experiments were held on a GeForce RTX 3060
values of the aforementioned metrics, instead of micro or Nvidia GPU with a parallel computing platform called CUDA3
weighted average. Macro averaging is equal to the arithmetic installed. For the implementation of the DL algorithms we
mean of all the per-class scores, without using any class used the TensorFlow4 platform, while for the ML classification
weights for the aggregation. To the contrary, weighted aver- models, we utilized the scikit-learn5 library. To enhance the
aging, for instance, of F1 -score, calculates all the per-class reproducibility of the experiments, we provide our scripts
F1 -scores but when adding them, it uses a weight depending online [50].
on the number of true labels of every class, whereas micro Initially, an empirical comparison of several ML classifiers
averaging computes TP, FP, TN, and FN separately for every (i.e., Decision Trees, Support Vector Machines, KNN, and
class and then computes the global F1 -score [49]. Since we Random Forest) was conducted so as to identify the best
consider all classes equally important, and therefore, we do not performing one for each utilized token numerical represen-
have to take into account the number of samples per class, we tation technique (i.e., BoW, Word2vec, fastText, BERT, and
qualify the macro averaging method. Equations (1), (2), and CodeBERT). Figure 2 provides a bar chart that shows the F1 -
(3) present the mathematical formulas of the macro average scores of all the examined models.
values of recall, precision, and F1 -score respectively. As can be seen, Decision Tree (DT) is the less accurate
model regardless of the NLP method, whereas the Random
1 X
N
1 X
N
TPi Forest (RF) achieves the highest scores. The K-Nearest Neigh-
R= Ri = (1) bors (KNN) and Support Vector Machines (SVM) models
N i=1 N i=1 T P i + F N i
are the second and third best models in all cases, with
3 https://developer.nvidia.com/cuda-toolkit
N N
1 X 1 X TPi 4 https://www.tensorflow.org/
P = Pi = (2) 5 https://scikit-learn.org/stable/
N i=1 N i=1 T P i + F P i
Fig. 2. Comparison of different ML classification models per NLP approach.

one outperforming the other in some cases and vice versa. very close not only to each other but also to BoW, with BoW
Therefore, we can argue that RF, which is an Ensemble model actually to be a bit higher, suggests that the classification
which combines the output of several decision trees [51], is process may pay more attention to the occurrence in the code
superior than the other examined classifiers. Hence, we qualify snippets of specific tokens that are indicative of a vulnerability
it as our predictor in the subsequent experiments. category, and, to the contrary, may have difficulty in capturing
Regarding the comparison among the NLP method utilized syntactic patterns that reside in the source code.
for text representation, in Table II the macro average values Subsequently, in order to depict clearer the role of the
of the classification metrics achieved by the best performing prior knowledge in transfer learning-based vulnerability clas-
ML model (i.e., Random Forest) are presented. We can see sification, we present Table III. On the one hand, Table
that all of the examined methods achieve adequate prediction III contains the classification scores of our approach when
performance with F1 -scores above 75% except for Word2vec, using pre-trained Word2vec, fastText, and BERT embeddings.
which suffers from the OOV problem and also does not These pre-trained embeddings have been trained on large
consider the context of each token when representing it with corpuses of natural language and have learnt syntactics and
a numerical vector. semantics of words. Word2vec was originally pre-trained on
To the contrary, fastText operates on character and sub- a dataset of google news, while fastText was pre-trained on
word level, and therefore, it can represent efficiently OOV Wikipedia data and BERT on a dataset comprising English
tokens. However, it still overlooks the context of tokens in Widipedia and BookCorpus6 . On the other hand, Table III
different sequences. This is where BERT excels, since it presents the classification scores for Word2vec and fastText
provides context-aware embeddings, managing to outperform embeddings that we trained using a Python corpus, and also for
Word2vec by almost 10%, but it lacks knowledge of the the CodeBERT which is already pre-trained on programming
programming language. The BERT variant called CodeBERT languages-related data.
is a method that encompasses all the aforementioned concepts Based on the findings presented in Table III, it is clear that
as it is context-aware (like BERT), handles OOV problem in all cases the prior knowledge of domain-specific language
using sub-word tokenization (similarly to fastText), and has (i.e., programming language) is very beneficial in the task of
prior knowledge of programming languages. vulnerability classification. In particular, we can see that when
The findings presented in Table II show that the OOV representing code snippets with code-aware embeddings, we
problem is a significant one, resulting in low performance succeed higher scores in all of accuracy, precision, recall, and
by Word2vec. Furthermore, fastText managed to surpass the F1 -score metrics.
context-aware BERT highlighting the importance of hav- Furthermore, we proceeded with training the whole
ing domain-specific prior knowledge (i.e., programming lan-
guage). Moreover, the fact that fastText and CodeBERT are 6 https://en.wikipedia.org/wiki/BookCorpus
TABLE II
E VALUATION RESULTS OF THE R ANDOM F OREST CLASSIFIER PER TEXT VECTORIZING METHOD

Vectorizing Method Accuracy (%) Precision (%) Recall (%) F1 -score (%)
Bag-of-Words 81.9 82.3 77.2 79.1
Word2vec 71.6 76.2 64.3 68.0
fastText 80.2 84.0 73.9 77.7
BERT 76.9 86.6 69.4 75.1
CodeBERT 80.7 87.6 72.9 78.0

TABLE III
C LASSIFICATION P ERFORMANCE OF NLP MODELS WITH PRIOR KNOWLEDGE OF NATURAL LANGUAGE VERSUS PROGRAMMING LANGUAGE

Vectorizing Method Accuracy (%) Precision (%) Recall (%) F1 -score (%)
pre-trained Word2vec 68.1 73.2 59.9 63.8
re-trained Word2vec 71.6 76.2 64.3 68.0
pre-trained fastText 74.9 78.0 68.0 71.5
re-trained fastText 80.2 84.0 73.9 77.7
pre-trained BERT 76.9 86.6 69.4 75.1
pre-trained CodeBERT 80.7 87.6 72.9 78.0

TABLE IV
C OMPARISON OF EMBEDDINGS EXTRACTION AND FINE - TUNING OF T RANSFORMER MODELS APPROACHES

Vectorizing Method Accuracy (%) Precision (%) Recall (%) F1 -score (%)
BERT + RF 76.9 86.6 69.4 75.1
BERT fine-tuning 84.5 82.4 82.7 82.5
CodeBERT + RF 80.7 87.6 72.9 78.0
CodeBERT fine-tuning 87.4 86.3 85.2 85.5

Transformer-based models. More specifically, instead of ex- source code). All things considered, the results showcase (i)
tracting their embeddings and feeding them to the ML clas- the benefit of performing fine-tuning over feature extraction
sifiers, we fine-tuned BERT and CodeBERT models on the in both BERT and CodeBERT cases, and (ii) the superiority
downstream task of vulnerability classification. Table IV com- of the fine-tuned CodeBERT over the fine-tuned BERT.
pares the fine-tuning and embeddings extraction approaches of For reasons of completeness, at this point we provide the
employing Large Language Models (LLMs). detailed results for each of the seven vulnerability categories
By inspecting Table IV, we can see that when utilizing achieved by each of the best performing vulnerability clas-
BERT embeddings with a RF classifier, the accuracy achieved sification approaches. Specifically, Table V showcases the
is 76.9%, with corresponding precision, recall, and F1 -score F1 -score of the fine-tuning CodeBERT, fine-tuning BERT,
of 86.6%, 69.4%, and 75.1%, respectively. However, fine- BoW, CodeBERT embeddings with ML classifier, and fastText
tuning BERT significantly improves results, with a significant embeddings with ML classifier approaches, which are the ones
boost in F1 -score from 75.1% to 82.5%. This improvement that achieved the highest F1 -score in order.
suggests that fine-tuning enables BERT to adapt better to the As presented in Table V, on the one hand, it seems that our
specific nuances of the downstream task, leading to enhanced best model, which is the fine-tuned CodeBERT, manages to
predictive performance. classify effectively all the vulnerability categories included in
In addition, fine-tuned CodeBERT demonstrates higher the analysis. In particular, it achieves F1 -score 90% and above
scores than fine-tuned BERT, confirming the findings of Table in the three most numerous categories (i.e., SQL Injection,
III regarding the effectiveness of CodeBERT in capturing XSRF, and Command Injection). It also succeeds high scores
code-related features. Moreover, fine-tuning CodeBERT re- close to 90% in case of Path Disclosure and XSS, and over
sults in a substantial performance gain compared to the fea- 80% even for the Remote Code Execution category, regardless
tures (i.e., embeddings) extraction approach, achieving an F1 - the fact that XSS and Remote Code Execution categories have
score of 85.5%, which is by far the highest F1 -score reported the less samples in the dataset (especially XSS). Only the
in this study. prediction of Open Redirect has an F1 -score lower than 80%,
This improvement highlights the importance of fine-tuning but it still reaches 75%.
pre-trained models, pointing the capability of the Transformer On the other hand, although it is clear that fine-tuned
architecture to capture long-term dependencies that RF cannot. CodeBERT achieves not only higher average scores but also
Hence, the fine-tuning approach seems to the optimal one, at higher scores per category for most cases, there are some
least for this specific objective (i.e., vulnerability classifica- exceptions. In case of Open Redirect vulnerabilities, which
tion), especially when working with domain-specific data (i.e., is the most weak category of CodeBERT, there are other
TABLE V
F1 - SCORE PER CATEGORY OF THE FIVE BEST PERFORMING APPROACHES

Category CodeBERT fine-tuning BERT fine-tuning BoW + RF CodeBERT + RF fastText + RF


SQL Injection 90 86 89 82 86
XSRF 90 91 86 86 80
Open Redirect 75 72 82 77 77
XSS 86 87 77 67 73
Remote Code Execution 81 71 86 80 81
Command Injection 91 86 77 85 81
Path Disclosure 87 85 68 72 79

approaches that achieve more accurate predictions. In partic- ACKNOWLEDGMENT


ular, all BoW, fastText and CodeBERT with RF approaches
Work reported in this paper has received funding from
surpass CodeBERT fine-tuning in classifying Open Redirect
the European Union’s Horizon Europe Research and Inno-
vulnerabilities. Especially, BoW seems to have a substantial
vation Program through the DOSS project, Grant Number
benefit (i.e., 7% higher F1 -score). It also manages to surpass by
101120270.
5% the fine-tuned CodeBERT in the Remote Code Execution
classification, which is the second weakest category of the fine-
tuned CodeBERT. Hence, the CodeBERT could be used as the R EFERENCES
main model for categorizing the detected vulnerabilities, and in [1] ISO/IEC, ISO/IEC 25010 - Systems and software engineering - Systems
case that the detected vulnerability is classified as Open Redi- and software Quality Requirements and Evaluation (SQuaRE) - System
and software quality models. ISO/IEC, 2011.
rect (or Remote Code Execution), one could supplementarily
[2] M. C. Sánchez, J. M. C. de Gea, J. L. Fernández-Alemán, J. Garcerán,
utilize another model that has higher predictive performance and A. Toval, “Software vulnerabilities overview: A descriptive study,”
in this category (e.g., the BoW with RF model in our case), Tsinghua Science and Technology, vol. 25, no. 2, pp. 270–280, 2019.
to reach safer conclusions. [3] ISO/IEC, ISO/IEC 27000:2018 - Information technology — Security
techniques — Information security management systems. ISO/IEC,
2005.
6. C ONCLUSION AND F UTURE W ORK [4] “Cyber Crime & Security,” https://www.statista.com/statistics/500755/worldwide-
common-vulnerabilities-and-exposures, Accessed: 2024-03-15.
In this study, our primary purpose was to categorize detected [5] “Common Weakness Enumeration,” https://cwe.mitre.org/, Accessed:
security vulnerabilities such as Path Disclosure, Command 2024-03-15.
[6] T. Wen, Y. Zhang, Q. Wu, and G. Yang, “Asvc: An automatic secu-
Injection, etc. from the testing phase of the SDLC, using rity vulnerability categorization framework based on novel features of
source code attributes as features, as opposed to traditional vulnerability data.” J. Commun., vol. 10, no. 2, pp. 107–116, 2015.
techniques, which classify reported vulnerabilities based on [7] G. Spanos and L. Angelis, “A multi-target approach to estimate
software vulnerability characteristics and severity scores,” Journal of
descriptions provided by NVD. To this end, we leveraged sev- Systems and Software, vol. 146, pp. 152–166, 2018. [Online]. Available:
eral NLP text representation and text classification techniques https://www.sciencedirect.com/science/article/pii/S01641
adopting a multi-class classification procedure. 21218302061
[8] G. Aivatoglou, M. Anastasiadis, G. Spanos, A. Voulgaridis, K. Votis,
During the conducted investigation, we considered and com- and D. Tzovaras, “A tree-based machine learning methodology to auto-
pared two distinct approaches: (i) build traditional ML models matically classify software vulnerabilities,” in 2021 IEEE International
based on numerical representations of vulnerable code snippets Conference on Cyber Security and Resilience (CSR), 2021, pp. 312–317.
[9] C. Liu, J. Li, and X. Chen, “Network vulnerability analysis using text
produced by a range of common NLP techniques (e.g., BoW, mining,” in Intelligent Information and Database Systems: 4th Asian
Word2vec, fastText, BERT, and CodeBERT), and (ii) build Conference, ACIIDS 2012, Kaohsiung, Taiwan, March 19-21, 2012,
Transformer-based models through fine-tuning popular pre- Proceedings, Part II 4. Springer, 2012, pp. 274–283.
trained models like BERT, and CodeBERT. The findings [10] R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen, “Predicting
vulnerable software components via text mining,” IEEE Transactions on
demonstrate the superiority of CodeBERT model, which is Software Engineering, vol. 40, no. 10, pp. 993–1006, 2014.
the one that handles the OOV issue using sub-tokenization, [11] J. Walden, J. Stuckman, and R. Scandariato, “Predicting vulnerable
has domain-specific prior knowledge, and also has contextual components: Software metrics vs text mining,” in 2014 IEEE 25th
international symposium on software reliability engineering. IEEE,
understanding. The results show important benefit from fine- 2014, pp. 23–33.
tuning over extracting embeddings approach, as well. [12] I. Kalouptsoglou, M. Siavvas, D. Tsoukalas, and D. Kehagias, “Cross-
Several directions for future work can be defined. For project vulnerability prediction based on software metrics and deep
learning,” in International Conference on Computational Science and
instance, an interesting analysis could include the application Its Applications. Springer, 2020, pp. 877–893.
of explainable AI techniques to reveal which parts of the code [13] B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t
snippets were the most influential in the models’ category software developers use static analysis tools to find bugs?” in 2013 35th
International Conference on Software Engineering (ICSE). IEEE, 2013,
predictions. Additionally, we aim at implementing a complete pp. 672–681.
working prototype that will be able to parse the source code of [14] M. Siavvas, I. Kalouptsoglou, D. Tsoukalas, and D. Kehagias, “A self-
software projects, identify vulnerable components, detect the adaptive approach for assessing the criticality of security-related static
analysis alerts,” in Computational Science and Its Applications–ICCSA
specific code snippets that contain vulnerabilities, and then 2021: 21st International Conference, Cagliari, Italy, September 13–16,
classify them to vulnerability categories. 2021, Proceedings, Part VII 21. Springer, 2021, pp. 289–305.
[15] M. Siavvas, D. Kehagias, D. Tzovaras, and E. Gelenbe, “A hierarchical Conference on Program Comprehension (ICPC). IEEE, 2021, pp. 276–
model for quantifying software security based on static analysis alerts 287.
and software metrics,” Software Quality Journal, vol. 29, no. 2, pp. [36] “Top programming languages that will rule in 2022,”
431–507, 2021. https://fireart.studio/blog/top-programming-languages-that-will-rule-
[16] Y. Li, S. Wang, and T. N. Nguyen, “Vulnerability detection with fine- in-2021/, accessed: 2024-03-20.
grained interpretations,” in Proceedings of the 29th ACM Joint Meeting [37] I. Kalouptsoglou, M. Siavvas, D. Kehagias, A. Chatzigeorgiou, and
on European Software Engineering Conference and Symposium on the A. Ampatzoglou, “An empirical evaluation of the usefulness of word
Foundations of Software Engineering, 2021, pp. 292–303. embedding techniques in deep learning-based vulnerability prediction,”
[17] F. Yang, F. Zhong, G. Zeng, P. Xiao, and W. Zheng, “Lineflowdp: A in EuroCybersec2021,Lecture Notes in Communications in Computer
deep learning-based two-phase approach for line-level defect prediction,” and Information Science, 10 2021.
Empirical Software Engineering, vol. 29, no. 2, pp. 1–49, 2024. [38] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulner-
[18] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of ability identification by learning comprehensive program semantics via
word representations in vector space,” arXiv preprint arXiv:1301.3781, graph neural networks,” arXiv preprint arXiv:1909.03496, 2019.
2013. [39] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,
[19] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word “Vuldeepecker: A deep learning-based system for vulnerability detec-
vectors with subword information,” Transactions of the association for tion,” arXiv preprint arXiv:1801.01681, 2018.
computational linguistics, vol. 5, pp. 135–146, 2017. [40] L. Wolf, Y. Hanani, K. Bar, and N. Dershowitz, “Joint word2vec net-
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training works for bilingual semantic representations.” Int. J. Comput. Linguistics
of deep bidirectional transformers for language understanding,” arXiv Appl., vol. 5, no. 1, pp. 27–42, 2014.
preprint arXiv:1810.04805, 2018. [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
[21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming in neural information processing systems, 2017, pp. 5998–6008.
and natural languages,” arXiv preprint arXiv:2002.08155, 2020. [42] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,
[22] S. Neuhaus and T. Zimmermann, “Security trend analysis with cve and S. Fidler, “Aligning books and movies: Towards story-like visual
topic models,” in 2010 IEEE 21st International Symposium on Software explanations by watching movies and reading books,” in The IEEE
Reliability Engineering. IEEE, 2010, pp. 111–120. International Conference on Computer Vision (ICCV), December 2015.
[23] Y. Yamamoto, D. Miyamoto, and M. Nakayama, “Text-mining approach [43] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
for estimating vulnerability score,” in 2015 4th International Workshop L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
on Building Analysis Datasets and Gathering Experience Returns for pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
Security (BADGERS). IEEE, 2015, pp. 67–73. [44] A. F. de Araújo and R. M. Marcacini, “Re-bert: automatic extraction of
[24] E. Aghaei, W. Shadid, and E. Al-Shaer, “Threatzoom: Hierarchical neu- software requirements from app reviews using bert language model,” in
ral network for cves to cwes classification,” in International Conference Proceedings of the 36th annual ACM symposium on applied computing,
on Security and Privacy in Communication Systems. Springer, 2020, 2021, pp. 1321–1327.
pp. 23–41. [45] K. Kaur and P. Kaur, “Improving bert model for requirements classifi-
[25] V. Yosifova, A. Tasheva, and R. Trifonov, “Predicting vulnerability cation by bidirectional lstm-cnn deep model,” Computers and Electrical
type in common vulnerabilities and exposures (cve) database with Engineering, vol. 108, p. 108699, 2023.
machine learning classifiers,” in 2021 12th National Conference with [46] K. Huang, S. Yang, H. Sun, C. Sun, X. Li, and Y. Zhang, “Repair-
International Participation (ELECTRONICA), 2021, pp. 1–6. ing security vulnerabilities using pre-trained programming language
[26] Y. Pang, X. Xue, and H. Wang, “Predicting vulnerable software com- models,” in 2022 52nd Annual IEEE/IFIP International Conference on
ponents through deep neural network,” in Proceedings of the 2017 Dependable Systems and Networks Workshops (DSN-W). IEEE, 2022,
International Conference on Deep Learning Technologies, 2017, pp. 6– pp. 111–116.
10. [47] “Explore the world of cyber security,” https://owasp.org/, accessed:
[27] R. Ferenc, P. Hegedűs, P. Gyimesi, G. Antal, D. Bán, and T. Gyimóthy, 2024-03-20.
“Challenging machine learning algorithms in predicting vulnerable [48] L. Wartschinski, “Vudenc - python corpus for word2vec,” Dec. 2019.
javascript functions,” in 2019 IEEE/ACM 7th International Workshop [Online]. Available: https://doi.org/10.5281/zenodo.3559480
on Realizing Artificial Intelligence Synergies in Software Engineering [49] “Understanding Micro, Macro, and Weighted Averages,”
(RAISE). IEEE, 2019, pp. 8–14. https://iamirmasoud.com/2022/06/19/understanding-micro-macro-and-
[28] I. Kalouptsoglou, M. Siavvas, D. Kehagias, A. Chatzigeorgiou, and weighted-averages-for-scikit-learn-metrics-in-multi-class-classification-
A. Ampatzoglou, “Examining the capacity of text mining and software with-example/, accessed: 2024-02-10.
metrics in vulnerability prediction,” Entropy, vol. 24, no. 5, 2022. [50] I. Kalouptsoglou, M. Siavvas, A. Ampatzoglou, D. Kehagias,
[Online]. Available: https://www.mdpi.com/1099-4300/24/5/651 and A. Chatzigeorgiou, “Software Vulnerability Classification
[29] I. Kalouptsoglou, M. Siavvas, A. Ampatzoglou, D. Kehagias, and using Text Mining and Deep Learning Techniques,”
A. Chatzigeorgiou, “Software vulnerability prediction: A systematic https://sites.google.com/view/vulnerabilityclassification/, Accessed:
mapping study,” Information and Software Technology, p. 107303, 2023. 2024-04-01.
[30] M. Siavvas, D. Kehagias, and D. Tzovaras, “A preliminary study [51] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
on the relationship among software metrics and specific vulnerability 5–32, 2001.
types,” in 2017 International Conference on Computational Science and
Computational Intelligence (CSCI). IEEE, 2017, pp. 916–921.
[31] L. Wartschinski, Y. Noller, T. Vogel, T. Kehrer, and L. Grunske, “Vu-
denc: Vulnerability detection with deep learning on a natural codebase
for python,” Information and Software Technology, p. 106809, 2022.
[32] L. Kong, S. Luo, L. Pan, Z. Wu, and X. Li, “A multi-type vulnerability
detection framework with parallel perspective fusion and hierarchical
feature enhancement,” Computers & Security, p. 103787, 2024.
[33] D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, “A deep learning-based
system for multiclass vulnerability detection,” IEEE Transactions on
Dependable and Secure Computing, vol. 18, no. 5, pp. 2224–2236, 2019.
[34] C. Mamede, E. Pinconschi, R. Abreu, and J. Campos, “Exploring trans-
formers for multi-label classification of java vulnerabilities,” in 2022
IEEE 22nd International Conference on Software Quality, Reliability
and Security (QRS), 2022, pp. 43–52.
[35] A. Mazuera-Rozo, A. Mojica-Hanke, M. Linares-Vásquez, and
G. Bavota, “Shallow or deep? an empirical study on detecting vulner-
abilities using deep learning,” in 2021 IEEE/ACM 29th International

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy