A Systematic Literature Review On Large Language Models
A Systematic Literature Review On Large Language Models
1 INTRODUCTION
Software bugs are recognized as inevitable and destructive, posing safety issues for users worldwide
and costing billions of dollars in financial losses annually [11, 158]. It is non-trivial and time-
consuming for developers to fix detected software bugs manually [13]. Automated Program
Repair (APR) plays a crucial role in software development and maintenance with the aim of fixing
software bugs without human intervention. Following the foundational work GenProg [80, 157]
in 2009, APR has been extensively investigated over the past decades [43, 106], and researchers
have proposed a variety of APR techniques, including heuristic-based [64, 80, 99, 180], constraint-
based [31, 100, 171, 173], and pattern-based [76, 91, 92] ones. Recently, inspired by the advances of
Deep Learning (DL), an increasing number of learning-based APR techniques have been proposed
that utilize neural network models to automatically learn bug-fixing patterns [18, 66, 84, 85, 96, 144,
176–178, 203, 204]. Thanks to the powerful ability of DL models to learn hidden repair patterns from
massive code corpora, learning-based APR has achieved remarkable performance in the last couple
of years [185], attracting considerable attention from both academia and industry [69, 70, 73].
Very recently, Large Language Models (LLMs) have been successfully applied to a broad
range of source code-related tasks [149, 187, 195], such as code generation [82, 150, 152, 205], code
summarization [134, 135, 148], and test generation [4, 24, 57, 109, 129]. Benefiting from massive
model parameters and vast training data, LLMs have demonstrated impressive performance and
fundamentally revolutionized the research paradigm in the Software Engineering (SE) commu-
nity. In the domain of APR, beginning with pioneering studies, e.g., TFix [7], CIRCLE [179] and
AlphaRepair [165], the community has witnessed an explosion of repair studies utilizing LLMs,
already achieving considerable advantages and further indicating significant potential for future
research. However, the integration of LLMs within APR is a considerably complex undertaking,
making it difficult for interested researchers to understand existing work. For example, existing
LLM-based APR studies encompass different research perspectives (e.g., empirical [164], techni-
cal [165] and benchmark studies [190]), repair phases (e.g., patch generation [189] and correctness
assessment [186]), repair scenarios (e.g., static warnings [69] and syntax errors [70]), mode archi-
tectures (e.g., encoder-only [188] and decoder-only [101]) and model utilization paradigms (e.g.,
fine-tuning [179], few-shot [109] and zero-shot [189]). Despite ongoing explorations in the field,
the literature currently lacks a detailed and systematic review of the applications of LLMs in APR,
making it challenging for researchers to understand the multitudinous design choices of existing
work and conduct follow-up research.
This Paper. To bridge this gap, our work provides the first systematic literature review on the
deployment of rapidly emerging LLM-based APR studies. Based on this, the community can gain a
comprehensive understanding of the strengths, weaknesses, and gaps in existing LLM-based APR
techniques. We discuss what LLMs are widely adopted in state-of-the-art APR research and how they
are integrated into the repair workflow. We collect 127 relevant papers and perform a systematic
analysis from LLMs, APR, and integration perspectives. From our analysis, we reveal the current
challenges and point out possible future directions for LLM-based APR research. Overall, this work
offers a thorough overview of the ongoing progress within the LLM-based APR community, aiding
researchers in navigating this burgeoning field and advancing toward innovative practices.
Contributions. To sum up, this work makes the following contributions:
• Survey Methodology. We conduct the first systematic literature review with 127 high-quality
APR papers that utilize recent LLMs to address repair challenges from 2020 to April 2024.
• Trend Analysis. We perform a detailed analysis of selected APR studies in terms of publication
trends, distribution of publication venues, and types of contributions.
• LLMs Perspective. We summarize 46 LLMs utilized to support program repair and provide a
summary of the typical usage and trends of different LLM categories in the APR domain.
• APR Perspective. We describe common repair scenarios that LLMs are applied to, encom-
passing 18 bug types, such as security vulnerabilities and programming problems.
• Integration Perspective. We discuss some key factors, including datasets, input representa-
tions and open science, that impact the performance of integrating LLMs into APR.
• Challenges and Opportunities. We summarize some crucial challenges of applying LLMs in
the APR field, and pinpoint some potential guidelines for future LLM-based APR research.
Paper Organization. Section 2 introduces some basic concepts about APR and LLMs. Then,
according to the contributions listed above, Section 3 lists our research questions (RQs) and the
research methodology to collect papers related to our work. Section 4 investigates the trend and
distribution of LLM-based APR studies. Section 5 summarizes LLMs that are used by existing APR
studies. Section 6 illustrates the primary repair scenarios that LLMs are applied to and provides
a brief description of each work. Section 7 discusses some crucial factors during the integration
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:3
of LLMs and APR, including datasets, input representation, patch correctness, and open science.
Section 8 discusses some challenges and practical guidelines. Section 9 draws the conclusions.
2.1.2 Repair Workflow. Fig. 1 illustrates the typical generate-and-validate workflow of APR, which
is usually composed of three parts. Specifically, for a detected bug, (1) the fault localization
phase identifies suspicious code elements that need to be fixed based on off-the-shelf localization
techniques [160]; (2) the patch generation phase generates program variants (i.e., candidate
patches) via transformation rules [185]; (3) the patch validation phase utilizes available test
suites as the oracle to identify correct patches by dynamic execution [186]. It is important to note
that a candidate patch that successfully passes the available test suite is termed a plausible patch.
However, if such a plausible patch fails to generalize to additional test cases, it is considered an
overfitting patch. Conversely, a plausible patch that remains effective across broader test cases is
recognized as a correct patch, one that is semantically in alignment with patches written manually
by developers.
2.1.3 Repair Techniques. The literature has introduced a multitude of APR techniques to generate
correct patches from various perspectives. Such APR techniques can be categorized into four
categories: (1) heuristic-based APR utilizes genetic programming to explore the search space of
the correct patch, such as GenProg [80], Astor [99], ARJA [180], and SimFix [64]; (2) constraint-
based APR usually focuses on the condition synthesis problem by treating program repair as
a constraint-solving task, such as Nopol [173], Cardumen [100], ACS [171], and Dynamoth [31];
(3) pattern-based APR utilizes pre-defined repair templates that are usually hand-crafted by
1:4 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
experts to transform buggy code snippets into correct ones, such as TBar [92], FixMiner [76] and
Avatar [91]; (4) learning-based APR treats patch generation as a neural machine translation
task with the advance of DL models, such as Tufano et al. [144], SequenceR [18], CoCoNut [96],
DLFix [84], CURE [66], Recoder [203], DEAR [85], SelfAPR [176], RewardAPR [177], and Tare [204].
Among the four types of APR techniques, learning-based APR has achieved remarkable perfor-
mance by learning hidden bug-fixing patterns automatically from extensive source code databases [185].
Recently, inspired by the success of LLMs in NLP and SE tasks, researchers have increasingly been
utilizing advanced LLMs to address software bugs [164, 165, 188, 189]. Compared with learning-
based APR, such LLM-based APR techniques have demonstrated significantly better performance
and have received growing attention in the community, which is the focus of our work.
2.2.2 Model Categories. The literature has seen a variety of LLMs supporting NLP and SE re-
search, which can be categorized into three main categories based on their model architectures. (1)
Encoder-only LLMs, such as CodeBERT [35], GraphCodeBERT [46], train the encoder part of the
Transformer to generate a fixed-dimensional bidirectional representation with Masked Language
Modeling (MLM) and Next Sentence Prediction (NSP). MLM aims to predict the original tokens
that have been randomly masked out, and NSP predicts whether two given sentences actually
follow each other in a text. (2) Decoder-only LLMs, such as CodeGPT [95], train the decoder
part of the Transformer to support auto-regressive tasks with Causal Language Modeling (CLM),
which aims to predict new tokens in a sequence based on previous tokens. (3) Encoder-decoder
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:5
LLMs, such as CodeT5 [154], train both encoder and decoder parts of the Transformer to support
sequence-to-sequence generation tasks with denoising objectives. We will summarize existing
LLMs and how they are leveraged to support program repair in Section 5.1.
3 SURVEY METHODOLOGY
In this section, guided by the principles outlined by Petersen et al. [118] and Kitchenham et al. [74],
we present details of our systematic literature review methodology.
1 http://program-repair.org/bibliography.html
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:7
Inclusion criteria
① The paper utilizes an LLM in its framework.
② The paper addresses the task of program repair.
③ The paper is accessible with full text.
④ The paper is only written in English.
Exclusion criteria
❶ The paper is fewer than seven pages.
❷ The paper is an old conference version extended to journals with the same authors.
❸ The paper uses repair methods to contribute to LLMs.
❹ The paper is published as an SLR, review, or survey only mentioning the development of LLMs and APR.
❺ The paper is published in a workshop or a doctoral symposium.
❻ The paper is a grey publication, e.g., a technical report or thesis.
❼ The paper falls into the category of short papers, tool demonstrations, and editorials.
Table 2. Checklist of Quality Assessment Criteria (QAC) for LLM-based APR studies.
It is noteworthy that we also include the list of keywords related to machine/deep learning from
Zhang et al. [185], so as to avoid missing any papers related to our work as much as possible.
70 140
65 127
60 120
50 100
42 85
40 80
30 60
20 40
14
20
10 5 20
1 6
1
0 0
2020 2021 2022 2023 2024 2020 2021 2022 2023 2024
(a) Number of publications per year. (b) Cumulative number of publications per year.
particularly those involving recently released LLMs, have not completed the peer review process,
we consider papers from arXiv and select high-quality papers with the quality assessment process,
to make our survey more comprehensive and up-to-date. We obtain 110 papers that are related to
our work.
prevalent trend since 2020, and huge numbers of studies will adopt LLMs to address the challenges
of APR in the future.
Table 3. Publication venues with LLM-based APR studies.
C 11%
C++ 8%
JS 7%
Verilog 2%
Python 24% C# 3% Solidity 1%
Rust 1%
Go 1%
PHP 1%
Kotlin 1%
Powershell 1%
Excel 1%
Power Fx 1%
OCaml 1%
Isabelle/HOL 1%
Ruby 1%
Java 37%
Big-Vul [33], and TFix [7]. We also find that LLM-based APR encompasses a broader range of
programming languages compared to traditional APR. For example, our collected papers involve
18 different programming languages in total, whereas learning-based APR techniques are typically
limited to only five languages [185]. Importantly, we notice that some rare languages, previously
overlooked in the APR community, are now being addressed. This broader language adaptability
of LLM-based APR might stem from the inherent capabilities of LLMs to encapsulate general
programming knowledge that can be transferred across multiple languages by fine-tuning [179].
Besides, LLMs’ robust natural language understanding capabilities facilitate the few-shot or zero-
shot repair settings with limited learning samples, which is a significant advantage over DL
models [185] that typically require extensive repair corpora for training. Consequently, LLMs can
efficiently handle lesser-known programming languages, like Verilog [1] and Rust [22], which are
often underrepresented in previous APR research [107]. This phenomenon highlights the promising
prospects and scalability of APR brought about by recent LLMs.
We categorize collected papers according to their main contributions into four categories: new
technique or methodology, empirical study, benchmark, and human study, as illustrated in Fig. 4.
We find that 78 relevant papers are published, with the aim of proposing a novel repair approach
or framework with LLMs to address various issues in the APR community. Besides, 38 papers
concentrate on conducting empirical studies to explore the actual benefits of LLMs in fixing various
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:11
40
37
35
30
25
25 23
21 Fine-tuning
20 18
37%
Zero-shot
14
48%
15
11
10
10 9 9 9
6
5
5 4 4 4
3 3 3 3
2 2 2 2 2 2 2 2 2 2 2
0 Few-shot
15%
BA T
ly RT
E T
am l
C rCo n
T5
LM
A
G GP o
Sa s C de
M er
2
G T
C T-4
C T5
Xc r
LL der
eG x
Tr Cla 5+
aC er
C GP ex
eB A
In RT
ra S eG r
eB er
PT a
-N -J
Pa T
od X
C La 5
Ll stra
od e
ni de
od Fi
a-
ph ta e
e
eL -3.
G RT
ER
PL ER
oB R
P
P
aM
C eo
od M
PT T
od
C od
nt od
od d
od
an u
-N
eT
e
tG
Po BE
R BA
C T
U o
o
P
od
od T
i
C
C
ha
C
bugs, such as the potential of fine-tuning LLMs in vulnerability repair [44, 188]. We further notice
nine relevant studies constructing new benchmarks to evaluate the performance of LLMs and two
papers [44, 102] administering a survey to offer insights into how practitioners or developers think
about and employ LLMs to fix software bugs in practice.
(1) LLMs have shown a booming trend in fixing software bugs, with 127 papers between
2020 and 2024. (2) The number of conference papers employing LLMs for APR significantly
exceeds that of journal papers, with ICSE and TOSEM being the most popular conference
and journal venues, respectively. (3) LLM-based APR papers are published in different
research fields, including SE, AI, and Security. (4) There are 18 programming languages
that LLM-based APR has been applied to, with Java, Python, C, and C++ being the most
frequently targeted. (5) LLMs have been applied to some underrepresented programming
languages, such as Verilog and Rust. (6) The vast majority of collected studies primarily
focus on introducing new techniques and conducting empirical research, while two papers
perform user studies to understand practitioners’ attitudes and experiences regarding
leveraging various LLMs for solving bug-fixing tasks.
corpus using a Masked Language Modeling (MLM) task, which is used to learn to predict the identity
of masked words based on their context. In the APR community, prominent models like CodeBERT
(11) [35] and GraphCodeBERT (6) [46] have been investigated in semantic bugs [59, 101, 165, 206]
and security vulnerabilities [188]. As such LLMs only contain an encoder component that is
capable of generating context-aware representations for inputs, they are particularly suited for code
understanding tasks, such as code search, while not directly applicable to code generation tasks,
such as program repair. Thus, in the APR community, as Zhang et al. [188] mention, researchers
may need to integrate a new decoder that initializes from scratch to the pre-trained encoder to
construct an encoder-decoder architecture for patch generation. We also notice such encoder-only
LLMs are utilized to identify patch correctness [79, 137–139, 186, 202], as discussed in Section 7.3.
5.1.2 Enocoder-decoder LLMs. Encoder-decoder LLMs denote a category of LLMs that utilize
both the encoder and decoder stacks of the Transformer architecture, thus inherently suitable for
transforming one sequence into another. Particularly, the encoder takes one sequence as input and
encodes it into a fixed-size hidden state, which effectively captures the semantics and meaning of the
input sequence. Then the decoder processes the hidden state and produces the corresponding output
sequence using attention mechanisms to refer back to parts of the input sequence as needed. Thanks
to the encoder-decoder architecture, such LLMs are particularly suited for code generation tasks in
a sequence-to-sequence learning setting, such as program repair. In the APR community, a mass
of encoder-decoder LLMs have been widely adopted, such as CodeT5 (23) [154], PLBART (10) [2],
UniXcoder (4) [45] and T5 (3) [122]. Similar to traditional learning-based APR [185], such studies
usually treat APR as a natural machine translation (NMT) task by supervised sequence-to-sequence
learning, such as CIRCLE [179], TFix [7], VulRepair [39] and RAP-Gen [153].
5.1.3 Decoder-only LLMs. Decoder-only LLMs denote a category of LLMs that utilize only the
decoder stack of the Transformer architecture. Decoder-only LLMs are typically pre-trained using
a causal language modeling (CLM) objective, learning to predict the next word in a sentence
based on the preceding words. These models are specifically designed for generating sequences
of text autoregressively, i.e., producing one token at a time and using what has been generated
so far as context for subsequent tokens. In the APR community, decoder-only LLMs are the most
popular and widely used group, compared with encoder-only and encoder-decoder LLMs. Notable
repair applications of decoder-only LLMs include GPT-series models (e.g., GPT-1 [120], GPT-
2 [121], GPT3 [14], GPT-3.5 [112], ChatGPT [113], and GPT-4 [114]), some open-sourced models,
(e.g., CodeGPT [95], GPT-Neo [10], GPT-NeoX [9], GPT-J [147], InCoder [37], CodeGen [110],
CodeLLaMA [125] and StarCoder [83]), as well as some closed-source models (e.g., Codex [16]
in Fan et al. [34] and CEDAR [109]). The emergence of decoder-only LLM-based APR studies is
primarily due to two reasons. The first reason is that these models can naturally perform program
repair from a few examples or simple instructions without any fine-tuning. The second reason is
the recent surge in decoder-only LLMs, marked by the introduction of commercial products by
leading Internet companies, such as ChatGPT and GPT-4 by OpenAI.
5.2 What approaches are employed to optimize LLMs for program repair?
LLMs typically acquire general knowledge from extensive datasets. Thus, a fundamental research
issue arises when integrating off-the-shelf LLMs with APR: how to adapt general-propose LLMs to
the specific program repair task. Fig. 5 displays the prevalence of three common adaptation strategies
in LLM-based APR research: fine-tuning, few-shot learning, and zero-shot learning. Our findings
indicate that zero-shot learning, employed in 48% of the studies, is the most popular approach,
suggesting a trend towards using LLMs as-is for program repair tasks. Meanwhile, fine-tuning is
utilized in 37% of the cases, followed by few-shot learning at 15%.
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:13
5.2.1 Fine-tuning. Fine-tuning refers to a process where LLMs are further trained on a smaller,
task-specific dataset. This is an intuitive way to allow LLMs to adjust their weights and biases
through supervised learning, enabling them to perform as expected on new tasks that are similar to,
but not exactly the same as, those they were initially trained on. In the APR community, fine-tuning
is widely utilized during the early emergence of LLMs with millions of parameters, such as T5 and
CodeT5, as it can significantly improve performance on the target program repair task without the
need to train an LLM from scratch.
We summarize the existing APR studies via fine-tuning LLMs into three stages. First, researchers
directly regard program repair as a downstream task by specific datasets, such as fine-tuning
T5 [7], CodeT5 [39], CodeBERT [101], and GPT-2 [77]. Second, researchers utilize more advanced
fine-tuning strategies for better performance. For example, CIRCLE utilizes continual learning to
repair multiple languages with a single model, and RAP-Gen [153] utilizes retrieval-augmented
generation to guide patch search space. Recently, Zirak et al. [206] empirically explore the domain
shift problem in APR with two LLMs, i.e., TFix and CodeBERT, and three fine-tuning methods, i.e.,
Full-Fine-Tuning, Tuning-With-Light-Weight-Adapter-Layers, and Curriculum-Learning. Third,
researchers conduct empirical studies to explore the actual fix capabilities of various LLMs in
different repair scenarios. For example, Zhang et al. [188] fine-tune five LLMs to repair C/C++
security vulnerabilities, Wu et al. [161] involve four LLMs in Java vulnerabilities, and Jiang et
al. [65] consider four LLMs on Java semantic bugs.
5.2.2 Few-shot Learning. Few-shot learning refers to the ability of LLMs to learn or adapt to new
tasks with a very limited amount of data—often only a few examples. This is an effective way
to use examples to help LLMs understand the targeted task and generate appropriate responses
without any explicit retraining or fine-tuning. In the APR community, few-shot learning is usually
utilized and impressive in LLMs, particularly with billions of parameters, as it requires LLMs’
powerful ability to generalize from very limited data. Researchers typically provide LLMs with a
small number of repair examples directly in the input prompt and require LLMs to generate correct
patches. For example, Nashid et al. [109] construct effective prompts by retrieving similar repair
demonstrations for CodeX, and Xia et al. [164] provide LLMs with examples from the same buggy
project to learn the coding style.
5.2.3 Zero-shot Learning. Zero-shot learning takes the concept of few-shot learning even further
by requiring LLMs to perform program repair without any explicit examples. This is a recently
popular way to query LLMs to perform a variety of unseen tasks, where LLMs are given a task
description and must use their pre-existing knowledge and understanding to generate a response
or solution. In the APR community, zero-shot rapidly emerges following the advent of LLMs with
record-breaking parameters, exemplified by ChatGPT, as it requires a powerful foundation model
to perform human-like chats.
There are two typical development routes that utilize LLMs for program repair in a zero-shot
learning setting. The first one is cloze-style repair, i.e., reframing program repair as a cloze-style
task, and then invoking LLMs to predict partially correct code with the help of repair patterns,
such as AlphaRepair [165], GAMMA [189], FitRepair [163] and Repilot [156]. The second one is
conversational-based repair, i.e., constructing complex prompts with various valuable infor-
mation (e.g., buggy code, failure diagnostics, even execution feedback), and then chatting with
LLMs to generate correct patches, such as Pearce et al. [116], TypeFix [117], RustAssistant [22],
Zhang et al. [190], Prenner et al. [119], Sobania et al. [133], and Napoli et al. [108]. Such repair routes
usually require LLMs capable of processing long-text prompts and human-like conversations, thus
predominantly employing powerful LLMs with billion-level parameters, like ChatGPT and GPT-4.
Besides, zero-shot gets rid of training datasets, thus generalizing to various repair scenarios where
1:14 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
Programming
Problem 9% Static
Warning 7% Syntax Error 6%
Security
Vulnerability 14% Hardware Bug 3%
Type Error 3%
Performance Bug 2%
Smart Contract 2%
Crash Bug 1%
Web UI test 1%
API Misuse 1%
Test Case 1%
Translation Bug 1%
Motion Planner 1%
Semantic Bug 48%
GitHub Issue 1%
Formal Proof 1%
Code Review 1%
gathering training data is challenging or impossible, such as hardware bugs [1], DL programs [15]
and crash bugs [30].
Overall, fine-tuning, which uses supervised learning to adapt LLMs more closely to the specifics
of program repair, becomes popular with the emergence of early million-parameter-level models
like CodeT5. Few-shot learning demonstrates LLMs’ capability to generalize from a few examples,
popular with the emergence of subsequent billion-parameter-level models like Codex. Zero-shot
learning showcases LLMs’ capability to tackle the program repair task without any prior direct
exposure to training or examples, particularly with the emergence of recent models with tens or
hundreds of billions of parameters like ChatGPT and GPT-4.
(1) We summarize 46 different LLMs already utilized to fix bugs, and these LLMs can be
classified into three categories based on model architectures, i.e., encoder-only, encoder-
decoder, and decoder-only. (2) Decoder-only LLMs are the most frequently utilized model
architecture, and four of the top popular LLMs are decoder-only models. (3) ChatGPT,
GPT-4, CodeT5, and Codex are the most popular LLMs in existing LLM-based APR studies,
utilized 37, 25, 23, and 21 times, respectively. (4) We summarize three typical ways of
leveraging the vast knowledge encapsulated in LLMs for the specific program repair task,
i.e., fine-tuning, few-shot, and zero-shot.
cybersecurity challenges. The third highest number of studies is observed in the programming
problems domain, constituting approximately 9% of the total research volume. We also find a
growing interest in rare bug types that are usually ignored by prior work, such as static warnings
(7%), syntax errors (6%), and hardware bugs (3%). This underscores that, thanks to LLMs’ general
knowledge gleaned from vast amounts of data, researchers have begun to explore repair scenarios
not previously addressed in prior works.
knowledge-intensified fine-tuning and repair-oriented fine-tuning to help CodeT5 learn the buggy
project-specific knowledge and the cloze-style task knowledge, respectively. FitRepair then retrieves
relevant identifiers with static analysis, which are fed into fine-tuned CodeT5 to generate candidate
patches. Similarly, Repilot [156] improves the cloze-style APR AlphaRepair with a completion
engine. Repilot builds an interaction between LLMs and a completion engine to generate more
valid patches by first pruning away infeasible tokens suggested by LLMs and then completing
the token based on the suggestions provided by the completion engine. GAMMA [189] further
explores the potential of using LLMs to generate patches in a zero-shot learning scenario with a
list of well-summarized fix templates. Particularly, GAMMA attempts to address the donor code
retrieval issue of traditional template-based APR (e.g., TBar [92]) and regards patch generation
as a fill-in-the-blank task by querying LLMs to predict the correct code for masked tokens in a
pre-defined fix pattern. Unlike the cloze-style APR, Ribeiro et al. [123] frame the APR problem as a
code completion task and apply CodeGPT to fix bugs from ManySStuBs4J [72].
Conversation-style APR leverages the powerful natural language and programming language
understanding capabilities of LLMs to generate patches iteratively using test failure information.
In 2023, Xia et al. [166, 167] propose ChatRepair, the first conversation-driven APR approach
that interleaves patch generation with instant feedback to perform APR in a dialogue manner.
ChatRepair constructs prompts with test failure information and queries ChatGPT to generate
correct patches from previously incorrect and plausible patches. However, ChatRepair mainly relies
on negative feedback (i.e., failure information derived from failing tests) to guide the conversations,
which may not always offer specific and adequate prompts for an effective repair. Thus, Kong et
al. [75] introduce ContrastRepair, which includes positive feedback from passing tests to supplement
the negative feedback. Given a buggy program and a failing test case, ContrastRepair generates a
similar passing test case by making minimal modifications to the failing test case. ContrastRepair
then constructs a contrastive pair to LLMs, allowing them to better pinpoint the root cause of
the bug and generate accurate patches. In the above conversation-style APR scenarios, LLMs take
a prompt containing some tokens about buggy code as the input and infer the following tokens
about patches as the output. During the conversations, all tokens in the input prompt and output
answer consume the computational and financial costs, such as $0.06 per 1k input tokens and $0.03
per 1k generated tokens for GPT-4. To reduce the computational cost of ChatRepair, Hidvegi et
al. [52] propose CigaR, a token-efficiency LLM-based APR approach that concentrates on token cost
minimization of ChatGPT. CigaR designs three prompts to help ChatGPT minimize the overall token
cost with previous responses, including (1) an initiation prompt to initialize the repair process,
(2) an improvement prompt to refine partial patches, avoiding discarding potentially valuable
patches, and (3) a multiplication prompt builds upon the already generated plausible patches to
synthesize more plausible patches with diversity maximization. Unlike previous work relying on
a fixed prompt template, RepairAgent [12] treats ChatGPT as an agent capable of autonomously
planning and executing actions to generate patches by using dynamic prompts and a state machine
to select suitable tools.
6.1.4 Empirical Study. In addition to the above novel techniques, researchers have conducted
numerous empirical studies to explore the capabilities of LLMs in fixing semantic bugs. As early as
2021, Mashhadi et al. [101] preliminarily evaluate the performance of fine-tuning CodeBERT for
fixing Java bugs from ManySStuBs4J. Lajko et al. [77] empirically fine-tune GPT-2 to automatically
generate candidate patches for JavaScript bugs. Prenner et al. [119] investigate CodeX in fixing
QuixBugs with a zero-shot setting and Sobania et al. [133] utilizes a more powerful LLM ChatGPT
with prompt engineering to generate patches for QuixBugs. In 2023, Horvath et al. [54] explore the
impact of model architectures and program representations, involving two popular programming
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:17
languages, i.e., Java and JavaScript, three different code representations, i.e., raw text, command
sequences, and ASTs, and four LLMs, i.e., T5, CodeT5, RoBERTa, and GPTNeo. Meanwhile, Zhao et
al. [198] conduct a comprehensive investigation into the effective utilization of LLMs for repairing
code review bugs, involving two LLMs (i.e., ChatGPT and GPT-4) in a zero-shot learning manner,
and six LLMs (InCoder, CodeT5+, CodeFuse, LLaMA, CodeGen-2, CodeLLaMA) in a fine-tuning
manner. Recently, inspired by the phenomenon that grammatical errors in natural language can be
fixed by round-trip translation, i.e., translating sentences to another intermediate language and
then back, Ruiz et al. [126] investigate to what extent the RTT pipeline can fix bugs in code in a
zero-shot fashion. The comprehensive study involves eight LLMs: PLBART, CodeT5, TransCoder,
SantaCoder, InCoder, StarCoderBase, GPT-3.5, and GPT-4, and four APR benchmarks: Defects4J-v1.2,
Defects4J-v2.0, QuixBugs, and HumanEval-Java.
Furthermore, more comprehensive empirical studies are published in top-tier SE conferences.
Xia et al. [164] conduct an comprehensive study to explore the performance of LLMs in program
repair, involving nine LLMs from two categories (i.e., infilling and generative models), and five
datasets across three programming languages. They explore three ways to use LLMs for patch
generation: complete function generation, correct code infilling, and single-line generation. Mean-
while, Jiang et al. [65] empirically explore the fixing capabilities of ten variants from four LLMs
under zero-shot and fine-tuning settings, involving four Java benchmarks. They also construct a
new benchmark, HumanEval-Java, which none of the LLMs has seen during training to address the
data leakage issue. Huang et al. [61] conduct an empirical study on fixing capabilities of LLMs in
the fine-tuning paradigm, involving five LLMs, three programming languages, and three repair
scenarios.
LLMs in a fine-tuning manner on two real-world Java vulnerability datasets. Recently, Le et al. [78]
conduct a preliminary study of ChatGPT and Bard in detecting and fixing security vulnerabilities
in JavaScript programs.
type errors in a zero-shot manner. Similar to GAMMA, TypeFix queries LLMs to generate patches
by filling the masks in code prompts using fix templates. However, TypeFix distinguishes itself
by automatically mining these templates through a hierarchical clustering algorithm, rather than
relying on predefined ones in GAMMA. Furthermore, Ribeiro et al. [124] present Mentat, a type
error repair technique for OCaml programs powered by GPT-3. Mentat first analyzes the source code
to generate contextually relevant prompts, then exploits GPT-3’s advanced language understanding
and generation capabilities to produce potential patches.
framework encompasses data generation through reverse engineering, a search engine to enhance
the retrieval-augmented generation, and a fine-tuning approach to train retrieval-augmented LLMs.
Similarly, Fu et al. [40] focus on the hardware design iteration process and introduce LLM4SECHW,
an LLM-based hardware debugging framework. LLM4SECHW constructs a hardware debugging-
oriented dataset from open-source hardware projects and fine-tunes a suite of hardware domain-
specific LLMs capable of automatically reading hardware designs and fixing bugs.
using both natural and programming languages. DrPlanner then queries GPT-4 to repair generate
algorithms with continuous diagnostic feedback in a closed-loop manner.
Software Formal Proof. Formal software verification aims to validate the correctness of software
properties. First et al. [36] introduce Baldur, an LLM-based approach to automate the generation and
repair of whole formal proofs. Baldur leverages two versions of Minerva [81], which is pre-trained
on a mathematics corpus based on the PaLM [20]: one with eight billion parameters and another
with 62 billion parameters. Baldur first fine-tunes a generation model to synthesize whole proofs
for theorems on a proof dataset, and then fine-tunes a repair model based on the proof assistant’s
error messages to repair incorrectly generated proofs.
Github Issue. Unlike most prior work [16], which evaluates LLMs in fixing self-contained
problems, such as programming problems, Jimenez et al. [68] explore the potential of LLMs to
resolve GitHub issues in a realistic software engineering setting. Given an issue (such as a bug report
or a feature request) submitted to popular GitHub Python repositories, they fine-tune CodeLlama-7B
and CodeLlama-13B to generate a patch that passes the unit and system tests.
Code Review Refinement. Guo et al. [48] conduct the first empirical study to explore the
potential of ChatGPT in code review, specifically focusing on automated code refinement based
on existing code reviews. They leverage prompt engineering to compare ChatGPT with CodeRe-
viewer [87] using two datasets: an established one named CodeReview [87] and a newly introduced
one named CodeReview-New. They also design several strategies to improve the performance of
ChatGPT, such as using more advanced models.
Overall, we observe that LLMs have been applied in a wide array of repair scenarios in the
literature, involving 18 bug types. In some common scenarios dominated by traditional APR,
such as semantic bugs, researchers continue to invest substantial efforts in investigating the
application of LLMs. Besides, thanks to LLMs’ general knowledge learned from all possible
Internet data, LLM-based APR has been extended to some rare scenarios that are previously
unexplored, such as hardware bugs and Web UI.
30
28
Raw Input 18%
25
20
20 Conversation-Style Input
18%
15
12 Mask Input 9%
10
7 7
6 Structure-Aware Input 3%
5
5
3 3
2 2 2 2 2 2 2 2 2 2 2 Prompt Input 52%
0
ev gs
ul
Q ts4J
J
s
Bi s
ce I
at y
er
F
Bu E
CV FP
M val x
St a
r
Re ars
al
F
.ja
e
s4
ug
yS Jav
CP o r
i
ER
R
D
in
V
an TF
ix
P e BI
B
Be
et
uB
IT
gs
ct
ee nyB
xB
g-
ec
M
Ef
-P
fa
M
ef
ui
ar
D
M
pD
an
um
D
H
these new benchmarks into three categories. The first category dataset is tailored for rare repair
scenarios. As discussed in Section 6, LLMs have been used in a variety of scenarios, some of which
have not been considered by previous work, leading to a gap in relevant benchmarks. As a result,
with the advent of LLM-based APR techniques, researchers have also developed corresponding
new datasets, such as TFix [7] for static warnings, DeepDev-PERF [41] for performance bugs, Du et
al. [30] for crash bugs, and Zhang et. al [192] for API misuse. The second category dataset attempts
to address the limitations of previous benchmarks. For example, considering BFP [144] lacking test
suites, FixEval [51] offers a collection of unit tests for a large-scale of competitive programming
problems and is evaluated with PLBART and CodeT5. The third category of datasets is designed to
address the issues unique to LLMs, particularly the data leakage problem. For example, Jiang et
al. [65] create a new evaluation benchmark, HumanEval-Java, that has not been seen by LLMs
during pre-training. Zhang et al. [190] extensively explore the data leakage issue of ChatGPT in
the APR domain and introduce EvalGPTFix, a new benchmark from competitive programming
problems after the training cutoff point of ChatGPT. DebugBench [140] is a follow-up of EvalGPTFix
with a larger scale and more diverse types of bugs. DebugBench contains 4,253 buggy programs
from the LeetCode community, covering four major bug categories and 18 minor types in C++,
Java, and Python. Similarly, ConDefects [162] contains 1,254 Java faulty programs and 1,625 Python
faulty programs from the online competition platform AtCoder. These collected programs are
produced between October 2021 and September 2023 to address the data leakage issue for LLM-
based APR approaches. Different from the aforementioned benchmarks derived from programming
problems[47], Silva et al. [132] introduce GitBug-Java, a reproducible benchmark comprising 199
recent Java bugs. These bugs are extracted from the 2023 commit history of 55 notable open-source
repositories to mitigate the risk of data leakage.
7.2 What input forms are software bugs transformed into when utilizing LLMs?
Thanks to the powerful natural language understanding capabilities of LLMs, the inputs of LLM-
based APR contain rich information, thus more complex than that of traditional APR techniques [185].
We summarize various input forms into five categories according to their data types. As illustrated
in Fig. 8, we find 52% of collected papers leverage prompt engineering to feed LLMs with bug-fixing
information, and 18% utilize a conversational-style representation to provide dynamic information.
We also find only 18% of LLM-based APR studies adopt raw bug-fixing inputs in a manner similar
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:23
to traditional DL/ML models [185]. We will discuss the five input representations utilized by LLMs
as follows.
❶Raw Bug-fixing Input. Similar to most traditional learning-based APR, this type of input
regards APR as an NMT task, which translates a sentence from one source language (i.e., buggy
code) to another target language (i.e., fixed code). Such representation directly feeds LLMs with
the buggy code snippet and has been typically employed to train LLMs with supervised learning
in semantic bugs [28, 101, 206] security veulnerabilities [39, 188], and static warnings [73]. For
example, Zhang et al. [188] investigate the performance of three bug-fixing representations (i.e.,
context, abstraction, and tokenization) to fine-tune five LLMs for vulnerability repair.
❷Prompt Input. This type of input incorporates more information to the buggy code. The
prompt concatenates different input components with some prefixed prompt, thus effectively bridg-
ing the gap between pre-trained tasks and the APR downstream task. For example, CIRCLE [179]
utilizes manually-designed prompt template to convent buggy code and corresponding context into
a unified fill-in-the-blank format. Particularly, they utilize “Buggy line:” and “Context:” to denote
the buggy and contextual code, and they utilize “The fixed code is:” to query a T5-based model to
generate candidate patches according to the previous input. Besides, TFix [7], Zirak et al. [206] and
Kim et al. [73] represent all valuable information about the bug as a single piece of text, including
bug type, bug message, bug line, and bug context. Furthermore, InferFix [69] and RAP-Gen [153]
construct prompts by retrieving relevant repair examples from an external codebase.
❸Mask Input. This type of input masks the buggy code and queries LLMs to fill the masks
with the correct code tokens. Unlike the above input forms, the mask input reformulates the
APR problem as a cloze-style task and directly leverages LLMs’ pre-training objectives in a zero-
shot setting. AlphaRepair [165] is considered the first work to demonstrate the potential of mask
inputs, and researchers have proposed various follow-ups to better perform mask prediction for
patch generation, such as GAMMA [189] with well-summarized repair patterns, FitRepair [163]
with the plastic surgery hypothesis, Repilot [156] with a completion engine, as well as empirical
studies [65, 164].
❹Conversation-Style Representation This type of input further extends the prompt input
with feedback-driven chats like humans. Conversation-style representation contains more complex
information, such as dynamic execution results, while iteratively improving generated patches
through multiple rounds of dialogue. For example, Sobania et al. [133] conduct an early explo-
ration into the feasibility of leveraging ChatGPT’s conversational capabilities for program repair,
motivating some follow-ups [167, 190].
❺Structure-Aware Input. This type of input represents source code as syntactic structures, such
as Abstract Syntax Trees (ASTs). For example, Horvath et al. [54] utilizes RoBERTa and GPTNeo to
encode ASTs for program repair. Besides, VulMaster [201] utilizes the AST as part of its input to
capture the structural aspects of the vulnerable code.
For example, PATCH-SIM [170] assesses the correctness of patches by calculating the similarity of
dynamic execution traces. PATCH-SIM is acknowledged as a foundational work in the APCA field,
providing crucial guidance for the development of follow-up works, particularly learning-based or
LLM-based ones [185].
Table 5. A summary of existing APR studies using LLMs to predict patch correctness.
Table 5 presents existing APCA studies involving LLMs. We summarize them into three stages.
❶LLMs as Feature Extractor. In 2020, Tian et al. [137, 138] empirically explore the performance
of code embeddings via representation learning models in reasoning about patch correctness.
Following the similarity-based pipeline from PATCH-SIM, they first calculate the similarities of
patched and buggy code snippets based on code embeddings, and then predict patch correctness
with a binary classifier. They consider four embedding models, including re-trained (i.e., Doc2vec,
code2vec and CC2vec) and pre-trained models (i.e., BERT), which is the first APCA study empowered
with LLMs. Recently, Le et al. [79] propose Invalidator, to assess the correctness of patches via
semantic and syntactic reasoning. Similar to Tian et al. [137], they utilize CodeBERT to extract
code features and train a classifier for prediction. Unlike the above studies calculating similarities
of patches, Tian et al. [139] formulate APCA as a question-answering (QA) problem and propose
Quatrain. Quatrain first utilizes CodeBERT to encode bug reports and patch descriptions and trains
a QA model for prediction.
❷Fine-tuning LLM-based APCA. In 2024, Zhang et al. [186] propose APPT, equipped with
BERT as the encoder stack, followed by an LSTM stack and a deep learning classifier. Unlike previous
studies [137, 138] limiting BERT to extract features without benefiting from training, APPT further
fine-tunes LLMs in conjunction with other components as a whole pipeline to fully adapt it
specifically for reasoning about patch correctness. APPT is implemented with BERT by default and
is also proven generalizable to other advanced LLMs, such as CodeBERT and GraphCodeBERT.
❸Zero-shot LLM-based APCA. In 2023, Zhou et al. [202] propose PatchZero to explore the
feasibility of LLMs in predicting patch correctness with a zero-shot setting. PatchZero directly
queries LLMs to generate the next token about patch correctness (i.e., a token either “correct” or
“overfitting”) based on previous tokens, which is similar to LLMs’ original pre-training objective.
7.4 How are LLMs utilized to facilitate both code generation and program repair?
Compared with traditional APR approaches [185] that usually employ heuristics or neural networks
to generate a multitude of patches in one go, LLMs can iteratively refine the generated patches
based on the outcomes of dynamic execution. As mentioned in Section 5.2, the iterative patch
generation capabilities brought by LLMs have facilitated the emergence of conversational-based
repair techniques [22, 108, 116, 117, 119, 133, 190]. In addition to conversational-based repair, we
have identified some efforts that improve code generation performance by repairing LLM-generated
code with feedback, referred to as self-repair. Different from conversational-based repair, which
utilizes feedback for patch generation, self-repair approaches leverage LLMs to identify code errors
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:25
implemented by themselves via investigating execution results and explaining the generated code.
Such self-repair approaches integrate code generation and program repair, which can be considered
a further step toward automated programming.
Table 6. A summary of existing studies that use APR to boost code generation.
Table 6 presents some LLM-based studies that leverage program repair to boost code generation.
Self-Edit [183] represents the first attempt to adopt a neural code editor that takes both the generated
code and error messages as inputs to improve the code quality on the competitive programming
task. Self-Edit is evaluated with both fine-tuned models (i.e., PyCodeGPT, GPT-Neo, CodeGen,
GPT-Neo, InCoder, GPT-J) and prompt-based LLMs (i.e., InCoder, CodeGen, Codex). Chen et al. [17]
from DeepMind propose Self-Debugging to teach LLMs to debug their own predicted code via
few-shot prompting, including Codex, ChatGPT, GPT-4 and StarCoder. There also exist some similar
self-repair studies, including OpenCodeInterpreter [199], Cycle[26], LDB [200], SelfEvolve [67],
Self-Refine [98], Hu et al. [56], AgentCoder [60]. Recently, Olausson et al. [111] conduct an empirical
study to investigate the ability of CodeLlama, GPT-3.5, and GPT-4 to perform self-repair in code
generation. They find that self-repair is not a panacea for code generation challenges, as existing
LLMs often fail to provide reliable, accurate, and valuable feedback on why the code is incorrect.
7.5 How often do the collected LLM-based APR papers provide publicly available
artifacts?
Open science plays a crucial role in advancing scientific progress through principles of transparency,
reproducibility, and applicability. Given the benefits, the SE community has been actively promoting
open science principles and encouraging all researchers to share their artifacts, thereby bolstering
the reliability of research findings. In this section, we investigate the extent to which the analyzed
papers make their artifacts publicly accessible.
We find that 80 studies provide the replication packages in their papers, accounting for 62.99%
(80/127) of all collected studies. Among 78 studies that propose novel LLM-based APR approaches,
which is the largest contribution type in Table 4, we find that 53.85% (42/78) of them fail to make
their artifacts publicly available. This makes it difficult for researchers to validate experimental
findings, conduct quantitative comparisons with existing studies, and build follow-ups instead of
reinventing the wheels. Considering that some papers have not been published, we then focus on
top-tier SE venues, i.e., ICSE, ASE, FSE, ISSTA, TSE and TOSEM, and identify 86.84% of papers
(33/38) make related artifacts publicly open, indicating a high-quality commitment to reproducibility
in high-quality papers. Besides, some studies only provide datasets or train models without source
code or essential instructions. Overall, open science is a critical challenge in advancing LLM-based
APR research development because many factors, such as datasets, data pre-processing methods,
source code, hyper-parameters, and documents, may lead to the reproducibility of studies. Therefore,
we hope that researchers in the LLM-based APR community can provide high-quality open-source
artifacts for convenient reproduction.
1:26 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
(1) We summarize 78 different datasets that are utilized to benchmark LLMs in fixing bugs.
(2) Defects4J, QuixBugs, BFP, CVEfixes, and Big-Vul are most frequently adopted in the
LLM-based APR. (3) We categorize the input forms within all collected papers into five
groups: raw bug-fixing input, prompt input, mask input, conversation-style input, and
structure-aware input. (4) Prompt input is the most frequently used form in applying LLMs
to program repair, indicating that designing effective prompts is particularly important for
leveraging LLMs’ natural language processing capabilities. (5) We summarize some studies
that leverage LLMs to predict patch correctness. (6) 62.99% of all collected papers have made
their artifacts open source, and the ratio increases to 86.84% for top-tier SE publications.
even more) of parameters is a highly time-consuming and resource-intensive process. The GPU
resources required for such training are often prohibitively expensive, making them inaccessible
for many researchers in both academic and industrial settings. For example, in Section 5.2, we
observe that most fine-tuning-LLM-based APR studies utilize CodeT5/T5 or similar-sized models,
except for InferFix from Microsoft, which fine-tunes Codex. However, InferFix is proposed by
the world-leading technology company Microsoft and trained with industrial-grade hardware,
i.e., 64 32-GB-V100-GPUs. Second, despite the increased likelihood of generating correct patches
with larger models, the patch generation time cost also increases. For example, Jiang et al. [65]
demonstrate PLBART takes 0.70–0.89 seconds on average to generate a correct patch, CodeGen,
although fixing more bugs than PLBART with more parameters, requires 3.64–13.88 seconds on
average to generate a correct patch. Third, the increase in patch generation time further compresses
the time available for patch validation, as developers need to spend more time waiting for the
model to complete its inference. For example, Shi et al. [130] demonstrate that the vast storage and
runtime memory consumption, coupled with high inference latency, make these LLMs prohibitive
for integration in modern IDEs, especially on resource-constrained or real-time terminal devices.
In the future, to address the first challenge, it is promising to explore the potential of parameter-
efficient fine-tuning approaches on APR, such as prefix-tuning and low-rank adaptation [25]. To
address the second challenge, researchers can optimize the size of LLMs without significantly com-
promising their performance, such as model pruning, quantization, and knowledge distillation [130].
To address the third challenge, we recommend boosting patch validation with advanced strategies,
such as mutation testing [168], or utilizing LLMs to rank candidate patches before validation.
Human Study with LLMs. As summarized in Section 6, with the introduction of LLMs, APR
has achieved groundbreaking progress in terms of the number of correctly fixed bugs in popular
benchmarks, exemplified by the advancements on Defects4J-1.2 from 64 bugs by CIRCLE, to 74
bugs by AlphaRepair, and 82 bugs by GAMMA. The progress prompts us to consider whether
LLMs have indeed facilitated improvements in real-world debugging and how LLM-based APR has
broader implications for developers’ daily activities. Previous research [159, 191] highlights the
potential pitfalls of developing APR tools without adequate feedback from developers, which could
compromise their effectiveness in real-world deployment. However, there remains a significant
gap in our understanding of how software engineers tackle software problems in practical settings,
including their use of dedicated debugging tools and their expertise in debugging techniques.
Thus, in the future, researchers should conduct human studies to gain deeper insights into the
maturity and reliability of LLM-based APR tools in terms of human factors. Possible directions
are to investigate whether LLMs can assist developers in reducing the debugging cost, such as
fixing more bugs, accelerating the bug-fixing process, and handling more complex bugs. Besides,
it would be valuable to investigate developers’ perceptions and interactions with LLMs based on
their practical experiences and established debugging practices.
Exploring More and Rare Repair Scenarions. As summarized in Section 6, we observe
that most existing LLM-based APR studies are concentrated on a limited number of bug types,
particularly semantic bugs. However, there exist some rare repair scenarios that benefit less from
LLMs, such as hardware bugs (one paper) and concurrency bugs (zero paper). The key challenge
lies in insufficient training data from which LLMs can learn.
We suggest that future work concentrates on three possible directions to broaden the scope of
LLM applications for more repair scenarios, such as software requirement [169] and fuzzing [194].
First, transfer learning is an effective training approach for rare scenarios. We can first fine-tune
LLMs with abundant data in a source scenario and then utilize a small amount of data to transfer
the acquired knowledge to a target scenario. The source and target scenarios should have similar
data distributions. For example, Zhang et al. [188] demonstrate that transferring learning from
1:28 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
bug-fixing can improve the vulnerability repair performance of five LLMs by 9.40% on average.
Second, it is promising to utilize the cloze-style APR with repair patterns to generate patches
for rare scenarios. Unlike fine-tuning, which requires a substantial amount of labeled data, this
approach directly leverages expert domain knowledge (such as pre-defined patterns) to guide the
general pre-trained knowledge of LLMs. Besides, for scenarios not previously encountered by LLMs,
researchers can employ unlabeled data in the target scenario to learn the data distribution of the
project or language under test in an unsupervised learning setting. Third, in-context learning and
prompt engineering are feasible solutions for directly querying billion-parameter LLMs to generate
correct code in a target scenario, given that these models are immensely large and encompass
virtually all data available on the Internet.
Integration with Off-the-Shelf APR. As mentioned in Section 6, researchers typically utilize
LLMs as core backbones to design novel repair approaches in an end-to-end setting, such as
sequence-to-sequence learning [7, 179]. Parallelly, the community has seen some explorations
treating LLMs as components integrated into existing repair workflow. These studies attempt
to boost the capabilities of off-the-shelf APR approaches instead of proposing new techniques.
For example, CURE [66] combines GPT and CoCoNut [96] to capture code syntax for the APR
task. Built on top of DLFix [84], DEAR [85] attempts to fix multi-hunk, multi-statement bugs by
fine-tuning BERT to learn fixing-together relationships among statements, i.e., identifying if two
statements are needed to be fixed together or not. Recently, GAMMA [189] integrates LLMs into the
traditional template-based APR TBar by querying LLMs to generate masked code tokens instead of
retrieving donor code from local files. These efforts demonstrate the potential of integrating LLMs
with off-the-shelf APR techniques, yet there is currently a lack of more in-depth work in this area.
In the future, researchers could attempt to combine LLMs with more traditional APR techniques.
For example, it is promising to utilize LLMs to help SMT solvers generate patches for constraint-
based APR, or feed search algorithms with LLM-generated potential patches to build search space
for heuristic-based APR. Besides, domain-specific repair techniques can benefit from the powerful
code-understanding capabilities of LLMs, thus extending to a broader range of repair scenarios.
For example, we can design fix templates for specific scenarios, such as static warnings, and then
utilize the general knowledge contained in LLMs to generate correct patches.
Data Leakage Issue. As highlighted in Section 7.1, Zhang et al. [190] identify that existing
repair benchmarks have been inadvertently included in the pre-training data of popular LLMs, such
as ChatGPT, through web scraping and other methods. For example, ChatGPT is able to enumerate
all projects within Defects4J [71], one of the most popular APR benchmarks. Researchers [189] can
ascertain the exposure for open-source LLMs by inspecting pre-training data against benchmarks.
It is significantly more challenging with more powerful black-box LLMs due to a lack of training
details. However, as preliminary explorations, they are mainly limited in the quantity of involved
bugs and the variety of bug types. For example, all of them are created from programming problems
with only small-scale or medium-scale buggy solutions, without delving into large-scale, real-
world projects that encompass complex API calls, such as Defecst4J [71]. More importantly, data
leakage issues in other repair scenarios, such as security vulnerabilities and API misuse, continue
to be overlooked. There may be overlaps among datasets across different repair scenarios, such as
Defects4J, which also serves as a source for a subset of the API misuse dataset [192]. Overall, the
risk of data leakage introduces bias into the benchmarking of existing work, necessitating urgent
efforts from researchers to mitigate it.
We recommend that future work can be conducted in the following two directions. First, it
is crucial to construct a large-scale benchmark free from data leakage that contains real-world
projects so as to evaluate the actual fix capabilities of LLMs in a more practical debugging scenario.
Commercial closed-source software, real-time updated programming websites, or manually written
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:29
programs may serve as potential data sources. Second, considering the variety of bug types that
LLMs have been applied to, researchers need to consider and attempt to address the data leakage
risk when conducting related studies.
9 CONCLUSION
Automated Program Repair (APR) tackles the long-standing challenge of fixing software bugs
automatically, thus facilitating software testing, validation, and debugging practices. Very recently,
Large Language Models (LLMs) have brought significant changes to the APR domain, already
yielding impressive progress and further demonstrating a promising future in follow-up research.
In this paper, we provide a systematic literature review of existing LLM-based APR techniques from
LLMs, APR and their integration perspectives We summarize popular LLMs, typical utilization
strategies, and repair scenarios. We also discuss some crucial factors, such as input forms and
self-debug metrics, within the LLM-based APR community. Finally, we outline several challenges,
such as data leakage issues, and suggest potential directions for future research.
ACKNOWLEDGMENTS
This work is supported partially by the National Natural Science Foundation of China (61932012,
62141215, 62372228).
REFERENCES
[1] Baleegh Ahmad, Shailja Thakur, Benjamin Tan, Ramesh Karri, and Hammond Pearce. 2024. On Hardware Security
Bug Code Fixes By Prompting Large Language Models. IEEE Transactions on Information Forensics and Security (2024).
Early Access, DOI: 10.1109/TIFS.2024.3374558.
[2] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-Training for Program
Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2655–2668.
[3] Toufique Ahmed, Noah Rose Ledesma, and Premkumar Devanbu. 2022. SynShine: Improved Fixing of Syntax Errors.
IEEE Transactions on Software Engineering 49, 4 (2022), 2169–2181.
[4] Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. 2023. A3test: Assertion-Augmented Automated
Test Case Generation. arXiv preprint arXiv:2302.10352 (2023).
[5] Kamel Alrashedy and Abdullah Aljasser. 2023. Can LLMs Patch Security Issues? arXiv preprint arXiv:2312.00024
(2023).
[6] Bandit. 2024. A Static tool to Find Common Security Issues in Python Code. URL: https://github.com/PyCQA/bandit.
Lasted accessed: 2024-04-01.
[7] Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. 2021. TFix: Learning to Fix Coding Errors with a
Text-to-Text Transformer. In International Conference on Machine Learning. PMLR, 780–791.
[8] Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their
Fixes from Open-source Software. In Proceedings of the 17th International Conference on Predictive Models and Data
Analytics in Software Engineering. 30–39.
[9] Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor
Leahy, Kyle McDonell, Jason Phang, et al. 2022. GPT-Neox-20b: An Open-Source Autoregressive Language Model.
In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models.
95–136.
[10] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive
Language Modeling with Mesh-Tensorflow. URL: https://doi.org/10.5281/zenodo.5297715. Lasted accessed: 2024-04-01.
[11] CO Boulder. 2013. Failure to Adopt Reverse Debugging Costs Global Economy $41 Billion Annually. https://totalview.
io/press-releases/university-cambridge-study-failure-adopt-reverse-debugging-costs-global-economy-41 Lasted
accessed: 2024-04-01.
[12] Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent
for Program Repair. arXiv preprint arXiv:2403.17134 (2024).
[13] Tom Britton, Lisa Jeng, Graham Carver, Paul Cheak, and Tomer Katzenellenbogen. 2013. Reversible Debugging
Software: Quantify the Time and Cost Saved Using Reversible Debuggers. Judge Bus. School, Univ. Cambridge,
Cambridge, UK, Tech. Rep 229 (2013).
1:30 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
[14] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models Are Few-Shot Learners. In Advances in
Neural Information Processing Systems, Vol. 33. 1877–1901.
[15] Jialun Cao, Meiziniu Li, Ming Wen, and Shing-chi Cheung. 2023. A Study on Prompt Design, Advantages and
Limitations of ChatGPT for Deep Learning Program Repair. arXiv preprint arXiv:2304.08191 (2023).
[16] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy
Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino,
Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse,
Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever,
and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374
(2021).
[17] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug.
In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=KuPixIqPiq
[18] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus.
2019. Sequencer: Sequence-To-Sequence Learning for End-To-End Program Repair. IEEE Transactions on Software
Engineering 47, 9 (2019), 1943–1959.
[19] Yiu Wai Chow, Luca Di Grazia, and Michael Pradel. 2024. PyTy: Repairing Static Type Errors in Python. In Proceedings
of the 46th International Conference on Software Engineering. 871–871.
[20] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson,
Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai,
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou,
Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling Language Modeling with Pathways. Journal of
Machine Learning Research 24, 240 (2023), 1–113.
[21] David de Fitero-Dominguez, Eva Garcia-Lopez, Antonio Garcia-Cabot, and Jose-Javier Martinez-Herraiz. 2024.
Enhanced Automated Code Vulnerability Repair Using Large Language Models. arXiv preprint arXiv:2401.03741
(2024).
[22] Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, and Aseem Rastogi. 2023. Fixing Rust Compilation Errors Using
Llms. arXiv preprint arXiv:2308.05177 (2023).
[23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for
Computational Linguistics, 4171–4186.
[24] Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. 2022. TOGA: A Neural Method for Test
Oracle Generation. In Proceedings of the 44th International Conference on Software Engineering. ACM, 2130–2141.
[25] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min
Chan, Weize Chen, et al. 2023. Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models. Nature
Machine Intelligence 5, 3 (2023), 220–235.
[26] Yangruibo Ding, Marcus J Min, Gail Kaiser, and Baishakhi Ray. 2024. CYCLE: Learning to Self-Refine the Code
Generation. arXiv preprint arXiv:2403.18746 (2024).
[27] Tung Do Viet and Konstantin Markov. 2023. Using Large Language Models for Bug Localization and Fixing. In 2023
12th International Conference on Awareness Science and Technology. IEEE, 192–197.
[28] Dawn Drain, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2021. DeepDebug: Fixing Python Bugs Using
Stack Traces, Backtranslation, and Code Skeletons. arXiv preprint arXiv:2105.09352 (2021).
[29] Dawn Drain, Chen Wu, Alexey Svyatkovskiy, and Neel Sundaresan. 2021. Generating Bug-Fixes Using Pretrained
Transformers. In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming. 1–8.
[30] Xueying Du, Mingwei Liu, Juntao Li, Hanlin Wang, Xin Peng, and Yiling Lou. 2023. Resolving Crash Bugs Via Large
Language Models: An Empirical Study. arXiv preprint arXiv:2312.10448 (2023).
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:31
[31] Thomas Durieux and Martin Monperrus. 2016. DynaMoth: Dynamic Code Synthesis for Automatic Program Repair.
In Proceedings of the 11th International Workshop on Automation of Software Test. 85–91.
[32] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023.
Large Language Models for Software Engineering: Survey and Open Problems. arXiv preprint arXiv:2310.03533 (2023).
[33] Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes
and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.
[34] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of
Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering.
IEEE, 1469–1481.
[35] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In
Findings of the Association for Computational Linguistics. 1536–1547.
[36] Emily First, Markus Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large
language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on
the Foundations of Software Engineering. 1229–1241.
[37] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke
Zettlemoyer, and Mike Lewis. 2023. InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh
International Conference on Learning Representations. https://openreview.net/forum?id=hQwb-lbM6EL
[38] Michael Fu, Van Nguyen, Chakkrit Tantithamthavorn, Dinh Phung, and Trung Le. 2023. Vision Transformer-Inspired
Automated Vulnerability Repair. ACM Transactions on Software Engineering and Methodology (2023).
[39] Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Phung Dinh. 2022. VulRepair: A T5-Based
Automated Software Vulnerability Repair. In the ACM Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering. ACM, 935–947.
[40] Weimin Fu, Kaichen Yang, Raj Gautam Dutta, Xiaolong Guo, and Gang Qu. 2023. LLM4SecHW: Leveraging domain-
specific large language model for hardware debugging. In 2023 Asian Hardware Oriented Security and Trust Symposium
(AsianHOST). IEEE, 1–6.
[41] Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B Clement, Neel Sundaresan, and Chen Wu. 2022. DeepDev-
PERF: A Deep Learning-Based Approach for Improving Software Performance. In Proceedings of the 30th ACM Joint
European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 948–958.
[42] Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2023. Rapgen: An Approach for Fixing Code
Inefficiencies in Zero-Shot. arXiv preprint arXiv:2306.17077 (2023).
[43] Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey. IEEE Transactions
on Software Engineering 45, 1 (2019), 34–67.
[44] Haotong Ge and Yuemeng Wu. 2023. An Empirical Study of Adoption of ChatGPT for Bug Fixing among Professional
Developers. Innovation & Technology Advances 1, 1 (Jun. 2023), 21–29. https://doi.org/10.61187/ita.v1i1.19
[45] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal
Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 7212–7225.
[46] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy,
Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang,
and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In Proceedings of the 9th
International Conference on Learning Representations. 1–18.
[47] Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue
Wang, et al. 2024. CodeEditorBench: Evaluating Code Editing Capability of Large Language Models. arXiv preprint
arXiv:2404.03543 (2024).
[48] Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2023. Exploring the
Potential of ChatGPT in Automated Code Refinement: An Empirical Study. In 2024 IEEE/ACM 46th International
Conference on Software Engineering (ICSE). IEEE Computer Society, 379–391.
[49] Sichong Hao, Xianjun Shi, and Hongwei Liu. 2024. Exploring the Potential of Pre-Trained Language Models of Code
for Automated Program Repair. Electronics 13, 7 (2024), 1200.
[50] Sichong Hao, Xianjun Shi, Hongwei Liu, and Yanjun Shu. 2023. Enhancing Code Language Models for Program
Repair by Curricular Fine-tuning Framework. In 2023 IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE, 136–146.
[51] Md Mahim Anjum Haque, Wasi Uddin Ahmad, Ismini Lourentzou, and Chris Brown. 2023. FixEval: Execution-Based
Evaluation of Program Fixes for Programming Problems. In 2023 IEEE/ACM International Workshop on Automated
Program Repair. IEEE, 11–18.
1:32 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
[52] Dávid Hidvégi, Khashayar Etemadi, Sofia Bobadilla, and Martin Monperrus. 2024. CigaR: Cost-efficient Program
Repair with LLMs. arXiv preprint arXiv:2402.06598 (2024).
[53] Sepp Hochreiter, J urgen Schmidhuber, and Corso Elvezia. 1997. Long Short-Term Memory. Neural Computation 9, 8
(1997), 1735–1780.
[54] Dániel Horváth, Viktor Csuvik, Tibor Gyimóthy, and László Vidács. 2023. An Extensive Study on Model Architecture
and Program Representation in the Domain of Learning-Based Automated Program Repair. In 2023 IEEE/ACM
International Workshop on Automated Program Repair (APR). 31–38.
[55] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu
Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv preprint
arXiv:2308.10620 (2023).
[56] Xueyu Hu, Kun Kuang, Jiankai Sun, Hongxia Yang, and Fei Wu. 2024. Leveraging Print Debugging to Improve Code
Generation in Large Language Models. arXiv preprint arXiv:2401.05319 (2024).
[57] Xing Hu, Zhuang Liu, Xin Xia, Zhongxin Liu, Tongtong Xu, and Xiaohu Yang. 2023. Identify and Update Test Cases
When Production Code Changes: A Transformer-Based Approach. In 2023 38th IEEE/ACM International Conference
on Automated Software Engineering (ASE). IEEE, 1111–1122.
[58] Yang Hu, Umair Z Ahmed, Sergey Mechtaev, Ben Leong, and Abhik Roychoudhury. 2019. Re-Factoring Based Program
Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software
Engineering (ASE). IEEE Computer Society, 388–398.
[59] Yaojie Hu, Xingjian Shi, Qiang Zhou, and Lee Pike. 2022. Fix Bugs with Transformer through a Neural-Symbolic Edit
Grammar. In Deep Learning for Code Workshop. https://openreview.net/forum?id=SBgE6i_WkZq
[60] Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code
Generation with Iterative Testing and Optimisation. arXiv preprint arXiv:2312.13010 (2023).
[61] Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2023. An Empirical
Study on Fine-Tuning Large Language Models of Code for Automated Program Repair. In 2023 38th IEEE/ACM
International Conference on Automated Software Engineering. IEEE, 1162–1174.
[62] Qing Huang, Jiahui Zhu, Zhenchang Xing, Huan Jin, Changjing Wang, and Xiwei Xu. 2023. A Chain of AI-Based
Solutions for Resolving Fqns and Fixing Syntax Errors in Partial Code. arXiv preprint arXiv:2306.11981 (2023).
[63] Ryosuke Ishizue, Kazunori Sakamoto, Hironori Washizaki, and Yoshiaki Fukazawa. 2024. Improved Program Repair
Methods using Refactoring with GPT Models. In Proceedings of the 55th ACM Technical Symposium on Computer
Science Education V. 1. 569–575.
[64] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space
with Existing Patches and Similar Code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software
Testing and Analysis. 298–309.
[65] Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program
Repair. In Proceedings of the 45th International Conference on Software Engineering. 1430–1442.
[66] Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural Machine Translation for Automatic
Program Repair. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering. 1161–1173.
[67] Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code Evolution Framework via Large Language
Models. arXiv preprint arXiv:2306.02907 (2023).
[68] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024.
SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on
Learning Representations. https://openreview.net/forum?id=VTF8yNQM66
[69] Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023.
InferFix: End-to-End Program Repair with LLMs. In Proceedings of the 31st ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering. ACM, 1646–1656.
[70] Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radiček. 2023. Repair
Is Nearly Generation: Multilingual Program Repair with LLMS. In Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 37. 5131–5140.
[71] René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled
Testing Studies for Java Programs. In Proceedings of the 23rd International Symposium on Software Testing and Analysis.
437–440.
[72] Rafael-Michael Karampatsis and Charles Sutton. 2020. How Often Do Single-Statement Bugs Occur? The Manysstubs4j
Dataset. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR’20). 573–577.
[73] Misoo Kim, Youngkyoung Kim, Hohyeon Jeong, Jinseok Heo, Sungoh Kim, Hyunhee Chung, and Eunseok Lee. 2022.
An Empirical Study of Deep Transfer Learning-Based Program Repair for Kotlin Projects. In Proceedings of the 30th
ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
1441–1452.
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:33
[74] Barbara Ann Kitchenham and Stuart Charters. 2007. Guidelines for Performing Systematic Literature Reviews in
Software Engineering. Technical Report EBSE 2007-001. Keele University and Durham University Joint Report. 1–65
pages.
[75] Jiaolong Kong, Mingfei Cheng, Xiaofei Xie, Shangqing Liu, Xiaoning Du, and Qi Guo. 2024. ContrastRepair: Enhancing
Conversation-Based Automated Program Repair Via Contrastive Test Case Pairs. arXiv preprint arXiv:2403.01971
(2024).
[76] Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon.
2020. FixMiner: Mining Relevant Fix Patterns for Automated Program Repair. Empirical Software Engineering 25, 3
(2020), 1980–2024.
[77] Márk Lajkó, Viktor Csuvik, and László Vidács. 2022. Towards Javascript Program Repair with Generative Pre-trained
Transformer. In 2022 IEEE/ACM International Workshop on Automated Program Repair. IEEE, 61–68.
[78] Tan Khang Le, Saba Alimadadi, and Steven Y Ko. 2024. A Study of Vulnerability Repair in JavaScript Programs with
Large Language Models. arXiv preprint arXiv:2403.13193 (2024).
[79] Thanh Le-Cong, Duc-Minh Luong, Xuan Bach D Le, David Lo, Nhat-Hoa Tran, Bui Quang-Huy, and Quyet-Thang
Huynh. 2023. Invalidator: Automated Patch Correctness Assessment Via Semantic and Syntactic Reasoning. IEEE
Transactions on Software Engineering 49, 06 (2023), 3411–3429.
[80] Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for
Automatic Software Repair. IEEE Transactions on Software Engineering 38, 01 (2012), 54–72.
[81] Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh
Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy
Gur-Ari, and Vedant Misra. 2022. Solving Quantitative Reasoning Problems with Language Models. In Advances in
Neural Information Processing Systems. https://openreview.net/forum?id=IFXTZERXdM7
[82] Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. SkCoder: A Sketch-based Approach for Automatic
Code Generation. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2124–2135.
[83] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier
Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade,
Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan
Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas,
Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey
Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-
Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis,
Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: May the Source
be With You! arXiv preprint arXiv:2305.06161 (2023).
[84] Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. DLFix: Context-based Code Transformation Learning for Automated
Program Repair. In Proceedings of the 42nd ACM/IEEE International Conference on Software Engineering. 602–614.
[85] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. DEAR: A Novel Deep Learning-based Approach for Automated
Program Repair. In Proceedings of the 44th International Conference on Software Engineering. 511–523.
[86] Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2024. LoftQ:
LoRA-Fine-Tuning-aware Quantization for Large Language Models. In The Twelfth International Conference on
Learning Representations. https://openreview.net/forum?id=LzPWWPAdY4
[87] Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey
Svyatkovskiy, Shengyu Fu, and Neel Sundaresan. 2022. Automating Code Review Activities by Large-Scale Pre-
Training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering. 1035–1047.
[88] Jingjing Liang, Ruyi Ji, Jiajun Jiang, Shurui Zhou, Yiling Lou, Yingfei Xiong, and Gang Huang. 2021. Interactive Patch
Filtering as Debugging Aid. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME).
IEEE, 239–250.
[89] Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A Multi-Lingual Program Re-
pair Benchmark Set Based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International
Conference on Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH Companion’17).
55–56.
[90] Yuanfei Lin, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan, and Matthias Althoff. 2024. DrPlanner:
Diagnosis and Repair of Motion Planners Using Large Language Models. arXiv preprint arXiv:2403.07470 (2024).
[91] Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Avatar: Fixing Semantic Bugs with Fix
Patterns of Static Analysis Violations. In Proceedings of the 26th IEEE International Conference on Software Analysis,
1:34 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
[115] Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris
Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs
Introduced by Large Language Models While Translating Code. In 2024 IEEE/ACM 46th International Conference on
Software Engineering. IEEE Computer Society, 866–866.
[116] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining
Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy. IEEE,
2339–2356.
[117] Yun Peng, Shuzheng Gao, Cuiyun Gao, Yintong Huo, and Michael Lyu. 2024. Domain Knowledge Matters: Improving
Prompts with Fix Templates for Repairing Python Type Errors. In Proceedings of the 46th IEEE/ACM International
Conference on Software Engineering. 1–13.
[118] Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for Conducting Systematic Mapping Studies
in Software Engineering: An Update. Information and Software Technology 64 (2015), 1–18.
[119] Julian Aron Prenner, Hlib Babii, and Romain Robbes. 2022. Can OpenAI’s Codex Fix Bugs? An Evaluation on
QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
[120] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. GPT-1: Improving Language Understand-
ing by Generative Pre-Training. URL: https://cdn.openai.com/research-covers/language-unsupervised/language_
understanding_paper.pdf. Lasted accessed: 2024-04-01.
[121] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language Models Are
Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9.
[122] Noam Raffel, Colinand Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The
Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
[123] Francisco Ribeiro, Rui Abreu, and João Saraiva. 2022. Framing Program Repair as Code Completion. In Proceedings of
the Third International Workshop on Automated Program Repair. IEEE, 38–45.
[124] Francisco Ribeiro, José Nuno Castro de Macedo, Kanae Tsushima, Rui Abreu, and João Saraiva. 2023. GPT-3-Powered
Type Error Debugging: Investigating the Use of Large Language Models for Code Repair. In Proceedings of the 16th
ACM SIGPLAN International Conference on Software Language Engineering. 111–124.
[125] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal
Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer,
Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas
Usunier, Thomas Scialom, and Gabriel Synnaeve. 2023. Code Llama: Open Foundation Models for Code. arXiv
preprint arXiv:2308.12950 (2023).
[126] Fernando Vallecillos Ruiz, Anastasiia Grishina, Max Hort, and Leon Moonen. 2024. A Novel Approach for Automatic
Program Repair Using Round-Trip Translation with Large Language Models. arXiv preprint arXiv:2401.07994 (2024).
[127] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning Representations by Back-Propagating
Errors. Nature 323, 6088 (1986), 533–536.
[128] Ahmadreza Saboor Yaraghi, Darren Holden, Nafiseh Kahani, and Lionel Briand. 2024. Automated Test Case Repair
Using Language Models. arXiv e-prints (2024), arXiv–2401.
[129] Max Schafer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language
Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105.
[130] Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2022. Compressing Pre-Trained Models of Code into
3 Mb. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
[131] André Silva, Sen Fang, and Martin Monperrus. 2023. RepairLLaMA: Efficient Representations and Fine-Tuned
Adapters for Program Repair. arXiv preprint arXiv:2312.15698 (2023).
[132] André Silva, Nuno Saavedra, and Martin Monperrus. 2024. GitBug-Java: A Reproducible Benchmark of Recent Java
Bugs. arXiv preprint arXiv:2402.02961 (2024).
[133] Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An Analysis of the Automatic Bug Fixing
Performance of ChatGPT. In 2023 IEEE/ACM International Workshop on Automated Program Repair. 23–30.
[134] Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for Source Code Summarization. Automated Software Engineering
31, 1 (2024), 22.
[135] Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen
Chen, Quanjun Zhang, et al. 2023. Automatic Code Summarization Via ChatGPT: How Far Are We? arXiv preprint
arXiv:2305.12865 (2023).
[136] Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu. 2014. Automatically Generated Patches As Debugging Aids:
A Human Study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software
Engineering. 64–74.
1:36 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
[137] Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F Bissyandé. 2020.
Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair. In
Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 981–992.
[138] Haoye Tian, Kui Liu, Yinghua Li, Abdoul Kader Kaboré, Anil Koyuncu, Andrew Habib, Li Li, Junhao Wen, Jacques
Klein, and Tegawendé F Bissyandé. 2023. The Best of Both Worlds: Combining Learned Embeddings with Engineered
Features for Accurate Prediction of Correct Patches. ACM Transactions on Software Engineering and Methodology 32,
4 (2023), 1–34.
[139] Haoye Tian, Xunzhu Tang, Andrew Habib, Shangwen Wang, Kui Liu, Xin Xia, Jacques Klein, and TegawendÉ F
BissyandÉ. 2022. Is This Change the Answer to That Problem? Correlating Descriptions of Bug and Code Changes
for Evaluating Patch Correctness. In 37th IEEE/ACM International Conference on Automated Software Engineering.
IEEE, 1–13.
[140] Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. DebugBench:
Evaluating Debugging Capability of Large Language Models. arXiv preprint arXiv:2401.04621 (2024).
[141] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
[142] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open Foundation and Fine-tuned Chat Models.
arXiv preprint arXiv:2307.09288 (2023).
[143] YunDa Tsai, Mingjie Liu, and Haoxing Ren. 2023. RTLFixer: Automatically Fixing RTL Syntax Errors with Large
Language Models. arXiv preprint arXiv:2311.16543 (2023).
[144] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019.
An Empirical Study on Learning Bug-Fixing Patches in the Wild Via Neural Machine Translation. ACM Transactions
on Software Engineering and Methodology 28, 4 (2019), 1–29.
[145] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems. 5998–6008.
[146] Nalin Wadhwa, Jui Pradhan, Atharv Sonwane, Surya Prakash Sahu, Nagarajan Natarajan, Aditya Kanade, Suresh
Parthasarathy, and Sriram Rajamani. 2023. Frustrated with Code Quality Issues? LLMs Can Help! arXiv preprint
arXiv:2309.12938 (2023).
[147] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. URL:
https://github.com/kingoflolz/mesh-transformer-jax. Lasted accessed: 2024-04-01.
[148] Deze Wang, Boxing Chen, Shanshan Li, Wei Luo, Shaoliang Peng, Wei Dong, and Xiangke Liao. 2023. One Adapter for
All Programming Languages? Adapter Tuning for Code Search and Summarization. arXiv preprint arXiv:2303.15822
(2023).
[149] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing with
Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering (2024). Early
Access, DOI: 10.1109/TSE.2024.3368208.
[150] Shangwen Wang, Mingyang Geng, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Li Li, Tegawendé F Bissyandé, and
Xiaoguang Mao. 2023. Natural Language to Code: How Far Are We?. In Proceedings of the 31st ACM Joint European
Software Engineering Conference and Symposium on the Foundations of Software Engineering. 375–387.
[151] Simin Wang, Liguo Huang, Amiao Gao, Jidong Ge, Tengfei Zhang, Haitao Feng, Ishna Satyarth, Ming Li, He Zhang,
and Vincent Ng. 2022. Machine/Deep Learning for Software Engineering: A Systematic Literature Review. IEEE
Transactions on Software Engineering (2022).
[152] Shangwen Wang, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Yan Lei, and Xiaoguang Mao. 2023. Two Birds with
One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network. Proceedings of the
ACM on Programming Languages 7, OOPSLA2 (2023), 486–515.
[153] Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. 2023. RAP-Gen: Retrieval-Augmented Patch Generation
with CodeT5 for Automatic Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering. ACM, 146–158.
[154] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained
Encoder-decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing. 8696–8708.
[155] Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2022. A Systematic
Literature Review on the Use of Deep Learning in Software Engineering Research. ACM Transactions on Software
Engineering and Methodology 31, 2 (2022), 1–58.
[156] Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copiloting the Copilots: Fusing Large Language
Models with Completion Engines for Automated Program Repair. In Proceedings of the 31st ACM Joint European
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:37
Software Engineering Conference and Symposium on the Foundations of Software Engineering. 172–184.
[157] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically Finding Patches
Using Genetic Programming. In 2009 IEEE 31st International Conference on Software Engineering. IEEE, 364–374.
[158] Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How Long Will It Take to Fix This
Bug?. In Fourth International Workshop on Mining Software Repositories. IEEE, 1–1.
[159] Emily Winter, Vesna Nowack, David Bowes, Steve Counsell, Tracy Hall, Sæmundur Haraldsson, and John Woodward.
2022. Let’s Talk with Developers, Not about Developers: A Review of Automatic Program Repair Research. IEEE
Transactions on Software Engineering 49, 1 (2022), 419–436.
[160] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa. 2016. A Survey on Software Fault Localization. IEEE Transactions
on Software Engineering 42, 8 (2016), 707–740.
[161] Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023.
How Effective Are Neural Networks for Fixing Security Vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT
International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery,
1282–1294.
[162] Yonghao Wu, Zheng Li, Jie M Zhang, and Yong Liu. 2023. ConDefects: A New Dataset to Address the Data Leakage
Concern for LLM-based Fault Localization and Program Repair. arXiv preprint arXiv:2310.16253 (2023).
[163] Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. 2023. The Plastic Surgery Hypothesis in the Era of Large
Language Models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 522–534.
[164] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-
trained Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering. IEEE Computer
Society, 1482–1494.
[165] Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program
Repair Via Zero-Shot Learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering. 959–971.
[166] Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational Automated Program Repair. arXiv preprint
arXiv:2301.13246 (2023).
[167] Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 Out of 337 Bugs for $0.42
Each Using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
[168] Yuan-An Xiao, Chenyang Yang, Bo Wang, and Yingfei Xiong. 2024. Accelerating Patch Validation for Program Repair
with Interception-Based Execution Scheduling. IEEE Transactions on Software Engineering 01 (2024), 1–18.
[169] Danning Xie, Byungwoo Yoo, Nan Jiang, Mijung Kim, Lin Tan, Xiangyu Zhang, and Judy S Lee. 2023. Impact of Large
Language Models on Generating Software Specifications. arXiv preprint arXiv:2306.03324 (2023).
[170] Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and Gang Huang. 2018. Identifying Patch Correctness in
Test-Based Program Repair. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering.
789–799.
[171] Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise Condition
Synthesis for Program Repair. In Proceedings of the 39th IEEE/ACM International Conference on Software Engineering.
IEEE, 416–426.
[172] Zhuolin Xu, Yuanzhang Lin, Qiushi Li, and Shin Hwei Tan. 2023. Guiding ChatGPT to Fix Web Ui Tests Via
Explanation-Consistency Checking. arXiv preprint arXiv:2312.05778 (2023).
[173] Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel
Le Berre, and Martin Monperrus. 2016. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs.
IEEE Transactions on Software Engineering 43, 1 (2016), 34–55.
[174] Yanming Yang, Xin Xia, David Lo, and John Grundy. 2022. A Survey on Deep Learning for Software Engineering.
Comput. Surveys 54, 10s (2022), 1–73.
[175] Xufeng Yao, Haoyang Li, Tsz Ho Chan, Wenyi Xiao, Mingxuan Yuan, Yu Huang, Lei Chen, and Bei Yu. 2024.
HDLdebugger: Streamlining HDL debugging with Large Language Models. arXiv preprint arXiv:2403.11671 (2024).
[176] He Ye, Matias Martinez, Xiapu Luo, Tao Zhang, and Martin Monperrus. 2022. SelfAPR: Self-Supervised Program
Repair with Test Execution Diagnostics. In 2022 37th IEEE/ACM International Conference on Automated Software
Engineering. IEEE.
[177] He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural Program Repair with Execution-Based Backpropagation.
In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering. 1506–1518.
[178] He Ye and Martin Monperrus. 2024. ITER: Iterative Neural Repair for Multi-Location Patches. In Proceedings of the
46th IEEE/ACM International Conference on Software Engineering. 79–91.
[179] Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin.
2022. CIRCLE: Continual Repair across Programming Languages. In Proceedings of the 31st ACM SIGSOFT International
Symposium on Software Testing and Analysis. ACM, 678–690.
1:38 Quanjun Zhang, Chunrong Fang, Yang Xie, Yuxiang Ma, Weisong Sun, Yun Yang, and Zhenyu Chen
[180] Yuan Yuan and Wolfgang Banzhaf. 2018. ARJA: Automated Repair of Java Programs Via Multi-objective Genetic
Programming. IEEE Transactions on Software Engineering 46, 10 (2018), 1040–1067.
[181] He Zhang, Muhammad Ali Babar, and Paolo Tell. 2011. Identifying Relevant Studies in Software Engineering.
Information and Software Technology 53, 6 (2011), 625–637.
[182] Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022.
Repairing Bugs in Python Assignments Using Large Language Models. arXiv preprint arXiv:2209.14876 (2022).
[183] Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023. Self-Edit: Fault-Aware Code Editor for Code Generation.
In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
769–787.
[184] Lyuye Zhang, Kaixuan Li, Kairan Sun, Daoyuan Wu, Ye Liu, Haoye Tian, and Yang Liu. 2024. ACFIX: Guiding LLMs
with Mined Common RBAC Practices for Context-Aware Repair of Access Control Vulnerabilities in Smart Contracts.
arXiv preprint arXiv:2403.06838 (2024).
[185] Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023. A Survey of Learning-Based
Automated Program Repair. ACM Transactions on Software Engineering and Methodology 33, 2 (2023), 1–69.
[186] Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2024. APPT:
Boosting Automated Patch Correctness Prediction via Fine-Tuning Pre-Trained Models. IEEE Transactions on Software
Engineering 50, 03 (2024), 474–494.
[187] Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen.
2023. A Survey on Large Language Models for Software Engineering. arXiv preprint arXiv:2312.15223 (2023).
[188] Quanjun Zhang, Chunrong Fang, Bowen Yu, Weisong Sun, Tongke Zhang, and Zhenyu Chen. 2023. Pre-Trained
Model-Based Automated Software Vulnerability Repair: How Far are We? IEEE Transactions on Dependable and
Secure Computing (2023). Early Access, DOI: 10.1109/TDSC.2023.3308897.
[189] Quanjun Zhang, Chunrong Fang, Tongke Zhang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023. GAMMA:
Revisiting Template-based Automated Program Repair via Mask Prediction. In 2023 38th IEEE/ACM International
Conference on Automated Software Engineering. IEEE, 535–547.
[190] Quanjun Zhang, Tongke Zhang, Juan Zhai, Chunrong Fang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023. A
Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated
Program Repair. arXiv preprint arXiv:2310.08879 (2023).
[191] Quanjun Zhang, Yuan Zhao, Weisong Sun, Chunrong Fang, Ziyuan Wang, and Lingming Zhang. 2022. Program
Repair: Automated vs. Manual. arXiv preprint arXiv:2203.05166 (2022).
[192] Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, David Lo, Asankhaya Sharma, and Lingxiao Jiang. 2023. Evaluating
Pre-Trained Language Models for Repairing Api Misuses. arXiv preprint arXiv:2310.16390 (2023).
[193] Yuntong Zhang, Xiang Gao, Gregory J. Duck, and Abhik Roychoudhury. 2022. Program Vulnerability Repair Via
Inductive Inference. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.
691–702.
[194] Yuntong Zhang, Ridwan Shariffdeen, Gregory J Duck, Jiaqi Tan, and Abhik Roychoudhury. 2023. Program Repair by
Fuzzing over Patch and Input Space. arXiv preprint arXiv:2308.00666 (2023).
[195] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023. Unifying the
Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. arXiv preprint arXiv:2311.07989
(2023).
[196] Qianhui Zhao, Fang Liu, Li Zhang, Yang Liu, Zhen Yan, Zhenghao Chen, Yufei Zhou, Jing Jiang, and Ge Li. 2024.
Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student Assignments. arXiv preprint
arXiv:2404.01754 (2024).
[197] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, et al. 2023. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023).
[198] Zelin Zhao, Zhaogui Xu, Jialong Zhu, Peng Di, Yuan Yao, and Xiaoxing Ma. 2023. The Right Prompts for the Job:
Repair Code-Review Defects with Large Language Model. arXiv preprint arXiv:2312.17485 (2023).
[199] Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. 2024.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658
(2024).
[200] Li Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Language Model Debugger via Verifying Runtime
Execution Step-by-step. arXiv preprint arXiv:2402.16906 (2024).
[201] Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, and David Lo. 2024. Out of Sight, Out of Mind: Better Automatic
Vulnerability Repair by Broadening Input Ranges and Sources. In 2024 IEEE/ACM 46th International Conference on
Software Engineering. IEEE Computer Society, 872–872.
[202] Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Thanh Le-Cong, Junda He, Bach Le, and David Lo. 2023. PatchZero:
Zero-Shot Automatic Patch Correctness Assessment. arXiv preprint arXiv:2303.00202 (2023).
A Systematic Literature Review on Large Language Models for Automated Program Repair 1:39
[203] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A Syntax-
Guided Edit Decoder for Neural Program Repair. In Proceedings of the 29th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering. 341–353.
[204] Qihao Zhu, Zeyu Sun, Wenjie Zhang, Yingfei Xiong, and Lu Zhang. 2023. Tare: Type-Aware Neural Program Repair.
In 2023 IEEE/ACM 45th International Conference on Software Engineering. IEEE, 1443–1455.
[205] Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Zhi Jin, and Hong Mei. 2024. Hot or Cold? Adaptive Temperature Sampling
for Code Generation with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence,
Vol. 38. 437–445.
[206] Armin Zirak and Hadi Hemmati. 2024. Improving Automated Program Repair with Domain Adaptation. ACM
Transactions on Software Engineering and Methodology 33, 3 (2024), 1–43.