2406.00515v1
2406.00515v1
SUNGHUN KIM† , The Hong Kong University of Science and Technology (Guangzhou), China
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks,
known as Code LLMs, particularly in code generation that generates source code with LLM from natural
language descriptions. This burgeoning field has captured significant interest from both academic researchers
and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. De-
spite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language
processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and
up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by
providing a systematic literature review that serves as a valuable reference for researchers investigating the
cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the
recent developments in LLMs for code generation, covering aspects such as data curation, latest advances,
performance evaluation, and real-world applications. In addition, we present a historical overview of the evo-
lution of LLMs for code generation and offer an empirical comparison using the widely recognized HumanEval
and MBPP benchmarks to highlight the progressive enhancements in LLM capabilities for code generation.
We identify critical challenges and promising opportunities regarding the gap between academia and practical
development. Furthermore, we have established a dedicated resource website (https://codellm.github.io) to
continuously document and disseminate the most recent advances in the field.
CCS Concepts: • General and reference → Surveys and overviews; • Software and its engineering →
Software development techniques; • Computing methodologies → Artificial intelligence.
Additional Key Words and Phrases: Large Language Models, Code Large Language Models, Code Generation
Authors’ addresses: Juyong Jiang, jjiang472@connect.hkust-gz.edu.cn, The Hong Kong University of Science and Technology
(Guangzhou), Guangzhou, China; Fan Wang, fwang380@connect.hkust-gz.edu.cn, The Hong Kong University of Science
and Technology (Guangzhou), Guangzhou, China; Jiasi Shen, sjs@cse.ust.hk, The Hong Kong University of Science and
Technology, Hong Kong, China; Sungju Kim, sungju.kim@navercorp.com, NAVER Cloud, Seoul, South Korea; Sunghun
Kim, hunkim@cse.ust.hk, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2024 Association for Computing Machinery.
1049-331X/2024/9-ART1 $15.00
https://doi.org/XXXXXXX.XXXXXXX
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:2 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
1 INTRODUCTION
The advent of Large Language Models (LLMs) such as ChatGPT1 [171] has profoundly transformed
the landscape of automated code-related tasks [45], including code completion [78, 152, 233, 244],
code translation [48, 121, 211], and code repair [109, 170, 176]. A particularly intriguing application
of LLMs is code generation, a task that involves producing source code from natural language
descriptions. Despite varying definitions across studies [47, 191, 204, 232], for the purposes of
this survey, we adopt a consistent definition of code generation as the natural-language-to-code
(NL2Code) task [15, 16, 264]. This area has garnered substantial interest from both academia and
industry, as evidenced by the development of tools like GitHub Copilot2 [45], CodeGeeX3 [275],
and Amazon CodeWhisperer4 , which leverage groundbreaking code LLMs to facilitate software
development.
Initial investigations into code generation primarily utilized heuristic rules or expert systems,
such as probabilistic grammar-based frameworks [9, 57, 113] and specialized language models [59,
74, 106]. These early techniques were typically rigid and difficult to scale. However, the introduction
of Transformer-based LLMs has shifted the paradigm, establishing them as the preferred method
due to their superior proficiency and versatility. One remarkable aspect of LLMs is their capability
to follow instructions [51, 164, 173, 238, 250], enabling even novice programmers to write code by
simply articulating their requirements. This emergent ability has democratized coding, making it
accessible to a broader audience [264]. The performance of LLMs on code generation tasks has seen
remarkable improvements, as illustrated by the HumanEval leaderboard5 , which showcases the
evolution from PaLM 8B [49] of 3.6% to LDB [279] of 95.1% on Pass@1 metrics. As can be seen, the
HumanEval benchmark [45] has been established as a de facto standard for evaluating the coding
proficiency of LLMs [45].
To offer a comprehensive chronological evolution, we present an overview of the development
of LLMs for code generation, as illustrated in Figure 1. The landscape of LLMs for code generation
is characterized by a spectrum of models, with certain models like ChatGPT [173], GPT4 [5],
LLaMA [217, 218], and Claude 3 [13] serving general-purpose applications, while others such
as StarCoder [132, 151], Code LLaMA [196], DeepSeek-Coder [79], and Code Gemma [54] are
tailored specifically for code-centric tasks. The convergence of code generation with the latest LLM
advancements is pivotal, especially when programming languages can be considered as distinct
dialects of multilingual natural language [15, 275]. These models are not only tested against software
engineering (SE) requirements but also propel the advancement of LLMs into practical production
[271].
While recent surveys have shed light on code LLMs from the lenses of Natural Language Process-
ing (NLP), Software Engineering (SE), or a combination of both disciplines [91, 264, 271, 278], they
have often encompassed a broad range of code-related tasks. There remains a dearth of literature
specifically reviewing advanced topics in code generation, such as meticulous data curation, in-
struction tuning, alignment with feedback, prompting techniques, the development of autonomous
coding agents, retrieval augmented code generation, LLM-as-a-Judge for code generation, among
others. A notably pertinent study [15, 264] also concentrates on LLMs for text-to-code generation
(NL2Code), yet it primarily examines models released from 2020 to 2022. Consequently, this notice-
able temporal gap has resulted in an absence of up-to-date literature reviews that contemplate the
1 https://chat.openai.com
2 https://github.com/features/copilot
3 https://codegeex.cn/en-US
4 https://aws.amazon.com/codewhisperer
5 https://paperswithcode.com/sota/code-generation-on-humaneval
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:3
latest advancements, including models like CodeQwen [215], WizardCoder [154], and PPOCoder
[204], as well as the comprehensive exploration of the advanced topics previously mentioned.
Recognizing the need for a dedicated and up-to-date literature review, this survey endeavors to fill
that void. We provide a systematic review that will serve as a foundational reference for researchers
quickly exploring the latest progress in LLMs for code generation. A taxonomy is introduced to
categorize and examine recent advancements, encompassing data curation [154, 231, 240], advanced
topics [42, 47, 94, 125, 146, 152, 164, 166, 177, 205, 266], evaluation methods [45, 85, 111, 284], and
practical applications [45, 275]. This category aligns with the comprehensive lifecycle of an LLM for
code generation. Furthermore, we pinpoint critical challenges and identify promising opportunities
to bridge the research-practicality divide. Therefore, this survey allows NLP and SE researchers
to seamlessly equip with a thorough understanding of LLM for code generation, highlighting
cutting-edge directions and current hurdles and prospects.
The remainder of the survey is organized following the structure outlined in our taxonomy
in Figure 3. In Section 2, we introduce the preliminaries of LLM with Transformer architecture
and formulate the task of LLM for code generation. Then, in Section 3, we propose a taxonomy,
categorizing the complete process of LLMs in code generation. Section 4 delves into the specifics of
LLMs for code generation within this taxonomy framework. In Section 5, we underscore the critical
challenges and promising opportunities for bridging the research-practicality gap and conclude
this work in Section 6.
2 BACKGROUND
2.1 Large Language Models
The effectiveness of large language models (LLMs) is fundamentally attributed to their substantial
quantity of model parameters, large-scale and diversified datasets, and the immense computational
power utilized during training [87, 114]. Generally, scaling up language models consistently results
in enhanced performance and sample efficiency across a broad array of downstream tasks [238, 273].
However, with the expansion of the model size to a certain extent (e.g., GPT-3 [31] with 175B-
parameters and PaLM [49] with 540B), LLMs have exhibited an unpredictable phenomenon known
as emergent abilities6 , including instruction following [173], in-context learning [65], and step-by-
step reasoning [95, 239], which are absent in smaller models but apparent in larger ones [238].
Adhering to the same architectures of the Transformer [222] in LLMs, code LLMs are specifically
pre-trained on large-scale unlabeled code corpora, whereas general-purpose LLMs (e.g., ChatGPT
[171]) are pre-trained on a blend of code and text data. Analogous to LLMs, Code LLMs can also
be classified into three architectural categories: encoder-only models, decoder-only models, and
encoder-decoder models. For encoder-only models, such as CodeBERT [68], they are typically
suitable for code comprehension tasks including type prediction, code retrieval, and clone detection.
For decoder-only models, such as StarCoder [31], they predominantly excel in generation tasks,
such as code generation, code translation, and code summarization. Encoder-decoder models, such
as CodeT5 [234], can accommodate both code understanding and generation tasks but do not
necessarily outperform encoder-only or decoder-only models. The overall architectures of the
different Code LLMs for code generation are depicted in Figure 2.
In the following subsection, we will delineate the key modules of the Transformer layers in Code
LLMs.
2.1.1 Multi-Head Self-Attention Modules. Each Transformer layer incorporates a multi-head self-
attention (MHSA) mechanism to discern the inherent semantic relationships within a sequence
6 It
should be noted that an LLM is not necessarily superior to a smaller language model, and emergent abilities may not
manifest in all LLMs [273].
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:4 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
StarCoder2-Instruct
Apr. ProCoder CodeGemma CodeQwen1.5
Llama 3
Claude 3 CodeS OpenDevin Devin Mar.
2024
Dec. Magicoder AlphaCode 2 phi-2 WaveCoder
DeepSeek-Coder Nov.
Apr. Self-Debugging
Mar. GPT4
Jan. PPOCoder
2023
SantaCoder ERNIE-Code Dec.
CodeGeeX Sep.
CodeGen Mar.
JuPyT5 Jan.
2022
Nov. CodeParrot
Jul. Codex
3
GPT-J May
5
Mar. GPT-Neo PLBART
1 4
CodeGPT Feb.
6
2021
6 PyMT5 Oct.
9 3
May GPT-C
2020
Fig. 1. A chronological overview of large language models (LLMs) for code generation in recent years. The
timeline was established mainly according to the release date. The models with publicly available model
checkpoints are highlighted in green color.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:5
of tokens across ℎ distinct latent representation spaces. Formally, the MHSA employed by the
Transformer can be formulated as follows:
h (𝑙 ) = MultiHeadSelfAttn(Q, K, V) = Concat {Head𝑖 }ℎ𝑖=1 WO, (1)
Q
Head𝑖 = Attention(H (𝑙 −1) W𝑖 , H (𝑙 −1) W𝑖K, H (𝑙 −1) W𝑖V ),
| {z } | {z } | {z } (2)
Q K V
!
QK𝑇
Attention(Q, K, V) = softmax √︁ V, (3)
𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ
where H (𝑙 −1) ∈ R𝑛×𝑑𝑚𝑜𝑑𝑒𝑙 denotes the input to the 𝑙-th Transformer layer, while h (𝑙 ) ∈ R𝑛×𝑑𝑚𝑜𝑑𝑒𝑙
represents the output of MHSA sub-layer. The quantity of distinct attentionn heads is represented
o
Q
by ℎ, and 𝑑𝑚𝑜𝑑𝑒𝑙 refers to the model dimension. The set of projections W𝑖 , W𝑖K, W𝑖V, W𝑖O ∈
R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ encompasses the affine transformation parameters for each attention head Head𝑖 ,
transforming the Query Q, Key K, Value V, and the output of the attention sub-layer, The softmax
function is applied√︁ in a row-wise manner. The dot-products of queries and keys are divided by
a scaling factor 𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ to counteract the potential risk of excessive large inner products and
correspondingly diminished gradients in the softmax function, thus encouraging a more balanced
attention landscape.
In addition to multi-head self-attention, there are two other types of attention based on the
source of queries and key-value pairs:
• Masked Multi-Head Self-Attention. Within the decoder layers of the Transformer, the
self-attention mechanism is constrained by introducing an attention mask, ensuring that
queries at each position can only attend to all key-value pairs up to and inclusive of that
position. To facilitate parallel training, this is typically executed by assigning a value of 0
to the lower triangular part and setting the remaining elements to −∞. Consequently, each
item attends only to its predecessors and itself. Formally, this modification in Equation 3 can
be depicted as follows:
!
QK𝑇
Attention(Q, K, V) = softmax √︁ + M𝑚𝑎𝑠𝑘 V, (4)
𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ
(
0 for 𝑖 ≥ 𝑗
M𝑚𝑎𝑠𝑘 = 𝑚𝑖 𝑗 = I(𝑖 ≥ 𝑗) = , (5)
𝑛×𝑛 𝑛×𝑛 −∞ otherwise
This form of self-attention is commonly denoted as autoregressive or causal attention [141].
• Cross-Layer Multi-Head Self-Attention. The queries are derived from the outputs of the
preceding (decoder) layer, while the keys and values are projected from the outputs of the
encoder.
2.1.2 Position-wise Feed-Forward Networks. Within each Transformer layer, a Position-wise Feed-
Forward Network (PFFN) is leveraged following the MHSA sub-layer to refine the sequence
embeddings at each position 𝑖 in a separate and identical manner, thereby encoding more intricate
feature representations. The PFFN is composed of a pair of linear transformations, interspersed
with a ReLU activation function. Formally,
n o𝑛 𝑇
PFFN(ℎ (𝑙 ) ) = Concat FFN(ℎ𝑖(𝑙 ) )𝑇 , (6)
𝑖=1
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:6 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
Output Probabilities
Layer Norm
Output Probabilities
+
Position-wise
Feed Forward Linear & Softmax
Layer Norm
Layer Norm
Layer Norm
+ +
+ Multi-Head 𝑁×
Position-wise
Position-wise Self-Attention Feed Forward
Feed Forward
𝑁×
𝑁× Layer Norm Layer Norm
Layer Norm
+ +
+ Masked Masked
Multi-Head Multi-Head Multi-Head
Self-Attention Self-Attention Self-Attention
Fig. 2. The overview of large language models (LLMs) with encoder-decoder and decoder-only Transformer
architecture for code generation, adapted from [222].
where ℎ (𝑙 ) ∈ R𝑛×𝑑𝑚𝑜𝑑𝑒𝑙 is the outputs of MHSA sub-layer in 𝑙-th Transformer layer, and ℎ𝑖(𝑙 ) ∈
(1) denotes the latent representation at each sequence position. The projection matrices
R 𝑑𝑚𝑜𝑑𝑒𝑙
W , (W (2) )𝑇 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×4𝑑𝑚𝑜𝑑𝑒𝑙 and bias vectors {b (1) , b (2) } ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 are parameters learned
during training. These parameters remain consistent across all positions while are individually
initialized from layer to layer. In this context, 𝑇 represents the transpose operation on a matrix.
2.1.3 Residual Connection and Normalization. To alleviate the issue of vanishing or exploding
gradients resulting from network deepening, the Transformer model incorporates a residual con-
nection [84] around each of the aforementioned modules, followed by Layer Normalization [17].
For the placement of Layer Normalization operation, there are two widely used approaches: 1)
Post-Norm: Layer normalization is implemented subsequent to the element-wise residual addition,
in accordance with the vanilla Transformer [222]. 2) Pre-Norm: Layer normalization is applied to
the input of each sub-layer, as seen in models like GPT-2 [186]. Formally, it can be formulated as:
Post-Norm : H (l) = LayerNorm(PFFN(h (l) ) + h (l) ),
(8)
h (l) = LayerNorm(MHSA(H (l−1) ) + H (l−1) )
Pre-Norm : H (l) = PFFN(LayerNorm(h (l) )) + h (l) ,
(9)
h (l) = MHSA(LayerNorm(H (l−1) )) + H (l−1)
2.1.4 Positional Encoding. Given that self-attention alone cannot discern the positional information
of each input token, the vanilla Transformer introduces an absolute positional encoding method to
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:7
supplement this positional information, known as sinusoidal position embeddings [222]. Specifically,
for a token at position 𝑝𝑜𝑠, the position embedding is defined as:
𝑝𝑜𝑠
p𝑝𝑜𝑠,2𝑖 = sin( ), (10)
100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
p𝑝𝑜𝑠,2𝑖+1 = cos( ), (11)
100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙
where 2𝑖, 2𝑖 + 1 represent the dimensions of the position embedding, while 𝑑𝑚𝑜𝑑𝑒𝑙 denotes the model
dimension. Subsequently, each position embedding is added to the corresponding token embedding,
and the sum is fed into the Transformer. Since the inception of this method, a variety of innovative
positional encoding approaches have emerged, such as learnable embeddings [61], relative position
embeddings [199], RoPE [209], and ALiBi [183]. For more detailed descriptions of each method,
please consult [141, 272].
Consequently, following [166], a more general formulation of LLMs for code generation with
few-shot (or zero-shot) exemplars can be revised as:
𝑃𝜃 (y | x) = 𝑃𝜃 (y | prompt(x, {(xi, yi )}𝑘𝑖=1 )), 𝑘 = {0, 1, . . . , 𝑀 } (12)
where prompt(x, {(xi, yi )}𝑘𝑖=1 )) is a string representation of the overall input, and {(xi, yi )}𝑘𝑖=1
denotes a set of 𝑘 exemplars randomly selected from {(xi, yi )}𝑖=1 𝑀 . In particular, when 𝑘 = 0,
this denotes zero-shot code generation, equivalent to vanilla ones without in-context learning.
Subsequently, a variety of decoding strategies can be performed for code generation, including
deterministic-based strategies (e.g., greedy search and beam search) and sampling-based strategies
(e.g., temperature sampling, top-k sampling, and top-p (nucleus) sampling). For more detailed
descriptions of each decoding strategy, please consult [89].
Greedy Search : y∗ = argmax 𝑃𝜃 (y | prompt(x, {(xi, yi )}𝑘𝑖=1 )), 𝑘 = {0, 1, . . . , 𝑀 } (13)
y
3 TAXONOMY
The recent surge in the development of Large Language Models (LLMs) has led to a significant
number of these models being repurposed for code generation task through continued pre-training
or fine-tuning. This trend is particularly observable in the realm of open-source models. For instance,
Meta AI initially made the LLaMA [217] model publicly available, which was followed by the release
of Code LLaMA [196], designed specifically for code generation. Similarly, DeepSeek LLM [25]
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:8 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
developed and released by DeepSeeker has been extended to create DeepSeek Coder [79], a variant
tailored for code generation. The Qwen team has developed and released Code Qwen [215], building
on their original Qwen [19] model. Microsoft, on the other hand, has unveiled WizardLM [250]
and is exploring its coding-oriented counterpart, WizardCoder [154]. Google has joined the fray
by releasing Gemma [214], subsequently followed by Code Gemma [54]. Beyond simply adapting
general-purpose LLMs for code-related tasks, there has been a proliferation of models specifically
engineered for code generation. Notable examples include StarCoder [132], OctoCoder [164], and
CodeGen [169]. These models underscore the trend of LLMs being developed with a focus on code
generation.
Recognizing the importance of these developments, we propose a taxonomy that categorizes
and evaluates the latest advances in LLMs for code generation. This taxonomy, depicted in Figure
3, serves as a comprehensive reference for researchers seeking to quickly familiarize themselves
with the state-of-the-art in this dynamic field.
In the subsequent sections, we will provide an in-depth analysis of each category related to code
generation. This will encompass a definition of the problem, the challenges to be addressed, and a
comparison of the most prominent models and their performance evaluation.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:9
Fig. 4. A diagram depicting the standard data preprocessing workflow utilized in the pre-training phase of
large language models (LLMs) for code generation.
news, scientific data, and code [19, 31, 49, 217, 218, 256], while these data are often crawled from the
web and must undergo meticulous and aggressive pre-processing [189, 271]. Fortunately, multiple
platforms and websites offer large-scale, open-source, and permissively licensed code corpora, such
as GitHub7 and Stack Overflow8 . Notably, the number of stars or forks of GitHub repositories has
emerged as a valuable metric for filtering high-quality code datasets. In a similar vein, the quantity
of votes on Stack Overflow can serve to discern the most relevant and superior answers.
Nonetheless, raw datasets are frequently laden with redundant, noisy data and personal infor-
mation, eliciting concerns regarding privacy leakage, which may include the names and email
addresses of repository contributors [7, 34, 123]. Consequently, it is essential to undertake rigorous
data-cleaning procedures. Typically, this process encompasses exact match deduplication, code
data filtering based on average line length and a defined threshold for the fraction of alphanumeric
characters, the removal of auto-generated files through keyword searches, and the expunction of
personal user data [118, 219]. Specifically, the standard data preprocessing workflow is depicted in
Figure 4.
The development of a proficient LLM for code generation necessitates the utilization of various
types of code data at different developmental stages. Therefore, we categorize code data into three
distinct classes: pre-training datasets, instruction-tuning datasets, and benchmarks for performance
evaluation. The subsequent subsections will provide a detailed illustration of code data within each
classification.
7 https://github.com
8 https://stackoverflow.com
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:11
4.1.1 Pre-training. The remarkable success of bidirectional pre-trained language models (PLMs)
such as BERT [61] and unidirectional PLMs like GPT [185] has firmly established the practice of
pre-training on large-scale unlabeled datasets to endow models with a broad spectrum of general
knowledge. Extending this principle to the realm of code generation enables Large Language
Models (LLMs) to assimilate fundamental coding principles, including the understanding of code
structure dependencies, the semantics of code identifiers, and the intrinsic logic of code sequences
[45, 76, 232, 234]. In light of this advancement, there has been a proliferation of large-scale unlabeled
code datasets proposed to serve as the foundational training ground for LLMs to develop coding
proficiency. A brief introduction of these datasets is as follows, with the statistics available in Table
1.
• CodeSearchNet [99]: CodeSearchNet corpus is a comprehensive dataset, consisting of 2
million (comment, code) pairs from open-source repositories on GitHub. It includes code
and documentation in several programming languages including Go, Java, PHP, Python,
JavaScript, and Ruby. The dataset was primarily compiled to promote research into the
problem of code retrieval using natural language.
• Google BigQuery [86]: the Google BigQuery Public Datasets program offers a full snapshot
of the content of more than 2.8 million open source GitHub repositories in BigQuery.
• The Pile [70]: the Pile is an 825 GiB diverse and open source language modeling dataset
aggregating 22 smaller, high-quality datasets including GitHub, Books3, and Wikipedia (en).
It aims to encompass text from as many modalities as possible, thereby facilitating the
development of models with broader generalization capabilities. For code generation, the
GitHub composite is specifically utilized.
• CodeParrot [219]: the CodeParrot dataset contains Python files used to train the code genera-
tion model in Chapter 10: Training Transformers from Scratch in the “NLP with Transformers
book” [219]. Created with the GitHub dataset available via Google’s BigQuery, the CodeParrot
dataset includes approximately 22 million Python files and is 180 GB (50 GB compressed) big.
• GitHub Code [219]: the GitHub Code dataset comprises 115M code files derived from GitHub,
spanning 32 programming languages and 60 extensions totaling 1TB of data. The dataset
was created from the public GitHub dataset on Google BiqQuery.
• ROOTS [123]: the BigScience ROOTS Corpus is a 1.6TB dataset spanning 59 languages that
was used to train the 176B BigScience Large Open-science Open-access Multilingual (BLOOM)
language model. For the code generation task, the code subset of the ROOTS Corpus will be
specifically utilized.
• The Stack [118]: the Stack contains over 6TB of permissively licensed source code files that
cover 358 programming languages. The dataset was compiled as part of the BigCode Project,
an open scientific collaboration working on the responsible development of Large Language
Models for Code (Code LLMs).
• The Stack v2 [151]: The Stack v2, a dataset created as part of the BigCode Project, contains
over 3B files across more than 600 programming and markup languages. The dataset is
derived from the Software Heritage archive9 , the largest public archive of software source
code and accompanying development history.
4.1.2 Instruction Tuning. Instruction tuning refers to the process of fine-tuning large language
models (LLMs) using a collection of datasets that are structured as instructions. This method
has demonstrated a considerable improvement in model performance and an enhanced ability
to generalize to unseen tasks that the model has not previously encountered, as evidenced by
9 https://archive.softwareheritage.org
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:12 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
Table 1. The statistics of some commonly-used pre-training datasets for large language models (LLMs) aimed
at code generation. The column labeled ‘#PL’ indicates the number of programming languages included in
each dataset. It should be noted that in the CodeSearchNet [99] dataset, each file represents a function, and
for the Pile [70] and ROOTS [123] datasets, only the code components are considered.
Table 2. The statistics of several representative datasets used in instruction-tuning large language models
(LLMs) for code generation. The column labeled ‘#PL’ indicates the number of programming languages
encompassed by each dataset.
recent studies [51, 173]. Leveraging the benefits of instruction tuning, instruction tuning has
been expanded into coding domains, especially for code generation, which involves the automatic
generation of the intended code from a natural language description. The promise of instruction
tuning in this area has led numerous researchers to develop large-scale instruction-tuning datasets
tailored for code generation. Below, we provide an overview of several notable datasets tailored for
instruction tuning, with their respective statistics detailed in Table 2.
• CodeAlpaca-20k [40]: CodeAlpaca-20k is a collection of 20K instruction-following data
generated using the data synthesis techniques termed Self-Instruct outlined in [231], with
modifications for code generation, editing, and optimization tasks instead of general tasks.
• CommitPackFT [164]: CommitPackFT is a 2GB refined version of CommitPack. It is filtered
to only include high-quality commit messages that resemble natural language instructions.
• Evol-Instruct-Code-80k [195]: Evol-Instruct-Code-80k is an open-source implementation of
Evol-Instruct-Code described in the WizardCoder paper [154], which enhances the fine-tuning
effect of pre-trained code large models by adding complex code instructions.
• Magicoder-OSS-Instruct-75k [240]: is a 75k synthetic data generated through OSS-Instruct
with gpt-3.5-turbo-1106 and used to train both Magicoder and Magicoder-S series models.
• Self-OSS-Instruct-SC2-Exec-Filter-50k [261]: Self-OSS-Instruct-SC2-Exec-Filter-50k is gen-
erated by StarCoder2-15B using the OSS-Instruct [240] data synthesis approach. It was
subsequently used to fine-tune StarCoder-15B without any human annotations or distilled
data from huge and proprietary LLMs.
4.1.3 Benchmarks. To rigorously assess the efficacy of Large Language Models (LLMs) for code
generation, the research community has introduced a variety of high-quality benchmarks in
recent years. Building on the foundational work by [45], numerous variations of the HumanEval
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:13
dataset and additional benchmarks have emerged, aiming to evaluate a broader spectrum of code
generation capabilities in LLMs. We roughly divide these benchmarks into six distinct categories
based on their application contexts, including general-purpose, competitive programming, data
science, multilingual, logical reasoning, and repository-level. The statistics for these benchmarks
are presented in Table 3.
General
• HumanEval [45]: HumanEval comprises 164 manually scripted Python programming prob-
lems, each featuring a function signature, docstring, body, and multiple unit tests.
• HumanEval+ [145]: HumanEval+ extends the original HumanEval [45] benchmark by in-
creasing the scale of the test cases by 80 times. As the test cases increase, HumanEval+ can
catch significant amounts of previously undetected incorrect code synthesized by LLMs.
• HumanEvalPack [164]: expands HumanEval [45] by extending it to encompass three coding
tasks across six programming languages, namely code synthesis, code repair, and code
explanation.
• MBPP [16]: MBPP is a collection of approximately 974 Python programming problems, crowd-
sourced and designed for entry-level programmers. Each problem comes with an English
task description, a code solution, and three automated test cases.
• MBPP+ [145]: MBPP+ enhances MBPP [16] by eliminating ill-formed problems and rectifying
problems with incorrect implementations. The test scale of MBPP+ is also expanded by 35
times for test augmentation.
• CoNaLa [255]: CoNaLa contains almost 597K data samples for evaluating Python code
generation. The curated part of CoNaLa is crawled from Stack Overflow, automatically
filtered, and then curated by annotators. The mined part of CoNaLais automatically mined,
with almost 600k examples.
• Spider [258]: Spider is large-scale complex text-to-SQL dataset covering 138 different domains.
It has over 10K questions and 5.6K complex SQL queries on 200 databases. This dataset aims
to test a model’s ability to generalize to SQL queries, database schemas, and new domains.
• CONCODE [102]: CONCODE is a dataset with over 100K samples consisting of Java classes
from public GitHub repositories. It provides near zero-shot conditions that can test the
model’s ability to generalize to unseen natural language tokens with unseen environments.
• ODEX [236]: ODEX is an open-domain dataset focused on the execution-based generation
of Python code from natural language. It features 945 pairs of natural language queries and
their corresponding Python code, all extracted from StackOverflow forums.
• CoderEval [257]: CoderEval is a pragmatic code generation benchmark that includes 230
Python and 230 Java code generation problems. It can be used to evaluate the model perfor-
mance in generating pragmatic code beyond just generating standalone functions.
• ReCode [226]: Recode serves as a comprehensive robustness evaluation benchmark. ReCode
applies perturbations to docstrings, function and variable names, code syntax, and code
format, thereby providing multifaceted assessments of a model’s robustness performance.
• StudentEval [18]: StudentEval is a dataset of 1,749 prompts for 48 problems, authored by 80
students who have only completed a one-semester Python programming class. Unlike many
other benchmarks, it has multiple prompts per problem and multiple attempts by the same
participant, each problem is also accompanied by a set of instructor-written test cases.
Competitions
• APPS [85]: The APPS benchmark is composed of 10K Python problems, spanning three levels
of difficulty: introductory, interview, and competition. Each entry in the dataset includes a
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:14 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:15
Table 3. The detailed statistics of commonly-used benchmarks used in evaluating large language models
(LLMs) for code generation. The column labeled ‘#PL’ indicates the number of programming languages
included in each dataset. For the sake of brevity, we list the programming languages (PLs) for benchmarks
that support fewer than or include five PLs. For benchmarks with six or more PLs, we provide only a numerical
count of the PLs supported.
Repository
• RepoEval [266]: RepoEval enables the evaluation of repository-level code completion. It can
offer different levels of granularity and improved evaluation accuracy through the use of unit
tests.
• Stack-Repo [205]: Stack-Repo is a dataset of 200 Java repositories from GitHub with near-
deduplicated files. These files are augmented with three types of repository contexts: prompt
proposal contexts, BM25 Contexts (based on BM25 similarity scores), and RandomNN Con-
texts (obtained using the nearest neighbors in the representation space of an embedding
model).
• Repobench [150]: Repobench is a benchmark specifically used for evaluating repository-
level code auto-completion systems. Supporting both Python and Java, it consists of three
interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion),
and RepoBench-P (Pipeline).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:16 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:17
general purposes. These models have consistently outperformed larger counterparts across various
benchmarks, demonstrating the efficacy of synthetic data in model training.
Drawing on the successes of data synthesis for general-purpose Large Language Models (LLMs),
researchers have expanded the application of synthetic data to the realm of code generation. The
Code Alpaca model, as described in [40], has been fine-tuned on a 7B and 13B LLaMA model using
a dataset of 20k instruction-following examples for code generation. This dataset was created
by text-davinci-00310 and employed the Self-Instruct technique [231]. Building on this, the
WizardCoder 15B [154] utilizes the Evol-Instruct technique to create an enhanced dataset of
78k evolved code instruction examples. This dataset originates from the initial 20k instruction-
following dataset used by Code Alpaca [40], which was also generated by text-davinci-003. The
WizardCoder model, fine-tuned on the StarCoder [132] base model, achieved a 57.3% pass@1 on
the HumanEval benchmarks. This performance not only surpasses all other open-source Code
LLMs by a significant margin but also outperforms leading closed LLMs such as Anthropic’s Claude
and Google’s Bard. In a similar vein, Magicoder [240] introduces a novel data synthesis approach
termed OSS-INSTRUCT which enlightens LLMs with open-source code snippets to generate high-
quality instruction data for coding tasks. It aims to address the inherent biases often present
in synthetic data produced by LLMs. Building upon CodeLlama [196], the MagicoderS-CL-7B
model — fine-tuned with 75k synthetic instruction data using the OSS-INSTRUCT technique and
with gpt-3.5-turbo-1106 as the data generator — has outperformed the prominent ChatGPT
on the HumanEval Plus benchmark, achieving pass@1 of 66.5% versus 65.9%. In a noteworthy
development, Microsoft has introduced the phi-1 model [75], a more compact LLM of only 1.3B
parameters. Despite its smaller size, phi-1 has been trained on high-quality textbook data sourced
from the web (comprising 6 billion tokens) and supplemented with synthetic textbooks and exercises
generated with GPT-3.5 (1 billion tokens). It has achieved pass@1 of 50.6% on HumanEval and
55.5% on MBPP, setting a new state-of-the-art for Python coding performance among existing small
language models (SLMs). The latest contribution to this field is from the BigCode team, which has
presented StarCoder2-15B-instruct [261], the first entirely self-aligned code LLM trained with a
transparent and permissive pipeline. This model aligns closely with the OSS-INSTRUCT principles
established by Magicoder, generating instructions based on seed functions filtered from the Stack
v1 dataset [118] and producing responses through self-validation. Unlike Magicoder, StarCoder2-
15B-instruct employs its base model, StarCoder2-15B, as the data generator, thus avoiding reliance
on large and proprietary LLMs like GPT-3.5-turbo [171].
While synthetic data has demonstrated its potential across both small- and large-scale LMs for a
variety of general and specialized tasks, including code generation, it also poses several challenges
that must be addressed. These challenges include a lack of data diversity [242], the need to ensure
the factuality and fidelity of the information [221, 243], and the potential to amplify existing biases
or introduce new ones [23, 80].
4.3 Pre-Training
4.3.1 Model Architectures. Since the inception of the Transformer architecture for machine trans-
lation [222], it has become the de facto backbone for a multitude of large language models (LLMs)
that address a wide range of downstream tasks. The Transformer and its derivatives owe their
prominence to their exceptional ability to parallelize computation and their powerful representa-
tional capacities [256, 273]. Through innovative scaling techniques, such as Mixture-of-Experts
(MoE) [33, 200] and Depth-Up-Scaling (DUS) [117], the capacity of Transformer-based LLMs has
expanded to encompass hundreds of billions or even trillions of parameters. These scaled-up models
10 https://platform.openai.com
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:18 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
Table 4. The overview of large language models (LLMs) with encoder-decoder architectures for code genera-
tion.
Context
Architecture Model Institution Size Vocabulary Date Open Source
Window
PyMT5[52] Microsoft 374M 50K 1024+1024 2020-10
PLBART[6] UCLA 140M 50K 1024+1024 2021-03 "
CodeT5 [234] Salesforce60M, 220M, 770M 32K 512+256 2021-09 "
JuPyT5[38] Microsoft 350M 50K 1024+1024 2022-01
284M, 1.1B, 2.8B,
AlphaCode[136] DeepMind 8K 1536+768 2022-02
8.7B, 41.1B
Encoder-Decoder CodeRL[125] Salesforce 770M 32K 512+256 2022-06 "
ERNIE-Code[37] Baidu 560M 250K 1024+1024 2022-12 "
PPOCoder[204] Virginia Tech 770M 32K 512+256 2023-01
220M, 770M, 2B,
CodeT5+[232] Salesforce 50K 2048+2048 2023-05 "
6B, 16B
CodeFusion[207] Microsoft 75M 32k 128+128 2023-10 "
AST-T5[73] UC Berkeley 226M 32k 512+200/300 2024-01 "
have exhibited a range of emergent abilities [87, 114, 238], such as instruction following [173],
in-context learning [65], and step-by-step reasoning [95, 239] that were previously unforeseen.
In the domain of code generation using LLMs, the architecture of contemporary models generally
falls into one of two categories: encoder-decoder models, such as CodeT5 [234], CodeT5+ [232],
and CodeRL [125]; or decoder-only models, such as Codex [45], StarCoder [132], Code Llama [196],
and CodeGemma [54]. These architectures are depicted in Figure 2(b) and (c), respectively. For a
comprehensive overview, Table 4 details the encoder-decoder architectures, while Table 5 focuses
on the decoder-only models utilized in code generation.
4.3.2 Pre-training Tasks. In the initial phase, language models for code generation are typically
trained from scratch using datasets consisting of manually annotated pairs of natural language
descriptions and corresponding code snippets, within a supervised learning framework. However,
manual annotation is not only laborious and time-consuming, but the efficacy of the resulting
models is also constrained by both the volume and the quality of the available annotated data. This
limitation is especially pronounced in the context of low-resource programming languages, such
as Swahili and Yoruba, where annotated examples are scarce [35, 43]. In light of these challenges,
there has been a shift towards an alternative training strategy that involves pre-training models on
extensive and unlabelled code corpora. This method is aimed at imbuing the models with a broad
understanding of programming knowledge, encompassing elements like identifiers, code structure,
and underlying semantics [45]. In this regard, two pre-training tasks have gained prominence
for their effectiveness, namely Causal Language Modeling (CLM), also known as unidirectional
language modeling or next-token prediction, and Denoising Autoencoding (DAE). The CLM task
can be applied to both decoder-only and encoder-decoder model architectures, while DAE tasks are
specifically designed for encoder-decoder frameworks. It should also be noted that there is a variety
of additional auxiliary pre-training tasks that can further enhance model performance. These
include Masked Identifier Prediction, Identifier Tagging, Bimodal Dual Generation [234], Text-Code
Matching, and Text-Code Contrastive Learning [232]. These tasks contribute to a more nuanced
and comprehensive pre-training process, equipping the models with the capabilities necessary to
handle a wide range of code generation scenarios.
Causal Language Modeling. In decoder-only LLMs, given a sequence of tokens x = {𝑥 1, . . . , 𝑥𝑛 },
the CLM task refers to autoregressively predict the target tokens 𝑥𝑖 based on the preceding tokens
𝑥 <𝑖 in a sequence. The causal language modeling objective for training decoder LLMs is to minimize
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:19
Table 5. The overview of large language models (LLMs) with decoder-only architectures for code generation.
Context
Architecture Model Institution Size Vocabulary Date Open Source
Window
GPT-C [210] Microsoft 366M 60K 1024 2020-05
CodeGPT [153] Microsoft 124M 50K 1024 2021-02 "
GPT-Neo[29] EleutherAI 125M, 1.3B, 2.7B 50k 2048 2021-03 "
GPT-J [223] EleutherAI 6B 50k 2048 2021-05 "
12M, 25M, 42M,
Codex [45] OpenAI 85M, 300M, 679M, - 4096 2021-07
2.5B, 12B
CodeParrot [219] Hugging Face 110M, 1.5B 33k 1024 2021-11 "
PolyCoder [251] CMU 160M, 400M, 2.7B 50k 2048 2022-02 "
350M, 2.7B, 6.1B,
CodeGen [169] Salesforce 51k 2048 2022-03 "
16.1B
GPT-NeoX [28] EleutherAI 20B 50k 2048 2022-04 "
PaLM-Coder [49] Google 8B, 62B, 540B 256k 2048 2022-04
InCoder [69] Meta 1.3B, 6.7B 50k 2049 2022-04 "
PanGu-Coder [50] Huawei 317M, 2.6B 42k 1024 2022-07
PyCodeGPT [263] Microsoft 110M 32k 1024 2022-06 "
CodeGeeX [275] Tsinghua 13B 52k 2048 2022-09 "
BLOOM [126] BigScience 176B 251k - 2022-11 "
ChatGPT [171] OpenAI - - 16k 2022-11 "
SantaCoder [8] Hugging Face 1.1B 49k 2048 2022-12 "
6.7B, 13.0B, 32.5B,
LLaMA [217] Meta 32K 2048 2023-02 "
65.2B
Decoder-Only GPT-4 [5] OpenAI - - 32K 2023-03
CodeGen2 [168] Salesforce 1B, 3.7B, 7B, 16B 51k 2048 2023-05 "
replit-code [193] replit 3B 33k 2048 2023-05 "
StarCoder [132] Hugging Face 15.5B 49k 8192 2023-05 "
WizardCoder [154] Microsoft 15B, 34B 49k 8192 2023-06 "
phi-1 [75] Microsoft 1.3B 51k 2048 2023-06 "
CodeGeeX2 [275] Tsinghua 6B 65k 8192 2023-07 "
PanGu-Coder2 [201] Huawei 15B 42k 1024 2023-07
Llama 2 [218] Meta 7B, 13B, 70B 32K 4096 2023-07 "
OctoCoder [164] Hugging Face 15.5B 49k 8192 2023-08 "
Code Llama [196] Meta 7B, 13B, 34B 32k 16384 2023-08 "
CodeFuse [143] Ant Group 350M, 13B, 34B 101k 4096 2023-09 "
phi-1.5 [135] Microsoft 1.3B 51k 2048 2023-09 "
CodeShell [247] Peking University 7B 70k 8192 2023-10 "
Magicoder [240] UIUC 7B 32k 16384 2023-12 "
AlphaCode 2 [10] Google DeepMind - - - 2023-12
StableCode [182] StabilityAI 3B 50k 16384 2024-01 "
WaveCoder [259] Microsoft 6.7B 32k 16384 2023-12 "
phi-2 [161] Microsoft 2.7B 51k 2048 2023-12 "
DeepSeek-Coder [79] DeepSeek 1.3B, 6.7B, 33B 32k 16384 2023-11 "
StarCoder 2 [151] Hugging Face 15B 49k 16384 2024-02 "
Claude 3 [13] Anthropic - - 200K 2024-03
CodeGemma [54] Google 2B, 7B 25.6k 8192 2024-04 "
Code-Qwen [215] Qwen Group 7B 92K 65536 2024-04 "
Llama3 [160] Meta 8B, 70B 128K 8192 2024-04 "
StarCoder2-Instruct [261] Hugging Face 15.5B 49K 16384 2024-04 "
where x<𝑖 represents the sequence of preceding tokens {𝑥 1, . . . , 𝑥𝑖 −1 } before x𝑖 in the input, 𝜃
denotes the model parameters. The conditional probability 𝑃𝜃 (𝑥𝑖 |x<𝑖 )) is modeled by adding a
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:20 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
causal attention mask to the multi-head self-attention matrix of each Transformer block. To be
specific, causal attention masking is implemented by setting the lower triangular part of the
matrix to 0 and the remaining elements to −∞, ensuring that each token 𝑥𝑖 attends only to its
predecessors and itself. On the contrary, in encoder-decoder LLMs, a pivot token 𝑥𝑘 is randomly
selected in a sequence of tokens and then regarding the context before it as the source sequence
x𝑖𝑛 = {𝑥 1, . . . , 𝑥𝑘 } of the encoder and the sequence after it as the target output x𝑜𝑢𝑡 = {𝑥𝑘+1, . . . , 𝑥𝑛 }
of decoder. Formally, the causal language modeling objective for training encoder-decoder LLMs is
to minimize loss function as follows:
𝑛
Ö 𝑛
∑︁
𝐸𝑛𝑐𝑜𝑑𝑒𝑟 −𝐷𝑒𝑐𝑜𝑑𝑒𝑟
L𝐶𝐿𝑀 (x) = − log( 𝑃𝜃 (𝑥𝑖 | x ≤𝑘 , x<𝑖 )) = − log 𝑃𝜃 (𝑥𝑖 | x ≤𝑘 , x<𝑖 ) (16)
𝑖=𝑘+1 𝑖=𝑘+1
where x ≤𝑘 is the source sequence input and x<𝑖 denotes the target sequence autoregressively
generated so far. During the inference phase, pre-trained LLMs that have been trained on large-
scale code corpus can generate code in a zero-shot manner without the need for fine-tuning. This
is achieved through the technique of prompt engineering, which guides the model to produce the
desired output11 [31, 186]. Additionally, recent studies have explored the use of few-shot learning,
also referred to as in-context learning, to enhance model performance further [131, 178].
Denoising Autoencoding. In addition to causal language modeling (CLM), the denoising
autoencoding (DAE) task has been extensively applied in pre-training encoder-decoder architectures
for code generation, such as PLBART [6], CodeT5 [234], and its enhanced successor, CodeT5+ [232].
Following T5 [189] and CodeT5 [234], the DAE refers to initially perturbing the source sequence
by introducing randomly masked spans of varying lengths. This corrupted sequence serves as the
input for the encoder. Subsequently, the decoder employs an autoregressive strategy to reconstruct
the masked spans, integrating sentinel tokens to facilitate the generation process. This method
has proven effective in improving the model’s ability to generate semantically and syntactically
accurate code by learning robust contextual representations [232, 234]. Formally, the denoising
autoencoding objective for training encoder-decoder LLMs is to minimize the following likelihood:
𝑘
∑︁
𝐸𝑛𝑐𝑜𝑑𝑒𝑟 −𝐷𝑒𝑐𝑜𝑑𝑒𝑟
| x\𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 , x<𝑖
𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠
L𝐷𝐴𝐸 (x) = − log 𝑃𝜃 (x𝑖 ) (17)
𝑖=1
where 𝜃 denotes the model parameters, x\𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 is the noisy input with masked spans,
x𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 is the masked spans to predict from the decoder with 𝑘 denoting the number of
𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠
tokens in x𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 , and x<𝑖 is the span sequence autoregressively generated so far.
Compared with CLM, the DAE task presents a more challenging scenario, as it necessitates a deeper
understanding and capture of the intrinsic semantic relationships among token sequences by LLMs
[189].
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:21
Instruction: Instruction:
Write code to create a list of all Generate a snippet of code to extract
prime numbers between 2 and 100. all the URLs from the given string.
Input: Input:
This string contains some urls such as
N/A https://www.google.com and
https://www.facebook.com.
Output: Output:
def find_primes(n): import re
prime_list = [2]
for number in range(2, n + 1): string = "This string contains some
is_prime = True urls such as https://www.google.com and
for k in range(2, number): https://www.facebook.com."
if number % k == 0:
is_prime = False urls = re.findall('http[s]?://(?:[a-zA-
if is_prime: Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-
prime_list.append(number) fA-F][0-9a-fA-F]))+', string)
return prime_list
print(urls)
Fig. 5. Two exemplars of instruction data sampled from Code Alpaca [40] used to instruction-tune pre-trained
code LLM to enhance their alignment with natural language instructions. The instruction corpus encompasses
a variety of tasks, each accompanied by distinct instructions, such as prime numbers generation and URLs
extraction.
In the realm of code generation, natural language descriptions serve as the instructions guiding
the model to generate corresponding code snippets. Consequently, a line of research on instruction
tuning LLMs for code generation has garnered substantial interest across academia and industry.
To perform instruction tuning, instruction data are typically compiled from source code with
permissive licenses [99, 118, 151] (refer to Section 4.1.2) or are constructed from synthetic code data
[154, 240, 261] (refer to Section 4.2). These datasets are then utilized to fine-tune LLMs through
a supervised learning paradigm. However, the substantial computational resources required for
full parameter fine-tuning (FFT) LLM pose a notable challenge, particularly in scenarios with
constrained resources [62, 138]. To mitigate this issue, parameter-efficient fine-tuning (PEFT) has
emerged as a compelling alternative strategy, gaining increasing attention for its potential to reduce
resource consumption [62]. In the following subsection, we categorize existing works based on
their instruction-tuning strategies to provide a comprehensive and systematic review.
4.4.1 Full Parameter Fine-tuning. Full parameter fine-tuning (FFT) involves updating all parameters
within a pre-trained model, as shown in Figure 6(a). This approach is often preferred when ample
computational resources and substantial training data are available, as it typically leads to better
performance. [234] introduces an encoder-decoder pre-trained language model for code generation,
named CodeT5+. They instruction-tune this model on a dataset comprising 20k instruction samples
from Code Alpaca [40], resulting in an instruction-following model called InstructCodeT5+, which
exhibited improved capabilities in code generation. [154] leverages the Evol-Instruct data synthesis
technique from WizardLM [250] to evolve 20K code Alpaca [40] instruction samples into a 78K
code instruction dataset. This enriched dataset is then used to fine-tune the StarCoder base model,
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:22 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
where W0 ∈ R𝑑 ×𝑘 denotes a pre-trained weight matrix, B𝑢𝑝 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 ∈ R𝑑 ×𝑟 and A𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 ∈ R𝑟 ×𝑘 are
𝑑𝑜𝑤𝑛
two trainable low-rank matrixes and initialized by a zero matrix and a random Gaussian distribution
N (0, 𝜎 2 ) respectively, to ensure ΔW = 0 at the beginning of training. The rank 𝑟 ≪ min(𝑑, 𝑘), the
𝛼
𝑟 is a scaling coefficient to balance the importance of the LoRA module, like a learning rate.
Despite the advancements in PEFT methods, their application in code generation remains limited.
For instance, [108] pioneered the use of parameter-efficient instruction-tuning on a Llama 2 [218]
model with a single RTX 3090 GPU, leading to the development of a multilingual code generation
model called CodeUp. More recently, ASTRAIOS [285] conducted a thorough empirical examination
of parameter-efficient instruction tuning for code comprehension and generation tasks. This study
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:23
Fig. 6. An illustration of full parameter fine-tuning (FFT) and parameter-efficient fine-tuning (PEFT) methods.
(a) refers to the Full Fine-tuning method, which updates all parameters of the base model during fine-tuning.
(b) stands for the Specification-based PEFT method that conditionally fine-tunes a small subset of the model
parameters while freezing the rest of the model, e.g. BitFit [262]. (c) represents the Addition-based PEFT
method that fine-tunes the incremental parameters introduced into the base model or input, e.g. Adapter
[92], Prefix-tuning [134], and Prompt-tuning [128]. (d) symbolizes the Reparameterization-based method
which reparameterizes existing model parameters by low-rank transformation, e.g. LoRA [93], QLoRA [60],
and AdaLoRA [267].
yielded several perceptive observations and conclusions, contributing valuable insights to the
domain.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:24 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
enhances code compilability by employing compiler feedback, including language model fine-
tuning, compilability reinforcement, and compilability discrimination strategies. Subsequently,
PPOCoder [204] integrates pre-trained code model CodeT5 [234] with Proximal Policy Optimization
(PPO) [198]. This integration not only utilizes execution (i.e., compilers or interpreters) feedback to
assess syntactic and functional correctness but also incorporates a reward function that evaluates
the syntactic and semantic congruence between abstract syntax tree (AST) sub-trees and data flow
graph (DFG) edges in the generated code against the ground truth. Additionally, the framework
applies a KL-divergence penalty to maintain fidelity between the actively learned policy and the
referenced pre-trained model, enhancing the optimization process. More recently, RLTF [146] has
proposed an online reinforcement learning framework that provides fine-grained feedback based
on compiler error information and location, along with adaptive feedback that considers the ratio
of passed test cases.
Despite these successes, reinforcement learning algorithms face inherent limitations such as
inefficiency, instability, extensive resource requirements, and complex hyperparameter tuning,
which can impede the performance and scalability of LLMs. To overcome these challenges, recent
studies have introduced various variants of RL methods that do not rely on PPO, including DPO
[188], RRHF [260], and sDPO [116]. In essence, these methods aim to maximize the likelihood
between the logarithm of conditional probabilities of preferred and rejected responses, which may
be produced by LLMs with varying capabilities. Inspired by RRHF [260], PanGu-Coder 2 [201]
leverages a novel framework, Reinforcement Learning via Rank Responses to align Test & Teacher
Feedback (RRTF), significantly enhancing code generation capabilities, as evidenced by pass@1 of
62.20% on the HumanEval benchmark.
Taking a step forward, the integration of more non-differentiable code features, such as coding
style [41, 158] and readability [32], into the reinforcement learning feedback for LLM-based code
generation, presents an exciting avenue for future research.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:25
Feedback
…
(Optional)
Fig. 7. An illustration of the self-improving code generation pipeline using prompts for large language models
(LLMs). This process incorporates iterative self-refinement by integrating execution outcomes and includes
an optional self-reflection mechanism to enhance generation quality.
light on the self-refinement effectiveness of these LLMs. Moreover, Reflexion [202] introduces a
general approach for code generation wherein LLM-powered agents engage in verbal self-reflection
on task feedback signals, storing these reflections in an episodic memory buffer to inform and
improve decision-making in subsequent interactions. LATS [280] adopts a novel strategy, utilizing
LLMs as agents, value functions, and optimizers. It enhances decision-making by meticulously
constructing trajectories through Monte Carlo Tree Search (MCTS) algorithms, integrating external
feedback, and learning from experience. This approach has demonstrated remarkable results in
code generation, achieving a pass@1 of 94.4% on the HumanEval benchmark with GPT-4.
Distinct from the aforementioned methods, CodeT [42] and LEVER [166] prompt LLMs to
generate numerous code samples, which are then re-ranked based on execution outcomes to select
the optimal solution. Notably, these approaches do not incorporate a self-refinement step to further
improve code generation.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:26 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
• LLMs may not have been adequately trained on extensive sets of repository data, such as
proprietary software or projects that are still in development [205].
Given that the scope of a typical software repository encompasses hundreds of thousands of
tokens, it is imperative to enhance the capacity of LLMs to handle extensive contexts when they
are employed for repository-level code generation. Fortunately, recent advancements in positional
encoding techniques, such as ALiBi [183] and RoPE [209], have shown promise in improving the
Transformer’s ability to generalize from shorter training sequences to longer inference sequences
[272]. This progress addresses the third challenge mentioned above to a certain degree, thereby
enabling better contextualization of coding activities within full repositories.
To further refine LLMs for repository-level code completion, several innovative approaches have
been introduced. RepoCoder [266] leverages a similarity-based retrieval system within an iterative
retrieval-generation paradigm to enrich the context and enhance code completion quality. In a
similar vein, CoCoMIC [64] employs a cross-file context finder named CCFINDER to pinpoint and
retrieve the most relevant cross-file contexts within a repository. RepoHyper [181] introduces a
semantic graph structure, termed RSG, to encapsulate the expansive context of code repositories
and uses an “Expand and Refine” retrieval method to obtain relevant code snippets. Moreover, a
framework known as RLPG [206] has been proposed to generate repository-level prompts that
integrate the repository’s structure with the relevant context across all files. However, the constant
reliance on retrieval mechanisms has raised concerns regarding efficiency and robustness, as some
retrieved contexts may prove unhelpful or harmful. In response, Repoformer [244] introduces a
selective Retrieval-Augmented Generation (RAG) framework that judiciously bypasses retrieval
when it is deemed redundant. This approach incorporates a self-supervised learning strategy that
equips a code LLM with the ability to perform a self-assessment on the utility of retrieval for
enhancing the quality of its output, thereby effectively utilizing potentially noisy retrieved contexts.
Additionally, RepoFusion [205] has been developed to train models to combine multiple relevant
contexts from a repository, aiming to produce more precise and context-aware code completions.
In a novel approach, Microsoft’s CodePlan [21] frames repository-level coding tasks as a planning
problem, generating a multi-step chain of edits (plan) where each step involves invoking an LLM on a
specific code location, considering context from the entire repository, preceding code modifications,
and task-specific instructions.
Advancing the state-of-the-art, [265] tackles the formidable challenge of NL2Repo, an endeavor
that seeks to create a complete code repository from natural language requirements. To address
this complex task, they introduce the CodeS framework, which strategically breaks down NL2Repo
into a series of manageable sub-tasks using a multi-layer sketch approach. The CodeS framework
comprises three distinct modules: 1) RepoSketcher, for creating a directory structure of the reposi-
tory based on given requirements; 2) FileSketcher, for sketching out each file within that structure;
and 3) SketchFiller, for fleshing out the specifics of each function within the file sketches [265].
Accordingly, a surge of benchmarks tailored for repository-level code generation has emerged,
such as RepoEval [266], Stack-Repo [205], Repobench [150], EvoCodeBench [130], SWE-bench
[111], CrossCodeEval [63], and SketchEval [265]. The detailed statistics and comparisons of these
benchmarks are presented in Table 3.
Despite the progress made by these methods in repository-level code generation, significant chal-
lenges remain to be addressed. Programming developers are often required to invest considerable
time in editing and debugging [24, 27, 163, 205, 220]. However, the advent of LLM-powered coding
agents, such as AutoCodeRover [270], SWE-Agent [112], and OpenDevin [172], has demonstrated
their potential to tackle complex problems, paving the way for future exploration in this field (for
more details, see Section 4.9).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:27
Stage 1: Retrieval
Open
Source
Embedding Vector Embedding Code Data
Query
Model Database Model Chunks
Create a quick-
sort algorithm
in Python.
Fig. 8. A workflow illustration of the Retrieval-Augmented Code Generation (RACG). Upon receiving a query
(instruction), the retriever selects the relevant contexts from a large-scale vector database. Subsequently, the
retrieved contexts are merged with the query, and this combined input is fed into the generator (LLM) to
produce the target code solution.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:28 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:29
Fig. 9. The general architecture of an LLM-powered autonomous agent system, adapted from [241]. Planning:
The agent decomposes large tasks into smaller, manageable sub-goals or engages in self-criticism and self-
reflection on past actions to learn from mistakes and improve future performance. Memory: This component
enables the agent to store and retrieve past information. Tools: The agent is trained to invoke external
functions or APIs. Action: The agent executes actions, with or without the use of tools, to interact with the
environment. The gray dashed lines represent the data flow within the system.
[94] introduces AgentCoder, a multi-agent framework composed of three specialized agents, each
with distinct roles and capabilities. These roles include a programmer agent responsible for code
generation, a test designer agent tasked with generating unit test cases, and a test executor agent
that executes the code and provides feedback. This division of labor within AgentCoder promotes
more efficient and effective code generation. CodeAct [228] distinguishes itself by utilizing exe-
cutable Python code to consolidate LLM agent actions within a unified action space, in contrast
to the generation of JSON or textual formats. Additionally, AutoCodeRover [270] is proposed to
autonomously resolve GitHub issues for program enhancement.
To address the complexity of tasks within software engineering, two innovative autonomous AI
software engineers Devin14 [56] and OpenDevin15 [172], have been released and rapidly garnered
considerable interest within the software engineering (SE) and artificial general intelligence (AGI)
community. Subsequently, an autonomous system, SWE-agent [112], leverages a language model
to interact with a computer to address software engineering tasks, successfully resolving 12.5% of
issues on the SWE-bench benchmark [111]. L2MAC [88] has been introduced as the first practical,
LLM-based, multi-agent, general-purpose stored-program automatic computer that utilizes a von
Neumann architecture, designed specifically for the generation of long and consistent outputs.
At the time of writing this survey, OpenDevin has enhanced CodeAct with bash command-based
tools, leading to the release of OpenDevin CodeAct 1.0 [249], which sets a new state-of-the-art
performance on the SWE-Bench Lite benchmark [111].
Despite these remarkable advancements, the journey toward fully realized AI software engineers
employing LLM-powered autonomous agents is far from complete [225, 246]. Critical aspects
such as prompt design, context length, agent count, and toolsets call for further refinement and
optimization, especially as problem complexities escalate [100].
4.10 Evaluation
14 https://www.cognition.ai/introducing-devin
15 https://github.com/OpenDevin/OpenDevin
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:30 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
Despite the impressive capabilities of large lan- Table 6. The performance comparison of LLMs for code
guage models (LLMs), they exhibit a range of generation on the HumanEval [45] benchmark, mea-
behaviors that are both beneficial and poten- sured by Pass@{1, 10, 100}. For models with various
tially risky. These behaviors can enhance per- sizes, we report only the largest size version of each
formance across various downstream tasks but model.
may also introduce reliability and trustworthi-
ness concerns in LLM deployment [39, 45, 251]. Model Size
pass@k
Consequently, it is imperative to develop pre- 𝑘 = 1 𝑘 = 10 𝑘 = 100
cise evaluation approaches to discern the quali- GPT-4 [5] - 84.1 - -
tative and quantitive differences between mod- GPT-3.5-Turbo [171] - 76.2 - -
Claude-3-Opus [13] - 82.9 - -
els, thereby encouraging further advancements Claude-3-Haiku [13] - 76.8 - -
in LLM capabilities. Claude-3-Sonnet [13] - 70.7 - -
Evaluation strategies for LLMs in code gen- StarCoder2-Instruct [261] 15.5B 72.6 - -
eration mirror those for general-purpose LLMs Llama3 [160] 70B 81.7 - -
CodeGemma [54] 7B 44.5 - -
and can be divided into three principal cat- StarCoder 2 [151] 15B 46.3 - -
egories: metrics-based, human-centered, and phi-2 [161] 2.7B 49.4 - -
LLM-based approaches. Detailed benchmarks WaveCoder [259] 6.7B 75 - -
StableCode [182] 3B 29.3 - -
for these evaluation strategies are presented in CodeShell [247] 7B 34.32 - -
Section 4.1.3 and summarized in Table 3. Sub- CodeQwen [215] 14B 45.1 - -
sequent subsections will provide a thorough DeepSeek-Coder [79] 33B 56.1 - -
replit-code [193] 3B 20.12 - -
analysis of each approach. Phi-1.5 [135] 1.3B 41.4 - -
PanGu-Coder2 [201] 15B 61.64 79.55 91.75
WizardCoder [154] 15B 57.3 73.2 90.46
4.10.1 Metrics. The pursuit of effective and CodeFuse [143] 34B 74.4 - -
reliable automatic evaluation metrics for gen- Phi-1 [75] 1.3B 50.6 - -
Code Llama [196] 34B 48.8 76.8 93.0
erated content is a long-standing challenge OctoCoder [164] 15.5B 46.2 - -
within the field of natural language process- PaLM-Coder [49] 540B 36 - 88.4
ing (NLP) [46, 140, 175]. At the early stage, CodeGeeX2 [275] 6B 35.9 62.6 88.3
InstructCodeT5+ [232] 16B 35.0 54.5 77.9
most works directly leverage token-matching- CodeGen-NL [169] 16.1B 14.24 23.46 38.33
based metrics, such as Exact Match, BLEU [175], CodeGen-Multi [169] 16.1B 18.32 32.07 50.8
ROUGE [140], and METEOR [22], which are CodeGen-Mono [169] 16.1B 29.28 49.86 75
StarCoder [132] 15B 33.60 45.78 79.82
prevalent in text generation of NLP, to assess CodeT5+ [234] 16B 30.9 51.6 76.7
the quality of code generation. LLaMA2 [218] 70B 30.5 59.4 87.0
While these metrics offer a rapid and cost- Codex [45] 12B 28.81 46.81 72.31
PaLM [49] 540B 26.2 - 76.2
effective approach for assessing the quality of PanGu-Coder [50] 2.6B 23.78 35.36 51.24
generated code, they often fall short of captur- LLaMA [217] 65B 23.7 - 79.3
ing the syntactical and functional correctness, CodeGeeX [275] 13B 22.89 39.57 60.92
Replit [192] 3B 21.9 - -
as well as the semantic features of the code. To CodeGen2 [168] 16B 20.46 36.5 56.71
eliminate this limitation, CodeBLEU [191] was SantaCoder [8] 1.1B 18 29 49
introduced, enhancing the traditional BLEU AlphaCode [136] 1.1B 17.1 28.2 45.3
BLOOM [126] 176B 15.52 32.20 55.45
metric [175] by incorporating syntactic infor- GPT-NeoX [28] 20B 15.4 25.6 41.2
mation through abstract syntax trees (AST) and InCoder [69] 6.7B 15.2 27.8 47.0
semantic understanding via data-flow graph GPT-J [223] 6B 11.62 15.74 27.74
PyCodeGPT [263] 110M 8.33 13.36 19.13
(DFG). Despite these improvements, the met- GPT-Neo [29] 2.7B 6.41 11.27 21.37
ric does not fully resolve issues pertaining to PolyCoder [251] 2.7B 5.59 9.84 17.68
execution errors or discrepancies in the execu- JuPyT5 [38] 300M 5.4 15.46 25.60
CodeParrot [219] 1.5B 3.99 8.69 17.88
tion results of the generated code. In light of
these challenges, execution-based metrics have
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:31
where 𝑛 is the total number of sampled candidate code solutions, 𝑘 is the number of randomly
selected code solutions from these candidates for each programming problem, with 𝑛 ≥ 𝑘, and 𝑐
is the count of correct samples within the 𝑘 selected. Tables 6 and 7 illustrate the performance of
contemporary large language models (LLMs) for code generation, measured by the pass@k metric
across different values of 𝑘 ∈ {1, 10, 100} on the HumanEval and MBPP benchmarks, respectively.
Nevertheless, these execution-based methods are heavily dependent on the quality of unit tests
and are limited to evaluating executable code [264]. Consequently, when unit tests are unavailable,
token-matching-based metrics are often employed as an alternative for evaluation. Furthermore, in
scenarios lacking a ground truth label, unsupervised metrics such as perplexity (PPL) [105] can
serve as evaluative tools. Perplexity quantifies an LLM’s uncertainty in predicting new content,
thus providing an indirect measure of the model’s generalization capabilities and the quality of the
generated code.
Taken together, while the aforementioned methods primarily focus on the functional correctness
of code, they do not provide a holistic evaluation that encompasses other critical dimensions such
as code vulnerability [165], maintainability [14], readability [32], complexity and efficiency [180],
stylistic consistency [158], and execution stability [187]. A comprehensive evaluation framework
that integrates these aspects remains an open area for future research and development in the field
of code generation assessment.
4.10.2 Human Evaluation. Given the intrinsic characteristics of code, the aforementioned automatic
evaluation metrics are inherently limited in their capacity to fully assess code quality. For instance,
metrics specifically designed to measure code style consistency are challenging to develop and
often fail to capture this aspect adequately [41]. When it comes to repository-level code generation,
the evaluation of overall code quality is substantially complicated due to the larger scale of the
task, which involves cross-file designs and intricate internal as well as external dependencies, as
discussed by [21, 205].
To overcome these challenges, conducting human evaluations becomes necessary, as it yields
relatively robust and reliable results. Human assessments also offer greater adaptability across
various tasks, enabling the simplification of complex and multi-step evaluations. Moreover, human
evaluations are essential for demonstrating the effectiveness of certain token-matching-based
metrics, such as CodeBLEU [191]. These studies typically conduct experiments to evaluate the
correlation coefficient between proposed metrics and quality scores assigned by actual users,
demonstrating their superiority over existing metrics.
Moreover, in an effort to better align large language models (LLMs) with human preferences and
intentions, InstructGPT [173] employs human-written prompts and demonstrations, and model
output ranking in the fine-tuning of LLMs using reinforcement learning from human feedback
(RLHF). Although similar alignment learning techniques have been applied to code generation, the
feedback in this domain typically comes from a compiler or interpreter, which offers execution
feedback, rather than from human evaluators. Notable examples include CodeRL [125], PPOCoder
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:32 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
[204], RLTF [146], and PanGu-Coder2 [201]. Further information on this topic is available in Section
4.5.
Nonetheless, human evaluations are not
without drawbacks, as they can be prone to Table 7. The performance comparison of LLMs for code
certain issues that may compromise their ac- generation on the MBPP [16] benchmark, measured
curacy and consistency. For instance, 1) per- by Pass@{1, 10, 100}. For models with various sizes,
sonalized tastes and varying levels of exper- we report only the largest size version of each model.
tise among human evaluators can introduce
biases and inconsistencies into the evaluation Model Size
pass@k
process; 2) conducting comprehensive and re- 𝑘 = 1 𝑘 = 10 𝑘 = 100
liable human evaluations often necessitates a GPT-3.5-Turbo [171] - 52.2 - -
substantial number of evaluators, leading to sig- Claude-3-Opus [13] - 89.4 - -
Claude-3-Haiku [13] - 80.2 - -
nificant expenses and time-consuming; 3) the Claude-3-Sonnet [13] - 83.6 - -
reproducibility of human evaluations is often StarCoder2-Instruct [261] 15.5B 78 - -
limited, which presents challenges in extending CodeGemma [54] 7B 65.1 - -
previous evaluation outcomes or monitoring StarCoder 2 [151] 15B 50.6 - -
phi-2 [161] 2.7B 64 - -
the progress of LLMs, as highlighted by [273]. WaveCoder [259] 6.7B 74.9 - -
CodeFuse [143] 34B 61.0 - -
4.10.3 LLM-as-a-Judge. The powerful instruction- CodeQwen [215] 14B 51.4 - -
following capabilities of large language mod- DeepSeek Coder [79] 33B 66.0 - -
Phi-1.5 [135] 1.3B 43.5 - -
els (LLMs) have stimulated researchers to in- WizardCoder [154] 16B 51.8 - -
novatively investigate the potential of LLM- StarCoder [132] 5.5B 52.7 - -
based evaluations. The LLM-as-a-Judge [274] SantaCoder [8] 1.1B 3.65 21.33 41.92
PyCodeGPT [263] 110M 9.39 28.37 48.71
refers to the application of advanced propri- PolyCoder [251] 2.7B 4.39 17.99 38.17
etary LLMs (e.g., GPT4, Gemini, and Claud 3) phi-1 [75] 1.3B 55.5 - -
as proxies for human evaluators. This involves PaLM-Coder [49] 540B 47 - -
PaLM [49] 540B 36.8 - -
designing prompts with specific requirements LLaMA [217] 65B 37.7 - -
to guide LLMs in conducting evaluations, as LLaMA 2 [218] 70B 45.4 66.2 83.1
demonstrated by AlpacaEval [133] and MT- CodeT5+ [234] 16B 56.6 - -
InCoder [69] 6.7B 21.3 46.5 66.2
bench [274]. This method reduces reliance on GPT-Neo [29] 2.7B 5.89 23.09 44.26
human participation, thereby facilitating more GPT-J [223] 6B 11.30 35.62 53.63
efficient and scalable evaluations. Moreover, CodeT5 [234] 770M 15.78 38.63 50.35
CodeParrot [219] 1.5B 1.29 8.66 27.17
LLMs can offer insightful explanations for the Code Llama [196] 34B 55 76.2 86.6
assigned rating scores, thereby augmenting the CodeGen-NL [169] 16.1B 10.92 38.43 62.76
interpretability of evaluations [273]. CodeGen-Multi [169] 16.1B 20.94 51.61 70.02
CodeGen-Mono [169] 16.1B 35.28 67.32 80.09
Nevertheless, the use of LLM-based evalu- CodeGeeX [275] 13B 24.4 48 -
ation for code generation remains relatively BLOOM [126] 1.7B 3.16 14.23 31.38
underexplored compared with general-purpose PanGu-Coder [50] 2.6B 23.0 43.60 59.64
CodeGeeX2 [275] 6B 24.37 47.95 -
LLM. A recent work [284] introduces the ICE-
Score evaluation metric, which instructs LLM
for code assessments. This approach attains su-
perior correlations with functional correctness and human preferences, thereby eliminating the
requirement for test oracles or references. As the capabilities of LLM continue to improve, we
anticipate seeing more research in this direction.
Despite their scalability and explainability, the effectiveness of LLM-based evaluation is con-
strained by the inherent limitations of the chosen LLM. Several studies have shown that most LLMs,
including GPT-4, suffer from several issues, including position, verbosity, and self-enhancement
biases, as well as restricted reasoning ability [274]. Specifically, position bias refers to the tendency
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:33
of large language models (LLMs) to disproportionately favor responses that are presented in certain
positions, which can skew the perceived quality of answers based on their order of presentation.
Meanwhile, verbosity bias describes the inclination of LLMs to prefer lengthier responses, even
when these are not necessarily of higher quality compared to more concise ones. Self-enhancement
bias, on the other hand, is observed when LLMs consistently overvalue the quality of the text they
generate [273, 274]. Moreover, due to their inherent limitations in tackling complex reasoning
challenges, LLMs may not be entirely reliable as evaluators for tasks that require intensive rea-
soning, such as those involving mathematical problem-solving. However, these shortcomings can
be partially addressed through the application of deliberate prompt engineering and fine-tuning
techniques, as suggested by [274].
4.11 Applications
Code LLMs have been integrated with development tools and platforms, such as integrated de-
velopment environments (IDEs) and version control systems, improving programming efficiency
substantially. In this section, we will briefly introduce several widely used applications as coding
assistants. The statistics of these applications are provided in Table 8.
GitHub Copilot. GitHub Copilot, powered by OpenAI’s Codex, is an AI pair programmer that
helps you write better code faster. Copilot suggests whole lines or blocks of code as you type, based
on the context provided by your existing code and comments. It’s trained on a dataset that includes
a significant portion of the public code available on GitHub, which enables it to understand a wide
range of programming languages and coding styles. Copilot not only improves productivity but
also serves as a learning tool by providing programmers with examples of how certain functions
can be implemented or how specific problems can be solved.
CodeGeeX. CodeGeeX stands out as a multifaceted programming assistant, proficient in code
completion, comment generation, code translation, and developer interactions. Its underlying code
generation LLM has been refined with extensive training on vast amounts of code data, exhibiting
superior performance on benchmarks like HumanEval, HumanEval-X, and DS1000. Renowned for
supporting multilingual code generation, CodeGeeX plays a pivotal role in enhancing the efficiency
of code development.
CodeWhisperer. Amazon’s CodeWhisperer is a versatile, machine learning-driven code genera-
tor that offers on-the-fly code recommendations. Tailored to your coding patterns and comments,
CodeWhisperer provides personalized suggestions that range from succinct comments to complex
functions, all aimed at streamlining your coding workflow.
Codeium. Codeium is an AI-accelerated coding toolkit that offers a suite of functions, including
code completion, explanation, translation, search, and user chatting. Compatible with over 70
programming languages, Codeium delivers fast and cutting-edge solutions to coding challenges,
simplifying the development process for its users.
CodeArts Snap. Huawei’s CodeArts Snap is capable of generating comprehensive function-level
code from both Chinese and English descriptions. This tool not only reduces the monotony of
manual coding but also efficiently generates test code, in addition to providing automatic code
analysis and repair services.
Tabnine. Tabnine is an AI coding assistant that empowers development teams to leverage
AI for streamlining the software development lifecycle while maintaining strict standards for
privacy, security, and compliance. With a focus on enhancing coding efficiency, code quality, and
developer satisfaction, Tabnine offers AI-driven automation that is tailored to the needs of your
team. Supporting over one million developers worldwide, Tabnine is applicable across various
industries.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
labeled ‘PLs’ and ‘IDEs’ indicate programming languages and integrated development environments, respec-
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
Table 8. The overview of code assistant applications powered by large language models (LLMs). The column
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
Java, Python, JavaScript, TypeScript,
Code Completion, Code Interpretation, IntelliJ IDEA, VS Code, PyCharm,
Zhipu AI CodeGeeX [275] CodeGeeX Objective C++, Objective C, Pascal,
Code Bugs Fix, Comment Generation, Android Studio, WebStorm, Rider,
HTML, SQL, Kotlin, R, Shell, Cuda,
AI Chatbot GoLand, DataGrip, DataSpell
Fortran, Tex, Lean, Scala
Code Completion, Code Explanation,
Code Translation, Java, Python, TypeScript, JavaScript, JetBrains IDE, VS Code, AWS Cloud9,
Amazon CodeWhisperer [11] −
Code Security Identification, C# AWS Lambda
Code Suggestion
More than 70 languages in total,
including but not limited to:
Code Completion, Bug Detection, JetBrains, VSCode, Visual Studio,
C, C#, C++, Dart, CSS, Go, Elixir,
Code Suggestions, AI Chatbot, Colab, Jupyter, Deepnote,
HTML, Haskell, Julia, Java, JavaScript,
Codeium Codeium [55] − Test Type Generation, Notebooks, Databricks, Chrome,
Lisp, Kotlin, Lua, Objective-C,
Test Plan Creation, Vim, Neovim, Eclipse, Emacs,
Perl, Pascal, PHP, Protobuf,
Codebase Search VSCode Web IDEs, Sublime Text
R, Python, Ruby, Scala, Rust,
Swift, SQL, TS, Vue
Code Generation, Code Explanation
Research and Development Knowledge
Huawei CodeArts Snap [201] PanGu-Coder Question and Answer Java, Python PyCharm, VS Code, IntelliJ
Code Comment, Code Debug
Unit Test Case Generation
Sublime, PyCharm, Neovim, Rider,
Code Generation, Code Completion, Python, Javascript, Java, TypeScript,
VS Code, IntelliJ IDE, Visual Studio,
Code Explanation, Bug Fix, HTML, Haskell, Matlab, Kotlin, Sass,
PhpStorm, Vim, RubyMine, DataGrip,
Tabnine TabNine [212] − Code Recommendation, Code Refactoring, Go, PHP, Ruby, C, C#, C++, Swift,
Android Studio, WebStorm, Emacs,
Code Test Generation, Rust, CSS, Perl, Angular, Dart, React,
Clion, Jupyter Notebook, JupyterLab,
Docstring Generation Objective C, NodeJS, Scala,
Eclipse, GoLand, AppCode
Code Completion, Code Editing, C#, Bash, C, CSS, C++, Java, Go,
Replit Replit[192] replit-code Code Generation, Code Explanation, HTML, JavaScript, Perl, PHP, −
tively [264].
Code Suggestion, Code Test Generation Ruby, Python, R, SQL, Rust
1:34
A Survey on Large Language Models for Code Generation 1:35
Replit. Replit is a multifunctional platform that caters to a diverse array of software development
needs. As a complimentary online IDE, it facilitates code collaboration, and cloud services, and
fosters a thriving developer community. Replit also enables users to compile and execute code in
more than 50 programming languages directly within a web browser, eliminating the need for local
software installations.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:36 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
fine-tuning phases [119, 242, 281]. Currently, there is a scarcity of large, high-quality datasets that
encompass a wide range of programming tasks, styles, and languages. This limitation constrains the
ability of LLMs to generalize across unseen programming tasks, different coding environments, and
real-world software development scenarios. The development of more sophisticated data acquisition
techniques, such as automated code repositories mining [142], advanced filtering algorithms, and
code data synthesis [148] (see Section 4.2), can lead to the creation of richer datasets. Collaborations
with industry partners (e.g., GitHub) could also facilitate access to proprietary codebases, thereby
enhancing the practical relevance of the training material. Furthermore, the adoption of open-source
models for dataset sharing can accelerate the collective effort to improve the breadth and depth of
code data available for LLM research.
Developing comprehensive benchmarks and metrics for coding proficiency evaluation
in LLMs. Current benchmarks like HumanEval may not capture the full spectrum of coding
skills required for practical software development [167]. Additionally, metrics often focus on
syntactic correctness or functional accuracy, neglecting aspects such as code efficiency [180],
style [41], readability [32], or maintainability [14]. The design of comprehensive benchmarks that
simulate real-world software development challenges could provide a more accurate assessment
of LLMs’ coding capabilities. These benchmarks should include diverse programming tasks of
varying difficulty levels, such as debugging [279], refactoring [203], and optimization [101], and
should be complemented by metrics that evaluate qualitative aspects of code. The establishment of
community-driven benchmarking platforms could facilitate continuous evaluation and comparison
of LLMs for code generation across the industry and academia.
Support for low-resource, low-level, and domain-specific programming languages. LLMs
are predominantly trained in popular high-level programming languages, leaving low-resource, low-
level, and domain-specific languages underrepresented. This lack of focus restricts the applicability
of LLMs in certain specialized fields and systems programming [216]. Intensifying research on
transfer learning and meta-learning approaches may enable LLMs to leverage knowledge from
high-resource languages to enhance their performance on less common ones [35, 43]. Additionally,
partnerships with domain experts can guide the creation of targeted datasets and fine-tuning
strategies to better serve niche markets. The development of LLMs with a capacity for multilingual
code generation also presents a significant opportunity for broadening the scope of applications.
Continuous learning for LLMs to keep pace with evolving coding knowledge. The
software development landscape is continuously evolving, with new languages, frameworks, and
best practices emerging regularly. LLMs risk becoming outdated if they cannot adapt to these
changes and incorporate the latest programming knowledge [104, 227]. While retrieval augmented
code generation mitigates these issues, the performance is limited by the quality of the retrieval
context While retrieval-augmented code generation offers a partial solution to these issues, its
effectiveness is inherently constrained by the quality of retrieved context. [152, 266, 283]. Therefore,
establishing mechanisms for continuous learning and updating of LLMs can help maintain their
relevance over time. This could involve real-time monitoring of code repositories to identify trends
and innovations, as well as the creation of incremental learning systems that can assimilate new
information without forgetting previously acquired knowledge. Engaging the LLMs in active
learning scenarios where they interact with human developers may also foster ongoing knowledge
acquisition.
Ensuring code safety and aligning LLM outputs with human coding preferences. Ensuring
the safety and security of code generated by LLMs is a paramount concern, as is their ability to
align with human preferences and ethical standards. Current models may inadvertently introduce
vulnerabilities or generate code that does not adhere to desired norms [45, 252]. Research into
the integration of formal verification tools within the LLM pipeline can enhance the safety of the
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:37
produced code. Additionally, developing frameworks for alignment learning that capture and reflect
human ethical preferences can ensure that the code generation process aligns with societal values
[173, 184]. Transparent and explainable AI methodologies can also contribute to building trust in
the LLM-generated code by making the decision-making process more accessible to developers.
6 CONCLUSION
In this survey, we provide a systematic literature review, serving as a valuable reference for
researchers investigating the cutting-edge progress in LLMs for code generation. A thorough intro-
duction and analysis for data curation, the latest advances, performance evaluation, and real-world
applications are illustrated. In addition, we present a historical overview of the evolution of LLMs
for code generation in recent years and offer an empirical comparison using the widely recognized
HumanEval and MBPP benchmarks to highlight the progressive enhancements in LLM capabilities
for code generation. Critical challenges and promising opportunities regarding the gap between
academia and practical development are also identified for future investigation. Furthermore, we
have established a dedicated resource website to continuously document and disseminate the most
recent advances in the field. We hope this survey can contribute to a comprehensive and systematic
overview of LLM for code generation and promote its thriving evolution. We optimistically believe
that LLM will ultimately change all aspects of coding and automatically write safe, helpful, accurate,
trustworthy, and controllable code, like professional programmers, and even solve coding problems
that currently cannot be solved by humans.
REFERENCES
[1] 2023. AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser. https://github.com/
reworkd/AgentGPT.
[2] 2023. AutoGPT is the vision of accessible AI for everyone, to use and to build on. https://github.com/Significant-
Gravitas/AutoGPT.
[3] 2023. BabyAGI. https://github.com/yoheinakajima/babyagi.
[4] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach,
Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model
locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
[5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[6] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program
understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
[7] Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. 2024. Traces of Memorisation in Large Language Models for
Code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12.
[8] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas
Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint
arXiv:2301.03988 (2023).
[9] Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd acm sigsoft
international symposium on foundations of software engineering. 472–483.
[10] Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-
media/AlphaCode2/AlphaCode2_Tech_Report.pdf.
[11] Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-
cwspr.html.
[12] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019.
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). 2357–2367.
[13] Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/
de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:38 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
[14] Luca Ardito, Riccardo Coppola, Luca Barbato, and Diego Verga. 2020. A tool-based perspective on software code
maintainability metrics: a systematic literature review. Scientific Programming 2020 (2020), 1–26.
[15] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad,
Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models. arXiv preprint
arXiv:2210.14868 (2022).
[16] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie
Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732
(2021).
[17] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450
(2016).
[18] Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson. 2023.
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code. arXiv:2306.04556 [cs.LG]
[19] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang,
et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
[20] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna
Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv
preprint arXiv:2212.08073 (2022).
[21] Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok,
Shashank Shet, et al. 2023. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499
(2023).
[22] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation
with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine
translation and/or summarization. 65–72.
[23] Enrico Barbierato, Marco L Della Vedova, Daniele Tessera, Daniele Toti, and Nicola Vanoli. 2022. A methodology for
controlling bias and fairness in synthetic data generation. Applied Sciences 12, 9 (2022), 4619.
[24] Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with
code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.
[25] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi
Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint
arXiv:2401.02954 (2024).
[26] Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Xuanhua Shi,
and Hai Jin. 2024. Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler
Feedback. arXiv preprint arXiv:2403.16792 (2024).
[27] Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and
Idan Gazit. 2022. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools.
Queue 20, 6 (2022), 35–57.
[28] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy,
Kyle McDonell, Jason Phang, et al. 2022. Gpt-neox-20b: An open-source autoregressive language model. arXiv
preprint arXiv:2204.06745 (2022).
[29] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive
Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please
cite it using these metadata..
[30] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein,
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models.
arXiv preprint arXiv:2108.07258 (2021).
[31] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[32] Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software
engineering 36, 4 (2009), 546–558.
[33] Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected Expert
Parallelism for Accelerating Mixture-of-Experts. arXiv preprint arXiv:2404.05019 (2024).
[34] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th
USENIX Security Symposium (USENIX Security 21). 2633–2650.
[35] Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg,
Abhinav Jangda, and Arjun Guha. 2023. Knowledge Transfer from High-Resource to Low-Resource Programming
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:39
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:40 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
[60] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized
llms. Advances in Neural Information Processing Systems 36 (2024).
[61] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[62] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min
Chan, Weize Chen, et al. 2022. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained
language models. arXiv preprint arXiv:2203.06904 (2022).
[63] Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh
Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file
code completion. Advances in Neural Information Processing Systems 36 (2024).
[64] Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia,
Dan Roth, and Bing Xiang. 2022. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv
preprint arXiv:2212.10007 (2022).
[65] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022.
A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
[66] Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Junjie Shan, Caishuang Huang, Wei Shen, Xiaoran Fan,
Zhiheng Xi, et al. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback.
arXiv preprint arXiv:2402.01391 (2024).
[67] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng,
and Yiling Lou. 2024. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM
46th International Conference on Software Engineering. 1–13.
[68] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint
arXiv:2002.08155 (2020).
[69] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke
Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv preprint
arXiv:2204.05999 (2022).
[70] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish
Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint
arXiv:2101.00027 (2020).
[71] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023.
Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
[72] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023.
Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
[73] Linyuan Gong, Mostafa Elhoushi, and Alvin Cheung. 2024. AST-T5: Structure-Aware Pretraining for Code Generation
and Understanding. arXiv preprint arXiv:2401.03003 (2024).
[74] Sumit Gulwani. 2010. Dimensions in program synthesis. In Proceedings of the 12th international ACM SIGPLAN
symposium on Principles and practice of declarative programming. 13–24.
[75] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan
Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need. arXiv preprint
arXiv:2306.11644 (2023).
[76] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal
Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 7212–7225.
[77] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy-
atkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint
arXiv:2009.08366 (2020).
[78] Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. Longcoder: A long-range pre-trained language
model for code completion. In International Conference on Machine Learning. PMLR, 12098–12107.
[79] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li,
et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.
arXiv preprint arXiv:2401.14196 (2024).
[80] Aman Gupta, Deepak Bhatt, and Anubha Pandey. 2021. Transitioning from Real to Synthetic data: Quantifying the
bias in model. arXiv preprint arXiv:2105.04144 (2021).
[81] Aman Gupta, Anup Shirgaonkar, Angels de Luis Balaguer, Bruno Silva, Daniel Holstein, Dawei Li, Jennifer Marsman,
Leonardo O Nunes, Mahsa Rouzbahman, Morris Sharp, et al. 2024. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a
Case Study on Agriculture. arXiv preprint arXiv:2401.08406 (2024).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:41
[82] Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating large language models in generating synthetic
hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
1–19.
[83] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning
with language model is planning with world model. arXiv preprint arXiv:2305.14992 (2023).
[84] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[85] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874
(2021).
[86] Felipe Hoffa. 2016. GitHub on BigQuery: Analyze all the open source code. URL: https:// cloud.google.com/ blog/ topics/
public-datasets/ github-on-bigquery-analyze-all-the-open-source-code (2016).
[87] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las
Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language
models. arXiv preprint arXiv:2203.15556 (2022).
[88] Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. 2023. L2MAC: Large Language Model Automatic
Computer for Unbounded Code Generation. In The Twelfth International Conference on Learning Representations.
[89] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration.
arXiv preprint arXiv:1904.09751 (2019).
[90] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing
Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework.
arXiv preprint arXiv:2308.00352 (2023).
[91] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang.
2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
[92] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,
Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on
machine learning. PMLR, 2790–2799.
[93] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
[94] Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code
Generation with Iterative Testing and Optimisation. arXiv preprint arXiv:2312.13010 (2023).
[95] Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint
arXiv:2212.10403 (2022).
[96] Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. In 61st
Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics
(ACL), 1049–1065.
[97] Junjie Huang, Chenglong Wang, Jipeng Zhang, Cong Yan, Haotian Cui, Jeevana Priya Inala, Colin Clement, Nan
Duan, and Jianfeng Gao. 2022. Execution-based evaluation for data science code generation models. arXiv preprint
arXiv:2211.09374 (2022).
[98] Qiuyuan Huang, Naoki Wake, Bidipta Sarkar, Zane Durante, Ran Gong, Rohan Taori, Yusuke Noda, Demetri Ter-
zopoulos, Noboru Kuno, Ade Famoti, et al. 2024. Position Paper: Agent AI Towards a Holistic Intelligence. arXiv
preprint arXiv:2403.00833 (2024).
[99] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet
challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[100] Yoichi Ishibashi and Yoshimasa Nishimura. 2024. Self-Organized Agents: A LLM Multi-Agent Framework toward
Ultra Large-Scale Code Generation and Optimization. arXiv preprint arXiv:2404.02183 (2024).
[101] Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F Henriques, and
Anthony Hu. 2024. LangProp: A code optimization framework using Language Models applied to driving. arXiv
preprint arXiv:2401.10314 (2024).
[102] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Pro-
grammatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
1643–1652.
[103] Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu
Wang, Qing Liu, Punit Singh Koura, et al. 2022. Opt-iml: Scaling language model instruction meta learning through
the lens of generalization. arXiv preprint arXiv:2212.12017 (2022).
[104] Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Jungkyu Choi, and Minjoon
Seo. 2022. Towards Continual Knowledge Learning of Language Models. In 10th International Conference on Learning
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:42 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:43
[127] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and
Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint
arXiv:2309.00267 (2023).
[128] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.
arXiv preprint arXiv:2104.08691 (2021).
[129] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler,
Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp
tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
[130] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation
Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
[131] Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2023. Towards enhancing in-context learning for code generation.
arXiv preprint arXiv:2303.17780 (2023).
[132] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161
(2023).
[133] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-
lab/alpaca_eval.
[134] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint
arXiv:2101.00190 (2021).
[135] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks are
all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023).
[136] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624
(2022), 1092–1097.
[137] Zongjie Li, Pingchuan Ma, Huaijin Wang, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2022. Unleashing the power of
compiler intermediate representation to enhance neural program embeddings. In Proceedings of the 44th International
Conference on Software Engineering. 2253–2265.
[138] Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient
fine-tuning. arXiv preprint arXiv:2303.15647 (2023).
[139] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak
Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2023. Holistic Evaluation of Language Models. Transactions on Machine
Learning Research (2023).
[140] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
74–81.
[141] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI open 3 (2022),
111–132.
[142] Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre Baldi. 2007. Mining internet-scale software
repositories. Advances in neural information processing systems 20 (2007).
[143] Bingchang Liu, Chaoyu Chen, Cong Liao, Zi Gong, Huan Wang, Zhichao Lei, Ming Liang, Dajun Chen, Min Shen,
Hailian Zhou, et al. 2023. Mftcoder: Boosting code llms with multitask fine-tuning. arXiv preprint arXiv:2311.02303
(2023).
[144] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel.
2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural
Information Processing Systems 35 (2022), 1950–1965.
[145] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really
correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing
Systems 36 (2024).
[146] Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023. Rltf: Reinforcement learning
from unit test feedback. arXiv preprint arXiv:2307.04349 (2023).
[147] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt,
and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023),
1–35.
[148] Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang,
Denny Zhou, et al. 2024. Best Practices and Lessons Learned on Synthetic Data for Language Models. arXiv preprint
arXiv:2404.07503 (2024).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:44 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
[149] Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2020. Retrieval-Augmented Generation for Code
Summarization via Hybrid GNN. In International Conference on Learning Representations.
[150] Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-
completion systems. arXiv preprint arXiv:2306.03091 (2023).
[151] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang,
Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv
preprint arXiv:2402.19173 (2024).
[152] Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A Retrieval-
Augmented Code Completion Framework. In Proceedings of the 60th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). 6227–6240.
[153] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain,
Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and
generation. arXiv preprint arXiv:2102.04664 (2021).
[154] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin,
and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In The Twelfth
International Conference on Learning Representations.
[155] Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. 2022. Are
Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? arXiv preprint arXiv:2212.10017 (2022).
[156] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri,
Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in
Neural Information Processing Systems 36 (2024).
[157] James Manyika and Sissie Hsiao. 2023. An overview of Bard: an early experiment with generative AI. AI. Google
Static Documents 2 (2023).
[158] Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, and Egor Bulychev. 2019. STYLE-ANALYZER:
fixing code style inconsistencies with interpretable unsupervised algorithms. In 2019 IEEE/ACM 16th International
Conference on Mining Software Repositories (MSR). IEEE, 468–478.
[159] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards
zero-shot language understanding. Advances in Neural Information Processing Systems 35 (2022), 462–477.
[160] Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-
llama-3/.
[161] Sébastien Bubeck Mojan Javaheripi. 2023. Phi-2: The surprising power of small language models. https://www.
microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models.
[162] Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. 2014. TBCNN: A tree-based convolutional neural network for
programming language processing. arXiv preprint arXiv:1409.5718 (2014).
[163] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2022. Reading between the lines: Modeling user
behavior and costs in AI-assisted programming. arXiv preprint arXiv:2210.14306 (2022).
[164] Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru
Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models.
arXiv preprint arXiv:2308.07124 (2023).
[165] Antonio Nappa, Richard Johnson, Leyla Bilge, Juan Caballero, and Tudor Dumitras. 2015. The attack of the clones: A
study of the impact of shared code on vulnerability patching. In 2015 IEEE symposium on security and privacy. IEEE,
692–708.
[166] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever:
Learning to verify language-to-code generation with execution. In International Conference on Machine Learning.
PMLR, 26106–26128.
[167] Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz,
Caiming Xiong, et al. 2023. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language
Models. arXiv preprint arXiv:2309.17446 (2023).
[168] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. 2023. Codegen2: Lessons for
training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023).
[169] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022.
Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474
(2022).
[170] Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair
a Silver Bullet for Code Generation?. In The Twelfth International Conference on Learning Representations.
[171] OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
[172] OpenDevin. 2024. OpenDevin: Code Less, Make More. https://github.com/OpenDevin/OpenDevin.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:45
[173] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
Advances in neural information processing systems 35 (2022), 27730–27744.
[174] Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2023. Fine-tuning or retrieval? comparing
knowledge injection in llms. arXiv preprint arXiv:2312.05934 (2023).
[175] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation
of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics.
311–318.
[176] Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl Barr, and Sergey Mechtaev.
2024. The Fact Selection Problem in LLM-Based Program Repair. arXiv preprint arXiv:2404.05520 (2024).
[177] Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval Augmented
Code Generation and Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021.
2719–2734.
[178] Arkil Patel, Siva Reddy, Dzmitry Bahdanau, and Pradeep Dasigi. 2023. Evaluating In-Context Learning of Libraries
for Code Generation. arXiv preprint arXiv:2311.09635 (2023).
[179] Indraneil Paul, Jun Luo, Goran Glavaš, and Iryna Gurevych. 2024. IRCoder: Intermediate Representations Make
Language Models Robust Multilingual Code Generators. arXiv preprint arXiv:2403.03894 (2024).
[180] Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund. 2021. Program comprehension and
code complexity metrics: An fmri study. In 2021 IEEE/ACM 43rd International Conference on Software Engineering
(ICSE). IEEE, 524–536.
[181] Huy N Phan, Hoang N Phan, Tien N Nguyen, and Nghi DQ Bui. 2024. RepoHyper: Better Context Retrieval Is All You
Need for Repository-Level Code Completion. arXiv preprint arXiv:2403.06095 (2024).
[182] Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, Ashish Datta, Maksym Zhu-
ravinskyi, Dakota Mahan, Marco Bellagente, Carlos Riquelme, et al. 2024. Stable Code Technical Report. arXiv
preprint arXiv:2404.01226 (2024).
[183] Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input
length extrapolation. arXiv preprint arXiv:2108.12409 (2021).
[184] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning
aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693
(2023).
[185] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by
generative pre-training. (2018).
[186] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are
unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[187] Steven Raemaekers, Arie Van Deursen, and Joost Visser. 2012. Measuring software library stability through historical
version analysis. In 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, 378–387.
[188] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct
preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing
Systems 36 (2024).
[189] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine
learning research 21, 140 (2020), 1–67.
[190] Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the text-to-sql capabilities of large
language models. arXiv preprint arXiv:2204.00498 (2022).
[191] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco,
and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297
(2020).
[192] Replit. 2016. Idea to software, fast. https://replit.com.
[193] Replit. 2023. replit-code-v1-3b. https://huggingface.co/replit/replit-code-v1-3b.
[194] Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering
to Flow Engineering. arXiv preprint arXiv:2401.08500 (2024).
[195] Nick Roshdieh. 2023. Evol-Instruct-Code-80k. https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1.
[196] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal
Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950
(2023).
[197] Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud
Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:46 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:47
[222] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[223] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https:
//github.com/kingoflolz/mesh-transformer-jax.
[224] Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching Code LLMs to
Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024).
[225] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen,
Yankai Lin, et al. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18,
6 (2024), 1–26.
[226] Shiqi Wang, Li Zheng, Haifeng Qian, Chenghao Yang, Zijian Wang, Varun Kumar, Mingyue Shang, Samson Tan,
Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. 2022.
ReCode: Robustness Evaluation of Code Generation Models. (2022). https://doi.org/10.48550/arXiv.2212.10264
[227] Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, et al. 2023. Knowledge editing for large language
models: A survey. arXiv preprint arXiv:2310.16218 (2023).
[228] Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code
actions elicit better llm agents. arXiv preprint arXiv:2402.01030 (2024).
[229] Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yitong Li, Pingyi Zhou, Jin Liu, Hao Wu, Xin Jiang, and Qun Liu.
2022. Compilable Neural Code Generation with Compiler Feedback. In Findings of the Association for Computational
Linguistics: ACL 2022. 9–19.
[230] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171
(2022).
[231] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi.
2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In The 61st Annual Meeting Of The
Association For Computational Linguistics.
[232] Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. CodeT5+: Open Code Large
Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods
in Natural Language Processing. 1069–1088.
[233] Yanlin Wang and Hui Li. 2021. Code completion by modeling flattened abstract syntax trees as graphs. In Proceedings
of the AAAI conference on artificial intelligence, Vol. 35. 14015–14023.
[234] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-
Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing. 8696–8708.
[235] Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and
Qun Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
[236] Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-based evaluation for open-domain
code generation. arXiv preprint arXiv:2212.10481 (2022).
[237] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and
Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).
[238] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, et al. 2022. Emergent Abilities of Large Language Models. Transactions on Machine
Learning Research (2022).
[239] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing
systems 35 (2022), 24824–24837.
[240] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need.
arXiv preprint arXiv:2312.02120 (2023).
[241] Lilian Weng. 2023. LLM-powered Autonomous Agents. lilianweng.github.io (Jun 2023). https://lilianweng.github.io/
posts/2023-06-23-agent/
[242] Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. QuRating: Selecting High-Quality Data for
Training Language Models. arXiv preprint arXiv:2402.09739 (2024).
[243] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. 2021.
Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international
conference on computer vision. 3681–3691.
[244] Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. 2024. Repoformer: Selective
Retrieval for Repository-Level Code Completion. arXiv preprint arXiv:2403.10059 (2024).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:48 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim
[245] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang,
and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv
preprint arXiv:2308.08155 (2023).
[246] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin,
Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint
arXiv:2309.07864 (2023).
[247] Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, and Wei Ye. 2024. CodeShell Technical Report.
arXiv preprint arXiv:2403.15747 (2024).
[248] Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2023. Data selection for language models via
importance resampling. Advances in Neural Information Processing Systems 36 (2023), 34201–34227.
[249] Bowen Li Xingyao Wang and Graham Neubig. 2024. Introducing OpenDevin CodeAct 1.0, a new State-of-the-art in
Coding Agents. https://www.cognition.ai/introducing-devin.
[250] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023.
Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
[251] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large
language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming.
1–10.
[252] Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, security, privacy,
explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024).
[253] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of
thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems
36 (2024).
[254] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct:
Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations
(ICLR).
[255] Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned
code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining
software repositories. 476–486.
[256] Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim,
Munhyong Kim, Sungju Kim, et al. 2024. HyperCLOVA X Technical Report. arXiv preprint arXiv:2404.01954 (2024).
[257] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao
Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings
of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.
[258] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle
Roman, et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing
and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
3911–3921.
[259] Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. 2023.
Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint
arXiv:2312.14187 (2023).
[260] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to
align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023).
[261] Jiawei Liu Yifeng Ding Naman Jain Harm de Vries Leandro von Werra Arjun Guha Lingming Zhang Yuxiang Wei,
Federico Cassano. 2024. StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation.
https://github.com/bigcode-project/starcoder2-self-align.
[262] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. 2021. Bitfit: Simple parameter-efficient fine-tuning for
transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021).
[263] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-
Guang Lou. 2022. CERT: continual pre-training on sketches for library-oriented code generation. arXiv preprint
arXiv:2206.06888 (2022).
[264] Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023.
Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 7443–7464.
[265] Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan,
et al. 2024. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. arXiv preprint arXiv:2403.16443
(2024).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:49
[266] Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen.
2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing. 2471–2484.
[267] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. 2023.
Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning
Representations.
[268] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang,
Fei Wu, et al. 2023. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
[269] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong
Chen, et al. 2023. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint
arXiv:2309.01219 (2023).
[270] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program
Improvement. arXiv preprint arXiv:2404.05427 (2024).
[271] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023. Unifying the
perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989
(2023).
[272] Liang Zhao, Xiaocheng Feng, Xiachong Feng, Bin Qin, and Ting Liu. 2023. Length Extrapolation of Transformers: A
Survey from the Perspective of Position Encoding. arXiv preprint arXiv:2312.17044 (2023).
[273] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[274] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural
Information Processing Systems 36 (2024).
[275] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li,
et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.
[276] Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. 2024.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658
(2024).
[277] Wenqing Zheng, SP Sharan, Ajay Kumar Jaiswal, Kevin Wang, Yihan Xi, Dejia Xu, and Zhangyang Wang. 2023.
Outline, then details: Syntactically guided coarse-to-fine code generation. In International Conference on Machine
Learning. PMLR, 42403–42419.
[278] Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023. A survey of
large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372 (2023).
[279] Li Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Language Model Debugger via Verifying Runtime
Execution Step-by-step. arXiv preprint arXiv:2402.16906 (2024).
[280] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree
search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406 (2023).
[281] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili
Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
[282] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui,
Olivier Bousquet, Quoc V Le, et al. 2022. Least-to-Most Prompting Enables Complex Reasoning in Large Language
Models. In The Eleventh International Conference on Learning Representations.
[283] Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao Jiang, and Graham Neubig. 2022. DocPrompting: Generating Code by
Retrieving the Docs. In The Eleventh International Conference on Learning Representations.
[284] Terry Yue Zhuo. 2024. ICE-Score: Instructing Large Language Models to Evaluate Code. In Findings of the Association
for Computational Linguistics: EACL 2024. 2232–2242.
[285] Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, and Niklas
Muennighoff. 2024. Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models. arXiv preprint
arXiv:2401.00788 (2024).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.