0% found this document useful (0 votes)
22 views49 pages

2406.00515v1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views49 pages

2406.00515v1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

1

A Survey on Large Language Models for Code Generation


JUYONG JIANG∗ , The Hong Kong University of Science and Technology (Guangzhou), China
FAN WANG∗ , The Hong Kong University of Science and Technology (Guangzhou), China
JIASI SHEN† , The Hong Kong University of Science and Technology, China
SUNGJU KIM† , NAVER Cloud, South Korea
arXiv:2406.00515v1 [cs.CL] 1 Jun 2024

SUNGHUN KIM† , The Hong Kong University of Science and Technology (Guangzhou), China
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks,
known as Code LLMs, particularly in code generation that generates source code with LLM from natural
language descriptions. This burgeoning field has captured significant interest from both academic researchers
and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. De-
spite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language
processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and
up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by
providing a systematic literature review that serves as a valuable reference for researchers investigating the
cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the
recent developments in LLMs for code generation, covering aspects such as data curation, latest advances,
performance evaluation, and real-world applications. In addition, we present a historical overview of the evo-
lution of LLMs for code generation and offer an empirical comparison using the widely recognized HumanEval
and MBPP benchmarks to highlight the progressive enhancements in LLM capabilities for code generation.
We identify critical challenges and promising opportunities regarding the gap between academia and practical
development. Furthermore, we have established a dedicated resource website (https://codellm.github.io) to
continuously document and disseminate the most recent advances in the field.

CCS Concepts: • General and reference → Surveys and overviews; • Software and its engineering →
Software development techniques; • Computing methodologies → Artificial intelligence.

Additional Key Words and Phrases: Large Language Models, Code Large Language Models, Code Generation

ACM Reference Format:


Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language
Models for Code Generation. ACM Trans. Softw. Eng. Methodol. 1, 1, Article 1 (September 2024), 49 pages.
https://doi.org/XXXXXXX.XXXXXXX

∗ Equally major contributors.


† Corresponding authors.

Authors’ addresses: Juyong Jiang, jjiang472@connect.hkust-gz.edu.cn, The Hong Kong University of Science and Technology
(Guangzhou), Guangzhou, China; Fan Wang, fwang380@connect.hkust-gz.edu.cn, The Hong Kong University of Science
and Technology (Guangzhou), Guangzhou, China; Jiasi Shen, sjs@cse.ust.hk, The Hong Kong University of Science and
Technology, Hong Kong, China; Sungju Kim, sungju.kim@navercorp.com, NAVER Cloud, Seoul, South Korea; Sunghun
Kim, hunkim@cse.ust.hk, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2024 Association for Computing Machinery.
1049-331X/2024/9-ART1 $15.00
https://doi.org/XXXXXXX.XXXXXXX

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:2 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

1 INTRODUCTION
The advent of Large Language Models (LLMs) such as ChatGPT1 [171] has profoundly transformed
the landscape of automated code-related tasks [45], including code completion [78, 152, 233, 244],
code translation [48, 121, 211], and code repair [109, 170, 176]. A particularly intriguing application
of LLMs is code generation, a task that involves producing source code from natural language
descriptions. Despite varying definitions across studies [47, 191, 204, 232], for the purposes of
this survey, we adopt a consistent definition of code generation as the natural-language-to-code
(NL2Code) task [15, 16, 264]. This area has garnered substantial interest from both academia and
industry, as evidenced by the development of tools like GitHub Copilot2 [45], CodeGeeX3 [275],
and Amazon CodeWhisperer4 , which leverage groundbreaking code LLMs to facilitate software
development.
Initial investigations into code generation primarily utilized heuristic rules or expert systems,
such as probabilistic grammar-based frameworks [9, 57, 113] and specialized language models [59,
74, 106]. These early techniques were typically rigid and difficult to scale. However, the introduction
of Transformer-based LLMs has shifted the paradigm, establishing them as the preferred method
due to their superior proficiency and versatility. One remarkable aspect of LLMs is their capability
to follow instructions [51, 164, 173, 238, 250], enabling even novice programmers to write code by
simply articulating their requirements. This emergent ability has democratized coding, making it
accessible to a broader audience [264]. The performance of LLMs on code generation tasks has seen
remarkable improvements, as illustrated by the HumanEval leaderboard5 , which showcases the
evolution from PaLM 8B [49] of 3.6% to LDB [279] of 95.1% on Pass@1 metrics. As can be seen, the
HumanEval benchmark [45] has been established as a de facto standard for evaluating the coding
proficiency of LLMs [45].
To offer a comprehensive chronological evolution, we present an overview of the development
of LLMs for code generation, as illustrated in Figure 1. The landscape of LLMs for code generation
is characterized by a spectrum of models, with certain models like ChatGPT [173], GPT4 [5],
LLaMA [217, 218], and Claude 3 [13] serving general-purpose applications, while others such
as StarCoder [132, 151], Code LLaMA [196], DeepSeek-Coder [79], and Code Gemma [54] are
tailored specifically for code-centric tasks. The convergence of code generation with the latest LLM
advancements is pivotal, especially when programming languages can be considered as distinct
dialects of multilingual natural language [15, 275]. These models are not only tested against software
engineering (SE) requirements but also propel the advancement of LLMs into practical production
[271].
While recent surveys have shed light on code LLMs from the lenses of Natural Language Process-
ing (NLP), Software Engineering (SE), or a combination of both disciplines [91, 264, 271, 278], they
have often encompassed a broad range of code-related tasks. There remains a dearth of literature
specifically reviewing advanced topics in code generation, such as meticulous data curation, in-
struction tuning, alignment with feedback, prompting techniques, the development of autonomous
coding agents, retrieval augmented code generation, LLM-as-a-Judge for code generation, among
others. A notably pertinent study [15, 264] also concentrates on LLMs for text-to-code generation
(NL2Code), yet it primarily examines models released from 2020 to 2022. Consequently, this notice-
able temporal gap has resulted in an absence of up-to-date literature reviews that contemplate the

1 https://chat.openai.com
2 https://github.com/features/copilot
3 https://codegeex.cn/en-US
4 https://aws.amazon.com/codewhisperer
5 https://paperswithcode.com/sota/code-generation-on-humaneval

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:3

latest advancements, including models like CodeQwen [215], WizardCoder [154], and PPOCoder
[204], as well as the comprehensive exploration of the advanced topics previously mentioned.
Recognizing the need for a dedicated and up-to-date literature review, this survey endeavors to fill
that void. We provide a systematic review that will serve as a foundational reference for researchers
quickly exploring the latest progress in LLMs for code generation. A taxonomy is introduced to
categorize and examine recent advancements, encompassing data curation [154, 231, 240], advanced
topics [42, 47, 94, 125, 146, 152, 164, 166, 177, 205, 266], evaluation methods [45, 85, 111, 284], and
practical applications [45, 275]. This category aligns with the comprehensive lifecycle of an LLM for
code generation. Furthermore, we pinpoint critical challenges and identify promising opportunities
to bridge the research-practicality divide. Therefore, this survey allows NLP and SE researchers
to seamlessly equip with a thorough understanding of LLM for code generation, highlighting
cutting-edge directions and current hurdles and prospects.
The remainder of the survey is organized following the structure outlined in our taxonomy
in Figure 3. In Section 2, we introduce the preliminaries of LLM with Transformer architecture
and formulate the task of LLM for code generation. Then, in Section 3, we propose a taxonomy,
categorizing the complete process of LLMs in code generation. Section 4 delves into the specifics of
LLMs for code generation within this taxonomy framework. In Section 5, we underscore the critical
challenges and promising opportunities for bridging the research-practicality gap and conclude
this work in Section 6.

2 BACKGROUND
2.1 Large Language Models
The effectiveness of large language models (LLMs) is fundamentally attributed to their substantial
quantity of model parameters, large-scale and diversified datasets, and the immense computational
power utilized during training [87, 114]. Generally, scaling up language models consistently results
in enhanced performance and sample efficiency across a broad array of downstream tasks [238, 273].
However, with the expansion of the model size to a certain extent (e.g., GPT-3 [31] with 175B-
parameters and PaLM [49] with 540B), LLMs have exhibited an unpredictable phenomenon known
as emergent abilities6 , including instruction following [173], in-context learning [65], and step-by-
step reasoning [95, 239], which are absent in smaller models but apparent in larger ones [238].
Adhering to the same architectures of the Transformer [222] in LLMs, code LLMs are specifically
pre-trained on large-scale unlabeled code corpora, whereas general-purpose LLMs (e.g., ChatGPT
[171]) are pre-trained on a blend of code and text data. Analogous to LLMs, Code LLMs can also
be classified into three architectural categories: encoder-only models, decoder-only models, and
encoder-decoder models. For encoder-only models, such as CodeBERT [68], they are typically
suitable for code comprehension tasks including type prediction, code retrieval, and clone detection.
For decoder-only models, such as StarCoder [31], they predominantly excel in generation tasks,
such as code generation, code translation, and code summarization. Encoder-decoder models, such
as CodeT5 [234], can accommodate both code understanding and generation tasks but do not
necessarily outperform encoder-only or decoder-only models. The overall architectures of the
different Code LLMs for code generation are depicted in Figure 2.
In the following subsection, we will delineate the key modules of the Transformer layers in Code
LLMs.
2.1.1 Multi-Head Self-Attention Modules. Each Transformer layer incorporates a multi-head self-
attention (MHSA) mechanism to discern the inherent semantic relationships within a sequence
6 It
should be noted that an LLM is not necessarily superior to a smaller language model, and emergent abilities may not
manifest in all LLMs [273].

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:4 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

StarCoder2-Instruct
Apr. ProCoder CodeGemma CodeQwen1.5
Llama 3
Claude 3 CodeS OpenDevin Devin Mar.

Feb. StepCoder OpenCodeInterpreter StarCoder2

AlphaCodium StableCode ToolGen AST-T5 Jan.

2024
Dec. Magicoder AlphaCode 2 phi-2 WaveCoder

DeepSeek-Coder Nov.

Oct. CodeShell CodeFusion

phi-1.5 MFTCoder Sep.

Aug. OctoPack Code Llama


RLTF
PanGu-Coder2 CodeGeeX2 ChainCoder Jul.
Llama 2
Jun. CodeTF WizardCoder phi-1 SelfEvolve

CodeT5+ StarCoder replit-code CodeGen2 May

Apr. Self-Debugging
Mar. GPT4

LEVER LLaMA Feb.

Jan. PPOCoder
2023
SantaCoder ERNIE-Code Dec.

Nov. BLOOM ChatGPT

CodeGeeX Sep.

Jul. PyCodeGPT CodeT


PanGu-Coder CodeRL Jun.

Apr. GPT-NeoX PaLM-Coder InCoder

CodeGen Mar.

Feb. AlphaCode PolyCoder

JuPyT5 Jan.

2022
Nov. CodeParrot

Open Source Closed Source CodeT5 Sep.

Jul. Codex
3
GPT-J May
5
Mar. GPT-Neo PLBART
1 4
CodeGPT Feb.
6
2021
6 PyMT5 Oct.

9 3
May GPT-C
2020

Fig. 1. A chronological overview of large language models (LLMs) for code generation in recent years. The
timeline was established mainly according to the release date. The models with publicly available model
checkpoints are highlighted in green color.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:5

of tokens across ℎ distinct latent representation spaces. Formally, the MHSA employed by the
Transformer can be formulated as follows:
h (𝑙 ) = MultiHeadSelfAttn(Q, K, V) = Concat {Head𝑖 }ℎ𝑖=1 WO, (1)
Q
Head𝑖 = Attention(H (𝑙 −1) W𝑖 , H (𝑙 −1) W𝑖K, H (𝑙 −1) W𝑖V ),
| {z } | {z } | {z } (2)
Q K V
!
QK𝑇
Attention(Q, K, V) = softmax √︁ V, (3)
𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ
where H (𝑙 −1) ∈ R𝑛×𝑑𝑚𝑜𝑑𝑒𝑙 denotes the input to the 𝑙-th Transformer layer, while h (𝑙 ) ∈ R𝑛×𝑑𝑚𝑜𝑑𝑒𝑙
represents the output of MHSA sub-layer. The quantity of distinct attentionn heads is represented
o
Q
by ℎ, and 𝑑𝑚𝑜𝑑𝑒𝑙 refers to the model dimension. The set of projections W𝑖 , W𝑖K, W𝑖V, W𝑖O ∈
R𝑑𝑚𝑜𝑑𝑒𝑙 ×𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ encompasses the affine transformation parameters for each attention head Head𝑖 ,
transforming the Query Q, Key K, Value V, and the output of the attention sub-layer, The softmax
function is applied√︁ in a row-wise manner. The dot-products of queries and keys are divided by
a scaling factor 𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ to counteract the potential risk of excessive large inner products and
correspondingly diminished gradients in the softmax function, thus encouraging a more balanced
attention landscape.
In addition to multi-head self-attention, there are two other types of attention based on the
source of queries and key-value pairs:
• Masked Multi-Head Self-Attention. Within the decoder layers of the Transformer, the
self-attention mechanism is constrained by introducing an attention mask, ensuring that
queries at each position can only attend to all key-value pairs up to and inclusive of that
position. To facilitate parallel training, this is typically executed by assigning a value of 0
to the lower triangular part and setting the remaining elements to −∞. Consequently, each
item attends only to its predecessors and itself. Formally, this modification in Equation 3 can
be depicted as follows:
!
QK𝑇
Attention(Q, K, V) = softmax √︁ + M𝑚𝑎𝑠𝑘 V, (4)
𝑑𝑚𝑜𝑑𝑒𝑙 /ℎ
(
    0 for 𝑖 ≥ 𝑗
M𝑚𝑎𝑠𝑘 = 𝑚𝑖 𝑗 = I(𝑖 ≥ 𝑗) = , (5)
𝑛×𝑛 𝑛×𝑛 −∞ otherwise
This form of self-attention is commonly denoted as autoregressive or causal attention [141].
• Cross-Layer Multi-Head Self-Attention. The queries are derived from the outputs of the
preceding (decoder) layer, while the keys and values are projected from the outputs of the
encoder.

2.1.2 Position-wise Feed-Forward Networks. Within each Transformer layer, a Position-wise Feed-
Forward Network (PFFN) is leveraged following the MHSA sub-layer to refine the sequence
embeddings at each position 𝑖 in a separate and identical manner, thereby encoding more intricate
feature representations. The PFFN is composed of a pair of linear transformations, interspersed
with a ReLU activation function. Formally,
 n o𝑛 𝑇
PFFN(ℎ (𝑙 ) ) = Concat FFN(ℎ𝑖(𝑙 ) )𝑇 , (6)
𝑖=1

FFN(ℎ𝑖(𝑙 ) ) = ReLU(ℎ𝑖(𝑙 ) W (1) +𝑏 (1)


)W (2)
+ 𝑏 (2) , (7)

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:6 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Output Probabilities

Linear & Softmax

Layer Norm
Output Probabilities
+
Position-wise
Feed Forward Linear & Softmax

Layer Norm
Layer Norm
Layer Norm
+ +
+ Multi-Head 𝑁×
Position-wise
Position-wise Self-Attention Feed Forward
Feed Forward
𝑁×
𝑁× Layer Norm Layer Norm
Layer Norm
+ +
+ Masked Masked
Multi-Head Multi-Head Multi-Head
Self-Attention Self-Attention Self-Attention

Token & Position Token & Position Token & Position


Embedding Embedding Embedding

Inputs Outputs (Shifted Right) Inputs

(a) Encoder-Decoder Models (b) Decoder-only Models

Fig. 2. The overview of large language models (LLMs) with encoder-decoder and decoder-only Transformer
architecture for code generation, adapted from [222].

where ℎ (𝑙 ) ∈ R𝑛×𝑑𝑚𝑜𝑑𝑒𝑙 is the outputs of MHSA sub-layer in 𝑙-th Transformer layer, and ℎ𝑖(𝑙 ) ∈
 (1) denotes the latent representation at each sequence position. The projection matrices
R 𝑑𝑚𝑜𝑑𝑒𝑙

W , (W (2) )𝑇 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 ×4𝑑𝑚𝑜𝑑𝑒𝑙 and bias vectors {b (1) , b (2) } ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 are parameters learned
during training. These parameters remain consistent across all positions while are individually
initialized from layer to layer. In this context, 𝑇 represents the transpose operation on a matrix.
2.1.3 Residual Connection and Normalization. To alleviate the issue of vanishing or exploding
gradients resulting from network deepening, the Transformer model incorporates a residual con-
nection [84] around each of the aforementioned modules, followed by Layer Normalization [17].
For the placement of Layer Normalization operation, there are two widely used approaches: 1)
Post-Norm: Layer normalization is implemented subsequent to the element-wise residual addition,
in accordance with the vanilla Transformer [222]. 2) Pre-Norm: Layer normalization is applied to
the input of each sub-layer, as seen in models like GPT-2 [186]. Formally, it can be formulated as:
Post-Norm : H (l) = LayerNorm(PFFN(h (l) ) + h (l) ),
(8)
h (l) = LayerNorm(MHSA(H (l−1) ) + H (l−1) )
Pre-Norm : H (l) = PFFN(LayerNorm(h (l) )) + h (l) ,
(9)
h (l) = MHSA(LayerNorm(H (l−1) )) + H (l−1)
2.1.4 Positional Encoding. Given that self-attention alone cannot discern the positional information
of each input token, the vanilla Transformer introduces an absolute positional encoding method to

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:7

supplement this positional information, known as sinusoidal position embeddings [222]. Specifically,
for a token at position 𝑝𝑜𝑠, the position embedding is defined as:
𝑝𝑜𝑠
p𝑝𝑜𝑠,2𝑖 = sin( ), (10)
100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙
𝑝𝑜𝑠
p𝑝𝑜𝑠,2𝑖+1 = cos( ), (11)
100002𝑖/𝑑𝑚𝑜𝑑𝑒𝑙
where 2𝑖, 2𝑖 + 1 represent the dimensions of the position embedding, while 𝑑𝑚𝑜𝑑𝑒𝑙 denotes the model
dimension. Subsequently, each position embedding is added to the corresponding token embedding,
and the sum is fed into the Transformer. Since the inception of this method, a variety of innovative
positional encoding approaches have emerged, such as learnable embeddings [61], relative position
embeddings [199], RoPE [209], and ALiBi [183]. For more detailed descriptions of each method,
please consult [141, 272].

2.2 Code Generation


Large language models (LLMs) for code generation refer to the use of LLM to generate source
code from natural language descriptions, a process also known as a natural-language-to-code
task. Typically, these natural language descriptions encompass programming problem statements
(or docstrings) and may optionally include some programming context (e.g., function signatures,
assertions, etc.). Formally, these natural language (NL) descriptions can be represented as x. Given
x, the use of an LLM with model parameters 𝜃 to generate a code solution y can be denoted as
𝑃𝜃 (y | x). To verify the functionality correctness of the code solution, y is subsequently executed
via a compiler or interpreter, represented as Exe(·), on a suit of unit tests. The feedback from this
execution can be denoted as Feedback(Exe(y)).
The advent of in-context learning abilities in LLM [238] has led to the appending of exemplars to
the natural language description x as demonstrations to enhance code generation performance or
constrain the generation format [131, 178]. A fixed set of 𝑀 exemplars is denoted as {(xi, yi )}𝑖=1
𝑀 .

Consequently, following [166], a more general formulation of LLMs for code generation with
few-shot (or zero-shot) exemplars can be revised as:
𝑃𝜃 (y | x) = 𝑃𝜃 (y | prompt(x, {(xi, yi )}𝑘𝑖=1 )), 𝑘 = {0, 1, . . . , 𝑀 } (12)
where prompt(x, {(xi, yi )}𝑘𝑖=1 )) is a string representation of the overall input, and {(xi, yi )}𝑘𝑖=1
denotes a set of 𝑘 exemplars randomly selected from {(xi, yi )}𝑖=1 𝑀 . In particular, when 𝑘 = 0,

this denotes zero-shot code generation, equivalent to vanilla ones without in-context learning.
Subsequently, a variety of decoding strategies can be performed for code generation, including
deterministic-based strategies (e.g., greedy search and beam search) and sampling-based strategies
(e.g., temperature sampling, top-k sampling, and top-p (nucleus) sampling). For more detailed
descriptions of each decoding strategy, please consult [89].
Greedy Search : y∗ = argmax 𝑃𝜃 (y | prompt(x, {(xi, yi )}𝑘𝑖=1 )), 𝑘 = {0, 1, . . . , 𝑀 } (13)
y

Sampling : y ∼ 𝑃𝜃 (y | prompt(x, {(xi, y𝑖 )}𝑘𝑖=1 )), 𝑘 = {0, 1, . . . , 𝑀 } (14)

3 TAXONOMY
The recent surge in the development of Large Language Models (LLMs) has led to a significant
number of these models being repurposed for code generation task through continued pre-training
or fine-tuning. This trend is particularly observable in the realm of open-source models. For instance,
Meta AI initially made the LLaMA [217] model publicly available, which was followed by the release
of Code LLaMA [196], designed specifically for code generation. Similarly, DeepSeek LLM [25]

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:8 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

developed and released by DeepSeeker has been extended to create DeepSeek Coder [79], a variant
tailored for code generation. The Qwen team has developed and released Code Qwen [215], building
on their original Qwen [19] model. Microsoft, on the other hand, has unveiled WizardLM [250]
and is exploring its coding-oriented counterpart, WizardCoder [154]. Google has joined the fray
by releasing Gemma [214], subsequently followed by Code Gemma [54]. Beyond simply adapting
general-purpose LLMs for code-related tasks, there has been a proliferation of models specifically
engineered for code generation. Notable examples include StarCoder [132], OctoCoder [164], and
CodeGen [169]. These models underscore the trend of LLMs being developed with a focus on code
generation.
Recognizing the importance of these developments, we propose a taxonomy that categorizes
and evaluates the latest advances in LLMs for code generation. This taxonomy, depicted in Figure
3, serves as a comprehensive reference for researchers seeking to quickly familiarize themselves
with the state-of-the-art in this dynamic field.
In the subsequent sections, we will provide an in-depth analysis of each category related to code
generation. This will encompass a definition of the problem, the challenges to be addressed, and a
comparison of the most prominent models and their performance evaluation.

4 LARGE LANGAUGE MODELS FOR CODE GENERATION


Large language models (LLMs) with Transformer architecture have revolutionized a multitude of
fields, and their application in code generation has been particularly impactful. These models follow
a comprehensive process that starts with the curation and synthesis of code data, followed by a
structured training approach that includes pre-training and fine-tuning, and the use of sophisticated
prompt engineering techniques. Recent advancements have seen the integration of repository-level
and retrieval-augmented code generation, as well as the development of autonomous coding agents.
Furthermore, the evaluation of coding abilities of LLMs has become a critical component of this
research area.
In the forthcoming sections, we will explore these dimensions of LLMs in the context of code
generation in detail. Section 4.1 will address the data curation and processing strategies employed
throughout the various stages of LLM development. Section 4.2 will discuss data synthesis methods
designed to mitigate the scarcity of high-quality data. Section 4.3 will outline the prevalent model
architectures used in LLMs for code generation. Moving to Section 4.4, we will examine the
techniques for full parameter fine-tuning and parameter-efficient fine-tuning, which are essential
for tailoring LLMs to code generation task. Section 4.5 will shed light on enhancing code quality
through reinforcement learning, utilizing the power of feedback. Section 4.6 will delve into the
strategic use of prompts to maximize the coding capabilities of LLMs. The innovative approaches of
repository-level and retrieval-augmented code generation will be elaborated in Sections 4.7 and 4.8,
respectively. Additionally, Section 4.9 will discuss the exciting field of autonomous coding agents.
Lastly, Section 4.11 will provide insights into some of the practical applications that leverage LLMs
for code generation, demonstrating the real-world impact of these sophisticated models. Through
this comprehensive exploration, we aim to highlight the significance and potential of LLMs within
the domain of automated code generation.

4.1 Data Curation & Processing


The exceptional performance of Large Language Models (LLMs) can be attributed to their training
on large-scale and diverse datasets [264]. Meanwhile, the extensive parameters of these models
necessitate substantial data to unlock their full potential, in alignment with established scaling
law [87, 114]. For a general-purpose LLM, amassing a large-scale corpus of natural language from
a variety of sources is imperative. Such sources include webpages, conversation data, books and

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:9

CodeSearchNet[99], Google BigQuery[86], The Pile[70], CodeParrot[219], GitHub Code[219]


Pre-training
ROOTS[123], The Stack[118], The Stack v2[151]

Instruction CommitPackFT [164], Code Alpaca[40], OA-Leet[58], OSS-Instruct[240], Evol-instruction[195]


Tuning Self-OSS-Instruct-SC2-Exec-Filter[261]

Data HumanEval[45], HumanEval+[145], HumanEvalPack[164], MBPP[16]


General MBPP+[145], CoNaLa[255], Spider[258], CONCODE[102], ODEX[236]
Curation
CoderEval[257], ReCode[226], StudentEval[18]

Competitions APPS[85], CodeContests[136]

Data Science DSP[38], DS-1000[122], ExeDS[97]


Benchmarks
MBXP[15], Multilingual HumanEval[15], HumanEval-X[275], MultiPL-E[36]
Multilingual
xCodeEval[115]

Reasoning MathQA-X[15], MathQA-Python[16], GSM8K[53], GSM-HARD[71]

RepoEval[266], Stack-Repo[205], Repobench[150], EvoCodeBench[130]


Repository
SWE-bench[111], CrossCodeEval[63], SketchEval[265]

Data Self-Instruct [231], Evol-Instruct [250], Phi-1[75], Code Alpaca[40], WizardCoder[154]


Synthesis Magicoder[240], StarCoder2-instruct [261]

PyMT5[52], PLBART[6], CodeT5[234], JuPyT5[38]


AlphaCode[136], CodeRL[125], ERNIE-Code[37]
Encoder-Decoder
PPOCoder[204], CodeT5+[232], CodeFusion[207]
AST-T5[73]

GPT-C[210], GPT-Neo[29], GPT-J[223], Codex[45]


CodeGPT[153], CodeParrot[219], PolyCoder[251]
Model CodeGen[169], GPT-NeoX[28], PaLM-Coder[49]
Architectures InCoder[69], PanGu-Coder[50], PyCodeGPT[263]
CodeGeeX[275], BLOOM[126], ChatGPT[171]
SantaCoder[8], LLaMA[217], GPT-4[5]
CodeGen2[168], replit-code[193], StarCoder[132]
WizardCoder[154], phi-1[75], ChainCoder[277]
LLMs for Code Generation

Decoder-Only CodeGeeX2[275], PanGu-Coder2[201], Llama 2[218]


OctoPack[164], Code Llama[196], MFTCoder[143]
Pre-training phi-1.5[135], CodeShell[247], Magicoder[240]
AlphaCode 2[10], StableCode[182], WaveCoder[259]
phi-2[161], DeepSeek-Coder[79], StepCoder[66]
OpenCodeInterpreter[276], StarCoder 2[151]
Claude 3[13], ProCoder[26], CodeGemma[54]
CodeQwen[215], Llama3[160]
Recent StarCoder2-Instruct[261]
Advances Pre-training
CLM[79, 132, 154, 240], DAE[6, 232, 234], Auxiliary[37, 232, 234]
Tasks
Code Alpaca[40], CodeT5+[234], WizardCoder[154]
Full Parameter StarCoder[132], Pangu-Coder2[201], OctoPack[164]
Fine-tuning CodeGeeX2[275], Magicoder[240], CodeGemma[54]
Instruction StarCoder2-instruct[261]
Tuning Parameter
Efficient CodeUp[108], ASTRAIOS[285]
Fine-tuning Fine-tuning
Reinforcement
CodeRL[125], CompCoder[229], PPOCoder[204], RLTF[146]
Learning
PanGu-Coder2[201], StepCoder[66]
with Feedback

Prompting Reflexion[202], LATS[280], Self-Debugging[47], SelfEvolve[110]


Engineering Theo X. et al.[170], CodeT[42], LEVER[166], AlphaCodium[194]

Repository RepoCoder[266], CoCoMIC[64], RepoHyper[181], RLPG[206]


Level & Long Repoformer[244], RepoFusion[205], ToolGen[224], CodePlan[21]
Context CodeS[265]

Retrieval HGNN[149], REDCODER[177], ReACC[152], DocPrompting[283]


Augmented RepoCoder[266], Su et al.[208]

Autonomous AgentCoder [94], MetaGPT[90], CodeAct [228], AutoCodeRover [270], Devin[56]


Coding Agents OpenDevin[172], SWE-agent[112], L2MAC[88], OpenDevin CodeAct 1.0[249]

Exact Match, BLEU[175], ROUGE[140], METEOR[22], CodeBLEU[191], pass@k[45]


Metrics
n@k[136], test case average[85], execution accuracy[190], pass@t[170], perplexity[105]
Human
Evaluation CodePlan[21], RepoFusion[205], CodeBLEU[191]
Evaluation

LLM-as-a-Judge AlpacaEval[133], MT-bench[274], ICE-Score[284]

GitHub Copilot[45], CodeGeeX[275], CodeWhisperer[11], Codeium[55], CodeArts Snap[201], TabNine[212]


Application
Replit[192]

Fig. 3. Taxonomy of large language models (LLMs) for code generation.


ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:10 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Fig. 4. A diagram depicting the standard data preprocessing workflow utilized in the pre-training phase of
large language models (LLMs) for code generation.

news, scientific data, and code [19, 31, 49, 217, 218, 256], while these data are often crawled from the
web and must undergo meticulous and aggressive pre-processing [189, 271]. Fortunately, multiple
platforms and websites offer large-scale, open-source, and permissively licensed code corpora, such
as GitHub7 and Stack Overflow8 . Notably, the number of stars or forks of GitHub repositories has
emerged as a valuable metric for filtering high-quality code datasets. In a similar vein, the quantity
of votes on Stack Overflow can serve to discern the most relevant and superior answers.
Nonetheless, raw datasets are frequently laden with redundant, noisy data and personal infor-
mation, eliciting concerns regarding privacy leakage, which may include the names and email
addresses of repository contributors [7, 34, 123]. Consequently, it is essential to undertake rigorous
data-cleaning procedures. Typically, this process encompasses exact match deduplication, code
data filtering based on average line length and a defined threshold for the fraction of alphanumeric
characters, the removal of auto-generated files through keyword searches, and the expunction of
personal user data [118, 219]. Specifically, the standard data preprocessing workflow is depicted in
Figure 4.
The development of a proficient LLM for code generation necessitates the utilization of various
types of code data at different developmental stages. Therefore, we categorize code data into three
distinct classes: pre-training datasets, instruction-tuning datasets, and benchmarks for performance
evaluation. The subsequent subsections will provide a detailed illustration of code data within each
classification.

7 https://github.com
8 https://stackoverflow.com

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:11

4.1.1 Pre-training. The remarkable success of bidirectional pre-trained language models (PLMs)
such as BERT [61] and unidirectional PLMs like GPT [185] has firmly established the practice of
pre-training on large-scale unlabeled datasets to endow models with a broad spectrum of general
knowledge. Extending this principle to the realm of code generation enables Large Language
Models (LLMs) to assimilate fundamental coding principles, including the understanding of code
structure dependencies, the semantics of code identifiers, and the intrinsic logic of code sequences
[45, 76, 232, 234]. In light of this advancement, there has been a proliferation of large-scale unlabeled
code datasets proposed to serve as the foundational training ground for LLMs to develop coding
proficiency. A brief introduction of these datasets is as follows, with the statistics available in Table
1.
• CodeSearchNet [99]: CodeSearchNet corpus is a comprehensive dataset, consisting of 2
million (comment, code) pairs from open-source repositories on GitHub. It includes code
and documentation in several programming languages including Go, Java, PHP, Python,
JavaScript, and Ruby. The dataset was primarily compiled to promote research into the
problem of code retrieval using natural language.
• Google BigQuery [86]: the Google BigQuery Public Datasets program offers a full snapshot
of the content of more than 2.8 million open source GitHub repositories in BigQuery.
• The Pile [70]: the Pile is an 825 GiB diverse and open source language modeling dataset
aggregating 22 smaller, high-quality datasets including GitHub, Books3, and Wikipedia (en).
It aims to encompass text from as many modalities as possible, thereby facilitating the
development of models with broader generalization capabilities. For code generation, the
GitHub composite is specifically utilized.
• CodeParrot [219]: the CodeParrot dataset contains Python files used to train the code genera-
tion model in Chapter 10: Training Transformers from Scratch in the “NLP with Transformers
book” [219]. Created with the GitHub dataset available via Google’s BigQuery, the CodeParrot
dataset includes approximately 22 million Python files and is 180 GB (50 GB compressed) big.
• GitHub Code [219]: the GitHub Code dataset comprises 115M code files derived from GitHub,
spanning 32 programming languages and 60 extensions totaling 1TB of data. The dataset
was created from the public GitHub dataset on Google BiqQuery.
• ROOTS [123]: the BigScience ROOTS Corpus is a 1.6TB dataset spanning 59 languages that
was used to train the 176B BigScience Large Open-science Open-access Multilingual (BLOOM)
language model. For the code generation task, the code subset of the ROOTS Corpus will be
specifically utilized.
• The Stack [118]: the Stack contains over 6TB of permissively licensed source code files that
cover 358 programming languages. The dataset was compiled as part of the BigCode Project,
an open scientific collaboration working on the responsible development of Large Language
Models for Code (Code LLMs).
• The Stack v2 [151]: The Stack v2, a dataset created as part of the BigCode Project, contains
over 3B files across more than 600 programming and markup languages. The dataset is
derived from the Software Heritage archive9 , the largest public archive of software source
code and accompanying development history.

4.1.2 Instruction Tuning. Instruction tuning refers to the process of fine-tuning large language
models (LLMs) using a collection of datasets that are structured as instructions. This method
has demonstrated a considerable improvement in model performance and an enhanced ability
to generalize to unseen tasks that the model has not previously encountered, as evidenced by
9 https://archive.softwareheritage.org

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:12 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Table 1. The statistics of some commonly-used pre-training datasets for large language models (LLMs) aimed
at code generation. The column labeled ‘#PL’ indicates the number of programming languages included in
each dataset. It should be noted that in the CodeSearchNet [99] dataset, each file represents a function, and
for the Pile [70] and ROOTS [123] datasets, only the code components are considered.

Dataset Size (GB) Files (M) #PL Date Link


CodeSearchNet [99] 20 6.5 6 2022-01 https://huggingface.co/datasets/code_search_net
Google BigQuery[86] - - - 2016-06 github-on-bigquery-analyze-all-the-open-source-code
The Pile [70] 95 19 - 2022-01 https://huggingface.co/datasets/EleutherAI/pile
CodeParrot [219] 180 22 1 2021-08 https://huggingface.co/datasets/transformersbook/codeparrot
GitHub Code[219] 1,024 115 32 2022-02 https://huggingface.co/datasets/codeparrot/github-code
ROOTS [123] 163 15 13 2023-03 https://huggingface.co/bigscience-data
The Stack [118] 3,136 317 30 2022-10 https://huggingface.co/datasets/bigcode/the-stack
The Stack v2 [151] 32K 3K 619 2024-04 https://huggingface.co/datasets/bigcode/the-stack-v2

Table 2. The statistics of several representative datasets used in instruction-tuning large language models
(LLMs) for code generation. The column labeled ‘#PL’ indicates the number of programming languages
encompassed by each dataset.

Dataset Size #PL Date Link


CodeAlpaca-20K [40] 20k - 2023-03 https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
CommitPackFT [164] 2GB 277 2023-08 https://huggingface.co/datasets/bigcode/commitpackft
Evol-Instruct-Code-80k [195] 80k - 2023-07 https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
Python, Shell,
TypeScript, C++,
Magicoder-OSS-Instruct-75k [240] 75k 2023-12 https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K
Rust, PHP, Java,
Swift, C#
Self-OSS-Instruct-SC2-Exec-Filter-50k [261] 50k Python 2024-04 https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k

recent studies [51, 173]. Leveraging the benefits of instruction tuning, instruction tuning has
been expanded into coding domains, especially for code generation, which involves the automatic
generation of the intended code from a natural language description. The promise of instruction
tuning in this area has led numerous researchers to develop large-scale instruction-tuning datasets
tailored for code generation. Below, we provide an overview of several notable datasets tailored for
instruction tuning, with their respective statistics detailed in Table 2.
• CodeAlpaca-20k [40]: CodeAlpaca-20k is a collection of 20K instruction-following data
generated using the data synthesis techniques termed Self-Instruct outlined in [231], with
modifications for code generation, editing, and optimization tasks instead of general tasks.
• CommitPackFT [164]: CommitPackFT is a 2GB refined version of CommitPack. It is filtered
to only include high-quality commit messages that resemble natural language instructions.
• Evol-Instruct-Code-80k [195]: Evol-Instruct-Code-80k is an open-source implementation of
Evol-Instruct-Code described in the WizardCoder paper [154], which enhances the fine-tuning
effect of pre-trained code large models by adding complex code instructions.
• Magicoder-OSS-Instruct-75k [240]: is a 75k synthetic data generated through OSS-Instruct
with gpt-3.5-turbo-1106 and used to train both Magicoder and Magicoder-S series models.
• Self-OSS-Instruct-SC2-Exec-Filter-50k [261]: Self-OSS-Instruct-SC2-Exec-Filter-50k is gen-
erated by StarCoder2-15B using the OSS-Instruct [240] data synthesis approach. It was
subsequently used to fine-tune StarCoder-15B without any human annotations or distilled
data from huge and proprietary LLMs.

4.1.3 Benchmarks. To rigorously assess the efficacy of Large Language Models (LLMs) for code
generation, the research community has introduced a variety of high-quality benchmarks in
recent years. Building on the foundational work by [45], numerous variations of the HumanEval

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:13

dataset and additional benchmarks have emerged, aiming to evaluate a broader spectrum of code
generation capabilities in LLMs. We roughly divide these benchmarks into six distinct categories
based on their application contexts, including general-purpose, competitive programming, data
science, multilingual, logical reasoning, and repository-level. The statistics for these benchmarks
are presented in Table 3.
General
• HumanEval [45]: HumanEval comprises 164 manually scripted Python programming prob-
lems, each featuring a function signature, docstring, body, and multiple unit tests.
• HumanEval+ [145]: HumanEval+ extends the original HumanEval [45] benchmark by in-
creasing the scale of the test cases by 80 times. As the test cases increase, HumanEval+ can
catch significant amounts of previously undetected incorrect code synthesized by LLMs.
• HumanEvalPack [164]: expands HumanEval [45] by extending it to encompass three coding
tasks across six programming languages, namely code synthesis, code repair, and code
explanation.
• MBPP [16]: MBPP is a collection of approximately 974 Python programming problems, crowd-
sourced and designed for entry-level programmers. Each problem comes with an English
task description, a code solution, and three automated test cases.
• MBPP+ [145]: MBPP+ enhances MBPP [16] by eliminating ill-formed problems and rectifying
problems with incorrect implementations. The test scale of MBPP+ is also expanded by 35
times for test augmentation.
• CoNaLa [255]: CoNaLa contains almost 597K data samples for evaluating Python code
generation. The curated part of CoNaLa is crawled from Stack Overflow, automatically
filtered, and then curated by annotators. The mined part of CoNaLais automatically mined,
with almost 600k examples.
• Spider [258]: Spider is large-scale complex text-to-SQL dataset covering 138 different domains.
It has over 10K questions and 5.6K complex SQL queries on 200 databases. This dataset aims
to test a model’s ability to generalize to SQL queries, database schemas, and new domains.
• CONCODE [102]: CONCODE is a dataset with over 100K samples consisting of Java classes
from public GitHub repositories. It provides near zero-shot conditions that can test the
model’s ability to generalize to unseen natural language tokens with unseen environments.
• ODEX [236]: ODEX is an open-domain dataset focused on the execution-based generation
of Python code from natural language. It features 945 pairs of natural language queries and
their corresponding Python code, all extracted from StackOverflow forums.
• CoderEval [257]: CoderEval is a pragmatic code generation benchmark that includes 230
Python and 230 Java code generation problems. It can be used to evaluate the model perfor-
mance in generating pragmatic code beyond just generating standalone functions.
• ReCode [226]: Recode serves as a comprehensive robustness evaluation benchmark. ReCode
applies perturbations to docstrings, function and variable names, code syntax, and code
format, thereby providing multifaceted assessments of a model’s robustness performance.
• StudentEval [18]: StudentEval is a dataset of 1,749 prompts for 48 problems, authored by 80
students who have only completed a one-semester Python programming class. Unlike many
other benchmarks, it has multiple prompts per problem and multiple attempts by the same
participant, each problem is also accompanied by a set of instructor-written test cases.
Competitions
• APPS [85]: The APPS benchmark is composed of 10K Python problems, spanning three levels
of difficulty: introductory, interview, and competition. Each entry in the dataset includes a

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:14 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

programming problem described in English, corresponding ground truth Python solutions,


test cases defined by their inputs and outputs or function names if provided.
• CodeContests [136]: is a competitive programming dataset consisting of samples from various
sources including Aizu, AtCoder, CodeChef, Codeforces, and HackerEarth. The dataset
encompasses programming problems accompanied by test cases in the form of paired inputs
and outputs, along with both correct and incorrect human solutions in multiple programming
languages.
Data Science
• DSP [38]: DSP allows for model evaluation based on real data science pedagogical notebooks.
It includes well-structured problems, along with unit tests to verify the correctness of solutions
and a Docker environment for reproducible execution.
• DS-1000 [122]: DS-1000 has 1K science questions from seven Python libraries, namely NumPy,
Pandas, TensorFlow, PyTorch, SciPy, Scikit-learn, and Matplotlib. The DS-1000 benchmark
features: (1) realistic problems with diverse contexts (2) implementation of multi-criteria
evaluation metrics, and (3) defense against memorization.
• ExeDS [97]: ExeDS is a data science code generation dataset specifically designed for execution
evaluation. It contains 534 problems with execution outputs from Jupyter Notebooks, as well
as 123K examples for training and validation.
Multilingual
• MBXP [15]: MBXP is a multilingual adaptation of the original MBPP [16] dataset. It is created
using a framework that translates prompts and test cases from the original Python datasets
into the corresponding data in the targeted programming language.
• Multilingual HumanEval [15]: Multilingual HumanEval is a dataset derived from HumanEval
[45]. It is designed to assess the performance of models in a multilingual context. It helps
uncover the generalization ability of the given model on languages that are out-of-domain.
• HumanEval-X [275]: HumanEval-X is developed for evaluating the multilingual ability of
code generation models with 820 hand-writing data samples in C++, Java, JavaScript, and Go.
• MultiPL-E [36]: MultiPL-E is a dataset for evaluating LLMs for code generation across 18 pro-
gramming languages. It adopts the HumanEval [45] and the MBPP [16] Python benchmarks
and uses little compilers to translate them to other languages.
• xCodeEval [115]: xCodeEval is an executable multilingual multitask benchmark consisting of
25M examples covering 17 programming languages. Its tasks include code understanding,
generation, translation, and retrieval.
Reasoning
• MathQA-X [15] MathQA-X is the multilingual version of MathQA [12]. It is generated by
utilizing a conversion framework that converts samples from Python datasets into the target
language.
• MathQA-Python [16] MathQA-Python is a Python version of the MathQA benchmark[12].
The benchmark, containing more than 23K problems, is designed to assess the capability of
models to synthesize code from complex textual descriptions.
• GSM8K [53]: GSM8K is a dataset of 8.5K linguistically diverse grade school math problems.
The dataset is crafted to facilitate the task of question answering on basic mathematical
problems that requires multi-step reasoning.
• GSM-HARD [71]: GSM-HARD is a more challenging version of the GSM8K [53] dataset. It
replaces the numbers in the GSM8K questions with larger, less common numbers, thereby
increasing the complexity and difficulty level of the problems.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:15

Table 3. The detailed statistics of commonly-used benchmarks used in evaluating large language models
(LLMs) for code generation. The column labeled ‘#PL’ indicates the number of programming languages
included in each dataset. For the sake of brevity, we list the programming languages (PLs) for benchmarks
that support fewer than or include five PLs. For benchmarks with six or more PLs, we provide only a numerical
count of the PLs supported.

Scenario Benchmark Size #PL Date Link


HumanEval [45] 164 Python 2021-07 https://huggingface.co/datasets/openai_humaneval
HumanEval+ [145] 164 Python 2023-05 https://huggingface.co/datasets/evalplus/humanevalplus
HumanEvalPack [164] 164 6 2023-08 https://huggingface.co/datasets/bigcode/humanevalpack
MBPP [16] 974 Python 2021-08 https://huggingface.co/datasets/mbpp
MBPP+ [145] 378 Python 2023-05 https://huggingface.co/datasets/evalplus/mbppplus
CoNaLa [255] 596.88K Python 2018-05 https://huggingface.co/datasets/neulab/conala
General
Spider [258] 8,034 SQL 2018-09 https://huggingface.co/datasets/xlangai/spider
CONCODE [102] 104K Java 2018-08 https://huggingface.co/datasets/AhmedSSoliman/CONCOD
ODEX [236] 945 Python 2022-12 https://huggingface.co/datasets/neulab/odex
CoderEval [257] 460 Python, Java 2023-02 https://github.com/CoderEval/CoderEval
ReCode [226] 1,138 Python 2022-12 https://github.com/amazon-science/recode
StudentEval [18] 1,749 Python 2023-06 https://huggingface.co/datasets/wellesley-easel/StudentEval
APPS [85] 10,000 Python 2021-05 https://huggingface.co/datasets/codeparrot/apps
Competitions
C++, Python,
CodeContests [136] 13,610 2022-02 https://huggingface.co/datasets/deepmind/code_contests
Java
DSP [38] 1,119 Python 2022-01 https://github.com/microsoft/DataScienceProblems
Data Science DS-1000 [122] 1,000 Python 2022-11 https://huggingface.co/datasets/xlangai/DS-1000
ExeDS [97] 534 Python 2022-11 https://github.com/Jun-jie-Huang/ExeDS
MBXP [15] 12.4K 13 2022-10 https://huggingface.co/datasets/mxeval/mbxp
Multilingual HumanEval [15] 1.9K 12 2022-10 https://huggingface.co/datasets/mxeval/multi-humaneval
Multilingual Python, C++,
HumanEval-X [275] 820 Java, JavaScript, 2023-03 https://huggingface.co/datasets/THUDM/humaneval-x
Go
MultiPL-E [36] 161 18 2022-08 https://huggingface.co/datasets/nuprl/MultiPL-E
xCodeEval [115] 5.5M 11 2023-03 https://github.com/ntunlp/xCodeEval
Python, Java,
MathQA-X [15] 5.6K 2022-10 https://huggingface.co/datasets/mxeval/mathqa-x
JavaScript
Reasoning MathQA-Python [16] 23,914 Python 2021-08 https://github.com/google-research/google-research
GSM8K [53] 8.5K Python 2021-10 https://huggingface.co/datasets/gsm8k
GSM-HARD [71] 1.32K Python 2022-11 https://huggingface.co/datasets/reasoning-machines/gsm-hard
RepoEval [266] 3,573 Python, Java 2023-03 https://paperswithcode.com/dataset/repoeval
Stack-Repo [205] 200 Java 2023-06 https://huggingface.co/datasets/RepoFusion/Stack-Repo
Repobench [150] 27k Python, Java 2023-01 https://github.com/Leolty/repobench
Repository EvoCodeBench [130] 275 Python 2024-03 https://huggingface.co/datasets/LJ0815/EvoCodeBench
SWE-bench [111] 2,294 Python 2023-10 https://huggingface.co/datasets/princeton-nlp/SWE-bench
Python, Java,
CrossCodeEval [63] 10K 2023-10 https://github.com/amazon-science/cceval
TypeScript, C#
SketchEval [265] 20,355 Python 2024-03 https://github.com/nl2code/codes

Repository
• RepoEval [266]: RepoEval enables the evaluation of repository-level code completion. It can
offer different levels of granularity and improved evaluation accuracy through the use of unit
tests.
• Stack-Repo [205]: Stack-Repo is a dataset of 200 Java repositories from GitHub with near-
deduplicated files. These files are augmented with three types of repository contexts: prompt
proposal contexts, BM25 Contexts (based on BM25 similarity scores), and RandomNN Con-
texts (obtained using the nearest neighbors in the representation space of an embedding
model).
• Repobench [150]: Repobench is a benchmark specifically used for evaluating repository-
level code auto-completion systems. Supporting both Python and Java, it consists of three
interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion),
and RepoBench-P (Pipeline).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:16 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

• EvoCodeBench [130]: EvoCodeBench is an evolutionary code generation benchmark, con-


structed through a rigorous pipeline and aligned with real-world repositories. This benchmark
also provides comprehensive annotations and robust evaluation metrics.
• SWE-bench [111]: SWE-bench is a dataset that tests a model’s ability to automatically solve
GitHub issues. The dataset has 2,294 Issue-Pull Request pairs from 12 popular Python reposi-
tories.
• CrossCodeEval [63]: CrossCodeEval is a diverse and multilingual scope completion dataset
covering four languages: Python, Java, TypeScript, and C#. This benchmark tests the model’s
ability to understand in-depth cross-file information and accurately complete the code.
• SketchEval [265]: SketchEval is a repository-oriented benchmark that encompasses data from
19 repositories, each varying in complexity. In addition to the dataset, SketchEval introduces
a metric, known as SketchBLEU, to measure the similarity between two repositories based
on their structures and semantics.

4.2 Data Synthesis


Numerous studies have demonstrated that high-quality datasets are integral to enhancing the
performance of large language models (LLMs) in various downstream tasks [31, 119, 159, 242, 248,
281]. For instance, the LIMA model, a 65B parameter LLaMa language model fine-tuned with a
standard supervised loss on a mere 1,000 meticulously curated prompts and responses, achieved
performance on par with, or even superior to, GPT-4 in 43% of evaluated cases. This figure rose to
58% when compared to Bard and 65% against DaVinci003, all without the use of reinforcement
learning or human preference modeling [281]. The QuRating initiative strategically selects pre-
training data embodying four key textual qualities — writing style, facts & trivia, required expertise,
and educational value — that resonate with human intuition. Training a 1.3B parameter model on
such data resulted in reduced perplexity and stronger in-context learning compared to baseline
models [242].
Despite these advancements, acquiring quality data remains a significant challenge due to issues
such as data scarcity, privacy concerns, and prohibitive costs [148, 231]. Human-generated data is
often labor-intensive and expensive to produce, and it may lack the necessary scope and detail to
navigate complex, rare, or ambiguous scenarios. As a resolution to these challenges, synthetic data
has emerged as a viable alternative. By generating artificial datasets that replicate the intricacies
of real-world information, models such as GPT-3.5-turbo [171] and GPT-4 [5] have enabled the
creation of rich datasets without the need for human annotation [82, 124, 148, 231]. This approach
is particularly beneficial in enhancing the instruction-following capabilities of LLMs, with a focus
on generating synthetic instruction-based data.
A notable example of this approach is the Self-Instruct [231] framework, which employs an off-the-
shelf language model to generate a suite of instructions, inputs, and outputs. This data is then refined
by removing invalid or redundant entries before being used to fine-tune the model. The empirical
evidence supports the efficacy of this synthetic data generation methodology. Building upon this
concept, the Alpaca [213] model, fine-tuned on 52k pieces of instruction-following data from a 7B
parameter LLaMa [217] model, exhibits performance comparable to the text-davinci-003 model.
WizardLM [250] introduced the Evol-Instruct technique, which incrementally transforms simple
instructions into more complex variants. The fine-tuned LLaMa model using this technique has
shown promising results in comparison to established proprietary LLMs such as ChatGPT [171] and
GPT-4 [5], to some extent. Moreover, Microsoft has contributed to this field with their Phi series of
models, predominantly trained on synthetic high-quality data, which includes Phi-1 (1.3B) [75]
for Python coding, Phi-1.5 (1.3B) [135] for common sense reasoning and language understanding,
Phi-2 (2.7B) [161] for advanced reasoning and language understanding, and Phi-3 (3.8B) [4] for

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:17

general purposes. These models have consistently outperformed larger counterparts across various
benchmarks, demonstrating the efficacy of synthetic data in model training.
Drawing on the successes of data synthesis for general-purpose Large Language Models (LLMs),
researchers have expanded the application of synthetic data to the realm of code generation. The
Code Alpaca model, as described in [40], has been fine-tuned on a 7B and 13B LLaMA model using
a dataset of 20k instruction-following examples for code generation. This dataset was created
by text-davinci-00310 and employed the Self-Instruct technique [231]. Building on this, the
WizardCoder 15B [154] utilizes the Evol-Instruct technique to create an enhanced dataset of
78k evolved code instruction examples. This dataset originates from the initial 20k instruction-
following dataset used by Code Alpaca [40], which was also generated by text-davinci-003. The
WizardCoder model, fine-tuned on the StarCoder [132] base model, achieved a 57.3% pass@1 on
the HumanEval benchmarks. This performance not only surpasses all other open-source Code
LLMs by a significant margin but also outperforms leading closed LLMs such as Anthropic’s Claude
and Google’s Bard. In a similar vein, Magicoder [240] introduces a novel data synthesis approach
termed OSS-INSTRUCT which enlightens LLMs with open-source code snippets to generate high-
quality instruction data for coding tasks. It aims to address the inherent biases often present
in synthetic data produced by LLMs. Building upon CodeLlama [196], the MagicoderS-CL-7B
model — fine-tuned with 75k synthetic instruction data using the OSS-INSTRUCT technique and
with gpt-3.5-turbo-1106 as the data generator — has outperformed the prominent ChatGPT
on the HumanEval Plus benchmark, achieving pass@1 of 66.5% versus 65.9%. In a noteworthy
development, Microsoft has introduced the phi-1 model [75], a more compact LLM of only 1.3B
parameters. Despite its smaller size, phi-1 has been trained on high-quality textbook data sourced
from the web (comprising 6 billion tokens) and supplemented with synthetic textbooks and exercises
generated with GPT-3.5 (1 billion tokens). It has achieved pass@1 of 50.6% on HumanEval and
55.5% on MBPP, setting a new state-of-the-art for Python coding performance among existing small
language models (SLMs). The latest contribution to this field is from the BigCode team, which has
presented StarCoder2-15B-instruct [261], the first entirely self-aligned code LLM trained with a
transparent and permissive pipeline. This model aligns closely with the OSS-INSTRUCT principles
established by Magicoder, generating instructions based on seed functions filtered from the Stack
v1 dataset [118] and producing responses through self-validation. Unlike Magicoder, StarCoder2-
15B-instruct employs its base model, StarCoder2-15B, as the data generator, thus avoiding reliance
on large and proprietary LLMs like GPT-3.5-turbo [171].
While synthetic data has demonstrated its potential across both small- and large-scale LMs for a
variety of general and specialized tasks, including code generation, it also poses several challenges
that must be addressed. These challenges include a lack of data diversity [242], the need to ensure
the factuality and fidelity of the information [221, 243], and the potential to amplify existing biases
or introduce new ones [23, 80].

4.3 Pre-Training
4.3.1 Model Architectures. Since the inception of the Transformer architecture for machine trans-
lation [222], it has become the de facto backbone for a multitude of large language models (LLMs)
that address a wide range of downstream tasks. The Transformer and its derivatives owe their
prominence to their exceptional ability to parallelize computation and their powerful representa-
tional capacities [256, 273]. Through innovative scaling techniques, such as Mixture-of-Experts
(MoE) [33, 200] and Depth-Up-Scaling (DUS) [117], the capacity of Transformer-based LLMs has
expanded to encompass hundreds of billions or even trillions of parameters. These scaled-up models
10 https://platform.openai.com

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:18 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Table 4. The overview of large language models (LLMs) with encoder-decoder architectures for code genera-
tion.

Context
Architecture Model Institution Size Vocabulary Date Open Source
Window
PyMT5[52] Microsoft 374M 50K 1024+1024 2020-10
PLBART[6] UCLA 140M 50K 1024+1024 2021-03 "
CodeT5 [234] Salesforce60M, 220M, 770M 32K 512+256 2021-09 "
JuPyT5[38] Microsoft 350M 50K 1024+1024 2022-01
284M, 1.1B, 2.8B,
AlphaCode[136] DeepMind 8K 1536+768 2022-02
8.7B, 41.1B
Encoder-Decoder CodeRL[125] Salesforce 770M 32K 512+256 2022-06 "
ERNIE-Code[37] Baidu 560M 250K 1024+1024 2022-12 "
PPOCoder[204] Virginia Tech 770M 32K 512+256 2023-01
220M, 770M, 2B,
CodeT5+[232] Salesforce 50K 2048+2048 2023-05 "
6B, 16B
CodeFusion[207] Microsoft 75M 32k 128+128 2023-10 "
AST-T5[73] UC Berkeley 226M 32k 512+200/300 2024-01 "

have exhibited a range of emergent abilities [87, 114, 238], such as instruction following [173],
in-context learning [65], and step-by-step reasoning [95, 239] that were previously unforeseen.
In the domain of code generation using LLMs, the architecture of contemporary models generally
falls into one of two categories: encoder-decoder models, such as CodeT5 [234], CodeT5+ [232],
and CodeRL [125]; or decoder-only models, such as Codex [45], StarCoder [132], Code Llama [196],
and CodeGemma [54]. These architectures are depicted in Figure 2(b) and (c), respectively. For a
comprehensive overview, Table 4 details the encoder-decoder architectures, while Table 5 focuses
on the decoder-only models utilized in code generation.
4.3.2 Pre-training Tasks. In the initial phase, language models for code generation are typically
trained from scratch using datasets consisting of manually annotated pairs of natural language
descriptions and corresponding code snippets, within a supervised learning framework. However,
manual annotation is not only laborious and time-consuming, but the efficacy of the resulting
models is also constrained by both the volume and the quality of the available annotated data. This
limitation is especially pronounced in the context of low-resource programming languages, such
as Swahili and Yoruba, where annotated examples are scarce [35, 43]. In light of these challenges,
there has been a shift towards an alternative training strategy that involves pre-training models on
extensive and unlabelled code corpora. This method is aimed at imbuing the models with a broad
understanding of programming knowledge, encompassing elements like identifiers, code structure,
and underlying semantics [45]. In this regard, two pre-training tasks have gained prominence
for their effectiveness, namely Causal Language Modeling (CLM), also known as unidirectional
language modeling or next-token prediction, and Denoising Autoencoding (DAE). The CLM task
can be applied to both decoder-only and encoder-decoder model architectures, while DAE tasks are
specifically designed for encoder-decoder frameworks. It should also be noted that there is a variety
of additional auxiliary pre-training tasks that can further enhance model performance. These
include Masked Identifier Prediction, Identifier Tagging, Bimodal Dual Generation [234], Text-Code
Matching, and Text-Code Contrastive Learning [232]. These tasks contribute to a more nuanced
and comprehensive pre-training process, equipping the models with the capabilities necessary to
handle a wide range of code generation scenarios.
Causal Language Modeling. In decoder-only LLMs, given a sequence of tokens x = {𝑥 1, . . . , 𝑥𝑛 },
the CLM task refers to autoregressively predict the target tokens 𝑥𝑖 based on the preceding tokens
𝑥 <𝑖 in a sequence. The causal language modeling objective for training decoder LLMs is to minimize

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:19

Table 5. The overview of large language models (LLMs) with decoder-only architectures for code generation.

Context
Architecture Model Institution Size Vocabulary Date Open Source
Window
GPT-C [210] Microsoft 366M 60K 1024 2020-05
CodeGPT [153] Microsoft 124M 50K 1024 2021-02 "
GPT-Neo[29] EleutherAI 125M, 1.3B, 2.7B 50k 2048 2021-03 "
GPT-J [223] EleutherAI 6B 50k 2048 2021-05 "
12M, 25M, 42M,
Codex [45] OpenAI 85M, 300M, 679M, - 4096 2021-07
2.5B, 12B
CodeParrot [219] Hugging Face 110M, 1.5B 33k 1024 2021-11 "
PolyCoder [251] CMU 160M, 400M, 2.7B 50k 2048 2022-02 "
350M, 2.7B, 6.1B,
CodeGen [169] Salesforce 51k 2048 2022-03 "
16.1B
GPT-NeoX [28] EleutherAI 20B 50k 2048 2022-04 "
PaLM-Coder [49] Google 8B, 62B, 540B 256k 2048 2022-04
InCoder [69] Meta 1.3B, 6.7B 50k 2049 2022-04 "
PanGu-Coder [50] Huawei 317M, 2.6B 42k 1024 2022-07
PyCodeGPT [263] Microsoft 110M 32k 1024 2022-06 "
CodeGeeX [275] Tsinghua 13B 52k 2048 2022-09 "
BLOOM [126] BigScience 176B 251k - 2022-11 "
ChatGPT [171] OpenAI - - 16k 2022-11 "
SantaCoder [8] Hugging Face 1.1B 49k 2048 2022-12 "
6.7B, 13.0B, 32.5B,
LLaMA [217] Meta 32K 2048 2023-02 "
65.2B
Decoder-Only GPT-4 [5] OpenAI - - 32K 2023-03
CodeGen2 [168] Salesforce 1B, 3.7B, 7B, 16B 51k 2048 2023-05 "
replit-code [193] replit 3B 33k 2048 2023-05 "
StarCoder [132] Hugging Face 15.5B 49k 8192 2023-05 "
WizardCoder [154] Microsoft 15B, 34B 49k 8192 2023-06 "
phi-1 [75] Microsoft 1.3B 51k 2048 2023-06 "
CodeGeeX2 [275] Tsinghua 6B 65k 8192 2023-07 "
PanGu-Coder2 [201] Huawei 15B 42k 1024 2023-07
Llama 2 [218] Meta 7B, 13B, 70B 32K 4096 2023-07 "
OctoCoder [164] Hugging Face 15.5B 49k 8192 2023-08 "
Code Llama [196] Meta 7B, 13B, 34B 32k 16384 2023-08 "
CodeFuse [143] Ant Group 350M, 13B, 34B 101k 4096 2023-09 "
phi-1.5 [135] Microsoft 1.3B 51k 2048 2023-09 "
CodeShell [247] Peking University 7B 70k 8192 2023-10 "
Magicoder [240] UIUC 7B 32k 16384 2023-12 "
AlphaCode 2 [10] Google DeepMind - - - 2023-12
StableCode [182] StabilityAI 3B 50k 16384 2024-01 "
WaveCoder [259] Microsoft 6.7B 32k 16384 2023-12 "
phi-2 [161] Microsoft 2.7B 51k 2048 2023-12 "
DeepSeek-Coder [79] DeepSeek 1.3B, 6.7B, 33B 32k 16384 2023-11 "
StarCoder 2 [151] Hugging Face 15B 49k 16384 2024-02 "
Claude 3 [13] Anthropic - - 200K 2024-03
CodeGemma [54] Google 2B, 7B 25.6k 8192 2024-04 "
Code-Qwen [215] Qwen Group 7B 92K 65536 2024-04 "
Llama3 [160] Meta 8B, 70B 128K 8192 2024-04 "
StarCoder2-Instruct [261] Hugging Face 15.5B 49K 16384 2024-04 "

the following likelihood:


𝑛
Ö 𝑛
∑︁
𝐷𝑒𝑐𝑜𝑑𝑒𝑟 −𝑜𝑛𝑙 𝑦
L𝐶𝐿𝑀 (x) = − log( 𝑃𝜃 (𝑥𝑖 | x<𝑖 )) = − log 𝑃𝜃 (𝑥𝑖 | x<𝑖 ) (15)
𝑖=1 𝑖=1

where x<𝑖 represents the sequence of preceding tokens {𝑥 1, . . . , 𝑥𝑖 −1 } before x𝑖 in the input, 𝜃
denotes the model parameters. The conditional probability 𝑃𝜃 (𝑥𝑖 |x<𝑖 )) is modeled by adding a

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:20 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

causal attention mask to the multi-head self-attention matrix of each Transformer block. To be
specific, causal attention masking is implemented by setting the lower triangular part of the
matrix to 0 and the remaining elements to −∞, ensuring that each token 𝑥𝑖 attends only to its
predecessors and itself. On the contrary, in encoder-decoder LLMs, a pivot token 𝑥𝑘 is randomly
selected in a sequence of tokens and then regarding the context before it as the source sequence
x𝑖𝑛 = {𝑥 1, . . . , 𝑥𝑘 } of the encoder and the sequence after it as the target output x𝑜𝑢𝑡 = {𝑥𝑘+1, . . . , 𝑥𝑛 }
of decoder. Formally, the causal language modeling objective for training encoder-decoder LLMs is
to minimize loss function as follows:
𝑛
Ö 𝑛
∑︁
𝐸𝑛𝑐𝑜𝑑𝑒𝑟 −𝐷𝑒𝑐𝑜𝑑𝑒𝑟
L𝐶𝐿𝑀 (x) = − log( 𝑃𝜃 (𝑥𝑖 | x ≤𝑘 , x<𝑖 )) = − log 𝑃𝜃 (𝑥𝑖 | x ≤𝑘 , x<𝑖 ) (16)
𝑖=𝑘+1 𝑖=𝑘+1
where x ≤𝑘 is the source sequence input and x<𝑖 denotes the target sequence autoregressively
generated so far. During the inference phase, pre-trained LLMs that have been trained on large-
scale code corpus can generate code in a zero-shot manner without the need for fine-tuning. This
is achieved through the technique of prompt engineering, which guides the model to produce the
desired output11 [31, 186]. Additionally, recent studies have explored the use of few-shot learning,
also referred to as in-context learning, to enhance model performance further [131, 178].
Denoising Autoencoding. In addition to causal language modeling (CLM), the denoising
autoencoding (DAE) task has been extensively applied in pre-training encoder-decoder architectures
for code generation, such as PLBART [6], CodeT5 [234], and its enhanced successor, CodeT5+ [232].
Following T5 [189] and CodeT5 [234], the DAE refers to initially perturbing the source sequence
by introducing randomly masked spans of varying lengths. This corrupted sequence serves as the
input for the encoder. Subsequently, the decoder employs an autoregressive strategy to reconstruct
the masked spans, integrating sentinel tokens to facilitate the generation process. This method
has proven effective in improving the model’s ability to generate semantically and syntactically
accurate code by learning robust contextual representations [232, 234]. Formally, the denoising
autoencoding objective for training encoder-decoder LLMs is to minimize the following likelihood:
𝑘
∑︁
𝐸𝑛𝑐𝑜𝑑𝑒𝑟 −𝐷𝑒𝑐𝑜𝑑𝑒𝑟
| x\𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 , x<𝑖
𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠
L𝐷𝐴𝐸 (x) = − log 𝑃𝜃 (x𝑖 ) (17)
𝑖=1

where 𝜃 denotes the model parameters, x\𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 is the noisy input with masked spans,
x𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 is the masked spans to predict from the decoder with 𝑘 denoting the number of
𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠
tokens in x𝑚𝑎𝑠𝑘𝑒𝑑_𝑠𝑝𝑎𝑛𝑠 , and x<𝑖 is the span sequence autoregressively generated so far.
Compared with CLM, the DAE task presents a more challenging scenario, as it necessitates a deeper
understanding and capture of the intrinsic semantic relationships among token sequences by LLMs
[189].

4.4 Instruction Tuning


After pre-training Large Language Models (LLM) on large-scale datasets, the next phase typically
involves augmenting the model’s ability to process and follow instructions, known as instruction
tuning. Instruction tuning generally refers to the supervised fine-tuning of pre-trained LLMs using
datasets comprised of structured examples framed as natural language instructions [103, 173, 237,
268]. Two exemplars of instruction data sampled from Code Alpaca [40] are demonstrated in Figure
5. It capitalizes on the heterogeneity of instruction types, positioning instruction tuning as a form
of multi-task prompted training that significantly enhances the model’s generalization to unseen
tasks [51, 173, 197, 237].
11 For more information on prompt engineering, visit https://www.promptingguide.ai

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:21

Instruction: Instruction:
Write code to create a list of all Generate a snippet of code to extract
prime numbers between 2 and 100. all the URLs from the given string.

Input: Input:
This string contains some urls such as
N/A https://www.google.com and
https://www.facebook.com.
Output: Output:
def find_primes(n): import re
prime_list = [2]
for number in range(2, n + 1): string = "This string contains some
is_prime = True urls such as https://www.google.com and
for k in range(2, number): https://www.facebook.com."
if number % k == 0:
is_prime = False urls = re.findall('http[s]?://(?:[a-zA-
if is_prime: Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-
prime_list.append(number) fA-F][0-9a-fA-F]))+', string)
return prime_list
print(urls)

Fig. 5. Two exemplars of instruction data sampled from Code Alpaca [40] used to instruction-tune pre-trained
code LLM to enhance their alignment with natural language instructions. The instruction corpus encompasses
a variety of tasks, each accompanied by distinct instructions, such as prime numbers generation and URLs
extraction.

In the realm of code generation, natural language descriptions serve as the instructions guiding
the model to generate corresponding code snippets. Consequently, a line of research on instruction
tuning LLMs for code generation has garnered substantial interest across academia and industry.
To perform instruction tuning, instruction data are typically compiled from source code with
permissive licenses [99, 118, 151] (refer to Section 4.1.2) or are constructed from synthetic code data
[154, 240, 261] (refer to Section 4.2). These datasets are then utilized to fine-tune LLMs through
a supervised learning paradigm. However, the substantial computational resources required for
full parameter fine-tuning (FFT) LLM pose a notable challenge, particularly in scenarios with
constrained resources [62, 138]. To mitigate this issue, parameter-efficient fine-tuning (PEFT) has
emerged as a compelling alternative strategy, gaining increasing attention for its potential to reduce
resource consumption [62]. In the following subsection, we categorize existing works based on
their instruction-tuning strategies to provide a comprehensive and systematic review.
4.4.1 Full Parameter Fine-tuning. Full parameter fine-tuning (FFT) involves updating all parameters
within a pre-trained model, as shown in Figure 6(a). This approach is often preferred when ample
computational resources and substantial training data are available, as it typically leads to better
performance. [234] introduces an encoder-decoder pre-trained language model for code generation,
named CodeT5+. They instruction-tune this model on a dataset comprising 20k instruction samples
from Code Alpaca [40], resulting in an instruction-following model called InstructCodeT5+, which
exhibited improved capabilities in code generation. [154] leverages the Evol-Instruct data synthesis
technique from WizardLM [250] to evolve 20K code Alpaca [40] instruction samples into a 78K
code instruction dataset. This enriched dataset is then used to fine-tune the StarCoder base model,

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:22 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

resulting in WizardCoder, which showcases notable advancements in code generation. In a similar


vein, inspired by the successes of WizardCoder [154] and RRHF [260], Pangu-Coder 2 [201] applies
the Evol-Instruct method to generate 68k high-quality instruction samples from the initial 20k Code
Alpaca [40] instruction samples. Additionally, they introduces a novel reinforcement learning via
Rank Responses to align Test & Teacher Feedback (RRTF), which further enhances the performance
of Pangu-Coder 2 in code generation. Diverging from synthetic instruction data generation methods,
OctoPack [164] utilizes real-world data by curating CommitPack from the natural structure of
Git commits, which inherently pair code changes with human-written instructions. This dataset,
consisting of 4 terabytes of Git commits across 350 programming languages, is employed to fine-
tune StarCoder [132] and CodeGeeX2 [275], leading to the instruction-following code models of
OctoCoder and OctoGeeX for code generation, respectively. The most recent innovation comes
from Magicoder [240], who proposes OSS-INSTRUCT, a novel data synthesis method that leverages
open-source code snippets to generate high-quality instruction data for code generation. This
approach seeks to reduce the bias often present in synthetic data generated by LLM. In line with
OSS-INSTRUCT, the BigCode team introduces StarCoder2-15B-instruct [261], which they claim to
be the first entirely self-aligned Large Language Model (LLM) for code generation, trained with
a fully permissive and transparent pipeline. Moreover, [54] harnesses open-source mathematics
datasets, such as MATH [85] and GSM8k [53], along with synthetically generated code following
the OSS-INSTRUCT [240] paradigm, to instruction-tune CodeGemma 7B, yielding exceptional
results in mathematical reasoning and code generation tasks.
4.4.2 Parameter-Efficient Fine-tuning. To mitigate the extensive computational and resource de-
mands inherent in fine-tuning large language models (LLMs), the concept of parameter-efficient
fine-tuning (PEFT) has emerged to focus on updating a minimal subset of parameters, which may
either be a selection of the model’s parameters or an array of additional parameters specifically
introduced for the tuning process [62, 138]. The categorization of these methods is depicted in
Figure 6(b), (c), and (d). A plethora of innovative PEFT approaches have been developed, among
which BitFit [262], Adapter [92], Prompt tuning [128], Prefix-tuning [134], LoRA [93], IA3 [144],
QLoRA [60], and AdaLoRA [267] are particularly noteworthy. A seminal study in this field, LoRA
[93], proposes a parameter update mechanism for a pre-trained weight matrix — such as those found
in the key or value projection matrices of a Transformer block’s multi-head self-attention layer — by
factorizing the update into two low-rank matrices. Crucially, all original model parameters remain
frozen, with only the pair of low-rank matrices being trainable. After fine-tuning, the product of
these low-rank matrices can be seamlessly incorporated into the existing weight matrix through
an element-wise addition. This process can be formally described as:
𝛼 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒
(W0 + ΔW)𝑥 = W0𝑥 + ΔW𝑥 = W0
𝑓 𝑟𝑜𝑧𝑒𝑛
𝑥 + B𝑢𝑝 A𝑑𝑜𝑤𝑛 𝑥
𝑟 (18)
| {z }
ΔW

where W0 ∈ R𝑑 ×𝑘 denotes a pre-trained weight matrix, B𝑢𝑝 𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 ∈ R𝑑 ×𝑟 and A𝑡𝑟𝑎𝑖𝑛𝑎𝑏𝑙𝑒 ∈ R𝑟 ×𝑘 are
𝑑𝑜𝑤𝑛
two trainable low-rank matrixes and initialized by a zero matrix and a random Gaussian distribution
N (0, 𝜎 2 ) respectively, to ensure ΔW = 0 at the beginning of training. The rank 𝑟 ≪ min(𝑑, 𝑘), the
𝛼
𝑟 is a scaling coefficient to balance the importance of the LoRA module, like a learning rate.
Despite the advancements in PEFT methods, their application in code generation remains limited.
For instance, [108] pioneered the use of parameter-efficient instruction-tuning on a Llama 2 [218]
model with a single RTX 3090 GPU, leading to the development of a multilingual code generation
model called CodeUp. More recently, ASTRAIOS [285] conducted a thorough empirical examination
of parameter-efficient instruction tuning for code comprehension and generation tasks. This study

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:23

Fig. 6. An illustration of full parameter fine-tuning (FFT) and parameter-efficient fine-tuning (PEFT) methods.
(a) refers to the Full Fine-tuning method, which updates all parameters of the base model during fine-tuning.
(b) stands for the Specification-based PEFT method that conditionally fine-tunes a small subset of the model
parameters while freezing the rest of the model, e.g. BitFit [262]. (c) represents the Addition-based PEFT
method that fine-tunes the incremental parameters introduced into the base model or input, e.g. Adapter
[92], Prefix-tuning [134], and Prompt-tuning [128]. (d) symbolizes the Reparameterization-based method
which reparameterizes existing model parameters by low-rank transformation, e.g. LoRA [93], QLoRA [60],
and AdaLoRA [267].

yielded several perceptive observations and conclusions, contributing valuable insights to the
domain.

4.5 Reinforcement Learning with Feedback


Large language models (LLMs) have exhibited remarkable instruction-following capabilities through
instruction tuning. However, they often produce outputs that are unexpected, toxic, biased, or
hallucinated outputs that do not align with users’ intentions or preferences [107, 173, 235]. Con-
sequently, aligning LLMs with human preference has emerged as a pivotal area of research. A
notable work is InstructGPT [173], which further fine-tunes an instruction-tuned model utilizing
reinforcement learning with human feedback (RLHF) on a dataset where labelers have ranked
model outputs in order of quality, from best to worst. This method has been instrumental in the
development of advanced conversational language models, such as ChatGPT [171] and Bard [157].
Despite its success, acquiring high-quality human preference ranking data is a resource-intensive
process [127]. To address this, Reinforcement Learning from AI Feedback (RLAIF) [20, 127] has
been proposed to leverage powerful off-the-shelf LLMs (e.g., ChatGPT [171] and GPT-4 [5]) to
simulate human annotators by generating preference data.
Building on RLHF’s success, researchers have explored reinforcement learning with feedback to
enhance code generation in LLMs. Unlike RLHF, which relies on human feedback, this approach
employs compilers or interpreters to automatically provide feedback on code samples through code
execution on unit test cases, catalyzing the advancement of this research domain. CodeRL [125]
introduced an actor-critic reinforcement learning framework for code generation. In this setup, the
language model serves as the actor-network, while a token-level functional correctness reward
predictor acts as the critic. Generated code is assessed through unit test signals from a compiler,
which can indicate compiler errors, runtime errors, unit test failures, or passes. CompCoder [229]

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:24 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

enhances code compilability by employing compiler feedback, including language model fine-
tuning, compilability reinforcement, and compilability discrimination strategies. Subsequently,
PPOCoder [204] integrates pre-trained code model CodeT5 [234] with Proximal Policy Optimization
(PPO) [198]. This integration not only utilizes execution (i.e., compilers or interpreters) feedback to
assess syntactic and functional correctness but also incorporates a reward function that evaluates
the syntactic and semantic congruence between abstract syntax tree (AST) sub-trees and data flow
graph (DFG) edges in the generated code against the ground truth. Additionally, the framework
applies a KL-divergence penalty to maintain fidelity between the actively learned policy and the
referenced pre-trained model, enhancing the optimization process. More recently, RLTF [146] has
proposed an online reinforcement learning framework that provides fine-grained feedback based
on compiler error information and location, along with adaptive feedback that considers the ratio
of passed test cases.
Despite these successes, reinforcement learning algorithms face inherent limitations such as
inefficiency, instability, extensive resource requirements, and complex hyperparameter tuning,
which can impede the performance and scalability of LLMs. To overcome these challenges, recent
studies have introduced various variants of RL methods that do not rely on PPO, including DPO
[188], RRHF [260], and sDPO [116]. In essence, these methods aim to maximize the likelihood
between the logarithm of conditional probabilities of preferred and rejected responses, which may
be produced by LLMs with varying capabilities. Inspired by RRHF [260], PanGu-Coder 2 [201]
leverages a novel framework, Reinforcement Learning via Rank Responses to align Test & Teacher
Feedback (RRTF), significantly enhancing code generation capabilities, as evidenced by pass@1 of
62.20% on the HumanEval benchmark.
Taking a step forward, the integration of more non-differentiable code features, such as coding
style [41, 158] and readability [32], into the reinforcement learning feedback for LLM-based code
generation, presents an exciting avenue for future research.

4.6 Prompting Engineering


Large-scale language models (LLMs) such as GPT-3 and its successors have been trained on large-
scale data corpora, endowing them with substantial world knowledge [31, 173, 237]. Despite this,
crafting an effective prompt to harness the full potential of LLMs remains a long-standing challenge
[147]. Recent advancements in prompting engineering have expanded the capabilities of LLMs,
enabling more sophisticated task completion and enhancing both reliability and performance.
Notable techniques include Chain-of-Thought (CoT) [239], Self-Consistency [230], Tree-of-Thought
(ToT) [253], Reasoning via Planning (RAP) [83], ReAct [254], Self-Refine [156], Reflexion [202], and
LATS [280].
Prompting engineering is particularly advantageous as it bypasses the need for additional training
and can significantly elevate performance. Consequently, numerous studies have leveraged this
technique for iterative and self-improving (refining) code generation within proprietary LLMs such
as ChatGPT and GPT-4. Figure 7 illustrates the general pipeline for self-improving code generation
with LLMs. For instance, Self-Debugging [47] involves prompting an LLM to iteratively refine a
predicted program by utilizing feedback composed of code explanations combined with execution
results, which assists in identifying and rectifying errors. When unit tests are unavailable, this
feedback can rely solely on code explanations. In parallel, SelfEvolve [110] employs a two-stage
process where LLMs first generate domain-specific knowledge for a problem, followed by a trial
code. This code is then iteratively refined through interactive prompting and execution feedback. An
empirical investigation by [170] provides a comprehensive analysis of the self-repairing capabilities
for code generation in models like Code Llama, GPT-3.5, and GPT-4, using problem sets from
HumanEval and APPS. This study yields a series of insightful observations and findings, shedding

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:25

Feedback

(Optional)

Step 1: Task Step 2: Trajectory Step 3: Evaluation Step 4: Reflection


Code LLM Executor Code LLM
assert
Write a Python Def unique_ unique_element […] does not
script to print elements(lst): work as expected
s([1, 2, 3, 4,
all unique return because it uses
4]) == [1, 2,
elements in a list(set(lst)) the built-in
3, 4]
list. `set()` function
assert assert
unique_element in Python, which
does not
s(['a', 'b',
maintain the
'c', 'a',
'd']) == ['a', order of
elements.[…]
'b', 'c', 'd']

Fig. 7. An illustration of the self-improving code generation pipeline using prompts for large language models
(LLMs). This process incorporates iterative self-refinement by integrating execution outcomes and includes
an optional self-reflection mechanism to enhance generation quality.

light on the self-refinement effectiveness of these LLMs. Moreover, Reflexion [202] introduces a
general approach for code generation wherein LLM-powered agents engage in verbal self-reflection
on task feedback signals, storing these reflections in an episodic memory buffer to inform and
improve decision-making in subsequent interactions. LATS [280] adopts a novel strategy, utilizing
LLMs as agents, value functions, and optimizers. It enhances decision-making by meticulously
constructing trajectories through Monte Carlo Tree Search (MCTS) algorithms, integrating external
feedback, and learning from experience. This approach has demonstrated remarkable results in
code generation, achieving a pass@1 of 94.4% on the HumanEval benchmark with GPT-4.
Distinct from the aforementioned methods, CodeT [42] and LEVER [166] prompt LLMs to
generate numerous code samples, which are then re-ranked based on execution outcomes to select
the optimal solution. Notably, these approaches do not incorporate a self-refinement step to further
improve code generation.

4.7 Repository Level & Long Context


In contemporary software engineering practices, modifications to a code repository are widespread
and encompass a range of activities, including package migration, temporary code edits, and the
resolution of GitHub issues. While large language models (LLMs) showcase impressive prowess in
function-level code generation, they often falter when grappling with the broader context inherent
to a repository, such as import dependencies, parent classes, and files bearing similar names. These
deficiencies result in suboptimal performance in repository-level code generation, as identified in
recent studies [205, 206]. The challenges faced by LLMs in this domain are primarily due to the
following factors:
• Code repositories typically contain intricate interdependencies scattered across various
files, including shared utilities, configurations, and cross-API invocations, which arise from
modular design principles [21, 266].
• Repositories are characterized by their unique structures, naming conventions, and coding
styles, which are essential for maintaining clarity and facilitating ongoing maintenance [41].
• The vast context of an entire repository often exceeds the context length limitations of LLMs,
thus hindering their ability to integrate comprehensive contextual information [21].

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:26 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

• LLMs may not have been adequately trained on extensive sets of repository data, such as
proprietary software or projects that are still in development [205].

Given that the scope of a typical software repository encompasses hundreds of thousands of
tokens, it is imperative to enhance the capacity of LLMs to handle extensive contexts when they
are employed for repository-level code generation. Fortunately, recent advancements in positional
encoding techniques, such as ALiBi [183] and RoPE [209], have shown promise in improving the
Transformer’s ability to generalize from shorter training sequences to longer inference sequences
[272]. This progress addresses the third challenge mentioned above to a certain degree, thereby
enabling better contextualization of coding activities within full repositories.
To further refine LLMs for repository-level code completion, several innovative approaches have
been introduced. RepoCoder [266] leverages a similarity-based retrieval system within an iterative
retrieval-generation paradigm to enrich the context and enhance code completion quality. In a
similar vein, CoCoMIC [64] employs a cross-file context finder named CCFINDER to pinpoint and
retrieve the most relevant cross-file contexts within a repository. RepoHyper [181] introduces a
semantic graph structure, termed RSG, to encapsulate the expansive context of code repositories
and uses an “Expand and Refine” retrieval method to obtain relevant code snippets. Moreover, a
framework known as RLPG [206] has been proposed to generate repository-level prompts that
integrate the repository’s structure with the relevant context across all files. However, the constant
reliance on retrieval mechanisms has raised concerns regarding efficiency and robustness, as some
retrieved contexts may prove unhelpful or harmful. In response, Repoformer [244] introduces a
selective Retrieval-Augmented Generation (RAG) framework that judiciously bypasses retrieval
when it is deemed redundant. This approach incorporates a self-supervised learning strategy that
equips a code LLM with the ability to perform a self-assessment on the utility of retrieval for
enhancing the quality of its output, thereby effectively utilizing potentially noisy retrieved contexts.
Additionally, RepoFusion [205] has been developed to train models to combine multiple relevant
contexts from a repository, aiming to produce more precise and context-aware code completions.
In a novel approach, Microsoft’s CodePlan [21] frames repository-level coding tasks as a planning
problem, generating a multi-step chain of edits (plan) where each step involves invoking an LLM on a
specific code location, considering context from the entire repository, preceding code modifications,
and task-specific instructions.
Advancing the state-of-the-art, [265] tackles the formidable challenge of NL2Repo, an endeavor
that seeks to create a complete code repository from natural language requirements. To address
this complex task, they introduce the CodeS framework, which strategically breaks down NL2Repo
into a series of manageable sub-tasks using a multi-layer sketch approach. The CodeS framework
comprises three distinct modules: 1) RepoSketcher, for creating a directory structure of the reposi-
tory based on given requirements; 2) FileSketcher, for sketching out each file within that structure;
and 3) SketchFiller, for fleshing out the specifics of each function within the file sketches [265].
Accordingly, a surge of benchmarks tailored for repository-level code generation has emerged,
such as RepoEval [266], Stack-Repo [205], Repobench [150], EvoCodeBench [130], SWE-bench
[111], CrossCodeEval [63], and SketchEval [265]. The detailed statistics and comparisons of these
benchmarks are presented in Table 3.
Despite the progress made by these methods in repository-level code generation, significant chal-
lenges remain to be addressed. Programming developers are often required to invest considerable
time in editing and debugging [24, 27, 163, 205, 220]. However, the advent of LLM-powered coding
agents, such as AutoCodeRover [270], SWE-Agent [112], and OpenDevin [172], has demonstrated
their potential to tackle complex problems, paving the way for future exploration in this field (for
more details, see Section 4.9).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:27

Stage 1: Retrieval
Open
Source
Embedding Vector Embedding Code Data
Query
Model Database Model Chunks
Create a quick-
sort algorithm
in Python.

Combine Prompts and Context Retrieved Context Code Solution


Create a quick-sort algorithm Algorithm:
in Python. def quick_sort(arr):
1. If the input
"""Sort a list of numbers in ascending
array...already
order using the Quick-Sort algorithm"""
Please solve the above problem sorted....
based on the following context: if len(arr) == 0:
5. Recursively
return []
{context} call quicksort...
pivot = arr[0]
left_arr = [x for x in arr if x < pivot]
Stage 2: Generation right_arr = [x for x in arr if x > pivot]
Code LLM return quick_sort(left_arr) + [pivot] +
quick_sort(right_arr)

Fig. 8. A workflow illustration of the Retrieval-Augmented Code Generation (RACG). Upon receiving a query
(instruction), the retriever selects the relevant contexts from a large-scale vector database. Subsequently, the
retrieved contexts are merged with the query, and this combined input is fed into the generator (LLM) to
produce the target code solution.

4.8 Retrieval Augmented


Large Language Models (LLMs) have exhibited impressive capabilities but are hindered by sev-
eral critical issues such as hallucination [139, 269], obsolescence of knowledge [104], and non-
transparent [30], untraceable reasoning processes [72, 96, 239, 282]. While techniques like instruction-
tuning (see Section 4.4) and reinforcement learning with feedback (see Section 4.5) mitigate these
issues, they also introduce new challenges, such as catastrophic forgetting and the requirement for
substantial computational resources during training [81, 174].
Recently, Retrieval-Augmented Generation (RAG) has emerged as an innovative approach to
overcoming these limitations by integrating knowledge from external databases. Formally defined,
RAG denotes a model that, in response to queries, initially sources relevant information from
an extensive corpus of documents, and then leverages this retrieved information in conjunction
with the original query to enhance the response’s quality and accuracy, especially for knowledge-
intensive tasks. The RAG framework typically consists of a vector database, a retriever, a re-ranker,
and a generator. It is commonly implemented using tools such as LangChain12 and LLamaIndex13 .
By performing continuous knowledge updates of the database and the incorporation of domain-
specific data, RAG circumvents the need for re-training LLMs from scratch [72]. Consequently,
RAG has substantially advanced LLM performance across a variety of tasks [44, 129].
Due to the nature of code, code LLMs are also susceptible to the aforementioned issues that
affect general-purpose LLMs. For instance, they may exhibit a hallucination phenomenon when
instructions fall outside the scope of their training data or necessitate the latest programming
packages. Given the dynamic nature of publicly available source-code libraries like PyTorch, which
undergo frequent expansion and updates, deprecated calling methods can become a significant
challenge. If Code LLMs are not updated in tandem with the latest functions and APIs, this can
introduce potential errors and safety risks. Retrieval-Augmented Code Generation (RACG) stands
12 LangChain facilitates the development of LLM-powered applications. https://www.langchain.com
13 LLamaIndex is a leading data framework for building LLM applications. https://www.llamaindex.ai

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:28 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

as a promising solution to these concerns. A workflow illustration of the RACG is depicted in


Figure 8.
Despite its potential, the adoption of RAG for code generation remains limited. Drawing in-
spiration from the common practice among programmers of referencing related code snippets,
[149] introduced a novel retrieval-augmented mechanism with graph neural networks (GNNs),
termed HGNN, which unites the advantages of similar examples retrieval with the generalization
capabilities of generative models for code summarization, which is the reverse process of code
generation. [177] pioneered a retrieval augmented framework named REDCODER for code gener-
ation by retrieving and integrating relevant code snippets from a source-code database, thereby
providing supplementary context for the generation process. Subsequently, a retrieval-augmented
code completion framework termed ReACC [152] is proposed to leverage both lexical copying and
semantic referencing of related code, achieving state-of-the-art performance on the CodeXGLUE
benchmark [153]. In the spirit of how programmers often consult textual resources such as code
manuals and documentation to comprehend functionalities, DocPrompting [283] explicitly utilizes
code documentation by retrieving the relevant documentation pieces based on a natural language
query and then generating the target code by blending the query with the retrieved information.
More recently, RepoCoder [266], an iterative retrieval-generation framework, is proposed for
enhancing repository-level code completion by effectively utilizing code analogies across different
files within a repository to inform and improve code suggestions. Furthermore, breaking away
from reliance on a singular source of retrieval, [208] developed a multi-faceted “knowledge soup”
that integrates web searches, documentation, execution feedback, and evolved code snippets. Then,
it incorporates an active retrieval strategy that iteratively refines the query and enriches the
knowledge soup, expanding the scope of information available for code generation.
Despite these advancements, several limitations in retrieval-augmented code generation warrant
further exploration: 1) the quality of the retrieved information significantly impacts overall perfor-
mance; 2) the effective integration of retrieved code information with the query needs optimization;
3) an over-reliance on retrieved information may lead to inadequate responses that fail to address
the query’s intent; 4) additional retrieved information necessitates larger context windows for the
LLM, resulting in increased computational demands.

4.9 Autonomous Coding Agents


The advent of large language models (LLMs) has marked the beginning of a new era of poten-
tial pathways toward artificial general intelligence (AGI), capturing significant attention in both
academia and industry [98, 225, 241, 246]. A rapidly expanding array of applications for LLM-based
autonomous agents, including AutoGPT [2], AgentGPT [1], BabyAGI [3], and AutoGen [245],
underlines the promise of this technology.
LLM-powered autonomous agents are systems endowed with sophisticated reasoning abilities,
leveraging an LLM as a central computational engine or controller. This allows them to formulate
and execute problem-solving plans through a series of tool-enabled functions or API calls. Moreover,
these agents are designed to function within a shared environment where they can communicate
and engage in cooperative, competitive, or negotiating interactions [94, 225, 245]. The typical
architecture of such an agent encompasses an LLM-based Agent, a memory module, a planning
component, and a tool utilization module, as depicted in Figure 9.
In the realm of automated code generation, LLM-powered autonomous agents have demon-
strated remarkable proficiency. For instance, AgentCoder [94] achieved a groundbreaking pass@1
of 96.3% on the HumanEval benchmark, forwarding a step closer to the future of automated soft-
ware development [100]. The innovative meta-programming framework termed MetaGPT [90]
integrates human workflow efficiencies into LLM-based multi-agent collaboration. Furthermore,

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:29

Fig. 9. The general architecture of an LLM-powered autonomous agent system, adapted from [241]. Planning:
The agent decomposes large tasks into smaller, manageable sub-goals or engages in self-criticism and self-
reflection on past actions to learn from mistakes and improve future performance. Memory: This component
enables the agent to store and retrieve past information. Tools: The agent is trained to invoke external
functions or APIs. Action: The agent executes actions, with or without the use of tools, to interact with the
environment. The gray dashed lines represent the data flow within the system.

[94] introduces AgentCoder, a multi-agent framework composed of three specialized agents, each
with distinct roles and capabilities. These roles include a programmer agent responsible for code
generation, a test designer agent tasked with generating unit test cases, and a test executor agent
that executes the code and provides feedback. This division of labor within AgentCoder promotes
more efficient and effective code generation. CodeAct [228] distinguishes itself by utilizing exe-
cutable Python code to consolidate LLM agent actions within a unified action space, in contrast
to the generation of JSON or textual formats. Additionally, AutoCodeRover [270] is proposed to
autonomously resolve GitHub issues for program enhancement.
To address the complexity of tasks within software engineering, two innovative autonomous AI
software engineers Devin14 [56] and OpenDevin15 [172], have been released and rapidly garnered
considerable interest within the software engineering (SE) and artificial general intelligence (AGI)
community. Subsequently, an autonomous system, SWE-agent [112], leverages a language model
to interact with a computer to address software engineering tasks, successfully resolving 12.5% of
issues on the SWE-bench benchmark [111]. L2MAC [88] has been introduced as the first practical,
LLM-based, multi-agent, general-purpose stored-program automatic computer that utilizes a von
Neumann architecture, designed specifically for the generation of long and consistent outputs.
At the time of writing this survey, OpenDevin has enhanced CodeAct with bash command-based
tools, leading to the release of OpenDevin CodeAct 1.0 [249], which sets a new state-of-the-art
performance on the SWE-Bench Lite benchmark [111].
Despite these remarkable advancements, the journey toward fully realized AI software engineers
employing LLM-powered autonomous agents is far from complete [225, 246]. Critical aspects
such as prompt design, context length, agent count, and toolsets call for further refinement and
optimization, especially as problem complexities escalate [100].

4.10 Evaluation
14 https://www.cognition.ai/introducing-devin
15 https://github.com/OpenDevin/OpenDevin

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:30 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Despite the impressive capabilities of large lan- Table 6. The performance comparison of LLMs for code
guage models (LLMs), they exhibit a range of generation on the HumanEval [45] benchmark, mea-
behaviors that are both beneficial and poten- sured by Pass@{1, 10, 100}. For models with various
tially risky. These behaviors can enhance per- sizes, we report only the largest size version of each
formance across various downstream tasks but model.
may also introduce reliability and trustworthi-
ness concerns in LLM deployment [39, 45, 251]. Model Size
pass@k
Consequently, it is imperative to develop pre- 𝑘 = 1 𝑘 = 10 𝑘 = 100
cise evaluation approaches to discern the quali- GPT-4 [5] - 84.1 - -
tative and quantitive differences between mod- GPT-3.5-Turbo [171] - 76.2 - -
Claude-3-Opus [13] - 82.9 - -
els, thereby encouraging further advancements Claude-3-Haiku [13] - 76.8 - -
in LLM capabilities. Claude-3-Sonnet [13] - 70.7 - -
Evaluation strategies for LLMs in code gen- StarCoder2-Instruct [261] 15.5B 72.6 - -
eration mirror those for general-purpose LLMs Llama3 [160] 70B 81.7 - -
CodeGemma [54] 7B 44.5 - -
and can be divided into three principal cat- StarCoder 2 [151] 15B 46.3 - -
egories: metrics-based, human-centered, and phi-2 [161] 2.7B 49.4 - -
LLM-based approaches. Detailed benchmarks WaveCoder [259] 6.7B 75 - -
StableCode [182] 3B 29.3 - -
for these evaluation strategies are presented in CodeShell [247] 7B 34.32 - -
Section 4.1.3 and summarized in Table 3. Sub- CodeQwen [215] 14B 45.1 - -
sequent subsections will provide a thorough DeepSeek-Coder [79] 33B 56.1 - -
replit-code [193] 3B 20.12 - -
analysis of each approach. Phi-1.5 [135] 1.3B 41.4 - -
PanGu-Coder2 [201] 15B 61.64 79.55 91.75
WizardCoder [154] 15B 57.3 73.2 90.46
4.10.1 Metrics. The pursuit of effective and CodeFuse [143] 34B 74.4 - -
reliable automatic evaluation metrics for gen- Phi-1 [75] 1.3B 50.6 - -
Code Llama [196] 34B 48.8 76.8 93.0
erated content is a long-standing challenge OctoCoder [164] 15.5B 46.2 - -
within the field of natural language process- PaLM-Coder [49] 540B 36 - 88.4
ing (NLP) [46, 140, 175]. At the early stage, CodeGeeX2 [275] 6B 35.9 62.6 88.3
InstructCodeT5+ [232] 16B 35.0 54.5 77.9
most works directly leverage token-matching- CodeGen-NL [169] 16.1B 14.24 23.46 38.33
based metrics, such as Exact Match, BLEU [175], CodeGen-Multi [169] 16.1B 18.32 32.07 50.8
ROUGE [140], and METEOR [22], which are CodeGen-Mono [169] 16.1B 29.28 49.86 75
StarCoder [132] 15B 33.60 45.78 79.82
prevalent in text generation of NLP, to assess CodeT5+ [234] 16B 30.9 51.6 76.7
the quality of code generation. LLaMA2 [218] 70B 30.5 59.4 87.0
While these metrics offer a rapid and cost- Codex [45] 12B 28.81 46.81 72.31
PaLM [49] 540B 26.2 - 76.2
effective approach for assessing the quality of PanGu-Coder [50] 2.6B 23.78 35.36 51.24
generated code, they often fall short of captur- LLaMA [217] 65B 23.7 - 79.3
ing the syntactical and functional correctness, CodeGeeX [275] 13B 22.89 39.57 60.92
Replit [192] 3B 21.9 - -
as well as the semantic features of the code. To CodeGen2 [168] 16B 20.46 36.5 56.71
eliminate this limitation, CodeBLEU [191] was SantaCoder [8] 1.1B 18 29 49
introduced, enhancing the traditional BLEU AlphaCode [136] 1.1B 17.1 28.2 45.3
BLOOM [126] 176B 15.52 32.20 55.45
metric [175] by incorporating syntactic infor- GPT-NeoX [28] 20B 15.4 25.6 41.2
mation through abstract syntax trees (AST) and InCoder [69] 6.7B 15.2 27.8 47.0
semantic understanding via data-flow graph GPT-J [223] 6B 11.62 15.74 27.74
PyCodeGPT [263] 110M 8.33 13.36 19.13
(DFG). Despite these improvements, the met- GPT-Neo [29] 2.7B 6.41 11.27 21.37
ric does not fully resolve issues pertaining to PolyCoder [251] 2.7B 5.59 9.84 17.68
execution errors or discrepancies in the execu- JuPyT5 [38] 300M 5.4 15.46 25.60
CodeParrot [219] 1.5B 3.99 8.69 17.88
tion results of the generated code. In light of
these challenges, execution-based metrics have

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:31

gained prominence for evaluating code genera-


tion, including pass@k [45], n@k [136], test case average [85], execution accuracy [190], and pass@t
[170]. In particular, the pass@k, serving as a principal evaluation metric, assesses the probability
that at least one out of 𝑘 code samples generated by a model will pass all unit tests. An unbiased
estimator for pass@k introduced by [45] is defined as:
" 𝑛−𝑐 
#
𝑘
pass@k B Etask 1 − 𝑛 (19)
𝑘

where 𝑛 is the total number of sampled candidate code solutions, 𝑘 is the number of randomly
selected code solutions from these candidates for each programming problem, with 𝑛 ≥ 𝑘, and 𝑐
is the count of correct samples within the 𝑘 selected. Tables 6 and 7 illustrate the performance of
contemporary large language models (LLMs) for code generation, measured by the pass@k metric
across different values of 𝑘 ∈ {1, 10, 100} on the HumanEval and MBPP benchmarks, respectively.
Nevertheless, these execution-based methods are heavily dependent on the quality of unit tests
and are limited to evaluating executable code [264]. Consequently, when unit tests are unavailable,
token-matching-based metrics are often employed as an alternative for evaluation. Furthermore, in
scenarios lacking a ground truth label, unsupervised metrics such as perplexity (PPL) [105] can
serve as evaluative tools. Perplexity quantifies an LLM’s uncertainty in predicting new content,
thus providing an indirect measure of the model’s generalization capabilities and the quality of the
generated code.
Taken together, while the aforementioned methods primarily focus on the functional correctness
of code, they do not provide a holistic evaluation that encompasses other critical dimensions such
as code vulnerability [165], maintainability [14], readability [32], complexity and efficiency [180],
stylistic consistency [158], and execution stability [187]. A comprehensive evaluation framework
that integrates these aspects remains an open area for future research and development in the field
of code generation assessment.

4.10.2 Human Evaluation. Given the intrinsic characteristics of code, the aforementioned automatic
evaluation metrics are inherently limited in their capacity to fully assess code quality. For instance,
metrics specifically designed to measure code style consistency are challenging to develop and
often fail to capture this aspect adequately [41]. When it comes to repository-level code generation,
the evaluation of overall code quality is substantially complicated due to the larger scale of the
task, which involves cross-file designs and intricate internal as well as external dependencies, as
discussed by [21, 205].
To overcome these challenges, conducting human evaluations becomes necessary, as it yields
relatively robust and reliable results. Human assessments also offer greater adaptability across
various tasks, enabling the simplification of complex and multi-step evaluations. Moreover, human
evaluations are essential for demonstrating the effectiveness of certain token-matching-based
metrics, such as CodeBLEU [191]. These studies typically conduct experiments to evaluate the
correlation coefficient between proposed metrics and quality scores assigned by actual users,
demonstrating their superiority over existing metrics.
Moreover, in an effort to better align large language models (LLMs) with human preferences and
intentions, InstructGPT [173] employs human-written prompts and demonstrations, and model
output ranking in the fine-tuning of LLMs using reinforcement learning from human feedback
(RLHF). Although similar alignment learning techniques have been applied to code generation, the
feedback in this domain typically comes from a compiler or interpreter, which offers execution
feedback, rather than from human evaluators. Notable examples include CodeRL [125], PPOCoder

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:32 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

[204], RLTF [146], and PanGu-Coder2 [201]. Further information on this topic is available in Section
4.5.
Nonetheless, human evaluations are not
without drawbacks, as they can be prone to Table 7. The performance comparison of LLMs for code
certain issues that may compromise their ac- generation on the MBPP [16] benchmark, measured
curacy and consistency. For instance, 1) per- by Pass@{1, 10, 100}. For models with various sizes,
sonalized tastes and varying levels of exper- we report only the largest size version of each model.
tise among human evaluators can introduce
biases and inconsistencies into the evaluation Model Size
pass@k
process; 2) conducting comprehensive and re- 𝑘 = 1 𝑘 = 10 𝑘 = 100
liable human evaluations often necessitates a GPT-3.5-Turbo [171] - 52.2 - -
substantial number of evaluators, leading to sig- Claude-3-Opus [13] - 89.4 - -
Claude-3-Haiku [13] - 80.2 - -
nificant expenses and time-consuming; 3) the Claude-3-Sonnet [13] - 83.6 - -
reproducibility of human evaluations is often StarCoder2-Instruct [261] 15.5B 78 - -
limited, which presents challenges in extending CodeGemma [54] 7B 65.1 - -
previous evaluation outcomes or monitoring StarCoder 2 [151] 15B 50.6 - -
phi-2 [161] 2.7B 64 - -
the progress of LLMs, as highlighted by [273]. WaveCoder [259] 6.7B 74.9 - -
CodeFuse [143] 34B 61.0 - -
4.10.3 LLM-as-a-Judge. The powerful instruction- CodeQwen [215] 14B 51.4 - -
following capabilities of large language mod- DeepSeek Coder [79] 33B 66.0 - -
Phi-1.5 [135] 1.3B 43.5 - -
els (LLMs) have stimulated researchers to in- WizardCoder [154] 16B 51.8 - -
novatively investigate the potential of LLM- StarCoder [132] 5.5B 52.7 - -
based evaluations. The LLM-as-a-Judge [274] SantaCoder [8] 1.1B 3.65 21.33 41.92
PyCodeGPT [263] 110M 9.39 28.37 48.71
refers to the application of advanced propri- PolyCoder [251] 2.7B 4.39 17.99 38.17
etary LLMs (e.g., GPT4, Gemini, and Claud 3) phi-1 [75] 1.3B 55.5 - -
as proxies for human evaluators. This involves PaLM-Coder [49] 540B 47 - -
PaLM [49] 540B 36.8 - -
designing prompts with specific requirements LLaMA [217] 65B 37.7 - -
to guide LLMs in conducting evaluations, as LLaMA 2 [218] 70B 45.4 66.2 83.1
demonstrated by AlpacaEval [133] and MT- CodeT5+ [234] 16B 56.6 - -
InCoder [69] 6.7B 21.3 46.5 66.2
bench [274]. This method reduces reliance on GPT-Neo [29] 2.7B 5.89 23.09 44.26
human participation, thereby facilitating more GPT-J [223] 6B 11.30 35.62 53.63
efficient and scalable evaluations. Moreover, CodeT5 [234] 770M 15.78 38.63 50.35
CodeParrot [219] 1.5B 1.29 8.66 27.17
LLMs can offer insightful explanations for the Code Llama [196] 34B 55 76.2 86.6
assigned rating scores, thereby augmenting the CodeGen-NL [169] 16.1B 10.92 38.43 62.76
interpretability of evaluations [273]. CodeGen-Multi [169] 16.1B 20.94 51.61 70.02
CodeGen-Mono [169] 16.1B 35.28 67.32 80.09
Nevertheless, the use of LLM-based evalu- CodeGeeX [275] 13B 24.4 48 -
ation for code generation remains relatively BLOOM [126] 1.7B 3.16 14.23 31.38
underexplored compared with general-purpose PanGu-Coder [50] 2.6B 23.0 43.60 59.64
CodeGeeX2 [275] 6B 24.37 47.95 -
LLM. A recent work [284] introduces the ICE-
Score evaluation metric, which instructs LLM
for code assessments. This approach attains su-
perior correlations with functional correctness and human preferences, thereby eliminating the
requirement for test oracles or references. As the capabilities of LLM continue to improve, we
anticipate seeing more research in this direction.
Despite their scalability and explainability, the effectiveness of LLM-based evaluation is con-
strained by the inherent limitations of the chosen LLM. Several studies have shown that most LLMs,
including GPT-4, suffer from several issues, including position, verbosity, and self-enhancement
biases, as well as restricted reasoning ability [274]. Specifically, position bias refers to the tendency

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:33

of large language models (LLMs) to disproportionately favor responses that are presented in certain
positions, which can skew the perceived quality of answers based on their order of presentation.
Meanwhile, verbosity bias describes the inclination of LLMs to prefer lengthier responses, even
when these are not necessarily of higher quality compared to more concise ones. Self-enhancement
bias, on the other hand, is observed when LLMs consistently overvalue the quality of the text they
generate [273, 274]. Moreover, due to their inherent limitations in tackling complex reasoning
challenges, LLMs may not be entirely reliable as evaluators for tasks that require intensive rea-
soning, such as those involving mathematical problem-solving. However, these shortcomings can
be partially addressed through the application of deliberate prompt engineering and fine-tuning
techniques, as suggested by [274].

4.11 Applications
Code LLMs have been integrated with development tools and platforms, such as integrated de-
velopment environments (IDEs) and version control systems, improving programming efficiency
substantially. In this section, we will briefly introduce several widely used applications as coding
assistants. The statistics of these applications are provided in Table 8.
GitHub Copilot. GitHub Copilot, powered by OpenAI’s Codex, is an AI pair programmer that
helps you write better code faster. Copilot suggests whole lines or blocks of code as you type, based
on the context provided by your existing code and comments. It’s trained on a dataset that includes
a significant portion of the public code available on GitHub, which enables it to understand a wide
range of programming languages and coding styles. Copilot not only improves productivity but
also serves as a learning tool by providing programmers with examples of how certain functions
can be implemented or how specific problems can be solved.
CodeGeeX. CodeGeeX stands out as a multifaceted programming assistant, proficient in code
completion, comment generation, code translation, and developer interactions. Its underlying code
generation LLM has been refined with extensive training on vast amounts of code data, exhibiting
superior performance on benchmarks like HumanEval, HumanEval-X, and DS1000. Renowned for
supporting multilingual code generation, CodeGeeX plays a pivotal role in enhancing the efficiency
of code development.
CodeWhisperer. Amazon’s CodeWhisperer is a versatile, machine learning-driven code genera-
tor that offers on-the-fly code recommendations. Tailored to your coding patterns and comments,
CodeWhisperer provides personalized suggestions that range from succinct comments to complex
functions, all aimed at streamlining your coding workflow.
Codeium. Codeium is an AI-accelerated coding toolkit that offers a suite of functions, including
code completion, explanation, translation, search, and user chatting. Compatible with over 70
programming languages, Codeium delivers fast and cutting-edge solutions to coding challenges,
simplifying the development process for its users.
CodeArts Snap. Huawei’s CodeArts Snap is capable of generating comprehensive function-level
code from both Chinese and English descriptions. This tool not only reduces the monotony of
manual coding but also efficiently generates test code, in addition to providing automatic code
analysis and repair services.
Tabnine. Tabnine is an AI coding assistant that empowers development teams to leverage
AI for streamlining the software development lifecycle while maintaining strict standards for
privacy, security, and compliance. With a focus on enhancing coding efficiency, code quality, and
developer satisfaction, Tabnine offers AI-driven automation that is tailored to the needs of your
team. Supporting over one million developers worldwide, Tabnine is applicable across various
industries.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
labeled ‘PLs’ and ‘IDEs’ indicate programming languages and integrated development environments, respec-
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Table 8. The overview of code assistant applications powered by large language models (LLMs). The column

Institution Products Model Supported Features Supported PLs Supported IDEs


Code Completions, Code Generation, Java, Python, JavaScript, TypeScript,
Coding Questions Answering, Perl, R, PowerShell, Rust, SQL, CSS,
Visual Studio, VS Code, Neovim,
GitHub & OpenAI GitHub Copilot [45] Codex Code Refactoring, Code Issues Fix, Ruby, Julia, C#, PHP, Swift, C++,Go,
JetBrains IDE
Unit Test Cases Generation, HTML, JSON, SCSS, .NET, Less,
Code Documentation Generation T-SQL, Markdown
PHP, Go, C, C#, C++, Rust, Perl, CSS,
Code Generation, Code Translation, Clion, RubyMine, AppCode, Aqua,

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
Java, Python, JavaScript, TypeScript,
Code Completion, Code Interpretation, IntelliJ IDEA, VS Code, PyCharm,
Zhipu AI CodeGeeX [275] CodeGeeX Objective C++, Objective C, Pascal,
Code Bugs Fix, Comment Generation, Android Studio, WebStorm, Rider,
HTML, SQL, Kotlin, R, Shell, Cuda,
AI Chatbot GoLand, DataGrip, DataSpell
Fortran, Tex, Lean, Scala
Code Completion, Code Explanation,
Code Translation, Java, Python, TypeScript, JavaScript, JetBrains IDE, VS Code, AWS Cloud9,
Amazon CodeWhisperer [11] −
Code Security Identification, C# AWS Lambda
Code Suggestion
More than 70 languages in total,
including but not limited to:
Code Completion, Bug Detection, JetBrains, VSCode, Visual Studio,
C, C#, C++, Dart, CSS, Go, Elixir,
Code Suggestions, AI Chatbot, Colab, Jupyter, Deepnote,
HTML, Haskell, Julia, Java, JavaScript,
Codeium Codeium [55] − Test Type Generation, Notebooks, Databricks, Chrome,
Lisp, Kotlin, Lua, Objective-C,
Test Plan Creation, Vim, Neovim, Eclipse, Emacs,
Perl, Pascal, PHP, Protobuf,
Codebase Search VSCode Web IDEs, Sublime Text
R, Python, Ruby, Scala, Rust,
Swift, SQL, TS, Vue
Code Generation, Code Explanation
Research and Development Knowledge
Huawei CodeArts Snap [201] PanGu-Coder Question and Answer Java, Python PyCharm, VS Code, IntelliJ
Code Comment, Code Debug
Unit Test Case Generation
Sublime, PyCharm, Neovim, Rider,
Code Generation, Code Completion, Python, Javascript, Java, TypeScript,
VS Code, IntelliJ IDE, Visual Studio,
Code Explanation, Bug Fix, HTML, Haskell, Matlab, Kotlin, Sass,
PhpStorm, Vim, RubyMine, DataGrip,
Tabnine TabNine [212] − Code Recommendation, Code Refactoring, Go, PHP, Ruby, C, C#, C++, Swift,
Android Studio, WebStorm, Emacs,
Code Test Generation, Rust, CSS, Perl, Angular, Dart, React,
Clion, Jupyter Notebook, JupyterLab,
Docstring Generation Objective C, NodeJS, Scala,
Eclipse, GoLand, AppCode
Code Completion, Code Editing, C#, Bash, C, CSS, C++, Java, Go,
Replit Replit[192] replit-code Code Generation, Code Explanation, HTML, JavaScript, Perl, PHP, −

tively [264].
Code Suggestion, Code Test Generation Ruby, Python, R, SQL, Rust
1:34
A Survey on Large Language Models for Code Generation 1:35

Replit. Replit is a multifunctional platform that caters to a diverse array of software development
needs. As a complimentary online IDE, it facilitates code collaboration, and cloud services, and
fosters a thriving developer community. Replit also enables users to compile and execute code in
more than 50 programming languages directly within a web browser, eliminating the need for local
software installations.

5 CHALLENGES & OPPORTUNITIES


According to our investigations, the LLMs have revolutionized the paradigm of code generation
and achieved remarkable performance. Despite this promising progress, there are still numerous
challenges that need to be addressed. These challenges are mainly caused by the gap between
academia and practical development. For example, in academia, the HumanEval benchmark has
been established as a de facto standard for evaluating the coding proficiency of LLMs. However,
many works have illustrated the evaluation of HumanEval can’t reflect the scenario of practical
development [63, 67, 111, 145]. In contrast, these serious challenges offer substantial opportunities
for further research and applications. In this section, we pinpoint critical challenges and identify
promising opportunities, aiming to bridge the research-practicality divide.
Enhancing complex code generation at repository and software scale. In practical devel-
opment scenarios, it often involves a large number of complex programming problems of varying
difficulty levels. While LLMs have shown proficiency in generating function-level code snippets,
these models often struggle with more complex, unseen programming problems, repository- and
software-level problems that are commonplace in real-world software development. To this end,
it requires strong problem-solving skills in LLM beyond simply functional-level code generation.
For example, AlphaCode [136] achieved an average ranking in the top 54.3% in programming
competitions where an understanding of algorithms and complex natural language is required to
solve competitive programming problems. [111] argues that existing LLMs can’t resolve real-world
GitHub issues well since the best-performing model, Claude 2, is able to solve a mere 1.96% of the
issues. The reason for poor performance is mainly attributed to the weak reasoning capabilities
[95], complex internal- and external- dependencies [21], and context length limitation of LLMs
[21]. Therefore, the pursuit of models that can handle more complex, repository- and software-
level code generation opens up new avenues for automation in software development and makes
programming more productive and accessible.
Innovating model architectures tuned to code structures. Due to their scalability and effec-
tiveness, Transformer-based LLM architectures have become dominant in solving code generation
task. Nevertheless, they might not be optimally designed to capture the inherent structure and
syntax of programming languages (PLs) [76, 77, 120, 155]. Code has a highly structured nature,
with a syntax that is more rigid than natural language. This presents a unique challenge for LLMs,
which are often derived from models that were originally designed for natural language processing
(NLP). The development of novel model architectures that inherently understand and integrate the
structural properties of code represents a significant opportunity to improve code generation and
comprehension. Innovations such as tree-based neural networks [162], which mirror the abstract
syntax tree (AST) representation of code, can offer a more natural way for models to learn and
generate programming languages. Additionally, leveraging techniques from the compiler theory,
such as intermediate representations (IR) [137], could enable models to operate on a more abstract
and generalizable level, making them effective across multiple programming languages [179]. By
exploring architectures beyond the traditional sequential models, researchers can unlock new
potentials in code generation.
Curating high-quality code data for pre-training and fine-tuning of LLMs. The efficacy
of LLMs largely depends on the quality and diversity of code datasets used during pre-training and

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:36 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

fine-tuning phases [119, 242, 281]. Currently, there is a scarcity of large, high-quality datasets that
encompass a wide range of programming tasks, styles, and languages. This limitation constrains the
ability of LLMs to generalize across unseen programming tasks, different coding environments, and
real-world software development scenarios. The development of more sophisticated data acquisition
techniques, such as automated code repositories mining [142], advanced filtering algorithms, and
code data synthesis [148] (see Section 4.2), can lead to the creation of richer datasets. Collaborations
with industry partners (e.g., GitHub) could also facilitate access to proprietary codebases, thereby
enhancing the practical relevance of the training material. Furthermore, the adoption of open-source
models for dataset sharing can accelerate the collective effort to improve the breadth and depth of
code data available for LLM research.
Developing comprehensive benchmarks and metrics for coding proficiency evaluation
in LLMs. Current benchmarks like HumanEval may not capture the full spectrum of coding
skills required for practical software development [167]. Additionally, metrics often focus on
syntactic correctness or functional accuracy, neglecting aspects such as code efficiency [180],
style [41], readability [32], or maintainability [14]. The design of comprehensive benchmarks that
simulate real-world software development challenges could provide a more accurate assessment
of LLMs’ coding capabilities. These benchmarks should include diverse programming tasks of
varying difficulty levels, such as debugging [279], refactoring [203], and optimization [101], and
should be complemented by metrics that evaluate qualitative aspects of code. The establishment of
community-driven benchmarking platforms could facilitate continuous evaluation and comparison
of LLMs for code generation across the industry and academia.
Support for low-resource, low-level, and domain-specific programming languages. LLMs
are predominantly trained in popular high-level programming languages, leaving low-resource, low-
level, and domain-specific languages underrepresented. This lack of focus restricts the applicability
of LLMs in certain specialized fields and systems programming [216]. Intensifying research on
transfer learning and meta-learning approaches may enable LLMs to leverage knowledge from
high-resource languages to enhance their performance on less common ones [35, 43]. Additionally,
partnerships with domain experts can guide the creation of targeted datasets and fine-tuning
strategies to better serve niche markets. The development of LLMs with a capacity for multilingual
code generation also presents a significant opportunity for broadening the scope of applications.
Continuous learning for LLMs to keep pace with evolving coding knowledge. The
software development landscape is continuously evolving, with new languages, frameworks, and
best practices emerging regularly. LLMs risk becoming outdated if they cannot adapt to these
changes and incorporate the latest programming knowledge [104, 227]. While retrieval augmented
code generation mitigates these issues, the performance is limited by the quality of the retrieval
context While retrieval-augmented code generation offers a partial solution to these issues, its
effectiveness is inherently constrained by the quality of retrieved context. [152, 266, 283]. Therefore,
establishing mechanisms for continuous learning and updating of LLMs can help maintain their
relevance over time. This could involve real-time monitoring of code repositories to identify trends
and innovations, as well as the creation of incremental learning systems that can assimilate new
information without forgetting previously acquired knowledge. Engaging the LLMs in active
learning scenarios where they interact with human developers may also foster ongoing knowledge
acquisition.
Ensuring code safety and aligning LLM outputs with human coding preferences. Ensuring
the safety and security of code generated by LLMs is a paramount concern, as is their ability to
align with human preferences and ethical standards. Current models may inadvertently introduce
vulnerabilities or generate code that does not adhere to desired norms [45, 252]. Research into
the integration of formal verification tools within the LLM pipeline can enhance the safety of the

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:37

produced code. Additionally, developing frameworks for alignment learning that capture and reflect
human ethical preferences can ensure that the code generation process aligns with societal values
[173, 184]. Transparent and explainable AI methodologies can also contribute to building trust in
the LLM-generated code by making the decision-making process more accessible to developers.

6 CONCLUSION
In this survey, we provide a systematic literature review, serving as a valuable reference for
researchers investigating the cutting-edge progress in LLMs for code generation. A thorough intro-
duction and analysis for data curation, the latest advances, performance evaluation, and real-world
applications are illustrated. In addition, we present a historical overview of the evolution of LLMs
for code generation in recent years and offer an empirical comparison using the widely recognized
HumanEval and MBPP benchmarks to highlight the progressive enhancements in LLM capabilities
for code generation. Critical challenges and promising opportunities regarding the gap between
academia and practical development are also identified for future investigation. Furthermore, we
have established a dedicated resource website to continuously document and disseminate the most
recent advances in the field. We hope this survey can contribute to a comprehensive and systematic
overview of LLM for code generation and promote its thriving evolution. We optimistically believe
that LLM will ultimately change all aspects of coding and automatically write safe, helpful, accurate,
trustworthy, and controllable code, like professional programmers, and even solve coding problems
that currently cannot be solved by humans.

REFERENCES
[1] 2023. AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser. https://github.com/
reworkd/AgentGPT.
[2] 2023. AutoGPT is the vision of accessible AI for everyone, to use and to build on. https://github.com/Significant-
Gravitas/AutoGPT.
[3] 2023. BabyAGI. https://github.com/yoheinakajima/babyagi.
[4] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach,
Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model
locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
[5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[6] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program
understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
[7] Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. 2024. Traces of Memorisation in Large Language Models for
Code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12.
[8] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas
Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint
arXiv:2301.03988 (2023).
[9] Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd acm sigsoft
international symposium on foundations of software engineering. 472–483.
[10] Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-
media/AlphaCode2/AlphaCode2_Tech_Report.pdf.
[11] Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-
cwspr.html.
[12] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019.
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). 2357–2367.
[13] Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/
de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:38 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

[14] Luca Ardito, Riccardo Coppola, Luca Barbato, and Diego Verga. 2020. A tool-based perspective on software code
maintainability metrics: a systematic literature review. Scientific Programming 2020 (2020), 1–26.
[15] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad,
Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models. arXiv preprint
arXiv:2210.14868 (2022).
[16] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie
Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732
(2021).
[17] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450
(2016).
[18] Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson. 2023.
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code. arXiv:2306.04556 [cs.LG]
[19] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang,
et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
[20] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna
Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv
preprint arXiv:2212.08073 (2022).
[21] Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok,
Shashank Shet, et al. 2023. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499
(2023).
[22] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation
with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine
translation and/or summarization. 65–72.
[23] Enrico Barbierato, Marco L Della Vedova, Daniele Tessera, Daniele Toti, and Nicola Vanoli. 2022. A methodology for
controlling bias and fairness in synthetic data generation. Applied Sciences 12, 9 (2022), 4619.
[24] Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with
code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.
[25] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi
Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint
arXiv:2401.02954 (2024).
[26] Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Xuanhua Shi,
and Hai Jin. 2024. Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler
Feedback. arXiv preprint arXiv:2403.16792 (2024).
[27] Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and
Idan Gazit. 2022. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools.
Queue 20, 6 (2022), 35–57.
[28] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy,
Kyle McDonell, Jason Phang, et al. 2022. Gpt-neox-20b: An open-source autoregressive language model. arXiv
preprint arXiv:2204.06745 (2022).
[29] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive
Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please
cite it using these metadata..
[30] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein,
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models.
arXiv preprint arXiv:2108.07258 (2021).
[31] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[32] Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software
engineering 36, 4 (2009), 546–558.
[33] Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected Expert
Parallelism for Accelerating Mixture-of-Experts. arXiv preprint arXiv:2404.05019 (2024).
[34] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th
USENIX Security Symposium (USENIX Security 21). 2633–2650.
[35] Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg,
Abhinav Jangda, and Arjun Guha. 2023. Knowledge Transfer from High-Resource to Low-Resource Programming

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:39

Languages for Code LLMs. arXiv preprint arXiv:2308.09895 (2023).


[36] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho
Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2022. A scalable and extensible approach to
benchmarking nl2code for 18 programming languages. arXiv preprint arXiv:2208.08227 (2022).
[37] Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu. 2022. ERNIE-Code: Beyond english-centric
cross-lingual pretraining for programming languages. arXiv preprint arXiv:2212.06742 (2022).
[38] Shubham Chandel, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2022. Training and evaluating a jupyter
notebook data science assistant. arXiv preprint arXiv:2201.12901 (2022).
[39] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang,
Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems
and Technology 15, 3 (2024), 1–45.
[40] Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.
com/sahil280114/codealpaca.
[41] Binger Chen and Ziawasch Abedjan. 2023. DUETCS: Code Style Transfer through Generation and Retrieval. In 2023
IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2362–2373.
[42] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. Codet:
Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
[43] Fuxiang Chen, Fatemeh H Fard, David Lo, and Timofey Bryksin. 2022. On the transferability of pre-trained language
models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on
Program Comprehension. 401–412.
[44] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented
generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17754–17762.
[45] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[46] Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models. (1998).
[47] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug.
arXiv preprint arXiv:2304.05128 (2023).
[48] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. Advances in
neural information processing systems 31 (2018).
[49] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways.
Journal of Machine Learning Research 24, 240 (2023), 1–113.
[50] Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng
Xiao, Bo Shen, Lin Li, et al. 2022. Pangu-coder: Program synthesis with function-level language modeling. arXiv
preprint arXiv:2207.11280 (2022).
[51] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa
Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine
Learning Research 25, 70 (2024), 1–53.
[52] Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5:
multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150
(2020).
[53] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry
Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168 (2021).
[54] CodeGemma Team, Ale Jakse Hartman, Andrea Hu, Christopher A. Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey
Hui, Jingyue Shen, Joe Kelley, Joshua Howland, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Nam Nguyen, Paul Michel,
Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Siqi Zuo, Tris Warkentin, and Zhitao
et al. Gong. 2024. CodeGemma: Open Code Models Based on Gemma. (2024). https://goo.gle/codegemma
[55] Codeium. 2023. Free, ultrafast Copilot alternative for Vim and Neovim. https://github.com/Exafunction/codeium.vim.
[56] Cognition. 2024. Introducing Devin, the first AI software engineer. https://www.cognition.ai/introducing-devin.
[57] Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing tree-substitution grammars. The Journal of
Machine Learning Research 11 (2010), 3053–3096.
[58] Cognitive Computations. 2023. oa_leet10k. https://huggingface.co/datasets/cognitivecomputations/oa_leet10k.
[59] Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and
Algorithms for the Construction and Analysis of Systems. Springer, 337–340.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:40 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

[60] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized
llms. Advances in Neural Information Processing Systems 36 (2024).
[61] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[62] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min
Chan, Weize Chen, et al. 2022. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained
language models. arXiv preprint arXiv:2203.06904 (2022).
[63] Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh
Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file
code completion. Advances in Neural Information Processing Systems 36 (2024).
[64] Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia,
Dan Roth, and Bing Xiang. 2022. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv
preprint arXiv:2212.10007 (2022).
[65] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022.
A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
[66] Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Junjie Shan, Caishuang Huang, Wei Shen, Xiaoran Fan,
Zhiheng Xi, et al. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback.
arXiv preprint arXiv:2402.01391 (2024).
[67] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng,
and Yiling Lou. 2024. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM
46th International Conference on Software Engineering. 1–13.
[68] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint
arXiv:2002.08155 (2020).
[69] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke
Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv preprint
arXiv:2204.05999 (2022).
[70] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish
Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint
arXiv:2101.00027 (2020).
[71] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023.
Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
[72] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023.
Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
[73] Linyuan Gong, Mostafa Elhoushi, and Alvin Cheung. 2024. AST-T5: Structure-Aware Pretraining for Code Generation
and Understanding. arXiv preprint arXiv:2401.03003 (2024).
[74] Sumit Gulwani. 2010. Dimensions in program synthesis. In Proceedings of the 12th international ACM SIGPLAN
symposium on Principles and practice of declarative programming. 13–24.
[75] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan
Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need. arXiv preprint
arXiv:2306.11644 (2023).
[76] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal
Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 7212–7225.
[77] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy-
atkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint
arXiv:2009.08366 (2020).
[78] Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. Longcoder: A long-range pre-trained language
model for code completion. In International Conference on Machine Learning. PMLR, 12098–12107.
[79] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li,
et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.
arXiv preprint arXiv:2401.14196 (2024).
[80] Aman Gupta, Deepak Bhatt, and Anubha Pandey. 2021. Transitioning from Real to Synthetic data: Quantifying the
bias in model. arXiv preprint arXiv:2105.04144 (2021).
[81] Aman Gupta, Anup Shirgaonkar, Angels de Luis Balaguer, Bruno Silva, Daniel Holstein, Dawei Li, Jennifer Marsman,
Leonardo O Nunes, Mahsa Rouzbahman, Morris Sharp, et al. 2024. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a
Case Study on Agriculture. arXiv preprint arXiv:2401.08406 (2024).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:41

[82] Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating large language models in generating synthetic
hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
1–19.
[83] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning
with language model is planning with world model. arXiv preprint arXiv:2305.14992 (2023).
[84] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[85] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874
(2021).
[86] Felipe Hoffa. 2016. GitHub on BigQuery: Analyze all the open source code. URL: https:// cloud.google.com/ blog/ topics/
public-datasets/ github-on-bigquery-analyze-all-the-open-source-code (2016).
[87] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las
Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language
models. arXiv preprint arXiv:2203.15556 (2022).
[88] Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. 2023. L2MAC: Large Language Model Automatic
Computer for Unbounded Code Generation. In The Twelfth International Conference on Learning Representations.
[89] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration.
arXiv preprint arXiv:1904.09751 (2019).
[90] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing
Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework.
arXiv preprint arXiv:2308.00352 (2023).
[91] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang.
2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
[92] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,
Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on
machine learning. PMLR, 2790–2799.
[93] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
[94] Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code
Generation with Iterative Testing and Optimisation. arXiv preprint arXiv:2312.13010 (2023).
[95] Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint
arXiv:2212.10403 (2022).
[96] Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. In 61st
Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics
(ACL), 1049–1065.
[97] Junjie Huang, Chenglong Wang, Jipeng Zhang, Cong Yan, Haotian Cui, Jeevana Priya Inala, Colin Clement, Nan
Duan, and Jianfeng Gao. 2022. Execution-based evaluation for data science code generation models. arXiv preprint
arXiv:2211.09374 (2022).
[98] Qiuyuan Huang, Naoki Wake, Bidipta Sarkar, Zane Durante, Ran Gong, Rohan Taori, Yusuke Noda, Demetri Ter-
zopoulos, Noboru Kuno, Ade Famoti, et al. 2024. Position Paper: Agent AI Towards a Holistic Intelligence. arXiv
preprint arXiv:2403.00833 (2024).
[99] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet
challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[100] Yoichi Ishibashi and Yoshimasa Nishimura. 2024. Self-Organized Agents: A LLM Multi-Agent Framework toward
Ultra Large-Scale Code Generation and Optimization. arXiv preprint arXiv:2404.02183 (2024).
[101] Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F Henriques, and
Anthony Hu. 2024. LangProp: A code optimization framework using Language Models applied to driving. arXiv
preprint arXiv:2401.10314 (2024).
[102] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Pro-
grammatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
1643–1652.
[103] Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu
Wang, Qing Liu, Punit Singh Koura, et al. 2022. Opt-iml: Scaling language model instruction meta learning through
the lens of generalization. arXiv preprint arXiv:2212.12017 (2022).
[104] Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Jungkyu Choi, and Minjoon
Seo. 2022. Towards Continual Knowledge Learning of Language Models. In 10th International Conference on Learning

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:42 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

Representations, ICLR 2022. International Conference on Learning Representations.


[105] Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity—a measure of the difficulty of speech
recognition tasks. The Journal of the Acoustical Society of America 62, S1 (1977), S63–S63.
[106] Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. 2010. Oracle-guided component-based program
synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 215–224.
[107] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi
Zhou, Zhaowei Zhang, et al. 2023. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852 (2023).
[108] Juyong Jiang and Sunghun Kim. 2023. CodeUp: A Multilingual Code Generation Llama2 Model with Parameter-
Efficient Instruction-Tuning. https://github.com/juyongjiang/CodeUp.
[109] Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program
repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
[110] Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. Selfevolve: A code evolution framework via large language models.
arXiv preprint arXiv:2306.02907 (2023).
[111] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2023.
SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on
Learning Representations.
[112] Alexander Wettig Kilian Lieret Shunyu Yao Karthik Narasimhan Ofir Press John Yang, Carlos E. Jimenez. 2024.
SWE-AGENT: AGENT-COMPUTER INTERFACES ENABLE AUTOMATED SOFTWARE ENGINEERING. (2024).
https://swe-agent.com/
[113] Aravind Joshi and Owen Rambow. 2003. A formalism for dependency grammar based on tree adjoining grammar. In
Proceedings of the Conference on Meaning-text Theory. MTT Paris, France, 207–216.
[114] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford,
Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
[115] Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty.
2023. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and
retrieval. arXiv preprint arXiv:2303.03004 (2023).
[116] Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. 2024. sDPO:
Don’t Use Your Data All at Once. arXiv preprint arXiv:2403.19270 (2024).
[117] Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim,
Hyeonju Lee, Jihoo Kim, et al. 2023. Solar 10.7 b: Scaling large language models with simple yet effective depth
up-scaling. arXiv preprint arXiv:2312.15166 (2023).
[118] Denis Kocetkov, Raymond Li, LI Jia, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis,
Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, et al. 2022. The Stack: 3 TB of permissively licensed source code.
Transactions on Machine Learning Research (2022).
[119] Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum,
Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. 2024. Openassistant conversations-democratizing large language
model alignment. Advances in Neural Information Processing Systems 36 (2024).
[120] Bonan Kou, Shengmai Chen, Zhijie Wang, Lei Ma, and Tianyi Zhang. 2023. Is model attention aligned with human
attention? an empirical study on large language models for code generation. arXiv preprint arXiv:2306.01220 (2023).
[121] Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. 2020. Unsupervised translation of
programming languages. arXiv preprint arXiv:2006.03511 (2020).
[122] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida
Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International
Conference on Machine Learning. PMLR, 18319–18345.
[123] Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao,
Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The bigscience
roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems 35 (2022),
31809–31826.
[124] Moritz Laurer. 2024. Synthetic data: save money, time and carbon with open source. https://huggingface.co/blog/
synthetic-data-save-costs.
[125] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering
code generation through pretrained models and deep reinforcement learning. Advances in Neural Information
Processing Systems 35 (2022), 21314–21328.
[126] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexan-
dra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual
language model. (2023).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:43

[127] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and
Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint
arXiv:2309.00267 (2023).
[128] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.
arXiv preprint arXiv:2104.08691 (2021).
[129] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler,
Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp
tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
[130] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation
Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
[131] Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2023. Towards enhancing in-context learning for code generation.
arXiv preprint arXiv:2303.17780 (2023).
[132] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161
(2023).
[133] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-
lab/alpaca_eval.
[134] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint
arXiv:2101.00190 (2021).
[135] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks are
all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023).
[136] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624
(2022), 1092–1097.
[137] Zongjie Li, Pingchuan Ma, Huaijin Wang, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2022. Unleashing the power of
compiler intermediate representation to enhance neural program embeddings. In Proceedings of the 44th International
Conference on Software Engineering. 2253–2265.
[138] Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient
fine-tuning. arXiv preprint arXiv:2303.15647 (2023).
[139] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak
Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2023. Holistic Evaluation of Language Models. Transactions on Machine
Learning Research (2023).
[140] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
74–81.
[141] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI open 3 (2022),
111–132.
[142] Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre Baldi. 2007. Mining internet-scale software
repositories. Advances in neural information processing systems 20 (2007).
[143] Bingchang Liu, Chaoyu Chen, Cong Liao, Zi Gong, Huan Wang, Zhichao Lei, Ming Liang, Dajun Chen, Min Shen,
Hailian Zhou, et al. 2023. Mftcoder: Boosting code llms with multitask fine-tuning. arXiv preprint arXiv:2311.02303
(2023).
[144] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel.
2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural
Information Processing Systems 35 (2022), 1950–1965.
[145] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really
correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing
Systems 36 (2024).
[146] Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023. Rltf: Reinforcement learning
from unit test feedback. arXiv preprint arXiv:2307.04349 (2023).
[147] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt,
and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023),
1–35.
[148] Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang,
Denny Zhou, et al. 2024. Best Practices and Lessons Learned on Synthetic Data for Language Models. arXiv preprint
arXiv:2404.07503 (2024).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:44 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

[149] Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2020. Retrieval-Augmented Generation for Code
Summarization via Hybrid GNN. In International Conference on Learning Representations.
[150] Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-
completion systems. arXiv preprint arXiv:2306.03091 (2023).
[151] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang,
Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv
preprint arXiv:2402.19173 (2024).
[152] Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A Retrieval-
Augmented Code Completion Framework. In Proceedings of the 60th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). 6227–6240.
[153] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain,
Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and
generation. arXiv preprint arXiv:2102.04664 (2021).
[154] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin,
and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In The Twelfth
International Conference on Learning Representations.
[155] Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. 2022. Are
Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? arXiv preprint arXiv:2212.10017 (2022).
[156] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri,
Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in
Neural Information Processing Systems 36 (2024).
[157] James Manyika and Sissie Hsiao. 2023. An overview of Bard: an early experiment with generative AI. AI. Google
Static Documents 2 (2023).
[158] Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, and Egor Bulychev. 2019. STYLE-ANALYZER:
fixing code style inconsistencies with interpretable unsupervised algorithms. In 2019 IEEE/ACM 16th International
Conference on Mining Software Repositories (MSR). IEEE, 468–478.
[159] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards
zero-shot language understanding. Advances in Neural Information Processing Systems 35 (2022), 462–477.
[160] Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-
llama-3/.
[161] Sébastien Bubeck Mojan Javaheripi. 2023. Phi-2: The surprising power of small language models. https://www.
microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models.
[162] Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. 2014. TBCNN: A tree-based convolutional neural network for
programming language processing. arXiv preprint arXiv:1409.5718 (2014).
[163] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2022. Reading between the lines: Modeling user
behavior and costs in AI-assisted programming. arXiv preprint arXiv:2210.14306 (2022).
[164] Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru
Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models.
arXiv preprint arXiv:2308.07124 (2023).
[165] Antonio Nappa, Richard Johnson, Leyla Bilge, Juan Caballero, and Tudor Dumitras. 2015. The attack of the clones: A
study of the impact of shared code on vulnerability patching. In 2015 IEEE symposium on security and privacy. IEEE,
692–708.
[166] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever:
Learning to verify language-to-code generation with execution. In International Conference on Machine Learning.
PMLR, 26106–26128.
[167] Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz,
Caiming Xiong, et al. 2023. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language
Models. arXiv preprint arXiv:2309.17446 (2023).
[168] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. 2023. Codegen2: Lessons for
training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023).
[169] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022.
Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474
(2022).
[170] Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair
a Silver Bullet for Code Generation?. In The Twelfth International Conference on Learning Representations.
[171] OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
[172] OpenDevin. 2024. OpenDevin: Code Less, Make More. https://github.com/OpenDevin/OpenDevin.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:45

[173] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
Advances in neural information processing systems 35 (2022), 27730–27744.
[174] Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2023. Fine-tuning or retrieval? comparing
knowledge injection in llms. arXiv preprint arXiv:2312.05934 (2023).
[175] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation
of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics.
311–318.
[176] Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl Barr, and Sergey Mechtaev.
2024. The Fact Selection Problem in LLM-Based Program Repair. arXiv preprint arXiv:2404.05520 (2024).
[177] Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval Augmented
Code Generation and Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021.
2719–2734.
[178] Arkil Patel, Siva Reddy, Dzmitry Bahdanau, and Pradeep Dasigi. 2023. Evaluating In-Context Learning of Libraries
for Code Generation. arXiv preprint arXiv:2311.09635 (2023).
[179] Indraneil Paul, Jun Luo, Goran Glavaš, and Iryna Gurevych. 2024. IRCoder: Intermediate Representations Make
Language Models Robust Multilingual Code Generators. arXiv preprint arXiv:2403.03894 (2024).
[180] Norman Peitek, Sven Apel, Chris Parnin, André Brechmann, and Janet Siegmund. 2021. Program comprehension and
code complexity metrics: An fmri study. In 2021 IEEE/ACM 43rd International Conference on Software Engineering
(ICSE). IEEE, 524–536.
[181] Huy N Phan, Hoang N Phan, Tien N Nguyen, and Nghi DQ Bui. 2024. RepoHyper: Better Context Retrieval Is All You
Need for Repository-Level Code Completion. arXiv preprint arXiv:2403.06095 (2024).
[182] Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, Ashish Datta, Maksym Zhu-
ravinskyi, Dakota Mahan, Marco Bellagente, Carlos Riquelme, et al. 2024. Stable Code Technical Report. arXiv
preprint arXiv:2404.01226 (2024).
[183] Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input
length extrapolation. arXiv preprint arXiv:2108.12409 (2021).
[184] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning
aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693
(2023).
[185] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by
generative pre-training. (2018).
[186] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are
unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[187] Steven Raemaekers, Arie Van Deursen, and Joost Visser. 2012. Measuring software library stability through historical
version analysis. In 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, 378–387.
[188] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct
preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing
Systems 36 (2024).
[189] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine
learning research 21, 140 (2020), 1–67.
[190] Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the text-to-sql capabilities of large
language models. arXiv preprint arXiv:2204.00498 (2022).
[191] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco,
and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297
(2020).
[192] Replit. 2016. Idea to software, fast. https://replit.com.
[193] Replit. 2023. replit-code-v1-3b. https://huggingface.co/replit/replit-code-v1-3b.
[194] Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering
to Flow Engineering. arXiv preprint arXiv:2401.08500 (2024).
[195] Nick Roshdieh. 2023. Evol-Instruct-Code-80k. https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1.
[196] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal
Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950
(2023).
[197] Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud
Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:46 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

In ICLR 2022-Tenth International Conference on Learning Representations.


[198] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347 (2017).
[199] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv
preprint arXiv:1803.02155 (2018).
[200] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017.
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538
(2017).
[201] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang
Zhao, et al. 2023. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint
arXiv:2307.14936 (2023).
[202] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language
agents with verbal reinforcement learning. Advances in Neural Information Processing Systems 36 (2024).
[203] Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka Watanobe. 2023. Refactoring Programs
Using Large Language Models with Few-Shot Examples. arXiv preprint arXiv:2311.11690 (2023).
[204] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. 2023. Execution-based code generation using
deep reinforcement learning. arXiv preprint arXiv:2301.13816 (2023).
[205] Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, and Torsten Scholak. 2023. RepoFusion:
Training Code Models to Understand Your Repository. arXiv preprint arXiv:2306.10998 (2023).
[206] Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-level prompt generation for large language
models of code. In International Conference on Machine Learning. PMLR, 31693–31715.
[207] Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, and Gust Verbruggen. 2023. Codefusion: A
pre-trained diffusion model for code generation. arXiv preprint arXiv:2310.17680 (2023).
[208] Hongjin Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, and Tao Yu. 2024. ARKS: Active
Retrieval in Knowledge Soup for Code Generation. arXiv preprint arXiv:2402.12317 (2024).
[209] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer
with rotary position embedding. Neurocomputing 568 (2024), 127063.
[210] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation
using transformer. In Proceedings of the 28th ACM joint meeting on European software engineering conference and
symposium on the foundations of software engineering. 1433–1443.
[211] Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, and Gabriel Synnaeve. 2022.
Code translation with compiler representations. In Proceedings of the Eleventh International Conference on Learning
Representations: ICLR.
[212] TabNine. 2018. AI Code Completions. https://github.com/codota/TabNine.
[213] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_
alpaca.
[214] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre,
Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and
technology. arXiv preprint arXiv:2403.08295 (2024).
[215] Qwen Team. 2024. Code with CodeQwen1.5. https://qwenlm.github.io/blog/codeqwen1.5.
[216] Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt,
and Siddharth Garg. 2023. Benchmarking large language models for automated verilog rtl code generation. In 2023
Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6.
[217] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971 (2023).
[218] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288 (2023).
[219] Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. 2022. Natural language processing with transformers. " O’Reilly
Media, Inc.".
[220] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability
of code generation tools powered by large language models. In Chi conference on human factors in computing systems
extended abstracts. 1–7.
[221] Boris Van Breugel, Zhaozhi Qian, and Mihaela Van Der Schaar. 2023. Synthetic data, real errors: how (not) to publish
and use synthetic data. In International Conference on Machine Learning. PMLR, 34793–34808.

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:47

[222] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[223] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https:
//github.com/kingoflolz/mesh-transformer-jax.
[224] Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching Code LLMs to
Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024).
[225] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen,
Yankai Lin, et al. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18,
6 (2024), 1–26.
[226] Shiqi Wang, Li Zheng, Haifeng Qian, Chenghao Yang, Zijian Wang, Varun Kumar, Mingyue Shang, Samson Tan,
Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. 2022.
ReCode: Robustness Evaluation of Code Generation Models. (2022). https://doi.org/10.48550/arXiv.2212.10264
[227] Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, et al. 2023. Knowledge editing for large language
models: A survey. arXiv preprint arXiv:2310.16218 (2023).
[228] Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code
actions elicit better llm agents. arXiv preprint arXiv:2402.01030 (2024).
[229] Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yitong Li, Pingyi Zhou, Jin Liu, Hao Wu, Xin Jiang, and Qun Liu.
2022. Compilable Neural Code Generation with Compiler Feedback. In Findings of the Association for Computational
Linguistics: ACL 2022. 9–19.
[230] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171
(2022).
[231] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi.
2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In The 61st Annual Meeting Of The
Association For Computational Linguistics.
[232] Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. CodeT5+: Open Code Large
Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods
in Natural Language Processing. 1069–1088.
[233] Yanlin Wang and Hui Li. 2021. Code completion by modeling flattened abstract syntax trees as graphs. In Proceedings
of the AAAI conference on artificial intelligence, Vol. 35. 14015–14023.
[234] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-
Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing. 8696–8708.
[235] Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and
Qun Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
[236] Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-based evaluation for open-domain
code generation. arXiv preprint arXiv:2212.10481 (2022).
[237] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and
Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).
[238] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, et al. 2022. Emergent Abilities of Large Language Models. Transactions on Machine
Learning Research (2022).
[239] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing
systems 35 (2022), 24824–24837.
[240] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need.
arXiv preprint arXiv:2312.02120 (2023).
[241] Lilian Weng. 2023. LLM-powered Autonomous Agents. lilianweng.github.io (Jun 2023). https://lilianweng.github.io/
posts/2023-06-23-agent/
[242] Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. QuRating: Selecting High-Quality Data for
Training Language Models. arXiv preprint arXiv:2402.09739 (2024).
[243] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. 2021.
Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international
conference on computer vision. 3681–3691.
[244] Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. 2024. Repoformer: Selective
Retrieval for Repository-Level Code Completion. arXiv preprint arXiv:2403.10059 (2024).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
1:48 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

[245] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang,
and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv
preprint arXiv:2308.08155 (2023).
[246] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin,
Enyu Zhou, et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint
arXiv:2309.07864 (2023).
[247] Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, and Wei Ye. 2024. CodeShell Technical Report.
arXiv preprint arXiv:2403.15747 (2024).
[248] Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2023. Data selection for language models via
importance resampling. Advances in Neural Information Processing Systems 36 (2023), 34201–34227.
[249] Bowen Li Xingyao Wang and Graham Neubig. 2024. Introducing OpenDevin CodeAct 1.0, a new State-of-the-art in
Coding Agents. https://www.cognition.ai/introducing-devin.
[250] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023.
Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
[251] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large
language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming.
1–10.
[252] Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, security, privacy,
explainability, efficiency, and usability of large language models for code. arXiv preprint arXiv:2403.07506 (2024).
[253] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of
thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems
36 (2024).
[254] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct:
Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations
(ICLR).
[255] Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned
code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining
software repositories. 476–486.
[256] Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim,
Munhyong Kim, Sungju Kim, et al. 2024. HyperCLOVA X Technical Report. arXiv preprint arXiv:2404.01954 (2024).
[257] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao
Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings
of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.
[258] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle
Roman, et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing
and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
3911–3921.
[259] Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. 2023.
Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint
arXiv:2312.14187 (2023).
[260] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to
align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023).
[261] Jiawei Liu Yifeng Ding Naman Jain Harm de Vries Leandro von Werra Arjun Guha Lingming Zhang Yuxiang Wei,
Federico Cassano. 2024. StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation.
https://github.com/bigcode-project/starcoder2-self-align.
[262] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. 2021. Bitfit: Simple parameter-efficient fine-tuning for
transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021).
[263] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-
Guang Lou. 2022. CERT: continual pre-training on sketches for library-oriented code generation. arXiv preprint
arXiv:2206.06888 (2022).
[264] Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023.
Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). 7443–7464.
[265] Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan,
et al. 2024. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. arXiv preprint arXiv:2403.16443
(2024).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.
A Survey on Large Language Models for Code Generation 1:49

[266] Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen.
2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing. 2471–2484.
[267] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. 2023.
Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning
Representations.
[268] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang,
Fei Wu, et al. 2023. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
[269] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong
Chen, et al. 2023. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint
arXiv:2309.01219 (2023).
[270] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program
Improvement. arXiv preprint arXiv:2404.05427 (2024).
[271] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2023. Unifying the
perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989
(2023).
[272] Liang Zhao, Xiaocheng Feng, Xiachong Feng, Bin Qin, and Ting Liu. 2023. Length Extrapolation of Transformers: A
Survey from the Perspective of Position Encoding. arXiv preprint arXiv:2312.17044 (2023).
[273] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie
Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[274] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural
Information Processing Systems 36 (2024).
[275] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li,
et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.
[276] Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. 2024.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv preprint arXiv:2402.14658
(2024).
[277] Wenqing Zheng, SP Sharan, Ajay Kumar Jaiswal, Kevin Wang, Yihan Xi, Dejia Xu, and Zhangyang Wang. 2023.
Outline, then details: Syntactically guided coarse-to-fine code generation. In International Conference on Machine
Learning. PMLR, 42403–42419.
[278] Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023. A survey of
large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372 (2023).
[279] Li Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Language Model Debugger via Verifying Runtime
Execution Step-by-step. arXiv preprint arXiv:2402.16906 (2024).
[280] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree
search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406 (2023).
[281] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili
Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
[282] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui,
Olivier Bousquet, Quoc V Le, et al. 2022. Least-to-Most Prompting Enables Complex Reasoning in Large Language
Models. In The Eleventh International Conference on Learning Representations.
[283] Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao Jiang, and Graham Neubig. 2022. DocPrompting: Generating Code by
Retrieving the Docs. In The Eleventh International Conference on Learning Representations.
[284] Terry Yue Zhuo. 2024. ICE-Score: Instructing Large Language Models to Evaluate Code. In Findings of the Association
for Computational Linguistics: EACL 2024. 2232–2242.
[285] Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, and Niklas
Muennighoff. 2024. Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models. arXiv preprint
arXiv:2401.00788 (2024).

ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article 1. Publication date: September 2024.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy