0% found this document useful (0 votes)
532 views124 pages

A Survey of Large Language Models

1. The document surveys recent advances in large language models (LLMs), which are pretrained language models containing tens or hundreds of billions of parameters. 2. It reviews LLMs in four aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. The rapid progress of LLMs is revolutionizing natural language processing and artificial intelligence. 3. The survey provides an up-to-date review of the literature on LLMs and can serve as a useful resource for researchers and engineers working in this area.

Uploaded by

tarungupta1027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
532 views124 pages

A Survey of Large Language Models

1. The document surveys recent advances in large language models (LLMs), which are pretrained language models containing tens or hundreds of billions of parameters. 2. It reviews LLMs in four aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. The rapid progress of LLMs is revolutionizing natural language processing and artificial intelligence. 3. The survey provides an up-to-date review of the literature on LLMs and can serve as a useful resource for researchers and engineers working in this area.

Uploaded by

tarungupta1027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

1

A Survey of Large Language Models


Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen

Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence
by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a
significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major
approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving
from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-
arXiv:2303.18223v13 [cs.CL] 24 Nov 2023

training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP)
tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling
effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these
enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-
context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different
parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g.,
containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia
and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has
attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI
community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this
survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,
we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we
also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides
an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.

Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation

1 I NTRODUCTION
“The limits of my language mean the limits of my world.” future (or missing) tokens. The research of LM has received
—Ludwig Wittgenstein extensive attention in the literature, which can be divided
into four major development stages:
• Statistical language models (SLM). SLMs [6–9] are de-
L ANGUAGE is a prominent ability in human beings to
express and communicate, which develops in early
childhood and evolves over a lifetime [3, 4]. Machines,
veloped based on statistical learning methods that rose in
the 1990s. The basic idea is to build the word prediction
model based on the Markov assumption, e.g., predicting the
however, cannot naturally grasp the abilities of understand-
next word based on the most recent context. The SLMs with
ing and communicating in the form of human language,
a fixed context length n are also called n-gram language
unless equipped with powerful artificial intelligence (AI)
models, e.g., bigram and trigram language models. SLMs
algorithms. It has been a longstanding research challenge
have been widely applied to enhance task performance
to achieve this goal, to enable machines to read, write, and
in information retrieval (IR) [10, 11] and natural language
communicate like humans [5].
processing (NLP) [12–14]. However, they often suffer from
Technically, language modeling (LM) is one of the major the curse of dimensionality: it is difficult to accurately
approaches to advancing language intelligence of machines. estimate high-order language models since an exponential
In general, LM aims to model the generative likelihood number of transition probabilities need to be estimated.
of word sequences, so as to predict the probabilities of Thus, specially designed smoothing strategies such as back-
off estimation [15] and Good–Turing estimation [16] have
• Version: v13 (major update on November 23, 2023). been introduced to alleviate the data sparsity problem.
• GitHub link: https://github.com/RUCAIBox/LLMSurvey • Neural language models (NLM). NLMs [1, 17, 18] charac-
• Chinese version link: https://github.com/RUCAIBox/LLMSurvey/blob/
main/assets/LLM Survey Chinese.pdf
terize the probability of word sequences by neural networks,
• * K. Zhou and J. Li contribute equally to this work. e.g., multi-layer perceptron (MLP) and recurrent neural net-
• The authors are mainly with Gaoling School of Artificial Intelligence and works (RNNs). As a remarkable contribution, the work in
School of Information, Renmin University of China, Beijing, China; Jian- [1] introduced the concept of distributed representation of
Yun Nie is with DIRO, Université de Montréal, Canada.
Contact e-mail: batmanfly@gmail.com words and built the word prediction function conditioned
• The authors of this survey paper reserve all the copyrights of the fig- on the aggregated context features (i.e., the distributed
ures/tables, and any use of these materials for publication purpose must be word vectors). By extending the idea of learning effective
officially granted by the survey authors.
features for text data, a general neural network approach
2


 

*37
*37

 //D0$
//D0$ 

&KDW*37
&KDW*37 
 *37
*37


,QVWUXFW*37
,QVWUXFW*37 
 //D0$
//D0$

 &RGH[
&RGH[ 

&KDW*37
&KDW*37



 *37
*37
77 
 ,QVWUXFW*37
,QVWUXFW*37
*37 *37
*37 *37 &RGH[
&RGH[
77 *37
*37
  %(57
%(57 

 
 
 
 
 
 
 
 
 

7L7LPPH H 7L7LPPHH
(a) Query=”Language Model” (b) Query=”Large Language Model”

Fig. 1: The trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018)
and “large language model” (since October 2019), respectively. The statistics are calculated using exact match by querying
the keyphrases in title or abstract by months. We set different x-axis ranges for the two keyphrases, because “language
models” have been explored at an earlier time. We label the points corresponding to important landmarks in the research
progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers
that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day (Figure 1(b)).

General-purpose
Transferable task solver
Task-agnostic NLP task solver GPT-3/4!ChatGPT!Claude
Scaling language models
Task Specific task feature learner ELMO!BERT!GPT-1/2 Prompt based completion
solving helper Word2vec (NPLM)!NLPS Context-aware representations Solve various real-world tasks
capacity n-gram models Static word representations
Pre-training + fine-tuning
Solve various NLP tasks LLM
Statistical methods Neural context modeling
Probability estimation Solve typical NLP tasks Pre-trained LM
Assist in specific tasks Neural LM
Statistical LM

1990s 2013 2018 2020

Fig. 2: An evolution process of the four generations of language models (LM) from the perspective of task solving capacity.
Note that the time period for each stage may not be very accurate, and we set the time mainly according to the publish
date of the most representative studies at each stage. For neural language models, we abbreviate the paper titles of
two representative studies to name the two approaches: NPLM [1] (“A neural probabilistic language model”) and NLPS [2]
(“Natural language processing (almost) from scratch”). Due to the space limitation, we don’t list all representative studies in
this figure.

was developed to build a unified, end-to-end solution for pre-training bidirectional language models with specially
various NLP tasks [2]. Furthermore, word2vec [19, 20] was designed pre-training tasks on large-scale unlabeled cor-
proposed to build a simplified shallow neural network pora. These pre-trained context-aware word representations
for learning distributed word representations, which were are very effective as general-purpose semantic features,
demonstrated to be very effective across a variety of NLP which have largely raised the performance bar of NLP
tasks. These studies have initiated the use of language tasks. This study has inspired a large number of follow-up
models for representation learning (beyond word sequence work, which sets the “pre-training and fine-tuning” learning
modeling), having an important impact on the field of NLP. paradigm. Following this paradigm, a great number of stud-
ies on PLMs have been developed, introducing either differ-
• Pre-trained language models (PLM). As an early at- ent architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or
tempt, ELMo [21] was proposed to capture context-aware improved pre-training strategies [27–29]. In this paradigm, it
word representations by first pre-training a bidirectional often requires fine-tuning the PLM for adapting to different
LSTM (biLSTM) network (instead of learning fixed word downstream tasks.
representations) and then fine-tuning the biLSTM network
according to specific downstream tasks. Furthermore, based • Large language models (LLM). Researchers find that
on the highly parallelizable Transformer architecture [22] scaling PLM (e.g., scaling model size or data size) often
with self-attention mechanisms, BERT [23] was proposed by leads to an improved model capacity on downstream tasks
3

(i.e., following the scaling law [30]). A number of studies PLMs, the major approach to accessing LLMs is through
have explored the performance limit by training an ever the prompting interface (e.g., GPT-4 API). Humans have to
larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- understand how LLMs work and format their tasks in a way
parameter PaLM). Although scaling is mainly conducted that LLMs can follow. Third, the development of LLMs no
in model size (with similar architectures and pre-training longer draws a clear distinction between research and en-
tasks), these large-sized PLMs display different behaviors gineering. The training of LLMs requires extensive practical
from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- experiences in large-scale data processing and distributed
parameter GPT-2) and show surprising abilities (called emer- parallel training. To develop capable LLMs, researchers
gent abilities [31]) in solving a series of complex tasks. For have to solve complicated engineering issues, working with
example, GPT-3 can solve few-shot tasks through in-context engineers or being engineers.
learning, whereas GPT-2 cannot do well. Thus, the research Nowadays, LLMs are posing a significant impact on
community coins the term “large language models (LLM)”1 the AI community, and the advent of ChatGPT and GPT-4
for these large-sized PLMs [32–35], which attract increasing leads to the rethinking of the possibilities of artificial general
research attention (See Figure 1). A remarkable application intelligence (AGI). OpenAI has published a technical article
of LLMs is ChatGPT2 that adapts the LLMs from the GPT entitled “Planning for AGI and beyond”, which discusses
series for dialogue, which presents an amazing conversation the short-term and long-term plans to approach AGI [40],
ability with humans. We can observe a sharp increase of the and a more recent paper has argued that GPT-4 might be
arXiv papers that are related to LLMs after the release of considered as an early version of an AGI system [41]. The
ChatGPT in Figure 1. research areas of AI are being revolutionized by the rapid
As discussed before, language model is not a new tech- progress of LLMs. In the field of NLP, LLMs can serve as a
nical concept specially for LLMs, but has evolved with the general-purpose language task solver (to some extent), and
advance of artificial intelligence over the decades. Early lan- the research paradigm has been shifting towards the use
guage models mainly aim to model and generate text data, of LLMs. In the field of IR, traditional search engines are
while latest language models (e.g., GPT-4) focus on complex challenged by the new information seeking way through AI
task solving. From language modeling to task solving, it is an chatbots (i.e., ChatGPT), and New Bing3 presents an initial
important leap in scientific thinking, which is the key to attempt that enhances the search results based on LLMs. In
understand the development of language models in the re- the field of CV, the researchers try to develop ChatGPT-like
search history. From the perspective of task solving, the four vision-language models that can better serve multimodal
generations of language models have exhibited different lev- dialogues [42–45], and GPT-4 [46] has supported multi-
els of model capacities. In Figure 2, we describe the evolu- modal input by integrating the visual information. This new
tion process of language models in terms of the task solving wave of technology would potentially lead to a prosperous
capacity. At first, statistical language models mainly assisted ecosystem of real-world applications based on LLMs. For
in some specific tasks (e.g., retrieval or speech tasks), in instance, Microsoft 365 is being empowered by LLMs (i.e.,
which the predicted or estimated probabilities can enhance Copilot) to automate the office work, and OpenAI supports
the performance of task-specific approaches. Subsequently, the use of plugins in ChatGPT for implementing special
neural language models focused on learning task-agnostic functions.
representations (e.g., features), aiming to reduce the efforts Despite the progress and impact, the underlying prin-
for human feature engineering. Furthermore, pre-trained ciples of LLMs are still not well explored. Firstly, it is
language models learned context-aware representations that mysterious why emergent abilities occur in LLMs, instead of
can be optimized according to downstream tasks. For the smaller PLMs. As a more general issue, there lacks a deep,
latest generation of language model, LLMs are enhanced by detailed investigation of the key factors that contribute to
exploring the scaling effect on model capacity, which can be the superior abilities of LLMs. It is important to study when
considered as general-purpose task solvers. To summarize, and how LLMs obtain such abilities [47]. Although there are
in the evolution process, the task scope that can be solved some meaningful discussions about this problem [31, 47],
by language models have been greatly extended, and the more principled investigations are needed to uncover the
task performance attained by language models have been “secrets“ of LLMs. Secondly, it is difficult for the research
significantly enhanced. community to train capable LLMs. Due to the huge de-
In the existing literature, PLMs have been widely dis- mand of computation resources, it is very costly to carry
cussed and surveyed [36–39], while LLMs are seldom re- out repetitive, ablating studies for investigating the effect
viewed in a systematic way. To motivate our survey, we first of various strategies for training LLMs. Indeed, LLMs are
highlight three major differences between LLMs and PLMs. mainly trained by industry, where many important training
First, LLMs display some surprising emergent abilities that details (e.g., data collection and cleaning) are not revealed
may not be observed in previous smaller PLMs. These abili- to the public. Thirdly, it is challenging to align LLMs with
ties are key to the performance of language models on com- human values or preferences. Despite the capacities, LLMs
plex tasks, making AI algorithms unprecedently powerful are also likely to produce toxic, fictitious, or harmful con-
and effective. Second, LLMs would revolutionize the way tents. It requires effective and efficient control approaches
that humans develop and use AI algorithms. Unlike small to eliminating the potential risk of the use of LLMs [46].
Faced with both opportunities and challenges, it needs
1. Note that a LLM is not necessarily more capable than a small PLM, more attention on the research and development of LLMs. In
and emergent abilities may not occur in some LLMs.
2. https://openai.com/blog/chatgpt/ 3. https://www.bing.com/new
4

order to provide a basic understanding of LLMs, this survey compute (orders of magnification). Extensive research has
conducts a literature review of the recent advances in LLMs shown that scaling can largely improve the model capacity
from four major aspects, including pre-training (how to pre- of LLMs [26, 55, 56]. Thus, it is useful to establish a quantita-
train a capable LLM), adaptation (how to effectively adapt tive approach to characterizing the scaling effect. Next, we
pre-trained LLMs for better use), utilization (how to use introduce two representative scaling laws for Transformer
LLMs for solving various downstream tasks) and capability language models [30, 34].
evaluation (how to evaluate the abilities of LLMs and existing • KM scaling law5 . In 2020, Kaplan et al. [30] (the OpenAI
empirical findings). We thoroughly comb the literature and team) firstly proposed to model the power-law relationship
summarize the key findings, techniques, and methods of of model performance with respective to three major factors,
LLMs. For this survey, we also create a GitHub project namely model size (N ), dataset size (D), and the amount of
website by collecting the supporting resources for LLMs, at training compute (C ), for neural language models. Given
the link https://github.com/RUCAIBox/LLMSurvey. We a compute budget c, they empirically presented three basic
are also aware of several related review articles on PLMs formulas for the scaling law6 :
or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either  α
discuss PLMs or some specific (or general) aspects of LLMs. Nc N
L(N ) = , αN ∼ 0.076, Nc ∼ 8.8 × 1013 (1)
Compared with them, we focus on the techniques and N

methods to develop and use LLMs and provide a relatively

Dc D
L(D) = , αD ∼ 0.095, Dc ∼ 5.4 × 1013
comprehensive reference to important aspects of LLMs. D
The remainder of this survey is organized as follows:  αC
Cc
Section 2 introduces the background for LLMs and the evo- L(C) = , αC ∼ 0.050, Cc ∼ 3.1 × 108
C
lution of GPT-series models, followed by the summarization
where L(·) denotes the cross entropy loss in nats, and
of available resources for developing LLMs in Section 3.
a follow-up study [58] from OpenAI has shown that the
Sections 4, 5, 6, and 7 review and summarize the recent
language modeling loss can be decomposed into two parts,
progress from the four aspects of pre-training, adaptation,
namely irreducible loss (the entropy of the true data distri-
utilization, and capacity evaluation, respectively. Then, Sec-
bution) and reducible loss (an estimate of the KL divergence
tion 8 discusses the practical guide for prompt design,
between the true and model distributions). The three laws
and Section 9 reviews the applications of LLMs in several
were derived by fitting the model performance with varied
representative domains. Finally, we conclude the survey in
data sizes (22M to 23B tokens), model sizes (768M to 1.5B
Section 10 by summarizing the major findings and discuss
non-embedding parameters) and training compute, under
the remaining issues for future work.
some assumptions (e.g., the analysis of one factor should
be not bottlenecked by the other two factors). They showed
2 OVERVIEW that the model performance has a strong dependence rela-
In this section, we present an overview about the back- tion on the three factors.
ground of LLMs and then summarize the technical evolu- • Chinchilla scaling law. As another representative study,
tion of the GPT-series models. Hoffmann et al. [34] (the Google DeepMind team) proposed
an alternative form for scaling laws to instruct the compute-
optimal training for LLMs. They conducted rigorous exper-
2.1 Background for LLMs iments by varying a larger range of model sizes (70M to
Typically, large language models (LLMs) refer to Transformer 16B) and data sizes (5B to 500B tokens), and fitted a similar
language models that contain hundreds of billions (or scaling law yet with different coefficients as below [34]:
more) of parameters4 , which are trained on massive text A B
data [32], such as GPT-3 [55], PaLM [56], Galactica [35], L(N, D) = E + α
+ β, (2)
N D
and LLaMA [57]. LLMs exhibit strong capacities to un-
derstand natural language and solve complex tasks (via where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and
text generation). To have a quick understanding of how β = 0.28. By optimizing the loss L(N, D) under the con-
LLMs work, this part introduces the basic background for straint C ≈ 6N D, they showed that the optimal allocation
LLMs, including scaling laws, emergent abilities and key of compute budget to model size and data size can be
techniques. derived as follows:
 a b
Formulation of Scaling Laws for LLMs. Currently, LLMs

C C
are mainly built upon the Transformer architecture [22], Nopt (C) = G , Dopt (C) = G−1 , (3)
6 6
where multi-head attention layers are stacked in a very α β
deep neural network. Existing LLMs adopt similar Trans- where a = α+β , b = α+β and G is a scaling coefficient that
former architectures and pre-training objectives (e.g., lan- can be computed by A, B , α and β . As analyzed in [34],
guage modeling) as small language models. However, LLMs 5. Since there was not a model trained following this law in the
significantly extend the model size, data size, and total original paper, we took the last names of the two co-first authors to
name this scaling law.
4. In existing literature, there is no formal consensus on the minimum 6. Here, Nc , Dc and Cc are measured in the number of non-
parameter scale for LLMs, since the model capacity is also related to embedding parameters, the number of training tokens and the number
data size and total compute. In this survey, we take a slightly loose of FP-days, respectively. According to the original paper [30], Cc and C
definition of LLMs, and mainly focus on discussing language models should be denoted by Ccmin and Cmin , corresponding to the optimal
with a model size larger than 10B. use of compute. We use the simplified notations for ease of discussions.
5

given an increase in compute budget, the KM scaling law characterize task-level scaling laws, since it might be also
favors a larger budget allocation in model size than the data dependent on task-related information (task metric, task
size, while the Chinchilla scaling law argues that the two difficulty, etc.). Furthermore, some capacities (e.g., in-context
sizes should be increased in equal scales, i.e., having similar learning [55]) are unpredictable according to the scaling law,
values for a and b in Equation (3). which can be observed only when the model size exceeds a
certain level (as discussed below).
Discussion on Scaling Laws. After introducing the formu-
lations, we continue to discuss scaling law in the following Emergent Abilities of LLMs. In the literature [31], emergent
two aspects, to enhance its understanding: abilities of LLMs are formally defined as “the abilities that
• Predictable scaling. In practice, scaling law can be used are not present in small models but arise in large models”,
to instruct the training of LLMs, and it has been proven which is one of the most prominent features that distin-
feasible to reliably estimate the performance of larger mod- guish LLMs from previous PLMs. It further introduces a
els based on that of smaller models, called predictable scal- notable characteristic when emergent abilities occur [31]:
ing [46]. The benefits of predictable scaling for training performance rises significantly above random when the
LLMs are mainly twofold. Firstly, for large models, it is scale reaches a certain level. By analogy, such an emergent
infeasible to rigorously examine various training tricks or pattern has close connections with the phenomenon of phase
variants, and it would be very helpful if experiences gained transition in physics [31, 63]. In principle, emergent abilities
from small models could also apply to large models. For can be defined in relation to some complex tasks [31, 64],
instance, small proxy models can be trained to find the while we are more concerned with general abilities that
optimal schedule of the data mixture for large models [59]. can be applied to solve a variety of tasks. Here, we briefly
Secondly, the training of large-scale models takes a long introduce three typical emergent abilities for LLMs and
time, often suffering from issues such as training loss spike, representative models that possess such an ability8 .
and scaling law can be employed to monitor the training • In-context learning. The in-context learning (ICL) ability
status of LLMs, e.g., identifying abnormal performance at an is formally introduced by GPT-3 [55]: assuming that the
early time. Despite that scaling law characterizes a smooth language model has been provided with a natural language
trend of performance increase (or loss decrease), it also instruction and/or several task demonstrations, it can gen-
indicates that diminishing returns7 might occur as model erate the expected output for the test instances by com-
scaling. An empirical study [58] from the OpenAI team pleting the word sequence of input text, without requiring
has shown that representation quality or semantic content additional training or gradient update9 . Among the GPT-
can still effectively improve even if approaching the point series models, the 175B GPT-3 model exhibited a strong ICL
of diminishing returns (i.e., approaching the irreducible ability in general, but not the GPT-1 and GPT-2 models. Such
loss) [58]. This finding suggests that training large models an ability also depends on the specific downstream task. For
are promising for improving the performance of down- example, the ICL ability can emerge on the arithmetic tasks
stream tasks. To further explore scaling effect, a potential (e.g., the 3-digit addition and subtraction) for the 13B GPT-3,
issue is that the amount of available data for training LLMs but 175B GPT-3 even cannot work well on the Persian QA
is actually limited. With the ever-increasing model scale, the task [31].
public text data would be soon “exhausted” for LLMs [60]. • Instruction following. By fine-tuning with a mixture of
Thus, it will be meaningful to study how scaling laws apply multi-task datasets formatted via natural language descrip-
to a data-constrained regime [61], where data repetition or tions (called instruction tuning), LLMs are shown to perform
augmentation might be useful to alleviate data scarcity. well on unseen tasks that are also described in the form
• Task-level predictability. Existing research of scaling laws of instructions [28, 66, 67]. With instruction tuning, LLMs
are mostly conducted in terms of language modeling loss are enabled to follow the task instructions for new tasks
(e.g., per-token cross-entropy loss in nats [30]), while in without using explicit examples, thus having an improved
practice we are more concerned about the performance of generalization ability. According to the experiments in [67],
LLMs on actual tasks. Thus, a basic problem is that how instruction-tuned LaMDA-PT [68] started to significantly
the decrease of language modeling loss translates into the outperform the untuned one on unseen tasks when the
improvement of task performance [58]. Intuitively, a model model size reached 68B, but not for 8B or smaller model
with a smaller language modeling loss tends to yield a sizes. A recent study [69] found that a model size of 62B is
better performance on downstream tasks, since language at least required for PaLM to perform well on various tasks
modeling loss can be considered as a general measure of in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA
the overall model capacity. GPT-4 [46] has reported that and MGSM), though a much smaller size might suffice for
some capabilities (e.g., coding ability) can be accurately some specific tasks (e.g., MMLU).
predicted via scaling law. Despite that, readers should be • Step-by-step reasoning. For small language models, it
aware that a direct decrease in language modeling loss does is usually difficult to solve complex tasks that involve
not always indicate an improvement of model performance
8. It is difficult to accurately examine the critical size for emergent
on downstream tasks. Specially, the phenomenon of inverse
abilities of LLMs (i.e., the minimum size to possess an ability), since it
scaling would occur for some tasks, where task performance might vary for different models or tasks. Also, existing studies often
surprisingly becomes worse as the language modeling loss test emergent abilities on very limited model sizes for a specific LLM.
decreases [62]. Overall, it is more difficult to explore and For example, PaLM is often tested with three sizes of 8B, 62B and 540B.
It is unclear about the model performance of the untested sizes.
9. In a recent study [65], it also shows that in-context learning implic-
7. https://en.wikipedia.org/wiki/Diminishing returns itly performs meta-optimization through the attention mechanism.
6

multiple reasoning steps, e.g., mathematical word problems. every day. It is interesting that young parents would be often
In contrast, with the chain-of-thought (CoT) prompting surprised by unexpected progress of the speaking ability
strategy [33], LLMs can solve such tasks by utilizing the exhibited by their babies.
prompting mechanism that involves intermediate reasoning
steps for deriving the final answer. This ability is speculated Key Techniques for LLMs. It has been a long way that
to be potentially obtained by training on code [33, 47]. An LLMs evolve into the current state: general and capable
empirical study [33] has shown that CoT prompting can learners. In the development process, a number of impor-
bring performance gains (on arithmetic reasoning bench- tant techniques are proposed, which largely improve the
marks) when applied to PaLM and LaMDA variants with capacity of LLMs. Here, we briefly list several important
a model size larger than 60B, while its advantage over techniques that (potentially) lead to the success of LLMs, as
the standard prompting becomes more evident when the follows.
model size exceeds 100B. Furthermore, the performance • Scaling. As discussed in previous parts, there exists
improvement with CoT prompting seems to be also varied an evident scaling effect in Transformer language mod-
for different tasks, e.g., GSM8K > MAWPS > SWAMP for els: larger model/data sizes and more training compute
PaLM [33]. typically lead to an improved model capacity [30, 34]. As
two representative models, GPT-3 and PaLM explored the
How Emergent Abilities Relate to Scaling Laws. In existing scaling limits by increasing the model size to 175B and
literature [30, 31, 34], scaling laws and emergent abilities 540B, respectively. Since compute budget is usually limited,
provide two perspectives to understand the advantage of scaling laws can be further employed to conduct a more
large models over small models. In general, scaling law compute-efficient allocation of the compute resources. For
(often measured by language modeling loss) describes pre- example, Chinchilla (with more training tokens) outper-
dictable performance relation with the potential effect of forms its counterpart model Gopher (with a larger model
diminishing returns, while emergent abilities (often mea- size) by increasing the data scale with the same compute
sured by task performance) are unpredictable but very prof- budget [34]. In addition, data scaling should be with careful
itable once such abilities actually emerge. Since the two cleaning process, since the quality of pre-training data plays
perspectives reflect different performance trends (continu- a key role in the model capacity.
ous improvement v.s. sharp performance leap), they might • Training. Due to the huge model size, it is very chal-
lead to misaligned findings or observations. There are also lenging to successfully train a capable LLM. Distributed
extensive debates on the rationality of emergent abilities. training algorithms are needed to learn the network param-
A popular speculation is that emergent abilities might be eters of LLMs, in which various parallel strategies are of-
partially attributed to the evaluation setting for special tasks ten jointly utilized. To support distributed training, several
(e.g., the discontinuous evaluation metrics) [70, 71]: when optimization frameworks have been released to facilitate
evaluation metrics are altered accordingly, the sharpness of the implementation and deployment of parallel algorithms,
the emergent ability curve would disappear. However, the such as DeepSpeed [74] and Megatron-LM [75–77]. Also, op-
performance of LLMs on most tasks are perceived by users timization tricks are also important for training stability and
naturally in a discontinuous way. For instance, end users model performance, e.g., restart to overcome training loss
prefer a reliable code generated by LLMs that can success- spike [56] and mixed precision training [78]. More recently,
fully pass the test case, but are less interested in selecting a GPT-4 [46] proposes to develop special infrastructure and
better code with fewer errors between two failed ones. More optimization methods that reliably predict the performance
recently, a study [72] proposes a new evaluation setting of large models with much smaller models.
that can enlarge the resolution of task metrics, making task • Ability eliciting. After being pre-trained on large-scale
performance more predictable. Despite these efforts, more corpora, LLMs are endowed with potential abilities as
fundamental research (e.g., grokking10 ) about the working general-purpose task solvers. These abilities might not be
mechanism of LLMs is still in need to understand the emer- explicitly exhibited when LLMs perform some specific tasks.
gence of certain abilities. The subtle relation between scaling As the technical approach, it is useful to design suitable task
law and emergent abilities can be explained by analogy with instructions or specific in-context learning strategies to elicit
the ability acquisition of human11 . Take the speaking ability such abilities. For instance, chain-of-thought prompting has
as an example. For children, language development (espe- been shown to be useful to solve complex reasoning tasks
cially infants) can be also considered as a multi-level process by including intermediate reasoning steps. Furthermore,
where “emergent abilities” occur. Specially, the language we can perform instruction tuning on LLMs with task
ability would relatively stable within a time interval, but descriptions expressed in natural language, for improving
qualitative change only occurs when evolving into another the generalizability of LLMs on unseen tasks. These eliciting
ability level (e.g., from speaking simple words to speaking techniques mainly correspond to the emergent abilities of
simple sentences). Such a learning process is essentially not LLMs, which may not show the same effect on small lan-
smooth and stable (i.e., language ability does not develop at guage models.
a constant rate over time), though a child actually grows • Alignment tuning. Since LLMs are trained to capture
the data characteristics of pre-training corpora (including
10. Grokking refers that “a pattern in the data, improving generaliza- both high-quality and low-quality data), they are likely to
tion performance from random chance level to perfect generalization”,
quoted from the original paper [73].
generate toxic, biased, or even harmful content for humans.
11. This explanation is only for ease of understanding, and there is It is necessary to align LLMs with human values, e.g., helpful,
not direct evidence to connect the two points. honest, and harmless. For this purpose, InstructGPT [66]
7

designs an effective tuning approach that enables LLMs to models was already explored in the early days of Ope-
follow the expected instructions, which utilizes the tech- nAI, while it was attempted with recurrent neural net-
nique of reinforcement learning with human feedback [66, 79]. works (RNN) [121]. With the advent of Transformer, OpenAI
It incorporates human in the training loop with elaborately developed two initial GPT models, namely GPT-1 [122] and
designed labeling strategies. ChatGPT is indeed developed GPT-2 [26], which can be considered as the foundation to
on a similar technique to InstructGPT, which shows a strong more powerful models subsequently i.e., GPT-3 and GPT-4.
alignment capacity in producing high-quality, harmless re- • GPT-1. In 2017, the Transformer model [22] was intro-
sponses, e.g., rejecting to answer insulting questions. duced by Google, and the OpenAI team quickly adapted
• Tools manipulation. In essence, LLMs are trained as text their language modeling work to this new neural network
generators over massive plain text corpora, thus performing architecture. They released the first GPT model in 2018,
less well on the tasks that are not best expressed in the i.e., GPT-1 [122], and coined the abbreviation term GPT
form of text (e.g., numerical computation). In addition, their as the model name, standing for Generative Pre-Training.
capacities are also limited to the pre-training data, e.g., the GPT-1 was developed based on a generative, decoder-only
inability to capture up-to-date information. To tackle these Transformer architecture, and adopted a hybrid approach of
issues, a recently proposed technique is to employ external unsupervised pretraining and supervised fine-tuning. GPT-
tools to compensate for the deficiencies of LLMs [80, 81]. 1 has set up the core architecture for the GPT-series models
For example, LLMs can utilize the calculator for accurate and established the underlying principle to model natural
computation [80] and employ search engines to retrieve language text, i.e., predicting the next word.
unknown information [81]. More recently, ChatGPT has • GPT-2. Following a similar architecture of GPT-1,
enabled the mechanism of using external plugins (existing GPT-2 [26] increased the parameter scale to 1.5B, which
or newly created apps)12 , which are by analogy with the was trained with a large webpage dataset WebText. As
“eyes and ears” of LLMs. Such a mechanism can broadly claimed in the paper of GPT-2, it sought to perform
expand the scope of capacities for LLMs. tasks via unsupervised language modeling, without explicit
In addition, many other factors (e.g., the upgrade of fine-tuning using labeled data. To motivate the approach,
hardware) also contribute to the success of LLMs. Currently, they introduced a probabilistic form for multi-task solving,
we limit our discussion to the major technical approaches i.e., p(output|input, task) (similar approaches have been
and key findings for developing LLMs. adopted in [123]), which predicts the output conditioned on
the input and task information. To model this conditional
probability, language text can be naturally employed as a
2.2 Technical Evolution of GPT-series Models unified way to format input, output and task information.
Due to the excellent capacity in communicating with hu- In this way, the process of solving a task can be cast as a
mans, ChatGPT has ignited the excitement of the AI com- word prediction problem for generating the solution text.
munity since its release. ChatGPT is developed based on the Further, they introduced a more formal claim for this idea:
powerful GPT model with specially optimized conversation “Since the (task-specific) supervised objective is the same
capacities. Considering the ever-growing interest in Chat- as the unsupervised (language modeling) objective but only
GPT and GPT models, we add a special discussion about the evaluated on a subset of the sequence, the global minimum
technical evolution of the GPT-series models, to briefly sum- of the unsupervised objective is also the global minimum
marize the progress how they have been developed in the of the supervised objective (for various tasks)” [26]15 . A
past years. Meanwhile, we drew a schematic diagram de- basic understanding of this claim is that each (NLP) task
picting the technological evolution of the GPT-series models can be considered as the word prediction problem based
in Figure 4. The basic principle underlying GPT models is on a subset of the world text. Thus, unsupervised language
to compress the world knowledge into the decoder-only modeling could be capable in solving various tasks, if it was
Transformer model by language modeling, such that it can trained to have sufficient capacity in recovering the world
recover (or memorize) the semantics of world knowledge text. These early discussion in GPT-2’s paper echoed in the
and serve as a general-purpose task solver. Two key points interview of Ilya Sutskever by Jensen Huang: “What the
to the success are (I) training decoder-only Transformer neural network learns is some representation of the process
language models that can accurately predict the next word that produced the text. This text is actually a projection of
and (II) scaling up the size of language models. Overall, the the world...the more accurate you are in predicting the next
research of OpenAI on LLMs can be roughly divided into word, the higher the fidelity, the more resolution you get in
the following stages13 . this process...”16 .

Early Explorations. According to one interview with Ilya Capacity Leap. Although GPT-2 is intended to be an “un-
Sutskever14 (a co-founder and chief scientist of OpenAI), supervised multitask learner”, it overall has an inferior
the idea of approaching intelligent systems with language performance compared with supervised fine-tuning state-
of-the-art methods. Because it has a relatively small model
12. https://openai.com/blog/chatgpt-plugins size, it has been widely fine-tuned in downstream tasks,
13. Note that the discussion of this part can be somewhat subjective. especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3
The overall viewpoints and summaries are made based on the under-
standing of the survey authors by reading the papers, blog articles,
interview reports and APIs released by OpenAI. 15. To better understand this sentence, we put some explanation
14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- words in parentheses.
founder-of-openai 16. https://lifearchitect.ai/ilya/
8

TABLE 1: Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the
capacity evaluation, pre-training data scale (either in the number of tokens or storage size) and hardware resource costs.
In this table, we only include LLMs with a public paper about the technical details. Here, “Release Time” indicates the
date when the corresponding paper was officially released. “Publicly Available” means that the model checkpoints can be
publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the model has been with
subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback.
“Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL
denotes in-context learning and CoT denotes chain-of-thought. “*” denotes the largest publicly available version.

Release Size Base Adaptation Pre-train Latest Data Hardware Training Evaluation
Model
Time (B) Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT
T5 [82] Oct-2019 11 - - - 1T tokens Apr-2019 1024 TPU v3 - ✓ -
mT5 [83] Oct-2020 13 - - - 1T tokens - - - ✓ -
PanGu-α [84] Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 - ✓ -
CPM-2 [85] Jun-2021 198 - - - 2.6TB - - - - -
T0 [28] Oct-2021 11 T5 ✓ - - - 512 TPU v3 27 h ✓ -
CodeGen [86] Mar-2022 16 - - - 577B tokens - - - ✓ -
GPT-NeoX-20B [87] Apr-2022 20 - - - 825GB - 96 40G A100 - ✓ -
Tk-Instruct [88] Apr-2022 11 T5 ✓ - - - 256 TPU v3 4h ✓ -
UL2 [89] May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 - ✓ ✓
OPT [90] May-2022 175 - - - 180B tokens - 992 80G A100 - ✓ -
NLLB [91] Jul-2022 54.5 - - - - - - - ✓ -
CodeGeeX [92] Sep-2022 13 - - - 850B tokens - 1536 Ascend 910 60 d ✓ -
GLM [93] Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d ✓ -
Flan-T5 [69] Oct-2022 11 T5 ✓ - - - - - ✓ ✓
BLOOM [78] Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d ✓ -
mT0 [94] Nov-2022 13 mT5 ✓ - - - - - ✓ -
Galactica [35] Nov-2022 120 - - - 106B tokens - - - ✓ ✓
BLOOMZ [94] Nov-2022 176 BLOOM ✓ - - - - - ✓ -
Publicly OPT-IML [95] Dec-2022 175 OPT ✓ - - - 128 40G A100 - ✓ ✓
Available LLaMA [57] Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d ✓ -
Pythia [96] Apr-2023 12 - - - 300B tokens - 256 40G A100 - ✓ -
CodeGen2 [97] May-2023 16 - - - 400B tokens - - - ✓ -
StarCoder [98] May-2023 15.5 - - - 1T tokens - 512 40G A100 - ✓ ✓
LLaMA2 [99] Jul-2023 70 - ✓ ✓ 2T tokens - 2000 80G A100 - ✓ -
Baichuan2 [100] Sep-2023 13 - ✓ ✓ 2.6T tokens - 1024 A800 - ✓ -
QWEN [101] Sep-2023 14 - ✓ ✓ 3T tokens - - - ✓ -
FLM [102] Sep-2023 101 - ✓ - 311B tokens - 192 A800 22 d ✓ -
Skywork [103] Oct-2023 13 - - - 3.2T tokens - 512 80G A800 - ✓ -

GPT-3 [55] May-2020 175 - - - 300B tokens - - - ✓ -


GShard [104] Jun-2020 600 - - - 1T tokens - 2048 TPU v3 4d - -
Codex [105] Jul-2021 12 GPT-3 - - 100B tokens May-2020 - - ✓ -
ERNIE 3.0 [106] Jul-2021 10 - - - 375B tokens - 384 V100 - ✓ -
Jurassic-1 [107] Aug-2021 178 - - - 300B tokens - 800 GPU - ✓ -
HyperCLOVA [108] Sep-2021 82 - - - 300B tokens - 1024 A100 13.4 d ✓ -
FLAN [67] Sep-2021 137 LaMDA-PT ✓ - - - 128 TPU v3 60 h ✓ -
Yuan 1.0 [109] Oct-2021 245 - - - 180B tokens - 2128 GPU - ✓ -
Anthropic [110] Dec-2021 52 - - - 400B tokens - - - ✓ -
WebGPT [81] Dec-2021 175 GPT-3 - ✓ - - - - ✓ -
Gopher [64] Dec-2021 280 - - - 300B tokens - 4096 TPU v3 920 h ✓ -
ERNIE 3.0 Titan [111] Dec-2021 260 - - - - - - - ✓ -
GLaM [112] Dec-2021 1200 - - - 280B tokens - 1024 TPU v4 574 h ✓ -
LaMDA [68] Jan-2022 137 - - - 768B tokens - 1024 TPU v3 57.7 d - -
MT-NLG [113] Jan-2022 530 - - - 270B tokens - 4480 80G A100 - ✓ -
Closed
AlphaCode [114] Feb-2022 41 - - - 967B tokens Jul-2021 - - - -
Source
InstructGPT [66] Mar-2022 175 GPT-3 ✓ ✓ - - - - ✓ -
Chinchilla [34] Mar-2022 70 - - - 1.4T tokens - - - ✓ -
PaLM [56] Apr-2022 540 - - - 780B tokens - 6144 TPU v4 - ✓ ✓
AlexaTM [115] Aug-2022 20 - - - 1.3T tokens - 128 A100 120 d ✓ ✓
Sparrow [116] Sep-2022 70 - - ✓ - - 64 TPU v3 - ✓ -
WeLM [117] Sep-2022 10 - - - 300B tokens - 128 A100 40G 24 d ✓ -
U-PaLM [118] Oct-2022 540 PaLM - - - - 512 TPU v4 5d ✓ ✓
Flan-PaLM [69] Oct-2022 540 PaLM ✓ - - - 512 TPU v4 37 h ✓ ✓
Flan-U-PaLM [69] Oct-2022 540 U-PaLM ✓ - - - - - ✓ ✓
GPT-4 [46] Mar-2023 - - ✓ ✓ - - - - ✓ ✓
PanGu-Σ [119] Mar-2023 1085 PanGu-α - - 329B tokens - 512 Ascend 910 100 d ✓ -
PaLM2 [120] May-2023 16 - ✓ - 100B tokens - - - ✓ ✓
9

T5 GShard Publicly Available


2019 2020 mT5 PanGu-𝛂 Ernie 3.0
2021
1-4 PLUG Jurassic-1
GPT-3
Codex 5-8
CPM-2
FLAN LaMDA
T0 9-10
Anthropic Yuan 1.0 AlphaCode Pythia
HyperCLOVA
WebGPT 11-12 Chinchilla InternLM Baichuan2
Vicuna
Ernie 3.0 Titan InstructGPT 2022 RWKV PanGu-Σ MPT QWEN
Gopher CodeGen UL2
1-3 Sparrow Bard Baichuan FLM
GLaM MT-NLG PaLM Flan-T5 PaLM2
OPT LLaMA Aquila2
YaLM Flan-PaLM CodeGen2
CodeGeeX GPT-NeoX-20B Skywork
4-6
BLOOM Luminous StarCoder
GLM Tk-Instruct XVERSE
7-10 NLLB
mT0 Cohere Falcon
AlexaTM 11-12 Grok-1
BLOOMZ 2023
WeLM 1-4
Galatica 5-8
9-11
OPT-IML ChatGPT GPT-4 LLaMA2

Fig. 3: A timeline of existing large language models (having a size larger than 10B) in recent years. The timeline was
established mainly according to the release date (e.g., the submission date to arXiv) of the technical paper for a model. If
there was not a corresponding paper, we set the date of a model as the earliest time of its public release or announcement.
We mark the LLMs with publicly available model checkpoints in yellow color. Due to the space limit of the figure, we only
include the LLMs with publicly reported evaluation results.

ChatGPT

GPT-1 GPT-2 GPT-3 +code Codex GPT-3.5 GPT-4


2018.06 2019.02 2020.05 2021.07 2022.03 2023.03
decoder-only architecture unsupervised multitask learner in-context learning code pre-training strong reasoning ability
generative pre-training scaling the model size exploring scaling limits
GPT-4 Turbo
2023.09
longer context window
code-davinci-002 +instruction text-davinci-002 +RLHF text-davinci-003 +chat gpt-3.5-turbo
2022.03 2022.03 2022.09 2023.03 GPT-4 Turbo with vision
2023.09
capable code model instruction following human alignment excellent comprehensive ability multimodal ability

Fig. 4: A brief illustration for the technical evolution of GPT-series models. We plot this figure mainly based on the papers,
blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official
statement that a new model is developed based on a base model) on the evolution path between two models, while dashed
lines denote a relatively weaker evolution relation.

demonstrates a key capacity leap by scaling of the (nearly and demonstrations. GPT-3 not only demonstrates very ex-
same) generative pre-training architecture. cellent performance in a variety of NLP tasks, but also on a
• GPT-3. GPT-3 [55] was released in 2020, which scaled number of specially designed tasks that require the abilities
the model parameters to an ever larger size of 175B. In of reasoning or domain adaptation. Although the GPT-3’s
the GPT-3’s paper, it formally introduced the concept of paper does not explicitly discuss the emergent abilities of
in-context learning (ICL)17 , which utilizes LLMs in a few- LLMs, we can observe large performance leap that might
shot or zero-shot way. ICL can teach (or instruct) LLMs to transcend the basic scaling law [30], e.g., larger models have
understand the tasks in the form of natural language text. significantly stronger ICL ability (illustrated in the original
With ICL, the pre-training and utilization of LLMs converge Figure 1.2 of the GPT-3’s paper [55]). Overall, GPT-3 can be
to the same language modeling paradigm: pre-training pre- viewed as a remarkable landmark in the journey evolving
dicts the following text sequence conditioned on the context, from PLMs to LLMs. It has empirically proved that scaling
while ICL predicts the correct task solution, which can be the neural networks to a significant size can lead to a huge
also formatted as a text sequence, given the task description increase in model capacity.

17. GPT-2 essentially used ICL for unsupervised task learning, Capacity Enhancement. Due to the strong capacities, GPT-
though it wasn’t called ICL at that time. 3 has been the base model to develop even more capable
10

LLMs for OpenAI. Overall, OpenAI has explored two major GPT-3 models with stronger capacities, which are called
approaches to further improving the GPT-3 model, i.e., train- GPT-3.5 models by OpenAI (see the discussion about the
ing on code data and alignment with human preference, OpenAI API in Section 3.1).
which are detailed as follows.
• Training on code data. A major limitation of the original The Milestones of Language Models. Based on all the ex-
GPT-3 model (pre-trained on plain text) lies in the lack of ploration efforts, two major milestones have been achieved
the reasoning ability on complex tasks, e.g., completing the by OpenAI, namely ChatGPT [131] and GPT-4 [46], which
code and solving math problems. To enhance this ability, have largely raised the capacity bar of existing AI systems.
Codex [105] was introduced by OpenAI in July 2021, which • ChatGPT. In November 2022, OpenAI released the
was a GPT model fine-tuned on a large corpus of GitHub conversation model ChatGPT, based on the GPT models
code. It demonstrated that Codex can solve very difficult (GPT-3.5 and GPT-4). As the official blog article intro-
programming problems, and also lead to a significant per- duced [131], ChatGPT was trained in a similar way as
formance improvement in solving math problems [126]. InstructGPT (called “a sibling model to InstructGPT” in the
Further, a contrastive approach [127] to training text and original post), while specially optimized for dialogue. They
code embedding was reported in January 2022, which was reported a difference between the training of ChatGPT and
shown to improve a series of related tasks (i.e., linear- InstructGPT in the data collection setup: human-generated
probe classification, text search and code search). Actually, conversations (playing both the roles of user and AI) are
the GPT-3.5 models are developed based on a code-based combined with the InstructGPT dataset in a dialogue format
GPT model (i.e., code-davinci-002), which indicates that for training ChatGPT. ChatGPT exhibited superior capaci-
training on code data is a very useful practice to improve ties in communicating with humans: possessing a vast store
the model capacity of GPT models, especially the reasoning of knowledge, skill at reasoning on mathematical problems,
ability. Furthermore, there is also a speculation that train- tracing the context accurately in multi-turn dialogues, and
ing on code data can greatly increase the chain-of-thought aligning well with human values for safe use. Later on, the
prompting abilities of LLMs [47], while it is still worth plugin mechanism has been supported in ChatGPT, which
further investigation with more thorough verification. further extends the capacities of ChatGPT with existing tools
• Human alignment. The related research of human or apps. So far, it seems to be the ever most powerful chatbot
alignment can be dated back to the year 2017 (or earlier) in the AI history. The launch of ChatGPT has a significant
for OpenAI: a blog article entitled “learning from human impact on the AI research in the future, which sheds light
preferences”18 was posted on the OpenAI blog describing on the exploration of human-like AI systems.
a work that applied reinforcement learning (RL) to learn • GPT-4. As another remarkable progress, GPT-4 [46] was
from the preference comparisons annotated by humans [79] released in March 2023, which extended the text input to
(similar to the reward training step in the aligning algorithm multimodal signals. Overall, GPT-4 has stronger capacities
of InstructGPT in Figure 12). Shortly after the release of this in solving complex tasks than GPT-3.5, showing a large
RL paper [79], the paper of the Proximal Policy Optimiza- performance improvement on many evaluation tasks. A re-
tion (PPO) [128] was published in July 2017, which now has cent study [41] investigated the capacities of GPT-4 by con-
been the foundational RL algorithm for learning from hu- ducting qualitative tests with human-generated problems,
man preferences [66]. Later in January 2020, GPT-2 was fine- spanning a diverse range of difficult tasks, and showed
tuned using the aforementioned RL algorithms [79, 128], that GPT-4 can achieve more superior performance than
which leveraged human preferences to improve the capac- prior GPT models such as ChatGPT. Furthermore, GPT-4
ities of GPT-2 on NLP tasks. In the same year, another responds more safely to malicious or provocative queries,
work [129] trained a summarization model for optimizing due to a six-month iterative alignment (with an additional
human preferences in a similar way. Based on these prior safety reward signal in the RLHF training). In the technical
work, InstructGPT [66] was proposed in January 2022 to report, OpenAI has emphasized how to safely develop
improve the GPT-3 model for human alignment, which GPT-4 and applied a number of intervention strategies to
formally established a three-stage reinforcement learning from mitigate the possible issues of LLMs, such as hallucinations,
human feedback (RLHF) algorithm. Note that it seems that privacy and overreliance. For example, they introduced the
the wording of “instruction tuning” has seldom been used in mechanism called red teaming [132] to reduce the harm or
OpenAI’s paper and documentation, which is substituted by toxic content generation. As another important aspect, GPT-
supervised fine-tuning on human demonstrations (i.e., the first 4 has been developed on a well-established deep learning
step of the RLHF algorithm [66]). In addition to improving infrastructure with improved optimization methods. They
the instruction following capacity, the RLHF algorithm is introduced a new mechanism called predictable scaling that
particularly useful to mitigate the issues of generating harm can accurately predict the final performance with a small
or toxic content for LLMs, which is key to the safe deploy- proportion of compute during model training.
ment of LLMs in practice. OpenAI describes their approach • GPT-4V, GPT-4 turbo, and beyond. Based on the work
to alignment research in a technical article [130], which done for GPT-4 [46], OpenAI further released GPT-4V in
has summarized three promising directions: “training AI September 2023, which focused on the safe deployment of
systems to use human feedback, to assist human evaluation the vision capabilities of GPT-4. In the GPT-4V’s system
and to do alignment research”. card [133], it has extensively discussed the assessment and
These enhancement techniques lead to the improved mitigation of risks related to visually augmented inputs.
Specially, GPT-4V exhibited strong vision capacities in var-
18. https://openai.com/research/learning-from-human-preferences ious application scenarios, showing the great potential as
11

a powerful multimodal learning system. More recently, in Models with Tens of Billions of Parameters. Most of the
November 2023, OpenAI released an upgraded generation models in this category have a parameter scale ranging from
of GPT-4 model at DevDay, named GPT-4 Turbo, with a 10B to 20B, except LLaMA [57] and LLaMA2 [99] (con-
series of technical improvements. GPT-4 Turbo is featured taining 70B parameters in the largest version), NLLB [91]
by the improved model capacity (more capable than GPT- (containing 54.5B parameters in the largest version), and
4), the extended knowledge source (up to April 2023), Falcon [135] (containing 40B parameters in the largest ver-
long context window (up to 128k tokens), optimized model sion). Other models within this range include mT5 [83],
performance (cheaper price), and other useful functional- PanGu-α [84], T0 [28], GPT-NeoX-20B [87], CodeGen [86],
ity updates (function call, reproducible outputs, etc.). At UL2 [89], Flan-T5 [69], and mT0 [94]. Among them, Flan-
the same time, Assistants API was launched to ease the T5 (11B version) can serve as a premier model for re-
rapid development of agent-like assistants. With this API, search on instruction tuning, since it explores the instruction
developers can easily create goal-oriented assistants within tuning from three aspects [69]: increasing the number of
their applications, by leveraging specific instruction, extra tasks, scaling the model size, and fine-tuning with chain-of-
knowledge and tool use. Furthermore, multimodal capaci- thought prompting data. Besides, CodeGen (11B version), as
ties (see, hear, and speak) were also enhanced in this new an autoregressive language model designed for generating
release, supported by GPT-4 Turbo with vision, DALL·E 3, code, can be considered as a good candidate for exploring
Text-to-speech (TTS), and Listen to voice samples. These the code generation ability. It also introduces a new bench-
improvements have greatly extended the capacity scope and mark MTPB [86] specially for multi-turn program synthesis,
enhanced the task performance of GPT models. More impor- which is composed by 115 expert-generated problems. To
tantly, the application ecosystem will be greatly strength- solve these problems, it requires LLMs to acquire sufficient
ened with the technology upgrade in improved models, programming knowledge (e.g., math, array operations, and
APIs, and functionalities. algorithms). More recently, CodeGen2 [97] has been released
Despite the huge progress, there are still limitations with to explore the impact of choices in model architecture,
these superior LLMs, e.g., generating hallucinations with learning algorithms, and data distributions on the model. As
factual errors or potentially risky response within some another LLM specialized in coding abilities, StarCoder [98]
specific context [46]. More limitations or issues of LLMs will has also achieved excellent results. As for multilingual tasks,
be discussed in Section 7. It poses long-standing research mT0 (13B version) might be a good candidate model, which
challenges to develop more capable, safer LLMs. From has been fine-tuned on multilingual tasks with multilingual
the perspective of engineering, OpenAI has adopted an prompts. Furthermore, PanGu-α [84] shows good perfor-
iterative deployment strategy [134] to develop the models mance in Chinese downstream tasks in zero-shot or few-
and products by following a five-stage development and shot settings, which is developed based on the deep learn-
deployment life-cycle, which aims to effectively reduce the ing framework MindSpore [136]. Note that PanGu-α [84]
potential risks of using the models. In the following, we holds multiple versions of models (up to 200B parameters),
will dive into the technical details in order to have a specific while the largest public version has 13B parameters. As
understanding of how they have been developed. a popular LLM, LLaMA (65B version) [57], which contains
approximately five times as many parameters as other mod-
els, has exhibited superior performance in tasks related to
3 R ESOURCES OF LLM S instruction following. Compared to LLaMA, LLaMA2 [99]
has made more explorations in reinforcement learning from
It is by no means an easy job to develop or reproduce LLMs, human feedback (RLHF) and developed a chat-oriented
considering the challenging technical issues and huge de- version called LLaMA-chat, which generally outperforms ex-
mands of computation resources. A feasible way is to learn isting open-source models across a range of helpfulness and
experiences from existing LLMs and reuse publicly avail- safety benchmarks. Due to the openness and effectiveness,
able resources for incremental development or experimental LLaMA has attracted significant attention from the research
study. In this section, we briefly summarize the publicly community, and many efforts [137–140] have been devoted
available resources for developing LLMs, including model to fine-tuning or continually pre-training its different model
checkpoints (or APIs), corpora and libraries. versions for implementing new models or tools. More
recently, Falcon [135], as another open-source LLM, has also
achieved very excellent performance on open benchmarks.
3.1 Publicly Available Model Checkpoints or APIs
It is featured by a more careful data cleaning process to
Given the huge cost of model pre-training, well-trained prepare the pre-training data (with a publicly shared dataset
model checkpoints are critical to the study and development RefinedWeb [141]). Typically, pre-training models at this
of LLMs for the research community. Since the parameter scale require hundreds or even thousands of GPUs or TPUs.
scale is a key factor to consider for using LLMs, we cate- For instance, GPT-NeoX-20B uses 12 supermicro servers,
gorize these public models into two scale levels (i.e., tens each equipped with 8 NVIDIA A100-SXM4-40GB GPUs,
of billions of parameters and hundreds of billions of parameters), while LLaMA utilizes 2,048 A100-80G GPUs as reported
which is useful for users to identify the suitable resources ac- in their original publications. To accurately estimate the
cording to their resource budget. In addition, for inference, computation resources needed, it is suggested to use the
we can directly employ public APIs to perform our tasks, metrics measuring the number of involved computations
without running the model locally. Next, we introduce the such as FLOPS (i.e., FLoating point number Operations Per
publicly available model checkpoints and APIs. Second) [30].
12

Continue pre-training LLaMA Parameter-efficient fine-tuning


Model inheritance Instruction
Data inheritance tuning Full parameter fine-tuning
+ chinese data + chat data

Chinese
Open-Chinese-LLaMA + synthetic data
Vicuna
Vicuna
Panda + task data
Alpaca
Linly-Chinese-LLaMA
Chinese Yulan-Chat
RLHF
LLaMA Alpaca Goat
+ chat data Lora PKU-Beaver
BiLLa
Cornucopia
+ synthetic data
+ chat data
+ Alpaca data
Lawyer
LLaMA OpenFlamingo LLaVA
BELLE MiniGPT-4
+ chat data
Ziya + task data
QiZhenGPT Baize
Chinese + task data
Alpaca + task data Guanaco
+ task data
Koala + task data VisionLLM InstructBLIP Chatbridge
TaoLi

LLaMA
ChatMed
Adapter PandaGPT
BenTsao LAWGPT Multimodal models

Math Finance Medicine Law Bilingualism Education

Fig. 5: An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, we cannot include all
the LLaMA variants in this figure, even much excellent work. To support incremental update, we share the source file of
this figure, and welcome the readers to include the desired models by submitting the pull requests on our GitHub page.

Models with Hundreds of Billions of Parameters. For els have achieved very excellent performance on various
models in this category, only a handful of models have been open benchmarks, which have become the most popu-
publicly released. For example, OPT [90], OPT-IML [95], lar open language models thus far. A large number of
BLOOM [78], and BLOOMZ [94] have nearly the same num- researchers have extended LLaMA models by either in-
ber of parameters as GPT-3 (175B version), while GLM [93] struction tuning or continual pretraining. In particular, in-
and Galactica [35] have 130B and 120B parameters, re- struction tuning LLaMA has become a major approach
spectively. Among them, OPT (175B version), with the to developing customized or specialized models, due to
instruction-tuned version OPT-IML, has been specially mo- the relatively low computational costs. To effectively adapt
tivated for open sharing, which aims to enable researchers LLaMA models in non-English languages, it often needs to
to carry out reproducible research at scale. For research extend the original vocabulary (trained mainly on English
in cross-lingual generalization, BLOOM (176B version) and corpus) or fine-tune it with instructions or data in the
BLOOMZ (176B version) can be used as base models, due to target language. Among these extended models, Stanford
the competence in multilingual language modeling tasks. Alpaca [142] is the first open instruct-following model
As a bilingual LLM, GLM has also provided a popular fine-tuned based on LLaMA (7B). It is trained by 52K
small-sized Chinese chat model ChatGLM2-6B (a updated instruction-following demonstrations generated via self-
version for ChatGLM-6B), which is featured with many instruct [143] using text-davinci-003. The instruction
improvements in efficiency and capacity (e.g., quantization, data, named Alpaca-52K, and training code have been ex-
32K-length context, fast inference rate). Models of this scale tensively adopted in subsequent work, such as Alpaca-
typically require thousands of GPUs or TPUs to train. For LoRA [144] (a reproduction of Stanford Alpaca using
instance, OPT (175B version) used 992 A100-80GB GPUs, LoRA [145]), Koala [146], and BELLE [147]. In addition, Vi-
while GLM (130B version) used a cluster of 96 NVIDIA cuna [138] is another popular LLaMA variant, trained upon
DGX-A100 (8x40G) GPU nodes. user-shared conversations collected from ShareGPT [148].
Due to the excellent performance and availability of the
LLaMA Model Family. The collection of LLaMA mod- LLaMA model family, many multimodal models incorpo-
els [57] were introduced by Meta AI in February, 2023, rate them as the base language models, to achieve strong
consisting of four sizes (7B, 13B, 30B and 65B). Since language understanding and generation abilities. Compared
released, LLaMA has attracted extensive attention from with other variants, Vicuna is more preferred in multimodal
both research and industry communities. LLaMA mod-
13

language models, which have led to the emergence of a va- of training data that covers a broad range of content. For
riety of popular models, including LLaVA [149], MiniGPT- this need, there are increasingly more accessible training
4 [150], InstructBLIP [151], and PandaGPT [152]. The re- datasets that have been released for research. In this section,
lease of LLaMA has greatly advanced the research progress we will briefly summarize several widely used corpora for
of LLMs. To summarize the research work conducted on training LLMs. Based on their content types, we catego-
LLaMA, we present a brief evolutionary graph in Figure 5. rize these corpora into six groups: Books, CommonCrawl,
Reddit links, Wikipedia, Code, and others.
Public API of LLMs. Instead of directly using the model
copies, APIs provide a more convenient way for common Books. BookCorpus [153] is a commonly used dataset in
users to use LLMs, without the need of running the model previous small-scale models (e.g., GPT [122] and GPT-2 [26]),
locally. As a representative interface for using LLMs, the consisting of over 11,000 books covering a wide range of
APIs for the GPT-series models [46, 55, 66, 105] have topics and genres (e.g., novels and biographies). Another
been widely used for both academia and industry19 . large-scale book corpus is Project Gutenberg [154], consist-
OpenAI has provided seven major interfaces to the models ing of over 70,000 literary books including novels, essays,
in GPT-3 series: ada, babbage, curie, davinci (the poetry, drama, history, science, philosophy, and other types
most powerful version in GPT-3 series), text-ada-001, of works in the public domain. It is currently one of the
text-babbage-001, and text-curie-001. Among largest open-source book collections, which is used in train-
them, the first four interfaces can be further fine- ing of MT-NLG [113] and LLaMA [57]. As for Books1 [55]
tuned on the host server of OpenAI. In particular, and Books2 [55] used in GPT-3 [55], they are much larger
babbage, curie, and davinci correspond to the than BookCorpus but have not been publicly released so
GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models, far.
respectively [55]. In addition, there are also two APIs
related to Codex [105], called code-cushman-001 (a CommonCrawl. CommonCrawl [163] is one of the largest
powerful and multilingual version of the Codex (12B) [105]) open-source web crawling databases, containing a petabyte-
and code-davinci-002. Further, GPT-3.5 series scale data volume, which has been widely used as training
include one base model code-davinci-002 and data for existing LLMs. As the whole dataset is very large,
three enhanced versions, namely text-davinci-002, existing studies mainly extract subsets of web pages from
text-davinci-003, and gpt-3.5-turbo. As more it within a specific period. However, due to the widespread
powerful alternatives, in this year, OpenAI has released existence of noisy and low-quality information in web data,
the model interfaces for GPT-4 series, including gpt-4, it is necessary to perform data preprocessing before usage.
gpt-4-32k, gpt-4-1106-preview (i.e., GPT-4 Turbo) Based on CommonCrawl, there are four filtered datasets
and gpt-4-vision-preview (i.e., GPT-4 Turbo with that are commonly used in existing work: C4 [82], CC-
vision, a multimodal model). It is worth noting that OpenAI Stories [155], CC-News [27], and RealNews [156]. The Colos-
has been maintaining and upgrading these model interfaces sal Clean Crawled Corpus (C4) includes five variants21 ,
(gpt-3.5-turbo, gpt-4, gpt-4-32k), so the API name namely en (806G), en.noclean (6T), realnewslike (36G), web-
will actually point to the latest version. Currently, ChatGPT textlike (17G), and multilingual (38T). The en version has
can be powered by either GPT-3.5 or GPT-4 models. Overall, been utilized for pre-training T5 [82], LaMDA [68], Go-
one select the suitable model interface based on the specific pher [64], and UL2 [89]. The multilingual C4, also called
application scenarios and response requirements. The mC4, has been used in mT5 [83]. CC-Stories (31G) is com-
detailed usage can be found on their project websites20 . posed of a subset of CommonCrawl data, in which the
contents are made in a story-like way. Because the original
TABLE 2: Statistics of commonly-used data sources. source of CC-Stories is not available now, we include a re-
production version, CC-Stories-R [164], in Table 2. Moreover,
Corpora Size Source Latest Update Time two news corpora extracted from CommonCrawl, i.e., RE-
BookCorpus [153] 5GB Books Dec-2015 ALNEWS (120G) and CC-News (76G), are also commonly
Gutenberg [154] - Books Dec-2021 used as the pre-training data.
C4 [82] 800GB CommonCrawl Apr-2019
CC-Stories-R [155] 31GB CommonCrawl Sep-2019 Reddit Links. Reddit is a social media platform that enables
CC-NEWS [27] 78GB CommonCrawl Feb-2019
REALNEWs [156] 120GB CommonCrawl Apr-2019
users to submit links and text posts, which can be voted on
OpenWebText [157] 38GB Reddit links Mar-2023 by others through “upvotes” or “downvotes”. Highly up-
Pushift.io [158] 2TB Reddit links Mar-2023 voted posts are often considered useful, and can be utilized
Wikipedia [159] 21GB Wikipedia Mar-2023 to create high-quality datasets. WebText [26] is a well-known
BigQuery [160] - Codes Mar-2023
the Pile [161] 800GB Other Dec-2020 corpus composed of highly upvoted links from Reddit, but it
ROOTS [162] 1.6TB Other Jun-2022 is not publicly available. As a surrogate, there is a readily ac-
cessible open-source alternative called OpenWebText [157].
Another corpus extracted from Reddit is PushShift.io [158],
3.2 Commonly Used Corpora for Pre-training a real-time updated dataset that consists of historical data
from Reddit since its creation day. Pushshift provides not
In contrast to earlier PLMs, LLMs which consist of a signifi-
only monthly data dumps but also useful utility tools to
cantly larger number of parameters require a higher volume
support users in searching, summarizing, and conducting
19. https://platform.openai.com/docs/api-reference/introduction
20. https://platform.openai.com/docs/models/overview 21. https://www.tensorflow.org/datasets/catalog/c4
14

preliminary investigations on the entire dataset. This makes TABLE 3: A detailed list of available collections for instruc-
it easy for users to collect and process Reddit data. tion tuning.

Wikipedia. Wikipedia [159] is an online encyclopedia con- Categories Collections Time #Examples
taining a large volume of high-quality articles on diverse Nat. Inst. [166] Apr-2021 193K
topics. Most of these articles are composed in an expository FLAN [67] Sep-2021 4.4M
P3 [167] Oct-2021 12.1M
style of writing (with supporting references), covering a
Task Super Nat. Inst. [88] Apr-2022 5M
wide range of languages and fields. Typically, the English- MVPCorpus [168] Jun-2022 41M
only filtered versions of Wikipedia are widely used in most xP3 [94] Nov-2022 81M
LLMs (e.g., GPT-3 [55], LaMDA [68], and LLaMA [57]). OIG[169] Mar-2023 43M
Wikipedia is available in multiple languages, so it can be HH-RLHF [170] Apr-2022 160K
used in multilingual settings. HC3 [171] Jan-2023 87K
Chat ShareGPT [148] Mar-2023 90K
Dolly [172] Apr-2023 15K
Code. To collect code data, existing work mainly crawls OpenAssistant [173] Apr-2023 161K
open-source licensed codes from the Internet. Two major
Self-Instruct [143] Dec-2022 82K
sources are public code repositories under open-source li- Alpaca [137] Mar-2023 52K
censes (e.g., GitHub) and code-related question-answering Synthetic Guanaco [174] Mar-2023 535K
platforms (e.g., StackOverflow). Google has publicly re- Baize [175] Apr-2023 158K
leased the BigQuery dataset [160], which includes a substan- BELLE [176] Apr-2023 1.5M
tial number of open-source licensed code snippets in various
programming languages, serving as a representative code TABLE 4: A list of available collections for alignment.
dataset. CodeGen has utilized BIGQUERY [86], a subset of
the BigQuery dataset, for training the multilingual version Dataset Release Time #Examples
of CodeGen (CodeGen-Multi). Summarize from Feedback [129] Sep-2020 193K
SHP [177] Oct-2021 385K
Others. The Pile [161] is a large-scale, diverse, and open- WebGPT Comparisons [81] Dec-2021 19K
Stack Exchange Preferences [178] Dec-2021 10M
source text dataset consisting of over 800GB of data from HH-RLHF [170] Apr-2022 169K
multiple sources, including books, websites, codes, scientific Sandbox Alignment Data [179] May-2023 169K
papers, and social media platforms. It is constructed from CValues [180] Jul-2023 145K
22 diverse high-quality subsets. The Pile dataset is widely PKU-SafeRLHF [181] Oct-2023 330K
used in models with different parameter scales, such as
GPT-J (6B) [165], CodeGen (16B) [86], and Megatron-Turing
NLG (530B) [113]. ROOTS [162] is composed of various 3.3 Commonly Used Datasets for Fine-tuning
smaller datasets (totally 1.61 TB of text) and covers 59 After pre-training, it requires further fine-tuning LLMs to
different languages (containing natural languages and pro- enhance the model capacity, which often involve two major
gramming languages), which have been used for training steps, namely instruction tuning (supervised fine-tuning)
BLOOM [78]. and alignment tuning. In this section, we mainly focus on
In practice, it commonly requires a mixture of different discussing the related available datasets for the two kinds of
data sources for pre-training LLMs (see Figure 6), instead tuning approaches, and more algorithm details can be found
of a single corpus. Therefore, existing studies commonly in Section 5.
mix several ready-made datasets (e.g., C4, OpenWebText,
and the Pile), and then perform further processing to obtain
3.3.1 Instruction Tuning Datasets
the pre-training corpus. Furthermore, to train the LLMs that
are adaptive to specific applications, it is also important After pre-training, instruction tuning (a.k.a., supervised fine-
to extract data from relevant sources (e.g., Wikipedia and tuning) is an important method to enhance or unlock spe-
BigQuery) for enriching the corresponding information in cific abilities of LLMs (e.g., instruction following). In this
pre-training data. To have a quick reference of the data part, we introduce several widely used datasets for in-
sources used in existing LLMs, we present the pre-training struction tuning, and categorize them into three main types
corpora of three representative LLMs: based on the construction method of formatted instruction
• GPT-3 (175B) [55] was trained on a mixed dataset of instances, namely NLP task datasets, daily chat datasets and
300B tokens, including CommonCrawl [163], WebText2 [55], synthetic datasets. We show their details in Table 3.
Books1 [55], Books2 [55], and Wikipedia [159]. NLP Task Datasets. This kind of datasets are formatted
• PaLM (540B) [56] uses a pre-training dataset of 780B based on collected NLP task datasets (e.g., text classifica-
tokens, which is sourced from social media conversations, tion and summarization) with corresponding natural lan-
filtered webpages, books, Github, multilingual Wikipedia, guage task descriptions. In this category, P3 [182] and
and news. FLAN [67, 183] are two widely used datasets for instruction
• LLaMA [57] extracts training data from various sources, tuning.
including CommonCrawl, C4 [82], Github, Wikipedia, • P3 [182] is composed of 170 English NLP datasets and
books, ArXiv, and StackExchange. The training data size for 2,052 English prompt templates, where the input and output
LLaMA (6B) and LLaMA (13B) is 1.0T tokens, while 1.4T of each data example have been formatted with specific
tokens are used for LLaMA (32B) and LLaMA (65B). prompt templates for composing the training instance.
15

• FLAN [67] consists of 62 widely used NLP benchmarks • Baize [175] is an English multi-turn conversation corpus
in its original version. Recently, FLAN-v2 [183] is also pro- constructed using ChatGPT, comprising 111.5K instances. To
posed, which expands FLAN by mixing additional instruc- create Baize, a method called “self-chat” [175] is purposed,
tion datasets, including Muffin [67], NIV2 [88], T0-SF [28], where ChatGPT takes on the roles of both the user and the
and CoT [184–186]. Muffin contains 62 tasks from the orig- AI assistant in turns, generating information in a conversa-
inal FLAN and additional 26 tasks, including conversation tional format.
and code synthesis tasks. T0-SF is extracted from T0 [28]
while ensuring no overlap with Muffin. NIV2 refers to the
Natural-Instructions v2 dataset [88], and CoT [184–186] is 3.3.2 Alignment Datasets
a combination of nine reasoning tasks with corresponding
chain-of-thought prompts and outputs. Apart from instruction tuning, it is important to construct
high-quality datasets for aligning LLMs with human values
Daily Chat Datasets. This kind of datasets are constructed and preferences (e.g., helpfulness, honesty, and harmless-
based on real user conversations where queries are posed ness). In this section, we introduce several widely used
by humans and responses are mainly generated by hu- datasets for alignment tuning, including HH-RLHF [170],
man labelers or LLMs (e.g., ChatGPT, GPT-4). The con- SHP [177], PKU-SafeRLHF [181], Stack Exchange Prefer-
versation types include open-ended generation, question ences [178] and Sandbox Alignment Data [179]. We show
answering, brainstorming, and chatting. In this category, their details in Table 4.
ShareGPT [148], OpenAssistant [173] and Dolly [172] are
• HH-RLHF [170] consists of around 169K instances, and
three commonly used datasets for LLM fine-tuning.
can be divided into two parts that focus on the helpfulness
• ShareGPT [148] is collected from a data collection and harmlessness of LLMs, respectively. Each instance is
platform where users can upload their conversations with an open-ended conversation between a crowdworker and
ChatGPT or GPT-4 through the ShareGPT API. Currently, a chat model, about seeking assistance, advice, or task
this dataset consists of approximately 90,000 conversations, completion. The chat model provides two responses to each
including real instructions or inquiries from human and user query, and the more helpful or harmful responses will
responses from ChatGPT. be chosen as the annotations.
• OpenAssistant [173] is a multilingual corpus containing
• SHP [177] focuses on the helpfulness of responses.
66,497 real-world conversation trees between human and AI
It comprises 385K collective human preferences over re-
assistant. Each conversation tree consists of multiple nodes,
sponses to questions/instructions across 18 diverse subject
and each node represents the information generated by a
areas, spanning topics from cooking to legal advice. Each
role in the dialogue. It spans 35 languages and includes
instance is a Reddit post containing a question or instruction
461,292 manually annotated quality ratings of responses.
and a pair of top-level comments, one of which is deemed
• Dolly [172] is an English dataset comprising 15,000 as more preferable by Reddit users and the other one is
human-generated data instances (prompt-response pairs) deemed as less helpful. Different from HH-RLHF [170], the
from Databricks. This dataset covers seven domains out- data in SHP consists of naturally occurring and human-
lined in the InstructGPT [66], including brainstorming, clas- written responses.
sification, closed-book quality assurance, generation, infor-
• PKU-SafeRLHF [181] encompasses more than 330K
mation extraction, open-book quality assurance, and sum-
instances of expert comparison data, concentrating on the
marization.
helpfulness and harmlessness. Each instance in the dataset
Synthetic Datasets. This kind of datasets are typically includes a question and two responses, accompanied by
constructed by instructing LLMs, based on pre-defined safety labels for each response and two preference anno-
guidance rules or methods. In this category, Self-Instruct- tations between the two responses according to helpfulness
52K [143], Alpaca [142] and Baize [175] are three commonly and harmlessness. The harmlessness of a response indicates
used synthetic datasets for LLMs. its classification as risk-neutral across all 14 harm categories,
• Self-Instruct-52K [143] is an instruction dataset gener- while the helpfulness of a response is evaluated based on its
ated through the self-instruct [143] method, consisting of effectiveness in addressing the question.
82,000 instances with 52,000 instructions. Concretely, the • Stack Exchange Preferences [178] focuses on the help-
authors construct 175 seed instances, and then iteratively fulness of answers. It comprises about 10M questions and
prompt the LLM [55] to synthesize additional instructions answers from Stack Overflow. Each instance consists of a
based on randomly selected 8 instructions as reference. question and more than two corresponding answers. Each
Subsequently, the LLM is further instructed to generate in- answer is annotated with a score calculated based on its
stance inputs and their corresponding outputs based on the votes and a label denoting whether it is selected.
synthetic instructions, and finally obtain the Self-Instruct- • Sandbox Alignment Data [179] is an alignment dataset
52K dataset. containing feedback from LLMs rather than human. It
• Alpaca [142] is also a synthetic dataset based on the self- comes from a virtual interaction environment called SAND-
instruct [143] method. It utilizes the text-davinci-003 BOX, where the model simulates social interactions with
model on the 175 seed datasets from Self-Instruct-52K to other models and revise responses according to the feedback
obtain 52,000 new instructions and corresponding inputs from other models. The dataset contains 169K instances, and
and outputs. Moreover, 60% of the examples are pure in- each instance consists of a societal query, several responses,
structions without the input part in the final dataset. and corresponding ratings from other models.
16

3.4 Library Resource decoding algorithms, tensor parallelism and streaming out-
puts. To ease the integration with other systems, vLLM is
In this part, we briefly introduce a series of available li-
friendly to the use of HuggingFace models, and also provide
braries for developing LLMs.
OpenAI-compatible API servers.
• Transformers [187] is an open-source Python library
• DeepSpeed-MII [194] is also a memory efficient
for building models using the Transformer architecture,
Python library developed by DeepSpeed [74]. It aims to
which is developed and maintained by Hugging Face. It
democratize LLMs inference by prioritizing high through-
has a simple and user-friendly API, making it easy to use
put, low latency, and cost-effectiveness. DeepSpeed-MII
and customize various pre-trained models. It is a powerful
achieves accelerated text generation inference by leveraging
library with a large and active community of users and
four essential technologies: blocked KV caching, continuous
developers who regularly update and improve the models
batching, dynamic SplitFuse, and high-performance CUDA
and algorithms.
Kernels. It currently supports over 13,000 models across
• DeepSpeed [74] is a deep learning optimization library three popular model architectures, such as LLaMA [57],
(compatible with PyTorch) developed by Microsoft, which
Mistral [195], and OPT [90].
has been used to train a number of LLMs, such as MT-
• DeepSpeed-Chat [196] is a fast, cost-effective, and
NLG [113] and BLOOM [78]. It provides the support of
easy-to-use system framework that enables the integration
various optimization techniques for distributed training,
of the complete RLHF process during model training. It
such as memory optimization (ZeRO technique, gradient
is featured by three major functionalities: (1) it simplifies
checkpointing), and pipeline parallelism.
the training and inference process for ChatGPT-like models,
• Megatron-LM [75–77] is a deep learning library devel- enabling using a simple script to implement multiple train-
oped by NVIDIA for training large-scale language models.
ing or inference steps; (2) it replicates the training mode
It also provides rich optimization techniques for distributed
of InstructGPT [66] and provides a complete pipeline for
training, including model and data parallelism, mixed-
three training steps (i.e., SFT, reward model fine-tuning, and
precision training, and FlashAttention. These optimization
RLHF); (3) it integrates the training engine and inference en-
techniques can largely improve the training efficiency and
gine of Deepspeed into a unified hybrid engine (Deepspeed
speed, enabling efficient distributed training across GPUs.
HE) for RLHF training, which enables seamless switch be-
• JAX [188] is a Python library for high-performance tween training and inference modes, and leveraging various
machine learning algorithms developed by Google, allow- optimizations from DeepSpeed Inference.
ing users to easily perform computations on arrays with In addition to the above library resources, existing deep
hardware acceleration (e.g., GPU or TPU). It enables efficient learning frameworks (e.g., PyTorch [197], TensorFlow [198],
computation on various devices and also supports several MXNet [199], PaddlePaddle [200], MindSpore [136] and
featured functions, such as automatic differentiation and OneFlow [201]) have also provided the support for parallel
just-in-time compilation. algorithms, which are commonly used for training large-
• Colossal-AI [189] is a deep learning library developed scale models.
by HPC-AI Tech for training large-scale AI models. It is
implemented based on PyTorch and supports a rich collec-
tion of parallel training strategies. Furthermore, it can also 4 P RE - TRAINING
optimize heterogeneous memory management with meth-
ods proposed by PatrickStar [190]. Recently, a ChatGPT-like Pre-training establishes the basis of the abilities of LLMs. By
model called ColossalChat [140] has been publicly released pre-training on large-scale corpora, LLMs can acquire essen-
with two versions (7B and 13B), which are developed using tial language understanding and generation skills [55, 56]. In
Colossal-AI based on LLaMA [57]. this process, the scale and quality of the pre-training corpus
• BMTrain [191] is an efficient library developed by are critical for LLMs to attain powerful capabilities. Fur-
OpenBMB for training models with large-scale parameters thermore, to effectively pre-train LLMs, model architectures,
in a distributed manner, which emphasizes code simplicity, acceleration methods, and optimization techniques need to
low resource, and high availability. BMTrain has already be well designed. In what follows, we first discuss the data
incorporated several common LLMs (e.g., Flan-T5 [69] and collection and processing in Section 4.1, then introduce the
GLM [93]) into its ModelCenter, where developers can use commonly used model architectures in Section 4.2, and fi-
these models directly. nally present the training techniques to stably and efficiently
• FastMoE [192] is a specialized training library for MoE optimize LLMs in Section 4.3.
(i.e., mixture-of-experts) models. It is developed based on
PyTorch, prioritizing both efficiency and user-friendliness
in its design. FastMoE simplifies the process of transferring 4.1 Data Collection and Preparation
Transformer models to MoE models and supports both data Compared with small-scale language models, LLMs have
parallelism and model parallelism during training. a stronger demand for high-quality data for model pre-
• vLLM [193] is a fast, memory efficient, and easy- training, and their model capacities largely rely on the pre-
to-use library for LLM inference and serving. To enable training corpus and how it has been preprocessed. In this
fast inference, it is specially optimized with high serving part, we discuss the collection and processing of pre-training
throughput, effective attention memory management using data, including data sources, preprocessing methods, and
PagedAttention [193], continuous batching, and optimized important analysis of how pre-training data affects the
CUDA kernels. Furthermore, vLLM also supports various performance of LLMs.
17

T5 (11B) Falcon (40B) LLaMA (65B) GPT-3 (175B) MT-NLG (530B) Gopher (280B) Chinchilla (70B)
3% 2%
2% 5% 16% 3% 4%
5% 26% 4% 37% 40%
62% 60% 56%
6%
100% 100% 87% 84%

GLaM (1200B) PaLM (540B) LaMDA (137B) Galactica (120B) GPT-NeoX (20B) CodeGen (16B) AlphaCode (41B)
5% 8%
13% 8% 20%
22% 14% 7% 30%
31% 38% 39%
48% 6%
38%
10% 10%
30% 50%
50% 86% 15% 25% 100%

💻 C4 (800G, 2019), 💻 OpenWebText (38G, 2023), 💻 Wikipedia (21G, 2023)


💬 the Pile - StackExchange (41G, 2020)
📚 BookCorpus (5G, 2015), 📚 Gutenberg (-, 2021), 📚 CC-Stories-R (31G, 2019), 📰 CC-NEWES (78G, 2019), 📰 REALNEWs (120G, 2019)
🔬 the Pile - ArXiv (72G, 2020), 🔬 the Pile - PubMed Abstracts (25G, 2020)
⌨ BigQuery (-, 2023), the Pile - GitHub (61G, 2020)

Fig. 6: Ratios of various data sources in the pre-training data for existing LLMs.

4.1.1 Data Source prove their performance on a range of question-answering


To develop a capable LLM, it is key to collect a large amount tasks [56]. Researchers can utilize subsets of public conver-
of natural language corpus from various data sources. Ex- sation corpus (e.g., PushShift.io Reddit corpus) [158, 202] or
isting LLMs mainly leverage a mixture of diverse public collect conversation data from online social media. Since on-
textual datasets as the pre-training corpus. Figure 6 shows line conversational data often involves discussions among
the distribution of the sources of pre-training data for a multiple participants, an effective processing way is to
number of representative LLMs. transform a conversation into a tree structure, where the
The source of pre-training corpus can be broadly cate- utterance is linked to the one it responds to. In this way, the
gorized into two types: general data and specialized data. multi-party conversation tree can be divided into multiple
General data, such as webpages, books, and conversational sub-conversations, which can be collected in the pre-training
text, is utilized by most LLMs [55, 56, 90] due to its large, corpus. Furthermore, a potential risk is that the excessive
diverse, and accessible nature, which can enhance the lan- integration of dialogue data into LLMs may result in a side
guage modeling and generalization abilities of LLMs. In effect [90]: declarative instructions and direct interrogatives
light of the impressive generalization capabilities exhibited are erroneously perceived as the beginning of conversations,
by LLMs, there are also studies that extend their pre-training thus leading to a decline in the efficacy of the instructions.
corpus to more specialized datasets, such as multilingual • Books. Compared to other corpus, books provide an
data, scientific data, and code, endowing LLMs with specific important source of formal long texts, which are potentially
task-solving capabilities [35, 56, 86]. In what follows, we beneficial for LLMs to learn linguistic knowledge, model
describe these two types of pre-training data sources and long-term dependency, and generate narrative and coherent
their effects on LLMs. For a detailed introduction to the texts. To obtain open-source book data, existing studies
commonly used corpus, one can refer to Section 3.2. usually adopt the Books3 and Bookcorpus2 datasets, which
are available in the Pile dataset [161].
General Text Data. As we can see in Figure 6, the vast
majority of LLMs adopt general-purpose pre-training data, Specialized Text Data. Specialized datasets are useful to
such as webpages, books, and conversational text, which improve the specific capabilities of LLMs on downstream
provides rich text sources on a variety of topics. Next, we tasks. Next, we introduce three kinds of specialized data.
briefly summarize three important kinds of general data. • Multilingual text. In addition to the text in the target
• Webpages. Owing to the proliferation of the Internet, language, integrating a multilingual corpus can enhance
various types of data have been created, which enables the multilingual abilities of language understanding and
LLMs to gain diverse linguistic knowledge and enhance generation. For example, BLOOM [78] and PaLM [56] have
their generalization capabilities [26, 82]. For convenient curated multilingual data covering 46 and 122 languages,
use of these data resources, a large amount of data is respectively, within their pre-training corpora. FLM [102]
crawled from the web in previous work, such as Com- mixes Chinese and English corpora in nearly equal propor-
monCrawl [163]. However, the crawled web data tends to tions. These models demonstrate impressive performance in
contain both high-quality text, such as Wikipedia and low- multilingual tasks, such as translation, multilingual summa-
quality text, like spam mail, thus it is important to filter and rization, and multilingual question answering, and achieve
process webpages for improving the data quality. comparable or superior performance to the state-of-the-
• Conversation text. Conversation data can enhance the art models that are fine-tuned on the corpus in the target
conversational competence of LLMs [90] and potentially im- language(s).
18

• Scientific text. The exploration of science by humans has pages) as positive instances and sample candidate data
been witnessed by the increasing growth of scientific publi- as negative instances, and predict the score that measures
cations. In order to enhance the understanding of scientific the quality of each data example. However, several stud-
knowledge for LLMs [35, 203], it is useful to incorporate a ies [64, 112] find that a classifier-based approach may result
scientific corpus for model pre-training [35, 203]. By pre- in the unintentional removal of high-quality texts in dialec-
training on a vast amount of scientific text, LLMs can tal, colloquial, and sociolectal languages, which potentially
achieve impressive performance in scientific and reasoning leads to bias in the pre-training corpus and diminishes the
tasks [204]. To construct the scientific corpus, existing efforts corpus diversity. As the second approach, several studies,
mainly collect arXiv papers, scientific textbooks, math web- such as BLOOM [78] and Gopher [64], employ heuristic-
pages, and other related scientific resources. Due to the com- based approaches to eliminate low-quality texts through a
plex nature of data in scientific fields, such as mathematical set of well-designed rules, which can be summarized as
symbols and protein sequences, specific tokenization and follows:
preprocessing techniques are usually required to transform • Language based filtering. If a LLM would be mainly used
these different formats of data into a unified form that can in the tasks of certain languages, the text in other lan-
be processed by language models. guages can be filtered.
• Code. Program synthesis has been widely studied in
the research community [105, 205–208], especially the use of • Metric based filtering. Evaluation metrics about the gener-
PLMs trained on code [165, 209]. However, it remains chal- ated texts, e.g., perplexity, can be employed to detect and
lenging for these PLMs (e.g., GPT-J [165]) to generate high- remove unnatural sentences.
quality and accurate programs. Recent studies [105, 208] • Statistic based filtering. Statistical features of a corpus,
have found that training LLMs on a vast code corpus e.g., the punctuation distribution, symbol-to-word ratio,
can lead to a substantial improvement in the quality of and sentence length, can be utilized to measure the text
the synthesized programs. The generated programs can quality and filter the low-quality data.
successfully pass expert-designed unit-test cases [105] or
• Keyword based filtering. Based on specific keyword set, the
solve competitive programming questions [114]. In gen-
noisy or unuseful elements in the text, such as HTML
eral, two types of code corpora are commonly used for
tags, hyperlinks, boilerplates, and offensive words, can
pre-training LLMs. The first source is from programming
be identified and removed.
question answering communities like Stack Exchange [210].
The second source is from public software repositories De-duplication. Existing work [214] has found that dupli-
such as GitHub [86, 105, 208], where code data (includ- cate data in a corpus would reduce the diversity of language
ing comments and docstrings) are collected for utilization. models, which may cause the training process to become un-
Compared to natural language text, code is in the format stable and thus affect the model performance. Therefore, it is
of a programming language, corresponding to long-range necessary to de-duplicate the pre-training corpus. Specially,
dependencies and accurate execution logic [211]. A recent de-duplication can be performed at different granularities,
study [47] also speculates that training on code might be a including sentence-level, document-level, and dataset-level
source of complex reasoning abilities (e.g., chain-of-thought de-duplication. First, low-quality sentences that contain re-
ability [33]). Furthermore, it has been shown that formatting peated words and phrases should be removed, as they may
reasoning tasks into code can help LLMs generate more introduce repetitive patterns in language modeling [215].
accurate results [211]. At the document level, existing studies mostly rely on the
overlap ratio of surface features (e.g., words and n-grams
4.1.2 Data Preprocessing overlap) between documents to detect and remove duplicate
After collecting a large amount of text data, it is essential documents containing similar contents [57, 64, 78, 216].
to preprocess the data for constructing the pre-training Furthermore, to avoid the dataset contamination problem,
corpus, especially removing noisy, redundant, irrelevant, it is also crucial to prevent the overlap between the training
and potentially toxic data [56, 64, 212], which may largely and evaluation sets [56], by removing the possible duplicate
affect the capacity and performance of LLMs. To facilitate texts from the training set. It has been shown that the three
the data processing, a recent study [213] proposes a useful levels of de-duplication are useful to improve the training
data processing system for LLMs, named Data-Juicer, which of LLMs [56, 217], which should be jointly used in practice.
provides over 50 processing operators and tools. In this
part, we review the detailed data preprocessing strategies Privacy Reduction. The majority of pre-training text data is
to improve the quality of the collected data [64, 78, 112]. A obtained from web sources, including user-generated con-
typical pipeline of preprocessing the pre-training data for tent involving sensitive or personal information, which may
LLMs has been illustrated in Figure 7. increase the risk of privacy breaches [218]. Thus, it is nec-
essary to remove the personally identifiable information (PII)
Quality Filtering. To remove low-quality data from the from the pre-training corpus. One direct and effective ap-
collected corpus, existing work generally adopts two ap- proach is to employ rule-based methods, such as keyword
proaches: (1) classifier-based, and (2) heuristic-based. The spotting, to detect and remove PII such as names, addresses,
former approach trains a selection classifier based on high- and phone numbers [162]. Furthermore, researchers also
quality texts and leverages it to identify and filter out low- find that the vulnerability of LLMs under privacy attacks
quality data. Typically, these methods [55, 56, 112] train can be attributed to the presence of duplicate PII data in the
a binary classifier with well-curated data (e.g., Wikipedia pre-training corpus [219]. Therefore, de-duplication can also
19

Ready to
Raw Corpus Quality Filtering De-duplication Privacy Reduction Tokenization
pre-train!

Language Filtering Sentence-level Detect Personality Reuse Existing


Document-level Identifiable Tokenizer
Metric Filtering
Information (PII) SentencePiece
Statistic Filtering Set-level
Remove PII Byte-level BPE
Keyword Filtering

Alice is writing a paper about Alice is writing a paper about Replace('Alice') is Encode('[Somebody] is 32, 145, 66, 79, 12, 56, ...
LLMs. #$^& Alice is writing LLMs. Alice is writing a paper writing a paper about LLMs. writing a paper about LLMs.')
a paper about LLMs. about LLMs.

Fig. 7: An illustration of a typical data preprocessing pipeline for pre-training large language models.

reduce privacy risks to some extent. taking a slightly different selection criterion for the merge.
To conduct the merge, it first trains a language model and
Tokenization. Tokenization is also a crucial step for data
employs it to score all possible pairs. Then, at each merge, it
preprocessing. It aims to segment raw text into sequences
selects the pair that leads to the most increase in the likeli-
of individual tokens, which are subsequently used as the
hood of training data. Since Google has’t released the official
inputs of LLMs. In traditional NLP research (e.g., sequence
implementation of the WordPiece algorithm, HuggingFace
labeling with conditional random fields [220]), word-based
gives a more intuitive selection measure in its online NLP
tokenization is the predominant approach, which is more
course: a pair is scored by dividing the co-occurrence count
aligned with human’s language cognition. However, word-
by the product of the occurrence counts of two tokens in the
based tokenization can yield different segmentation results
pair based on training corpus.
for the same input in some languages (e.g., Chinese word
segmentation), generate a huge word vocabulary containing • Unigram tokenization. Unlike BPE and WordPiece, Un-
many low-frequency words, and also suffer from the “out- igram tokenization [225] starts with a sufficiently large
of-vocabulary” issue. Thus, several neural network models set of possible substrings or subtokens for a corpus, and
employ character as the minimum unit to derive the word iteratively removes the tokens in the current vocabulary
representation (e.g., a CNN word encoder in ELMo [21]). until the expected vocabulary size is reached. As the se-
Recently, subword tokenizers have been widely used in Trans- lection criterion, it calculates the yielded increase in the
former based language models, typically including Byte- likelihood of training corpus by assuming that some to-
Pair Encoding tokenization, WordPiece tokenization and ken was removed from current vocabulary. This step is
Unigram tokenization. HuggingFace has maintained an conducted based on a trained unigram language model.
excellent online NLP course on tokenizer22 with running To estimate the unigram language model, it adopts an
examples, and we refer to the beginners to this course. Next, expectation–maximization (EM) algorithm: at each iteration,
we briefly describe the three representative tokenization we first find the currently optimal tokenization of words
methods. based on the old language model, and then re-estimate the
• Byte-Pair Encoding (BPE) tokenization. BPE was origi- probabilities of unigrams to update the language model.
nally proposed as a general data compression algorithm in During this procedure, dynamic programming algorithms
1994 [221], and then adapted to NLP for tokenization [222]. (i.e., the Viterbi algorithm) are used to efficiently find the
It starts with a set of basic symbols (e.g., the alphabets optimal decomposition way of a word given the language
and boundary characters), and iteratively combine frequent model. Representative models that adopt this tokenization
pairs of two consecutive tokens in the corpus as new to- approach include T5 and mBART.
kens (called merge). For each merge, the selection criterion
is based on the co-occurrence frequency of two contigu- Although it is expedient to leverage an existing tokenizer
ous tokens: the top frequent pair would be selected. The (e.g., OPT [90] and GPT-3 [55] utilize the tokenizer of GPT-
merge process continues until it reaches the predefined 2 [26]), using a tokenizer specially designed for the pre-
size. Further, Byte-level BPE has been used to improve the training corpus can be highly beneficial [78], especially for
tokenization quality for multilingual corpus (e.g., the text the corpus that consists of diverse domains, languages, and
containing non-ASCII characters) by considering bytes as the formats. Therefore, recent LLMs often train the customized
basic symbols for merge. Representative language models tokenizers specially for the pre-training corpus with the
with this tokenization approach include GPT-2, BART, and SentencePiece library [226], which includes Byte-level BPE
LLaMA. and Unigram tokenization. A note is that normalization
• WordPiece tokenization. WordPiece was a Google inter- techniques in BPE, such as NFKC [227], may degrade the
nal subword tokenization algorithm. It was originally pro- tokenization performance [34, 64, 78]. When extending
posed by Google in developing voice search systems [223]. existing LLMs (i.e., continual pre-training or instruction
Then, it was used in the neural machine translation system tuning), we should be also aware of the potential side effect
in 2016 [224], and was adopted as the word tokenizer for with customized tokenizers. For example, LLaMA trains
BERT in 2018 [23]. WordPiece has a very similar idea with the BPE tokenizer based on a pre-training corpus mainly
BPE by iteratively merging consecutive tokens, whereas consisting of English texts, and the derived vocabulary
might be less capable in processing non-English data, e.g.,
22. https://huggingface.co/learn/nlp-course/chapter6 taking longer inference latency to generate Chinese texts.
20

Data 1 source to create specific data mixtures as pre-training data.


Source
2
3 Data Mixture As Figure 6 illustrates, existing LLMs use different data mix-
4
tures to construct the pre-training data. As a representative
Stage 1 Stage 2 Stage Stage model, the pre-training data of LLaMA [57] mainly consists
of webpages (over 80%), alongside 6.5% of code-heavy data
··· from GitHub and StackExchange, 4.5% from books, and
2.5% of scientific data sourced from arXiv, which has become
an important reference for training general-purpose LLMs.
Data Curriculum Furthermore, special data mixtures can be used to facilitate
different purposes. For example, Falcon [141] is trained on
Fig. 8: An illustration of data scheduling for pre-training pure webpages, and CodeGen [86] largely increases the
LLMs. amount of code data. In practice, data mixture is often de-
termined empirically, and we summarize several common
strategies for finding an effective data mixture as follows:
Discussion on Effect of Data Quality. For pre-training, the • Increasing the diversity of data sources. Recent studies
quality of pre-training data is vital to the model capacities have empirically shown that training on excessive data
of LLMs. Existing work has shown that pre-training on the about a certain domain would degrade the generalization
low-quality corpus, such as noisy, toxic, and duplicate data, capability of LLMs on other domains [35, 64]. In contrast,
would largely hurt the performance of models [64, 214, increasing the data source heterogeneity (e.g., including
216, 219]. Recent studies, such as T5 [82], GLaM [112], and diverse data sources) is critical for improving the down-
Gopher [64], have investigated the influence of data quality stream performance of LLMs [212, 229, 230]. To further
on the LLMs’ capacities. By comparing the performance of examine the effect of different data sources, some studies
models trained on the filtered and unfiltered corpus, they have conducted ablation experiments by removing each
have reached the similar conclusion that pre-training LLMs data source one by one, and pre-train LLMs with specially
on cleaned data can improve the model performance. More curated datasets [212]. It has been shown that dropping data
specifically, the duplication of data may result in “double sources with high heterogeneity (e.g., webpages) impacts
descent” (referring to the phenomenon of performance ini- LLM’s abilities more severely than dropping sources with
tially deteriorating and subsequently improving) [214, 228], low heterogeneity (e.g., academic corpus).
or even overwhelm the training process [214]. In addition, • Optimizing data mixtures. In addition to manually set-
it has been shown that duplicate data degrades the ability ting the data mixtures, several studies have proposed to
of LLMs to copy from the context, which might further optimize the data mixtures for improving the model pre-
affect the generalization capacity of LLMs using in-context training [59, 231]. Given the target downstream tasks, one
learning [214]. Therefore, as suggested in [56, 64, 78, 212], can select pre-training data with either higher proximity
it is essential to utilize preprocessing methods like quality in the feature space [231] or those that provide positive
filtering, toxic filtering and deduplication to carefully clean influences on downstream task performance [232]. Further,
the pre-training corpus (as illustrated in Section 4.1.2), to to reduce the reliance of target tasks, DoReMi [59] first trains
improve stability of the training process and avoid affecting a small reference model using given initial domain weights,
the model performance. and then trains another small proxy model, upweighting the
domains on which the greatest discrepancies in likelihood
4.1.3 Data Scheduling between the two models are observed. Finally, the learned
After data preprocessing, it is essential to design suit- domain weights of the proxy model are applied to train
able strategies to schedule these multi-source data for pre- a much larger LLM. In a more simple way, one can train
training a capable LLM. Generally, two key aspects should several small language models with different data mixtures,
be paid close attention for data scheduling: the proportion and select the data mixture that leads to the most desir-
of each data source (data mixture), and the order in which able performance. However, an assumption made in this
each data source is scheduled for training (data curriculum). approach is, when trained in a similar way, small models
Next, we discuss the two aspects in detail. An illustration of would resemble with large models in model abilities or
data scheduling has been presented in Figure 8. behaviors, which may not always hold in practice.
• Specializing the targeted abilities. The model capacities
Data Mixture. Since each kind of data source is closely of LLMs heavily rely on data selection and mixture, and
related to the development of certain capacities for LLMs one can boost the proportions of specific data sources to
(referring to the discussions in Section 4.1), it is important enhance certain model abilities [64, 212]. For example, the
to set a suitable distribution to mix these data. The data mathematical reasoning and coding abilities can be specially
mixture is generally set in a global level (i.e., the distribution enhanced by training with more mathematical texts and
of the entire pre-training data), and can be also locally set code data, respectively. Furthermore, experimental results
to varied proportions at different training stages. During on the LAMBADA dataset [233] show that increasing the
pre-training, data samples from different sources would be proportion of books data can improve the model capacity in
selected according to the mixture proportions: more data capturing long-term dependencies from text, and increasing
will be sampled from a data source with a larger weight. the proportion of the C4 dataset [82] leads to performance
Typically, existing LLMs such as LLaMA [57] may employ improvement on the C4 validation dataset [64]. Generally,
upsampling or downsampling on the full data of each it is important to identify more implicit relations between
21

data sources and model abilities. To enhance specific skills ing the context windows of LLMs via continually train-
such as mathematics and coding in LLMs, or to develop ing [235, 238]. With modifications on position embeddings
specialized LLMs, a practical way is to employ a multi-stage (i.e., position interpolation) of RoPE-based LLMs [57, 99,
training approach, e.g., general and skill-specific data can 240], CodeLLaMA further extends the context window of
be scheduled at two consecutive stages. This approach of LLaMA 2 (2.5T tokens with 4K context window → 20B
training LLMs on varying sources or proportions of data tokens with 16K context window). LongLLaMA [238] also
across multiple stages is also known as “data curriculum”, achieves longer context window with the help of external
which will be introduced below. memory and a unique training objective (1T tokens with 2K
context window → 10B tokens with 8K context window).
Data Curriculum. After preparing the data mixture, it
is important to schedule the order that specific data is 4.1.4 Summary of Data Preparation
presented to LLMs for pre-training. It has been shown that,
In this part, we summarize the general procedure and key
in some cases, to learn a certain skill, learning in a skill-
points to prepare pre-training data for LLMs, which are
set sequence (e.g., basic skills → target skill) outperforms
detailed in the following three aspects.
direct learning from a corpus focused solely on the target
• Data collection. It is suggested to include diverse data
skill [234, 235]. Following the idea of curriculum learn-
sources in the pre-training data. Although Falcon [141]
ing [236], data curriculum has been proposed and widely
shows that webpages alone can be employed to train power-
used in model pre-training [234, 235, 237, 238]. It aims to
ful LLMs, a more typical approach is to also incorporate di-
organize different parts of pre-training data for LLMs in
verse high-quality text like code, books, scientific papers, etc.
a specific order, e.g., starting with easy/general examples
If a LLM is specialized with a certain skill, the proportion of
and progressively introducing more challenging/special-
corresponding data source should be increased accordingly.
ized ones. More generally, it can broadly refer to the adap-
For example, Gopher [64] and Chinchilla [34] are trained
tive adjustment of data proportions for different sources
with approximately 40% of data from books. PaLM [44] and
during pre-training. Existing work about data curriculum
LaMDA [68] use approximately 50% conversational data.
mainly focuses on continual pre-training, such as special-
• Data cleaning. After data collection, it is crucial to clean
ized coding LLMs (e.g., CodeLLaMA [235]) or long context
the raw corpus to enhance its quality as possible. First,
LLMs (e.g., LongLLaMA [238]). However, it still lacks of
deduplication is commonly used in existing work [99, 141,
more detailed report about data curriculum for general-
229]. Second, low-quality text, toxic content, and data with
purpose LLMs (e.g., LLaMA) in the literature. To determine
privacy concerns should be removed at different granulari-
data curriculum, a practical approach is to monitor the de-
ties (e.g., document, passage or sentence). In practice, both
velopment of key abilities of LLMs based on specially con-
heuristic and classifier-based methods can be employed
structed evaluation benchmarks, and then adaptively adjust
for quality and toxicity filtering (e.g., CCNet [241], fast-
the data mixture during pre-training. Next, we take three
Text [242], and Data-Juicer [243]). Third, with the cleaned
common abilities as examples to introduce how the concept
data, one can further unify or specify the format for pre-
of data curriculum23 applies in continual pre-training.
training data, and perform the tokenization by training
• Coding. To improve the coding ability of LLMs, CodeL-
the tokenizer on the filtered and deduplicated corpus with
LaMA [235] is developed based on LLaMA 2 [99] (2T general
libraries like SentencePiece [226].
tokens → 500B code-heavy tokens), aiming to improve the
• Data scheduling. With the preprocessed data, the next
code generation ability and retain natural language under-
step is to determine the data mixture and the specific order
standing skills. CodeLLaMA also provides a version that
of data for pre-training LLMs. To determine both settings, a
is further specialized to a certain programming language,
practical way is to first train several small language models
namely CodeLLaMA-Python (2T general tokens → 500B
with multiple candidate plans and then select a good plan
code-heavy tokens → 100B Python-heavy tokens).
among them [59]. Overall, it is more difficult to find a
• Mathematics. Llemma [239] is proposed to enhance
suitable data curriculum. In practice, one can monitor the
the mathematical capacities of general-purpose LLMs. It
performance of intermediate model checkpoints on specific
is developed based on CodeLLaMA. Although CodeL-
evaluation benchmarks, and dynamically tune the data mix-
LaMA [235] mainly focuses on the coding ability, exper-
ture and distribution during pre-training. In this process, it
iments have shown that it performs better than its base
is also useful to explore the potential relations between data
model LLaMA 2 on mathematics benchmarks [239]. Based
sources and model abilities to instruct the design of data
on CodeLLaMA, Llemma is continually trained on mixtures
curriculum.
of scientific papers, web data containing mathematical text
and code (2T general tokens → 500B code-heavy tokens
→ 50∼200B math-heavy tokens). Note that the pre-training 4.2 Architecture
data of Llemma also contains 5% general domain data as a In this section, we review the architecture design of LLMs,
form of regularization. i.e., mainstream architecture, pre-training objective, and de-
• Long context. Long context modeling is an important tailed configuration. Table 5 presents the model cards of
ability for LLMs, and many studies have explored extend- several representative LLMs with public details.

23. We utilize the symbol “→” to represent the data order in data 4.2.1 Typical Architectures
curriculum. For example, “2T webpage tokens → 500B code tokens”
means that the LLM is firstly trained with 2T webpage tokens and Due to the excellent parallelizability and capacity, the Trans-
subsequently with 500B code data tokens. former architecture [22] has become the de facto backbone to
22

TABLE 5: Model cards of several selected LLMs with public configuration details. Here, PE denotes position embedding,
#L denotes the number of layers, #H denotes the number of attention heads, dmodel denotes the size of hidden states, and
MCL denotes the maximum context length during training.

Model Category Size Normalization PE Activation Bias #L #H dmodel MCL


GPT3 [55] Causal decoder 175B Pre LayerNorm Learned GeLU ✓ 96 96 12288 2048
PanGU- α [84] Causal decoder 207B Pre LayerNorm Learned GeLU ✓ 64 128 16384 1024
OPT [90] Causal decoder 175B Pre LayerNorm Learned ReLU ✓ 96 96 12288 2048
PaLM [56] Causal decoder 540B Pre LayerNorm RoPE SwiGLU × 118 48 18432 2048
BLOOM [78] Causal decoder 176B Pre LayerNorm ALiBi GeLU ✓ 70 112 14336 2048
MT-NLG [113] Causal decoder 530B - - - - 105 128 20480 2048
Gopher [64] Causal decoder 280B Pre RMSNorm Relative - - 80 128 16384 2048
Chinchilla [34] Causal decoder 70B Pre RMSNorm Relative - - 80 64 8192 -
Galactica [35] Causal decoder 120B Pre LayerNorm Learned GeLU × 96 80 10240 2048
LaMDA [68] Causal decoder 137B - Relative GeGLU - 64 128 8192 -
Jurassic-1 [107] Causal decoder 178B Pre LayerNorm Learned GeLU ✓ 76 96 13824 2048
LLaMA [57] Causal decoder 65B Pre RMSNorm RoPE SwiGLU × 80 64 8192 2048
LLaMA 2 [99] Causal decoder 70B Pre RMSNorm RePE SwiGLU × 80 64 8192 4096
Falcon [141] Causal decoder 40B Pre LayerNorm RoPE GeLU × 60 64 8192 2048
GLM-130B [93] Prefix decoder 130B Post DeepNorm RoPE GeGLU ✓ 70 96 12288 2048
T5 [82] Encoder-decoder 11B Pre RMSNorm Relative ReLU × 24 128 1024 512

Causal Decoder Prefix Decoder Encoder-Decoder

A
A

Encoder

Survey
Survey

Survey
Decoder

Decoder

of
of

of

Models Language Large


Models Language Large

Models Language Large

Decoder

A Survey of Large Language Models A Survey of Large Language Models A Survey of Large Language Models

Decoder Decoder Encoder Decoder

Fig. 9: A comparison of the attention patterns in three mainstream architectures. Here, the blue, green, yellow and grey
rounded rectangles indicate the attention between prefix tokens, attention between prefix and target tokens, attention
between target tokens, and masked attention respectively.

develop various LLMs, making it possible to scale language in Section 4.2.6.


models to hundreds or thousands of billions of parameters.
In general, the mainstream architectures of existing LLMs Causal Decoder Architecture. The causal decoder archi-
can be roughly categorized into three major types, namely tecture incorporates the unidirectional attention mask, to
encoder-decoder, causal decoder, and prefix decoder, as guarantee that each input token can only attend to the
shown in Figure 9. past tokens and itself. The input and output tokens are
processed in the same fashion through the decoder. As
Encoder-decoder Architecture. The vanilla Transformer representative language models of this architecture, the
model is built on the encoder-decoder architecture [22], GPT-series models [26, 55, 122] are developed based on
which consists of two stacks of Transformer blocks as the causal-decoder architecture. In particular, GPT-3 [55]
the encoder and decoder, respectively. The encoder adopts has successfully demonstrated the effectiveness of this ar-
stacked multi-head self-attention layers to encode the input chitecture, also showing an amazing in-context learning
sequence for generating its latent representations, while capability of LLMs. Interestingly, GPT-1 [122] and GPT-
the decoder performs cross-attention on these representa- 2 [26] do not exhibit such superior abilities as those in
tions and autoregressively generates the target sequence. GPT-3, and it seems that scaling plays an important role
Encoder-decoder PLMs (e.g., T5 [82] and BART [24]) have in increasing the model capacity of this model architecture.
shown effectiveness on a variety of NLP tasks. So far, So far, the causal decoders have been widely adopted as
there are only a small number of LLMs that are built based the architecture of LLMs by various existing LLMs, such
on the encoder-decoder architecture, e.g., Flan-T5 [69]. We as OPT [90], BLOOM [78], and Gopher [64]. Note that both
leave a detailed discussion about the architecture selection the causal decoder and prefix decoder discussed next belong
23

to decoder-only architectures. When mentioning “decoder- models with these new architectures to be trained in a highly
only architecture”, it mainly refers to the causal decoder parallel and efficient manner.
architecture in existing literature, unless specified.
4.2.2 Detailed Configuration
Prefix Decoder Architecture. The prefix decoder architec-
ture (a.k.a., non-causal decoder [244]) revises the masking Since the launch of Transformer [22], various improvements
mechanism of causal decoders, to enable performing bidi- have been proposed to enhance its training stability, per-
rectional attention over the prefix tokens [245] and unidi- formance, and computational efficiency. In this part, we
rectional attention only on generated tokens. In this way, will discuss the corresponding configurations for four major
like the encoder-decoder architecture, the prefix decoders parts of the Transformer, including normalization, position
can bidirectionally encode the prefix sequence and autore- embeddings, activation functions, and attention and bias.
gressively predict the output tokens one by one, where the To make this survey more self-contained, we present the
same parameters are shared during encoding and decoding. detailed formulations for these configurations in Table 6.
Instead of pre-training from scratch, a practical suggestion
Normalization Methods. Training instability is a challeng-
is to continually train causal decoders and then convert
ing issue for pre-training LLMs. To alleviate this issue,
them into prefix decoders for accelerating convergence [29],
normalization is a widely adopted strategy to stabilize the
e.g., U-PaLM [118] is derived from PaLM [56]. Existing rep-
training of neural networks. In the vanilla Transformer [22],
resentative LLMs based on prefix decoders include GLM-
LayerNorm [256] is employed. Recently, several advanced
130B [93] and U-PaLM [118].
normalization techniques have been proposed as alterna-
Mixture-of-Experts. For the above three types of archi- tives to LayerNorm, e.g., RMSNorm, and DeepNorm.
tectures, we can further extend them via the mixture-of- • LayerNorm. In the early research, BatchNorm [265] is
experts (MoE) scaling, in which a subset of neural network a commonly used normalization method. However, it is
weights for each input are sparsely activated, e.g., Switch difficult to deal with sequence data of variable lengths and
Transformer [25] and GLaM [112]. The major merit is that small-batch data. Thus, LayerNorm [256] is introduced to
MoE is a flexible way to scale up the model parameter while conduct layerwise normalization. Specifically, the mean and
maintaining a constant computational cost [25]. It has been variance over all activations per layer are calculated to re-
shown that substantial performance improvement can be center and re-scale the activations.
observed by increasing either the number of experts or the • RMSNorm. To improve the training speed of Lay-
total parameter size [246]. Despite the merits, training large erNorm (LN), RMSNorm [257] is proposed by re-scaling
MoE models may suffer from instability issues due to the the activations with only the root mean square (RMS) of
complex, hard-switching nature of the routing operation. the summed activations, instead of the mean and variance.
To enhance the training stability of MoE-based language Related research has demonstrated its superiority in training
models, techniques such as selectively using high-precision speed and performance on Transformer [266]. Representa-
tensors in the routing module or initializing the model with tive models that adopt RMSNorm include Gopher [64] and
a smaller range have been introduced [25]. More recently, Chinchilla [34].
there is widespread speculation that GPT-4 has been devel- • DeepNorm. DeepNorm is proposed by Microsoft [258]
oped based on the MoE architecture, but without official to stabilize the training of deep Transformers. With Deep-
verification. Norm as residual connections, Transformers can be scaled
up to 1,000 layers [258], which has shown the advantages
Emergent Architectures. The conventional Transformer ar- of stability and good performance. It has been adopted by
chitectures typically suffer from quadratic computational GLM-130B [93].
complexity. Because of this, efficiency has become an im-
portant issue when training and making inference with Normalization Position. In addition to the normalization
long inputs. To improve efficiency, some studies aim to method, normalization position also plays a crucial role in
devise new architectures for language modeling, including the LLMs. There are generally three choices for the normal-
parameterized state space models (e.g., S4 [247], GSS [248], ization position, i.e., post-LN, pre-LN, and sandwich-LN.
and H3 [249]), long convolutions like Hyena [250], and • Post-LN. Post-LN is used in the vanilla Trans-
Transformer-like architectures that incorporate recursive up- former [22], which is placed between residual blocks. How-
date mechanisms (e.g., RWKV [251] and RetNet [252]). The ever, existing work has found that the training of Trans-
key merits of these new architectures are twofold. First, formers with post-LN tends to be instable due to the large
these models can generate outputs recursively like RNNs, gradients near the output layer [267]. Thus, post-LN is rarely
meaning that they only need to refer to the single previous employed in existing LLMs except combined with other
state during decoding. It makes the decoding process more strategies (e.g., combining post-LN with pre-LN in GLM-
efficient as it eliminates the need to revisit all previous 130B [93]).
states as in conventional Transformers. Second, these mod- • Pre-LN. Different from post-LN, pre-LN [268] is applied
els have the capacity to encode an entire sentence in parallel before each sub-layer, and an additional LN is placed before
like Transformers. This contrasts with conventional RNNs the final prediction. Compared with post-LN, the Trans-
which has to encode sentences on a token-by-token basis. formers with pre-LN are more stable in training. However,
Thus, they can benefit from the parallelism of GPUs with it performs worse than the variants with post-LN [269].
techniques such as Parallel Scan [253, 254], FFT [250, 251], Despite the decreasing performance, most LLMs still adopt
and Chunkwise Recurrent [252]. These techniques enable pre-LN due to the training stability. However, one excep-
24

TABLE 6: Detailed formulations for the network configurations. Here, Sublayer denotes a FFN or a self-attention module
in a Transformer layer, d denotes the size of hidden states, pi denotes position embedding at position i, Aij denotes the
attention score between a query and a key, ri−j denotes a learnable scalar based on the offset between the query and the
key, and RΘ,t denotes a rotary matrix with rotation degree t · Θ.

Configuration Method Equation


Post Norm [22] Norm(x+Sublayer(x))
Normalization position Pre Norm [26] x + Sublayer(Norm(x))
Sandwich Norm [255] x + Norm(Sublayer(Norm(x)))
q P
x−µ 1 Pd 1 d
LayerNorm [256] σ
·γ + β, µ= d i=1 xi , σ= d i=1 (xi − µ))2
Normalization method x
q
1 Pd 2
RMSNorm [257] RMS(x)
· γ, RMS(x) = d i=1 xi
DeepNorm [258] LayerNorm(α · x + Sublayer(x))
ReLU [259] ReLU(x) = max(x, 0)
√ Rx 2
GeLU [260] GeLU(x) = 0.5x ⊗ [1 + erf(x/ 2)], erf(x) = √2 e−t dt
π 0
Activation function
Swish [261] Swish(x) = x ⊗ sigmoid(x)
SwiGLU [262] SwiGLU(x1 , x2 ) = Swish(x1 ) ⊗ x2
GeGLU [262] GeGLU(x1 , x2 ) = GeLU(x1 ) ⊗ x2
Absolute [22] xi = xi + p i
Position embedding Relative [82] Aij = Wq xi xT T
j Wk + ri−j
RoPE [263] Aij = Wq xi RΘ,i−j xT T
j Wk = (Wq xi RΘ,i )(Wk xj RΘ,j )
T

ALiBi [264] T T
Aij = Wq xi xj Wk − m(i − j)

tion is that pre-LN has been found unstable in GLM when beddings, which was subsequently adopted by Gopher [64].
training models more than 100B parameters [93]. Specifically, it adds learnable scalars to the attention scores,
• Sandwich-LN. Based on pre-LN, Sandwich-LN [255] where the scalars are calculated based on the distances
adds extra LN before the residual connections to avoid between the positions of the query and the key. Compared
the value explosion issues in Transformer layer outputs. with the absolute PE, Transformers with relative position
However, it has been found that Sandwich-LN sometimes embedding can generalize to sequences longer than those
fails to stabilize the training of LLMs and may lead to the sequences for training, i.e., extrapolation [264].
collapse of training [93]. • Rotary Position Embedding. Rotary position embedding
(RoPE) [263] sets specific rotatory matrices based on the
Activation Functions. To obtain good performance, activa-
absolute position of each key or query. The scores between
tion functions also need to be properly set in feed-forward
keys and queries can be computed with relative position
networks. In existing LLMs, GeLU activations [270] are
information (Table 6). RoPE combines each consecutive pair
widely used. Specially, in the latest LLMs (e.g., PaLM and
of elements in query and key vectors as a dimension, so there
LaMDA), variants of GLU activation [262, 271] have also
are d/2 dimensions for an original d-length embedding.
been utilized, especially the SwiGLU and GeGLU variants,
For each dimension i ∈ {1, . . . , d/2}, the pair of involved
which often achieve better performance in practice [266].
elements will rotate based on the rotation angle t · θi , where
However, compared with GeLU, they require extra parame-
t denotes the position index and θi is the basis in the
ters (about 50%) in the feed-forward networks [272].
dimension. Following sinusoidal position embeddings [22],
Position Embeddings. Since the self-attention modules in RoPE defines the basis θi as an exponentiation of the base b
Transformer are permutation equivariant, position embed- (set to 10000 by default):
dings (PE) are employed to inject absolute or relative posi-
tion information for modeling sequences.
Θ = {θi = b−2(i−1)/d |i ∈ {1, 2, . . . , d/2}}. (4)
• Absolute position embedding. In the vanilla Trans- Furthermore, a recent study [276] defines the distance re-
former [22], absolute position embeddings are employed. quired to rotate one cycle (2π ) for each dimension as wave-
At the bottoms of the encoder and the decoder, the absolute length:
positional embeddings are added to the input embeddings. λi = 2πb2(i−1)/d = 2π/θi . (5)
There are two variants of absolute position embeddings
proposed in the vanilla Transformer [22], i.e., sinusoidal and Due to the excellent performance and the long-term decay
learned position embeddings, where the latter is commonly property, RoPE is widely adopted in the latest LLMs, e.g.,
used in existing pre-trained language models. PaLM [56] and LLaMA [57]. Based on RoPE, xPos [277] fur-
• Relative position embedding. Unlike absolute position ther improves the translation invariance and length extrap-
embeddings, relative positional embeddings are generated olation of Transformer. At each dimension of the rotation
according to the offsets between keys and queries [273]. angle vector, xPos adds a special exponential decay that is
A popular variant of relative PE was introduced in smaller when the basis is larger. It can alleviate the unstable
Transformer-XL [274, 275]. The calculation of attention phenomenon during training as the distance increases.
scores between keys and queries has been modified to • ALiBi. ALiBi [264] is proposed to improve the extrap-
introduce learnable embeddings corresponding to relative olation of Transformer. Similar to relative position embed-
positions. T5 [82] further simplified relative positional em- ding, it biases attention scores with a penalty based on the
25

distances between keys and queries. Different from the rela- • PagedAttention. It has been observed when LLM are
tive positional embedding methods like T5 [82], the penalty deployed on servers, GPU memory is largely occupied by
scores in ALiBi are pre-defined without any trainable pa- cached attention key and value tensors (called KV cache).
rameters. Empirical results in [264] have shown that ALiBi The major reason is that the input lengths are often varied,
has a better extrapolation performance on sequences that are leading to fragmentation and over-reservation issues. In-
longer than those for training than several popular position spired by the classic paging technique in operating systems,
embedding methods such as sinusoidal PE [22], RoPE [263], PagedAttention has been proposed to improve the memory
and T5 bias [82]. In addition, it has been shown that ALiBi efficiency and throughput of deployed LLMs [285]. In detail,
can also improve training stability in BLOOM [78]. PagedAttention partitions each sequence into subsequences,
and the corresponding KV caches of these subsequences are
Attention. Attention mechanism is a critical component of allocated into non-contiguous physical blocks. The paging
Transformer. It allows the tokens across the sequence to technique increases the GPU utilization and enables efficient
interact with each other and compute the representations memory sharing in parallel sampling.
of the input and output sequence. To put all these discussions together, we summarize the
• Full attention. In the vanilla Transformer [22], the atten- suggestions from existing literature for detailed configura-
tion mechanism is conducted in a pairwise way, considering tion. For stronger generalization and training stability, it is
the relations between all token pairs in a sequence. It adopts suggested to choose the pre RMSNorm for layer normaliza-
scaled dot-product attention, in which the hidden states tion, and SwiGLU or GeGLU as the activation function. In
are mapped into queries, keys, and values. Additionally, addition, LN may not be used immediately after embedding
Transformer uses multi-head attention instead of single layers, which is likely to incur performance degradation. As
attention, projecting the queries, keys, and values with for position embeddings, RoPE or ALiBi is a better choice
different projections in different heads. The concatenation since it performs better on long sequences.
of the output of each head is taken as the final output.
• Sparse attention. A crucial challenge of full attention
4.2.3 Pre-training Tasks
is the quadratic computational complexity, which becomes
a burden when dealing with long sequences. Therefore, Pre-training plays a key role that encodes general knowl-
various efficient Transformer variants are proposed to re- edge from large-scale corpus into the massive model param-
duce the computational complexity of the attention mecha- eters. For training LLMs, there are two commonly used pre-
nism [278, 279]. For instance, locally banded sparse attention training tasks, namely language modeling and denoising
(i.e., Factorized Attention [280] has been adopted in GPT- autoencoding.
3 [55]. Instead of the whole sequence, each query can only
Language Modeling. The language modeling task (LM) is
attend to a subset of tokens based on the positions.
the most commonly used objective to pre-train decoder-only
• Multi-query/grouped-query attention. Multi-query atten-
LLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of
tion refers to the attention variant where different heads
tokens x = {x1 , . . . , xn }, the LM task aims to autoregres-
share the same linear transformation matrices on the keys
sively predict the target tokens xi based on the preceding
and values [281]. It achieves higher inference speed with
tokens x<i in a sequence. A general training objective is to
only a minor sacrifice in model quality. Representative
maximize the following likelihood:
models with multi-query attention include PaLM [56] and
StarCoder [98]. To make a trade-off between multi-query n
X
attention and multi-head attention, grouped-query attention LLM (x) = log P (xi |x<i ). (6)
(GQA) [282] has been explored. In GQA, heads are assigned i=1
into different groups, and those heads that belong to the
Since most language tasks can be cast as the prediction
same group will share the same transformation matrices.
problem based on the input, these decoder-only LLMs might
Specially, GQA has been adopted and empirically tested in
be potentially advantageous to implicitly learn how to ac-
the recently released LLaMA 2 model [99].
complish these tasks in a unified LM way. Some studies
• FlashAttention. Different from most existing approx- have also revealed that decoder-only LLMs can be naturally
imate attention methods that trade-off model quality to transferred to certain tasks by autoregressively predicting
improve the computing efficiency, FlashAttention [283] pro- the next tokens [26, 55], without fine-tuning. An important
poses to optimize the speed and memory consumption of variant of LM is the prefix language modeling task, which is
attention modules on GPUs from an IO-aware perspective. designed for pre-training models with the prefix decoder
There exist different levels of memory on modern GPUs, architecture. The tokens within a randomly selected prefix
e.g., SRAM with a fast IO and HBM with a relatively would not be used in computing the loss of prefix language
slow IO. FlashAttention organizes the input into blocks and modeling. With the same amount of tokens seen during pre-
introduces necessary recomputation, both to make better training, prefix language modeling performs slightly worse
use of the fast memory SRAM. Implemented as a fused than language modeling, since fewer tokens in the sequence
kernel in CUDA, FlashAttention has been integrated into are involved for model pre-training [29].
PyTorch [197], DeepSpeed [74], and Megatron-LM [75]. The
updated version FlashAttention-2 [284] further optimizes Denoising Autoencoding. In addition to conventional
the work partitioning of GPU thread blocks and warps, lead- LM, the denoising autoencoding task (DAE) has also been
ing to around 2× speedup when compared to the original widely used to pre-train language models [24, 82]. The
FlashAttention. inputs x\x̃ for DAE task are corrupted text with randomly
26

I am sleepy. I start a pot of including T5 bias [82], ALiBi [264], xPos [277] and even
NoPE [287]. However, as one of the mainstream position
coffee 0.661 strong 0.008 soup 0.005
embedding methods, RoPE exhibits limited extrapolation
water 0.119 black 0.008 ... ...
ability in empirical studies [240]. In the following, we dis-
tea 0.057 hot 0.007 happy 4.3e-6
cuss several methods that can scale RoPE to longer texts.
rice 0.017 oat 0.006 Boh 4.3e-6
chai 0.012 beans 0.006 ... ...
• Direct model fine-tuning. To adapt LLMs to a long con-
text window, a straightforward approach is to directly fine-
Fig. 10: The probability distribution over the vocabulary in tune the models on long texts with the desired length. The
descending order for the next token of the context “I am context extension can be scheduled with increased lengths
sleepy. I start a pot of ”. For ease of discussion, this example is in a multi-stage approach (e.g., 2K → 8K → 32K). To conduct
given in word units instead of subword units. effective extension, it needs specially prepared long texts
for training. Specially, some recent study has shown that
the quality is more important than the lengths of training
replaced spans. Then, the language models are trained to re- text in long context models [288]. However, a recent study
cover the replaced tokens x̃. Formally, the training objective has highlighted that the fine-tuning approach tends to be
of DAE is denoted as follows: inherently slow when adapting LLMs for long texts [240].
• Position interpolation. This method downscales the po-
LDAE (x) = log P (x̃|x\x̃ ). (7) sition indices within the original context window, to avoid
out-of-distribution rotation angles during pre-training [240,
However, the DAE task seems to be more complicated
289]. To be more specific, this approach multiplies all posi-
in implementation than LM task. As a result, it has not
tion indices by a coefficient L/L′ (L < L′ ), where L and
been widely used to pre-train large language models. Exist-
L′ represent the original and target context window length,
ing LLMs that take DAE as pre-training objectives include
respectively. Experimental results [240] have shown that
T5 [82] and GLM-130B [93]. These models are mainly trained
this method can extend the context window effectively and
to recover the replaced spans in an autoregressive way.
efficiently, compared to the above approach of direct model
Mixture-of-Denoisers. Mixture-of-Denoisers (MoD) [89], fine-tuning. However, it is worth noting that this technique
also known as UL2 loss, was introduced as a unified ob- may have an adverse impact on the model’s performance
jective for pre-training language models. MoD regards both when handling shorter texts[240, 290].
LM and DAE objectives as different types of denoising tasks, • Position truncation. To mitigate the challenges posed
namely S-denoiser (LM), R-denoiser (DAE, short span and by out-of-distribution rotation angles, another practical ap-
low corruption), and X-denoiser (DAE, long span or high proach is to truncate longer relative positions to satisfy the
corruption). Among the three denoising tasks, S-denoiser requirement of the maximum training length. Specifically,
is similar to the conventional LM objective (Equation (6)), ReRoPE and LeakyReRoPE [291] introduce a pre-defined
while R-denoiser and X-denoiser are similar to DAE ob- window length, which is smaller than the maximum train-
jectives (Equation (7)) but differ from each other in the ing length. Position indices within this pre-defined window
lengths of spans and ratio of corrupted text. For input sen- are retained, while those indices beyond the window are
tences started with different special tokens (i.e., {[R], [S], either truncated to the pre-defined window length or in-
[X]}), the model will be optimized using the corresponding terpolated to align with the maximum training length. This
denoisers. MoD has been applied in the latest PaLM 2 strategy can reserve local position relationships and enhance
model [120]. the extrapolation capacity. However, this approach needs
to compute the attention matrices twice, accommodating
4.2.4 Long Context Modeling additional computational budget.
In real applications, there is an increasing demand for long • Base modification. LLMs are usually trained with a pre-
context modeling capacities of LLMs, such as PDF pro- set maximum training length, e.g., 4096 in Llama 2 [99].
cessing and story writing [286]. Many closed-source LLMs However, wavelengths in certain dimensions of RoPE may
provide professional support for long text processing. For exceed the training length for longer text [276], so that
instance, OpenAI releases GPT-4 Turbo with a 128K context language models have not undergone sufficient training
window, and Anthropic releases Claude 2.1 with a 200K (i.e., a complete rotation cycle) on these dimensions. Thus,
context window. To enhance the long context modeling when we adapt LLMs to longer texts, the rotation angles
abilities, there are generally two feasible directions, namely for certain dimensions would be never seen in the training
scaling position embeddings and adapting context window. phase [292]. Given a fixed rotation angle t·θi , a smaller basis
Next, we introduce the two parts in detail. θi allows for a greater distance t, i.e., enabling the modeling
of longer texts [235, 276, 288]. According to the formula
Scaling Position Embeddings. Transformer-based LLMs θi = b−2(i−1)/d in Equation 4, decreasing the basis can be
can learn effective position embeddings within the maxi- achieved by increasing the value of the base. In addition,
mum training length. Thus, when adapting LLMs to lan- decreasing the base can also help re-scale the wavelengths
guage tasks beyond the maximum training length, it is of all dimensions below the training length, while it often
necessary to scale to larger position indices. Some specific needs continual pre-training to adapt the LLMs to long
position embeddings have been shown to possess a certain context windows [292]. A recent study [292] has empirically
degree of ability to generalize to text beyond the training compared these two base modification methods, and shown
length, which is formally termed extrapolation capability, that decreasing the base demonstrates a better extrapolation
27

capacity beyond the training length, while increasing the tentions and other efficient architectures, aiming to alleviate
base performs better within the training length. high computational cost for modeling long texts. These
• Basis truncation. Similar to the base modification, the studies have been extensively discussed in Section 4.2.1
truncation of the basis also concentrates on dealing with and Section 4.2.2. Furthermore, context compression and
the singular dimensions with wavelengths exceeding the prompting techniques (e.g., iterative reasoning [303]) have
training length [293]. According to the definition λi = 2π/θi also been proven to be a viable strategy for handling long
in Equation 5, the dimension with a large wavelength λi text tasks [303–306], without the need of model adaption.
has a small basis θi accordingly. Based on this observation,
this approach first defines a basis range [a, c]. Given the 4.2.5 Decoding Strategy
basis range, the value of basis is modified according to the After the LLMs have been pre-trained, it is essential to em-
following ways: (1) when θi ≥ c, the value is retained, ploy a specific decoding strategy to generate the appropriate
(2) when θi ≤ a, the value is set to zero, and (3) when output from the LLMs.
a < θi < c, the value is truncated to a fixed small
value. Via basis truncation, the out-of-distribution rotation Background. We start the discussion with the prevalent
angles can be avoided at larger position indices. However, decoder-only architecture, and introduce the auto-regressive
this approach does not perform very well at long context decoding mechanism. Since such LLMs are pre-trained
tasks [293]. based on the language modeling task (Equation 6), a basic
decoding method is greedy search that predicts the most
Adapting Context Window. Since Transformer-based LLMs likely token at each step based on the previously generated
have limited context windows, they can not directly inte- tokens, formally modeled as:
grate or utilize the entire information of the long sequences
exceeding the context window. To alleviate the limitation, xi = arg maxP (x|x<i ), (8)
several methods adapting LLMs to long context have been x

proposed, as discussed below. where xi is the token with the highest probability at i-
• Parallel context window. Inspired by fusion-in- th step of generation conditioned on the context x<i . For
decoder [294], parallel context window methods [295, 296] instance in Figure 10, when predicting the next token of
adopt a divide-and-conquer strategy to process input text. the sentence “I am sleepy. I start a pot of”, greedy search
Specially, it divides the input text into multiple segments, selects the token “coffee” which has the highest probability
each independently encoded with shared position embed- at the current step. Greedy search can achieve satisfactory
dings. In the generation stage, the attention masks are mod- results in text generation tasks (e.g., machine translation
ified to make that subsequent tokens can access to previous and text summarization), in which the output is highly
tokens in each segment. Nevertheless, this method cannot dependent on the input [307]. However, in terms of open-
distinguish the order of different segments, constraining the ended generation tasks (e.g., story generation and dialog),
model capacity on certain tasks. greedy search sometimes tends to generate awkward and
• Λ-shaped context window. Some prior work has revealed repetitive sentences [308].
that LLMs tend to allocate greater attention weights to As another alternative decoding strategy, sampling-
the starting and nearest tokens among all previous to- based methods are proposed to randomly select the next
kens [297, 298], so called the “lost in the middle” phe- token based on the probability distribution to enhance the
nomenon [299]. Based on this observation, LM-Infinite [300] randomness and diversity during generation:
and StreamingLLM [298] propose to employ a “Λ-shaped”
attention mask, which selectively preserves the initial tokens xi ∼ P (x|x<i ). (9)
and the nearest tokens that each query can attend to and
For the example in Figure 10, sampling-based methods will
then discards any tokens beyond this scope. Experiments
sample the word “coffee” with higher probability while
demonstrate that this method can facilitate extra-long text
also retaining the possibilities of selecting the rest words,
generation with a fixed memory [298]. However, it may
“water”, “tea”, “rice”, etc.
struggle to model the long-range dependency in prompts,
Not limited to the decoder-only architecture, these two
since it cannot effectively utilize the information from the
decoding methods can be generally applied to encoder-
discarded tokens [298].
decoder models and prefix decoder models in a similar way.
• External memory. It has been shown that a relatively
small subset of tokens can effectively capture the majority Improvement for Greedy Search. Selecting the token with
of attention patterns in a Transformer [301], i.e., the top- the highest probability at each step may result in overlook-
k attention keys can well approximate the original full ing a sentence with a higher overall probability but a lower
attention. Therefore, a number of studies propose to store local estimation. Next, we introduce several improvement
the past keys in external memory and utilize a k -NN strategies to alleviate this issue.
search method to retrieve the k most relevant tokens for • Beam search. Beam search [309] retains the sentences
generation [238, 301, 302]. For a decoder model, it typically with the n (beam size) highest probabilities at each step
employs one certain layer to access these top-k external during the decoding process, and finally selects the gener-
tokens, while still adopts the normal context window in the ated response with the top probability. Typically, the beam
rest layers [238, 302]. size is configured within the range of 3 to 6. However,
In addition to the studies based on vanilla Transformer, opting for a larger beam size might result in a decline in
there are a surge of Transformer variants with efficient at- performance [310].
28

• Length penalty. Since beam search favours shorter sen- thereby amplifying the impact of important tokens. Based
tences, imposing length penalty (a.k.a., length normaliza- on this contrastive idea, DoLa [319] further extends this
tion) is a commonly used technique [311] to overcome this approach to contrasting the logits across different layers of
issue, which normalizes the sentence probability according a single LLM, as higher layers tend to assign more weight
to the sentence length (divided by an exponential power α to important tokens.
of the length).
Besides, some researchers [312] propose to penalize the Memory Wall
generation of previously generated tokens or n-grams to
alleviate the issue of repetitive generation. In addition, When generating a new token, the most time-
diverse beam search [313] can be leveraged to produce a consuming steps revolve around data transfer and
set of diverse outputs based on the same input. weight computation. A main issue is the significant
amount of time overwhelmed by data transfer, of-
Improvement for Random Sampling. Sampling-based ten referred to as the memory wall issue.
methods sample the token over the whole vocabulary, which
may select wrong or irrelevant tokens (e.g., “happy” and To address this issue, researchers formally quantify
“Boh” in Figure 10) based on the context. To improve the data transfer from GPU memory to GPU caches
generation quality, several strategies have been proposed using the number of bytes in I/O, and they assess
for mitigating or preventing the selection of words with weight computation by measuring the number of
exceedingly low probabilities. FLOPs [320]. Specifically, let b, s, n, d, and h denote
• Temperature sampling. To modulate the randomness of the batch size, sequence length, number of attention
sampling, a practical method is to adjust the temperature heads, hidden size of each head, and overall hidden
coefficient of the softmax function for computing the proba- size (h = n · d), respectively. During the layer-
bility of the j -th token over the vocabulary: wise multi-head self-attention calculation in causal
decoder, the I/O bytes and FLOPs at each decoding
exp (lj /t) step can be expressed as 8bsn + 4bsnd + 4bnd and
P (xj |x<i ) = P , (10)
j ′ exp (lj ′ /t)
8bsnd, respectively [320].
where lj ′ is the logits of each word and t is the temperature Arithmetic intensity is further defined as the ratio of
coefficient. Reducing the temperature t increases the chance FLOPs to I/O bytes:
of selecting words with high probabilities while decreases FLOPs 2
the chances of selecting words with low probabilities. When intensity = = (11)
I/O bytes 1 + d2 + 1
s
t is set to 1, it becomes the default random sampling; when
t is approaching 0, it is equivalent to greedy search. In Let’s consider LLaMA 13B (d = 128) with a se-
addition, when t goes to infinity, it degenerates to uniform quence length of 1024 (s = 1024) as an example.
sampling. The calculated arithmetic intensity is 1.97. How-
• Top-k sampling. Different from temperature sampling, ever, the A100 80G GPU can perform 312 TFLOPs
top-k sampling directly truncates the tokens with lower and transfer 2 TB of data in one second, i.e., its ideal
probability and only samples from the tokens with the top arithmetic intensity is 156. This indicates that the
k highest probabilities [314]. For example in Figure 10, top- bottleneck in attention calculation lies in the process
5 sampling will sample from the words “coffee”, “water”, of data transfer (i.e., excessive I/O loading).
“tea”, “rice”, and “chai” from their re-scaled probabilities.
• Top-p sampling. Since top-k sampling does not consider
the overall possibility distribution, a constant value of k may Decoding Efficiency Issues. In this part, we briefly ana-
be not be suitable for different contexts. Therefore, top-p lyze the decoding efficiency issues of LLMs. Overall, the
sampling (a.k.a., nucleus sampling) is proposed by sampling decoding process of LLMs can be divided into two stages
from the smallest set having a cumulative probability above for overhead analysis: (1) the prefill stage, which computes
(or equal to) p [308]. In practice, the smallest set can be con- the hidden states of the input sequence, and (2) the incre-
structed by gradually adding tokens from the vocabulary mental decoding stage, which generates a token and updates
sorted in descending order of generative probability, until hidden states in an auto-regressive manner [321]. As shown
their cumulative value exceeds p. in the above memory wall box, the arithmetic intensity of
Recently, researchers have also explored other sampling the incremental decoding stage is only 1.97, which is far
strategies for LLMs. For instance, η -sampling [315] further from the expected value of 156 (calculated according to
improves top-p sampling by introducing a dynamic thresh- the standard configuration of A100 80GB GPU). In contrast,
old based on the probability distribution. Furthermore, con- the arithmetic intensity of the prefill stage achieves 113.78
trastive search [316] and typical sampling [317] can be utilized for LLaMA-13B. Consequently, existing work mainly inves-
to improve the generation coherence during decoding. Since tigates how to enhance the efficiency of the incremental
it has been found that large models tend to assign higher decoding algorithm, which can be categorized into two
probability to important tokens compared to small models, main approaches:
contrastive decoding [318] utilizes a larger LM (e.g., OPT- • Reducing data transfer mainly focuses on optimizing
13B) and a smaller LM (e.g., OPT-125M) to measure their GPU memory access, thereby increasing the arithmetic in-
log-likelihood differences. Subsequently, tokens are sampled tensity. As introduced in Section 4.2.2, KV cache can avoid
based on the delta value of the probability distribution, redundant computation of previous tokens and PagedAt-
29

tention allocates KV caches into continuous blocks to reduce Why does Predicting the Next Word Works?
memory fragmentation. Furthermore, Flash-Decoding [322]
speeds up attention computation by loading the keys and The essence of decoder-only architecture is to
values in parallel, especially effective for long text gen- accurately predict the next word for reconstructing
eration. As another alternative approach, multi-query and the pre-training data. Till now, there has been no
grouped-query attention can reduce the GPU memory band- formal study that theoretically demonstrates its
width overhead by sharing KV parameters (loading fewer advantage over other architectures. An interesting
weights). explanation was from Ilya Sutskever during the
• Decoding strategy optimization aims to improve the se- interview held by Jensen Huanga . The original
quential nature of the auto-regressive generation manner in transcript from the interview was copied belowb :
different ways. As a representative study, speculative decod-
ing [323, 324] first leverages a compact but efficient model Say you read a detective novel. It’s
(e.g., a n-gram model or a small PLM) to generate short like complicated plot, a storyline,
segments and then utilizes the LLM to verify and correct different characters, lots of events,
these drafts. It can lead to a notable 2× to 3× speedup mysteries like clues, it’s unclear.
without compromising the generation quality. Researchers Then, let’s say that at the last
further suggest several variants to improve the efficiency of page of the book, the detective has
this approach, such as a learning-based method to combine gathered all the clues, gathered
several small models [325] and a stage-wise acceleration all the people and saying, "okay,
which employs a more smaller LM to accelerate the small I’m going to reveal the identity of
LM first [326]. In addition, token-level early-exit techniques whoever committed the crime and that
have been proposed enabling the generation of a token at person’s name is". Predict that word.
lower Transformer layers, rather than passing through all ...
the layers [327]. It can attain greater speedup, but at the cost Now, there are many different words.
of sacrificing generation quality. But predicting those words better and
better, the understanding of the text
Practical Settings. In practice, existing libraries (e.g., Trans- keeps on increasing. GPT-4 predicts
formers [187]) and public APIs of LLMs (e.g., OpenAI) have the next word better.
supported various decoding strategies to serve different
scenarios of text generation. Next, we present the decoding a. https://www.nvidia.com/en-us/on-
demand/session/gtcspring23-S52092/
settings of several representative LLMs:
b. https://lifearchitect.ai/ilya/
• T5 [82] utilizes greedy search as the default setting and
applies beam search (beam size of 4) with a length penalty
of 0.6 for translation and summarization tasks. Architecture Choice. In earlier literature of pre-trained lan-
• GPT-3 [55] employs beam search with a beam size of 4 guage models, there are lots of discussions on the effects
and a length penalty of 0.6 for all generation tasks. of different architectures [29, 89]. However, most LLMs are
• Alpaca [142] utilizes sampling-based strategies with developed based on the causal decoder architecture, and
top-k (k = 50), top-p (p = 0.9), and temperature of 0.7 for there still lacks a theoretical analysis on its advantage over
open-ended generation. the other alternatives. Next, we briefly summarize existing
• LLaMA [57] applies diverse decoding strategies tai- discussions on this issue.
lored to specific tasks. For instance, it employs the greedy • By pre-training with the LM objective, it seems that
search for question answering tasks while utilizes a sam- causal decoder architecture can achieve a superior zero-
pling strategy with the temperature settings of 0.1 (pass@1) shot and few-shot generalization capacity. Existing research
and 0.8 (pass@100) for code generation. has shown that without multi-task fine-tuning, the causal
• OpenAI API supports several basic decoding strate- decoder has better zero-shot performance than other archi-
gies, including greedy search (by setting temperature to tectures [29]. The success of GPT-3 [55] has demonstrates
0), beam search (with the setting best_of), temperature that the large causal decoder model can be a good few-
sampling (with the setting temperature), nucleus sam- shot learner. In addition, instruction tuning and alignment
pling (with the setting top_p). It also introduce param- tuning discussed in Section 5 have been proven to fur-
eters presence_penalty and frequency_penalty to ther enhance the capability of large causal decoder mod-
control the repetition degree of generation. According to els [66, 67, 69].
the OpenAI’s document, their APIs would produce different • Scaling law has been widely observed in causal de-
outputs even if the input and the hyper-parameters are the coders. By scaling the model size, the dataset size, and
same. Setting temperature to 0 can yield more deterministic the total computation, the performance of causal decoders
outputs, albeit with a slight chance of variability. can be substantially improved [30, 55]. Thus, it has become
an important strategy to increase the model capacity of
4.2.6 Summary and Discussion the causal decoder via scaling. However, more detailed
The choice of architecture and pre-training tasks may incur investigation on encoder-decoder models is still lacking, and
different inductive biases for LLMs, which would lead to more efforts are needed to investigate the performance of
different model capacities. In this part, we discuss one open encoder-decoder models at a large scale.
issue about the architecture choice for LLMs. More research efforts about the discussions on architec-
30

tures and pre-training objectives are in need to analyze how training. To mitigate this problem, PaLM [56] and OPT [90]
the choices of the architecture and pre-training tasks affect use a simple strategy that restarts the training process from
the capacity of LLMs, especially for encoder-decoder archi- an earlier checkpoint before the occurrence of the spike and
tectures. Despite the effectiveness of decoder-only architec- skips over the data that may have caused the problem.
ture, it is also suggested to make more diverse exploration Further, GLM [93] finds that the abnormal gradients of the
on architecture design. Besides the major architecture, the embedding layer usually lead to spikes, and proposes to
detailed configuration of LLM is also worth attention, which shrink the embedding layer gradients to alleviate it.
has been discussed in Section 4.2.2.
4.3.2 Scalable Training Techniques
4.3 Model Training As the model and data sizes increase, it has become chal-
In this part, we review the important settings, techniques, lenging to efficiently train LLMs under a limited compu-
or tricks for training LLMs. tational resource. Especially, two primary technical issues
are required to be resolved, i.e., increasing training through-
4.3.1 Optimization Setting put and loading larger models into GPU memory. In this
For parameter optimization of LLMs, we present the com- part, we review several widely used approaches in existing
monly used settings for batch training, learning rate, opti- work to address the above two challenges, namely 3D
mizer, and training stability. parallelism [75, 331, 332], ZeRO [333], and mixed precision
training [334], and also give general suggestions about how
Batch Training. For language model pre-training, existing to utilize them for training.
work generally sets the batch size to a large number (e.g.,
2,048 examples or 4M tokens) to improve the training 3D Parallelism. 3D parallelism is actually a combination of
stability and throughput. For LLMs such as GPT-3 and three commonly used parallel training techniques, namely
PaLM, they have introduced a new strategy that dynam- data parallelism, pipeline parallelism [331, 332], and tensor
ically increases the batch size during training, ultimately parallelism [75]24 . We next introduce the three parallel train-
reaching a million scale. Specifically, the batch size of GPT-3 ing techniques.
is gradually increasing from 32K to 3.2M tokens. Empirical • Data parallelism. Data parallelism is one of the most
results have demonstrated that the dynamic schedule of fundamental approaches to improving the training through-
batch size can effectively stabilize the training process of put. It replicates the model parameters and optimizer states
LLMs [56]. across multiple GPUs and then distributes the whole train-
ing corpus into these GPUs. In this way, each GPU only
Learning Rate. Existing LLMs usually adopt a similar learn- needs to process the assigned data for it, and performs
ing rate schedule with the warm-up and decay strategies the forward and backward propagation to obtain the gra-
during pre-training. Specifically, in the initial 0.1% to 0.5% dients. The computed gradients on different GPUs will be
of the training steps, a linear warm-up schedule is employed further aggregated to obtain the gradients of the entire batch
for gradually increasing the learning rate to the maximum for updating the models in all GPUs. In this way, as the
value that ranges from approximately 5 × 10−5 to 1 × 10−4 calculations of gradients are independently performed on
(e.g., 6 × 10−5 for GPT-3). Then, a cosine decay strategy different GPUs, the data parallelism mechanism is highly
is adopted in the subsequent steps, gradually reducing the scalable, enabling the way that increases the number of
learning rate to approximately 10% of its maximum value, GPUs to improve training throughput. Furthermore, this
until the convergence of the training loss. technique is simple in implementation, and most of existing
Optimizer. The Adam optimizer [328] and AdamW opti- popular deep learning libraries have already implemented
mizer [329] are widely utilized for training LLMs (e.g., GPT- data parallelism, such as TensorFlow and PyTorch.
3), which are based on adaptive estimates of lower-order • Pipeline parallelism. Pipeline parallelism aims to dis-
moments for first-order gradient-based optimization. Com- tribute the different layers of a LLM into multiple GPUs.
monly, its hyper-parameters are set as follows: β1 = 0.9, Especially, in the case of a Transformer model, pipeline
β2 = 0.95 and ϵ = 10−8 . Meanwhile, the Adafactor op- parallelism loads consecutive layers onto the same GPU, to
timizer [330] has also been utilized in training LLMs (e.g., reduce the cost of transmitting the computed hidden states
PaLM and T5), which is a variant of the Adam optimizer or gradients between GPUs. However, a naive implemen-
specially designed for conserving GPU memory during tation of pipeline parallelism may result in a lower GPU
training. The hyper-parameters of the Adafactor optimizer utilization rate as each GPU has to wait for the previous
are set as: β1 = 0.9 and β2 = 1.0 − k −0.8 , where k denotes one to complete the computation, leading to the unneces-
the number of training steps. sary cost of bubbles overhead [331]. To reduce these bubbles
in pipeline parallelism, GPipe [331] and PipeDream [332]
Stabilizing the Training. During the pre-training of LLMs, propose the techniques of padding multiple batches of data
it often suffers from the training instability issue, which and asynchronous gradient update to improve the pipeline
may cause the model collapse. To address this issue, weight efficiency.
decay and gradient clipping have been widely utilized, • Tensor parallelism. Tensor parallelism is also a com-
where existing studies [55, 78, 90, 93, 113] commonly set monly used technique that aims to decompose the LLM for
the threshold of gradient clipping to 1.0 and weight decay
rate to 0.1. However, with the scaling of LLMs, the training 24. Model parallelism is a more broader term that includes tensor
loss spike is also more likely to occur, leading to unstable parallelism and pipeline parallelism in some work [75].
31

TABLE 7: Detailed optimization settings of several existing LLMs.

Batch Size Learning Precision Weight Grad


Model Warmup Decay Method Optimizer Dropout
(#tokens) Rate Type Decay Clip
GPT3 (175B) 32K→3.2M 6 × 10−5 yes cosine decay to 10% Adam FP16 0.1 1.0 -
PanGu-α (200B) - 2 × 10−5 - - Adam - 0.1 - -
OPT (175B) 2M 1.2 × 10−4 yes manual decay AdamW FP16 0.1 - 0.1
PaLM (540B) 1M→4M 1 × 10−2 no inverse square root Adafactor BF16 lr2 1.0 0.1
BLOOM (176B) 4M 6 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 0.0
MT-NLG (530B) 64 K→3.75M 5 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 -
Gopher (280B) 3M→6M 4 × 10−5 yes cosine decay to 10% Adam BF16 - 1.0 -
Chinchilla (70B) 1.5M→3M 1 × 10−4 yes cosine decay to 10% AdamW BF16 - - -
Galactica (120B) 2M 7 × 10−6 yes linear decay to 10% AdamW - 0.1 1.0 0.1
LaMDA (137B) 256K - - - - BF16 - - -
Jurassic-1 (178B) 32 K→3.2M 6 × 10−5 yes - - - - - -
LLaMA (65B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
LLaMA 2 (70B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
Falcon (40B) 2M 1.85 × 10−4 yes cosine decay to 10% AdamW BF16 0.1 - -
GLM (130B) 0.4M→8.25M 8 × 10−5 yes cosine decay to 10% AdamW FP16 0.1 1.0 0.1
T5 (11B) 64K 1 × 10−2 no inverse square root AdaFactor - - - 0.1
ERNIE 3.0 Titan (260B) - 1 × 10−4 - - Adam FP16 0.1 1.0 -
PanGu-Σ (1.085T) 0.5M 2 × 10−5 yes - Adam FP16 - - -

multi-GPU loading. Unlike pipeline parallelism, tensor par- some studies [334] have started to utilize 16-bit floating-
allelism focuses on decomposing the tensors (the parameter point numbers (FP16), which reduces memory usage and
matrices) of LLMs. For a matrix multiplication operation communication overhead. Additionally, as popular NVIDIA
Y = XA in the LLM, the parameter matrix A can be GPUs (e.g., A100) have twice the amount of FP16 computa-
split into two submatrices, A1 and A2 , by column, which tion units as FP32, the computational efficiency of FP16 can
can be expressed as Y = [XA1 , XA2 ]. By placing matrices be further improved. However, existing work has found that
A1 and A2 on different GPUs, the matrix multiplication FP16 may lead to the loss of computational accuracy [64, 78],
operation would be invoked at two GPUs in parallel, and which affects the final model performance. To alleviate it, an
the final result can be obtained by combining the outputs alternative called Brain Floating Point (BF16) has been used
from the two GPUs through across-GPU communication. for training, which allocates more exponent bits and fewer
Currently, tensor parallelism has been supported in several significant bits than FP16. For pre-training, BF16 generally
open-source libraries, e.g., Megatron-LM [75], and can be performs better than FP16 on representation accuracy [78].
extended to higher-dimensional tensors. Also, Colossal-AI
has implemented tensor parallelism for higher-dimensional
Overall Training Suggestion. In practice, the above train-
tensors [335–337] and proposed sequence parallelism [338]
ing techniques, especially 3D parallelism, are often jointly
especially for sequence data, which can further decompose
used to improve the training throughput and large model
the attention operation of the Transformer model.
loading. For instance, researchers have incorporated 8-way
ZeRO. ZeRO [333] technique, proposed by the Deep- data parallelism, 4-way tensor parallelism, and 12-way
Speed [74] library, focuses on the issue of memory re- pipeline parallelism, enabling the training of BLOOM [78]
dundancy in data parallelism. As mentioned before, data on 384 A100 GPUs. Currently, open-source libraries like
parallelism requires each GPU to store the same copy of DeepSpeed [74], Colossal-AI [189], and Alpa [340] can well
a LLM, including model parameters, model gradients, and support the three parallel training methods. To reduce the
optimizer parameters. Whereas, not all of the above data is memory redundancy, ZeRO, FSDP, and activation recom-
necessary to be retained on each GPU, which would cause putation techniques [77, 341] can be also employed for
a memory redundancy problem. To resolve it, the ZeRO training LLMs, which have already been integrated into
technique aims to retain only a fraction of data on each DeepSpeed, PyTorch, and Megatron-LM. In addition, the
GPU, while the rest data can be retrieved from other GPUs mixed precision training technique such as BF16 can be
when required. Specifically, ZeRO provides three solutions, also leveraged to improve the training efficiency and reduce
depending on how the three parts of the data are stored, GPU memory usage, while it requires necessary support on
namely optimizer state partitioning, gradient partitioning, hardware (e.g., A100 GPU). Because training large models is
and parameter partitioning. Empirical results indicate that a time-intensive process, it would be useful to forecast the
the first two solutions do not increase the communication model performance and detect abnormal issues at an early
overhead, and the third solution increases about 50% com- stage. For this purpose, GPT-4 [46] has recently introduced
munication overhead but saves memory proportional to a new mechanism called predictable scaling built on a deep
the number of GPUs. PyTorch has implemented a similar learning stack, enabling the performance prediction of large
technique as ZeRO, called FSDP [339]. models with a much smaller model, which might be quite
useful for developing LLMs. In practice, one can further
Mixed Precision Training. In previous PLMs (e.g., leverage the supporting training techniques of mainstream
BERT [23]), 32-bit floating-point numbers, also known as deep learning frameworks. For instance, PyTorch supports
FP32, have been predominantly used for pre-training. In the data parallel training algorithm FSDP [339] (i.e., fully
recent years, to pre-train extremely large language models, sharded data parallel), which allows for partial offloading
32

of training computations to CPUs if desired. description “Please answer this question” is added for each
example in the question-answering task. After instruction
tuning, LLMs can generalize well to other unseen tasks by
5 A DAPTATION OF LLM S
following their task descriptions [28, 67, 69]. In particular,
After pre-training, LLMs can acquire the general abilities it has been shown that instructions are the crucial factor
for solving various tasks. However, an increasing number in task generalization ability for LLMs [67]: by fine-tuning
of studies have shown that LLM’s abilities can be further the model on labeled datasets with the task descriptions re-
adapted according to specific goals. In this section, we moved, it results in a dramatic drop in model performance.
introduce two major approaches to adapting pre-trained To better generate labeled instances for instruction tuning,
LLMs, namely instruction tuning and alignment tuning. The a crowd-sourcing platform, PromptSource [167] has been
former approach mainly aims to enhance (or unlock) the proposed to effectively create, share, and verify the task
abilities of LLMs, while the latter approach aims to align the descriptions for different datasets. To enrich the training
behaviors of LLMs with human values or preferences. Fur- instances, several studies [28, 168, 345] also try to invert the
ther, we will also discuss efficient tuning and quantization input-output pairs of existing instances with specially de-
for model adaptation in resource-limited settings. In what signed task descriptions for instruction tuning. For instance,
follows, we will introduce the four parts in detail. given a question-answer pair, we can create a new instance
by predicting the answer-conditioned question (e.g., “Please
5.1 Instruction Tuning generate a question based on the answer:”).
In essence, instruction tuning is the approach to fine-tuning
pre-trained LLMs on a collection of formatted instances in Formatting Daily Chat Data. Despite that a large number
the form of natural language [67], which is highly related of training instances have been formatted with instructions,
to supervised fine-tuning [66] and multi-task prompted they mainly come from public NLP datasets, either lack-
training [28]. In order to perform instruction tuning, we first ing instruction diversity or mismatching with real human
need to collect or construct instruction-formatted instances. needs [66]. To overcome this issue, InstructGPT [66] pro-
Then, we employ these formatted instances to fine-tune poses to take the queries that real users have submitted to
LLMs in a supervised learning way (e.g., training with the the OpenAI API as the task descriptions. Additionally, to
sequence-to-sequence loss). After instruction tuning, LLMs enrich the task diversity, human labelers are also asked to
can demonstrate superior abilities to generalize to unseen compose the instructions for real-life tasks, including open-
tasks [28, 67, 69], even in a multilingual setting [94]. ended generation, open question answering, brainstorm-
A recent survey [342] presents a systematic overview ing, and chatting. Then, they let another group of labelers
of the research on instruction tuning. In comparison to directly answer these instructions as the output. Finally,
that, we mainly focus on the effect of instruction tuning they pair one instruction (i.e., the collected user query) and
on LLMs and provide detailed guidelines or strategies for the expected output (i.e., the human-written answer) as a
instance collection and tuning. In addition, we also discuss training instance. Note that InstructGPT also employs these
the use of instruction tuning for satisfying the real needs of real-world tasks formatted in natural language for align-
users, which has been widely applied in existing LLMs, e.g., ment tuning (discussed in Section 5.2). Further, GPT-4 [46]
InstructGPT [66] and GPT-4 [46]. has designed potentially high-risk instructions and guided
the model to reject these instructions through supervised
5.1.1 Formatted Instance Construction fine-tuning for safety concerns. Considering the absence
of high-quality public chat data, several studies have also
Generally, an instruction-formatted instance consists of a
collected users’ chat requests as input data, and then utilized
task description (called an instruction), an optional input,
ChatGPT or GPT-4 to generate responses as output data. A
the corresponding output, and a small number of demon-
notable example of such a dataset is the conversational data
strations (optional). As important public resources, existing
from ShareGPT [148]. Additionally, Dolly [172] and Ope-
studies have released a large number of labeled data format-
nAssistant [173] have further released their conversation
ted in natural language (see the list of available resources in
data, which has been carefully labeled by human annotators
Table 3) as introduced in Section 3.3.1. Next, we introduce
to attain a high level of quality.
three major methods for constructing formatted instances
(see an illustration in Figure 11) and then discuss several
Formatting Synthetic Data. To reduce the burden of human
key factors for instance construction.
annotation or manual collection, several semi-automated
Formatting NLP Task Datasets. Before instruction tuning approaches [143] have been proposed for constructing in-
was proposed, several early studies [168, 343, 344] collected stances by feeding existing instances into LLMs to synthe-
the instances from a diverse range of traditional NLP tasks size diverse task descriptions and instances. As illustrated
(e.g., text summarization, text classification, and translation) in Figure 11(c), the Self-Instruct method only needs 175
to create supervised multi-task training datasets. As a major instances as the initial task pool. Then, they randomly select
source of instruction tuning instances, it is convenient to for- a few instances from the pool as demonstrations and prompt
mat these multi-task training datasets with natural language a LLM to generate new instructions and corresponding
task descriptions. Specifically, recent work [28, 66, 67, 88] input-output pairs. After the quality and diversity filter-
augments the labeled datasets with human-written task de- ing, newly generated instances would be added into the
scriptions, which instructs LLMs to understand the tasks by task pool. Hence, the synthetic method is an effective and
explaining the task goal. For example, in Figure 11(a), a task economical way to generate large-scale instruction data for
33

API collection Human-written Seed


Instance Pool
Human-written Task description Instances

Please answer this question: & Instruction


Generation LLM Filter
Demonstrations Task description
Task description
NLP Datasets Q: What is the capital of France? Give me a quote from a
A: Paris. Can you recommend some ways
famous person on this topic.
to lose weight?
Q: What is the capital of Brazil?
A: Brasilia Input-Output
Desired output written by human Generation LLM

Input Output Output Input Output


Here are some ways to lose weight: Input: The importance of being honest.
Q: What is the capital of China?
1. Eat a healthy diet: Focus on … Output: Honesty is the first chapter in
A: Beijing.
2. Increase physical activity: Engage … the book of wisdom.

(a) Formatting Task Datasets (b) Formatting Daily Chat Data (c) Formatting Synthetic Data

Fig. 11: An illustration of instance formatting and three different methods for constructing the instruction-formatted
instances.

LLMs. However, the instances generated by the Self-Instruct performance across a wide range of tasks compared to the
method might be simplistic or lack the diversity. To improve methods tuned with instruction data.
the quality of synthetic int ructions, WizardLM [346] intro- • Formatting design. As an important factor, the design
duces Evol-Instruct by proposing in-depth and in-breadth of natural language format also highly impacts the gener-
evolving to enrich the complexity and diversity of the alization performance of LLMs [88]. Typically, we can add
instances. Furthermore, Self-Align [347] establishes multiple task descriptions and optional demonstrations to the input-
human-aligned principles to filter the synthesized instances. output pairs of existing datasets, where the task description
It then employs these instances to train a LLM in order is the most key part for LLMs to understand the task [88].
to yield more aligned instances. To enhance the quality Further, it can lead to substantial improvements by using an
of the instance output, researchers directly adopt human- appropriate number of exemplars as demonstrations [69],
written texts as the output and synthesize corresponding which also alleviates the model sensitivity to instruction
instructions using ICL examples [348]. engineering [67, 69]. However, incorporating other compo-
nents (e.g., things to avoid, reasons, and suggestions) into
Key Factors for Instance Construction. The quality of instructions may have a negligible or even adverse effect
instruction instances has an important impact on the perfor- on the performance of LLMs [88, 166]. Recently, to elicit
mance of the model. Here, we discuss some essential factors the step-by-step reasoning ability of LLMs, some work [69]
for instance construction. proposes to include chain-of-thought (CoT) examples for
• Scaling the instructions. It has been widely shown that some reasoning datasets, such as arithmetic reasoning. It
scaling the number of tasks can largely enhance the gen- has been shown that fine-tuning LLMs with both CoT and
eralization ability of LLMs [28, 67, 88]. With the increasing non-CoT examples can lead to a good performance across
of the task number, the model performance initially shows various reasoning tasks, including those that require multi-
a continuous growth pattern, while the gain becomes neg- hop reasoning ability (e.g., commonsense question answer-
ligible when it reaches a certain level [69, 88]. A plausible ing and arithmetic reasoning) as well as those without the
speculation is that a certain number of representative tasks need for such a reasoning way (e.g., sentiment analysis and
can provide relatively sufficient knowledge and adding extractive question answering) [69, 95].
more tasks may not bring additional gains [69]. Also, it is To summarize, diversity and quality of instructions seem
beneficial to enhance the diversity of the task descriptions in to be more important than the number of instances [349]
several aspects, such as length, structure, and creativity [28]. since the well-performing InstructGPT [66] and LLaMA-2-
As for the number of instances per task, it has been found Chat [99] utilize fewer but more diverse instructions (or
that a small number of instances can usually saturate the instances) than the Flan-series LLMs [67, 69]. However,
generalization performance of the model to perform a spe- a large amount of training data may compensate for the
cific task [67, 69]. Specially, several recent work [349, 350] absence of high-quality data [351]. Further, it is more useful
has explored the effect of fine-tuning with a small amount of to invite labelers to compose human-need tasks than using
high-quality instruction data (e.g., one or a few thousand in- dataset-specific tasks. However, it still lacks general guide-
stances), showing very promising results on the evaluation lines to annotate human-need instances, making the task
tasks. In contrast, another line of studies continue to explore composition somehow heuristic. To reduce human efforts,
the scaling effect of instruction data [351, 352]. For example, we can either reuse existing formatted datasets (Table 3)
Orca [351] scales up the synthesized instances to 5 million or automatically construct the instructions using existing
with step-by-step explanations, and it achieves superior LLMs [143]. We conduct a preliminary experiment to show
34

the effectiveness of different construction methods in Sec- different instruction data, we can also adopt a multi-stage
tion 5.1.4. instruction tuning strategy [352], where LLMs are first fine-
tuned with large-scale task-formatted instructions and sub-
5.1.2 Instruction Tuning Strategies sequently fine-tuned on daily chat ones. To avoid the capac-
Unlike pre-training, instruction tuning is often more effi- ity forgetting issue, it is also useful to add an amount of task-
cient since only a moderate number of instances are used formatted instructions at the second stage. Actually, such
for training. Since instruction tuning can be considered as a multi-stage tuning strategy can be also applied to other
a supervised training process, its optimization is different settings for instruction tuning. For example, we can sched-
from pre-training in several aspects [69], such as the training ule different fine-tuning stages with progressively increased
objective (i.e., sequence-to-sequence loss) and optimization levels on difficulty and complexity, and gradually improve
configuration (e.g., smaller batch size and learning rate), the capacities of LLMs to follow complex instructions.
which require special attention in practice. In addition to Other Practical Tricks. In practice, there are also several
these optimization configurations, there are also four im- useful strategies and tricks that are helpful to improve the
portant aspects to consider for instruction tuning: fine-tuning performance of LLMs. We list several represen-
Balancing the Data Distribution. Since instruction tun- tative ones as follows:
ing involves a mixture of different tasks, it is important • Efficient training for multi-turn chat data. Given a multi-
to balance the proportion of different tasks during fine- turn chat example (the conversation between a user and
tuning. A widely used method is the examples-proportional chatbot), a straightforward fine-tuning way is to split it into
mixing strategy [82], i.e., combining all the datasets and multiple context-response pairs for training: a LLM is fine-
sampling each instance equally from the mixed datasets. tuned to generate the response based on the correspond-
Furthermore, increasing the sampling ratio of high-quality ing context for all splits (i.e., at each utterance from the
collections (e.g., FLAN [67] and P3 [167]) can generally user). In such a fine-tuning way, it is apparent that there
lead to performance improvement according to recent find- exist overlapping utterances in the split examples from a
ings [69, 95]. Further, it is common to set a maximum conversation. To save the training cost, Vicuna [138] has
cap to control the maximum number of examples that a adopted an efficient way that feeds the whole conversation
dataset can contain during instruction tuning [82], which into the LLM, but relies on a loss mask that only computes
is set to prevent larger datasets from overwhelming the the loss on the responses of the chatbot for training. It can
entire distribution [82, 95]. In practice, the maximum cap significantly reduce the compute costs derived from the
is typically set to several thousands or tens of thousands overlapped utterances.
according to different datasets [67, 69]. Recently, it has been • Establishing self-identification for LLM. To deploy LLMs
empirically found that existing instruction datasets (Table 3) for real-world applications, it is necessary to establish its
mainly focus on enhancing LLMs’ capabilities in certain identity and make LLMs aware of these identity informa-
aspects, and a single dataset alone cannot lead to a compre- tion, such as name, developer and affiliation. A practical
hensive enhancement in model capacity [353]. Therefore, it way is to create identity-related instructions for fine-tuning
is often suggested to use a mixture of existing instruction the LLM. It is also feasible to prefix the input with the self-
datasets to achieve a balanced improvement in different identification prompt, e.g., “The following is a conversation
capacities, including NLP task data (e.g., FLAN v2 [292]), between a human and an AI assistant called C HATBOT N AME,
chat data (e.g., ShareGPT [148]), and synthetic data (e.g., developed by D EVELOPER.”, where C HATBOT N AME and D E -
GPT4-Alpaca [354]). VELOPER refer to the name and developer of the chatbot,
respectively.
Combining Instruction Tuning and Pre-Training. To make In addition to the above practical strategies and tricks,
the tuning process more effective and stable, OPT-IML [95] existing work has also used other tricks, e.g., concatenating
incorporates pre-training data during instruction tuning, multiple examples into a single sequence to approach the
which can be regarded as regularization for model tuning. max length [355].
Further, instead of using a separate two-stage process (pre-
training then instruction tuning), some studies attempt to 5.1.3 The Effect of Instruction Tuning
train a model from scratch with a mixture of pre-training In this part, we discuss the effect of instruction tuning on
data (i.e., plain texts) and instruction tuning data (i.e., for- LLMs in three major aspects.
matted datasets) using multi-task learning [82]. Specifically,
Performance Improvement. Despite being tuned on a mod-
GLM-130B [93] and Galactica [35] integrate instruction-
erate number of instances, instruction tuning has become
formatted datasets as a small proportion of the pre-training
an important way to improve or unlock the abilities of
corpora to pre-train LLMs, which potentially achieves the
LLMs [69]. Recent studies have experimented with language
advantages of pre-training and instruction tuning at the
models in multiple scales (ranging from 77M to 540B),
same time.
showing that the models of different scales can all benefit
Multi-stage Instruction Tuning. For instruction tuning, from instruction tuning [69, 345], yielding improved perfor-
there are two kinds of important instruction data, namely mance as the parameter scale increases [94]. Further, smaller
task-formatted instructions and daily chat instructions. Gen- models with instruction tuning can even perform better
erally, the former has a significantly larger volume than the than larger models without fine-tuning [28, 69]. Besides
latter. It is important to balance the training with the two the model scale, instruction tuning demonstrates consistent
kinds of instruction data. In addition to carefully mixing improvements in various model architectures, pre-training
35

TABLE 8: Basic statistics of the required number of GPUs, tuning time, batch size (denoted as BS) per device (full tuning
and LoRA tuning), and inference rate (the number of generated tokes per second). Our experiments are conducted based
on two Linux servers having 8 A800-80G SXM4 GPUs with 6 NVSwitch and 8 3090-24G GPUs, respectively. The major
difference between A800 and A100 lies in the NVLink interconnect speed. Thus, our estimations about training and
inference efficiency would be slightly improved for A100, while the rest memory consumption would remain the same.
For full tuning experiments, we use data parallel training, ZeRO Stage 3, BF16, and gradient checkpointing. Additionally,
the LoRA tuning can be executed on one 80G GPU utilizing INT8 quantization with the rank setting set to 16. All the
experiments are conducted with Alpaca-52K dataset by training LLaMA models three epochs. The max sequence length
for both training settings is set to 512. The inference experiments are performed with the batch size set to 1.

A800 Full Training A800 LoRA Training A800 Inference (16-bit) 3090 Inference (16-bit) 3090 Inference (8-bit)
Models
#GPU BS Time #GPU BS Time #GPU #Token/s #GPU #Token/s #GPU #Token/s
LLaMA (7B) 2 8 3.0h 1 80 3.5h 1 36.6 1 24.3 1 7.5
LLaMA (13B) 4 8 3.1h 1 48 5.1h 1 26.8 2 9.9 1 4.5
LLaMA (30B) 8 4 6.1h 1 24 14.3h 1 17.7 4 3.8 2 2.6
LLaMA (65B) 16 2 11.2h 1 4 60.6h 2 8.8 8 2.0 4 1.5

objectives, and model adaptation methods [69]. In practice, those of expert clinicians. Furthermore, a recent study [357]
instruction tuning offers a general approach to enhancing fine-tunes FLAN-T5 to support e-commerce recommender
the abilities of existing language models [69] (including systems with natural language instructions, showing strong
small-sized PLMs). Also, it is much less costly than pre- performance in a variety of recommendation tasks. There
training, since the amount of instruction data required by are also several open-sourced medical models instruction-
LLMs is significantly smaller than pre-training data. tuned based on LLaMA [57], such as BenTsao [358]. Also,
researchers explore instruction tuning on law [359], fi-
Task Generalization. Instruction tuning encourages the nance [360], and arithmetic computation [361].
model to understand natural language instructions for task
completion. It endows LLMs with the ability (often con- 5.1.4 Empirical Analysis for Instruction Tuning
sidered as an emergent ability) to follow human instruc- Fine-tuning LLMs with different instruction sets tend to lead
tions [31] to perform specific tasks without demonstrations, to model variants with varied performance on downstream
even on unseen tasks [69]. A large number of studies tasks. In this section, we will explore the effect of different
have confirmed the effectiveness of instruction tuning to types of instructions in fine-tuning LLMs (i.e., LLaMA (7B)
achieve superior performance on both seen and unseen and LLaMA (13B)25 ), as well as examine the usefulness of
tasks [95, 345]. Also, instruction tuning has been shown to several instruction improvement strategies.
be useful in alleviating several weaknesses of LLMs (e.g.,
repetitive generation or complementing the input without Instruction Datasets. According to the discussion in Sec-
accomplishing a certain task) [66, 69], leading to a superior tion 5.1.1, we mainly consider three common kinds of in-
capacity to solve real-world tasks for LLMs. Furthermore, structions as follows:
LLMs trained with instruction tuning can generalize to re- • Task-specific instructions. For the first type of instruc-
lated tasks across languages. For example, BLOOMZ-P3 [94] tions, we adopt the most commonly-used multi-task instruc-
is fine-tuned based on BLOOM [78] using English-only task tion dataset, FLAN-T5 [69], which contains 1,836 tasks and
collection P3 [167]. Interestingly, BLOOMZ-P3 can achieve over 15M instructions by combining four data mixtures from
a more than 50% improvement in multilingual sentence prior work.
completion tasks compared to BLOOM, which shows that • Daily chat instructions. This type of instructions are con-
instruction tuning can help LLMs acquire general task skills versations posed by users about daily life, which are more
from English-only datasets and transfer such skills into closely related to real-life scenarios. We adopt the ShareGPT
other languages [94]. In addition, it has been found that instruciton set, consisting of 63K real-user instructions. It
using English-only instructions can produce satisfactory has been used as the core instructions for Vicuna.
results on multilingual tasks [94], which helps reduce the • Synthetic instructions. In addition to reusing existing
effort of instruction engineering for a specific language. instructions, we can also automatically synthesize massive
instructions using LLMs. We adopt the popular synthetic
Domain Specialization. Existing LLMs have showcased su- instruction dataset Self-Instruct-52K [143], consisting of 52K
perior capabilities in traditional NLP tasks (e.g., generation instructions paired with about 82K instance inputs and
and reasoning) and daily questions. However, they may outputs. These generated instructions have a similar data
still lack domain knowledge to accomplish specific tasks, distribution as the human-written seed tasks (e.g., grammar
such as medicine, law, and finance (See Section 8 for a checking, brainstorming).
detailed discussion of LLMs in different applications). In- As the original FLAN-T5 dataset is very large (i.e., over
struction tuning is an effective approach to adapting existing 15M), we randomly sample 80,000 instructions from it for
general LLMs to be domain-specific experts. For instance, conducting a fair comparison with other instruction datasets
researchers propose to fine-tune Flan-PaLM [69] using medi-
25. Due to the limit of computational resources, we cannot conduct
cal datasets to create Med-PaLM [356], a medical knowledge large-scale experiments on larger LLaMA variants right now, which
assistant that achieves performance levels comparable to would be scheduled in a future version.
36

TABLE 9: Results of instruction-tuning experiments (all in a single-turn conversation) based on the LLaMA (7B) and LLaMA
(13B) model under the chat and QA setting. We employ four instruction improvement strategies on the Self-Instruct-52K
dataset, i.e., enhancing the complexity (w/ complexity), increasing the diversity (w/ diversity), balancing the difficulty (w/
difficulty), and scaling the instruction number (w/ scaling). ∗ Since we select the LLaMA (7B)/(13B) model fine-tuned on
Self-Instruct-52K as the baseline, we omit the win rate of the fine-tuned model with Self-Instruct-52K against itself.

Dataset Instruction Lexical Chat QA


Models
Mixtures Numbers Diversity AlpacaFarm MMLU BBH3k
LLaMA (7B) ① FLAN-T5 80,000 48.48 23.77 38.58 32.79
② ShareGPT 63,184 77.31 81.30 38.11 27.71
③ Self-Instruct-52K 82,439 25.92 /∗ 37.52 29.81
②+③ 145,623 48.22 71.36 41.26 28.36
①+②+③ 225,623 48.28 70.00 43.69 29.69
③ Self-Instruct-52K 82,439 25.92 /∗ 37.52 29.81
w/ complexity 70,000 70.43 76.96 39.73 33.25
w/ diversity 70,000 75.59 81.55 38.01 30.03
w/ difficulty 70,000 73.48 79.15 32.55 31.25
w/ scaling 220,000 57.78 51.13 33.81 26.63
LLaMA (13B) ① FLAN-T5 80,000 48.48 22.12 34.12 34.05
② ShareGPT 63,184 77.31 77.13 47.49 33.82
③ Self-Instruct-52K 82,439 25.92 /∗ 36.73 25.43
②+③ 145,623 48.22 72.85 41.16 29.49
①+②+③ 225,623 48.28 69.49 43.50 31.16
③ Self-Instruct-52K 82,439 25.92 /∗ 36.73 25.43
w/ complexity 70,000 70.43 77.94 46.89 35.75
w/ diversity 70,000 75.59 78.92 44.97 36.40
w/ difficulty 70,000 73.48 80.45 43.15 34.59
w/ scaling 220,000 57.78 58.12 38.07 27.28

(i.e., ShareGPT and Self-Instruct-52K) at a similar scale. In Following YuLan-Chat [352], we employ ChatGPT to rewrite
our experiments, we test on each individual instruction the instructions from Self-Instruct-52K dataset for adapting
set to explore their own effects and also examine their them into 293 topics via specific prompts. Finally, we obtain
combinatorial effects on model performance. 70K instructions as the diversity-increased dataset.
• Scaling the instruction number. In addition to the above
Improvement Strategies. Although real-world instructions
aspects, the number of instructions is also an important
from human users are more suitable for fine-tuning LLMs,
factor that may affect the model performance. Specially,
it is difficult to collect them at a large scale. As alternatives
using more instructions can extend the task knowledge and
to human-generated instructions, most existing research
improve the ability of instruction following for LLMs [69].
mainly adopts synthetic instructions generated by LLMs.
To examine this strategy, we sample new instructions from
However, there are some potential problems with synthetic
the synthesized instruction set released from the MOSS
instructions, such as poor topic diversity and uneven in-
project [362], as they are also synthesized using the same
struction difficulty (either too simple or too difficult). Thus,
self-instruct method [143]. We mix them with the Self-
it is necessary to improve the quality of the synthetic in-
Instruct-52K dataset to compose a larger one containing
structions. Next, we summarize four major improvement
220K instructions.
strategies widely used in existing work as follows: • Balancing the instruction difficulty. As the synthetic
• Enhancing the instruction complexity. As discussed in
instructions tend to contain too easy or too hard ones, it
existing work [346], enhancing the complexity of instruc-
is likely to result in training instability or even overfitting
tions can improve the model capacity of LLMs in following
for LLMs. To explore the potential effects, we leverage
complex instructions, e.g., including more task demands or
the perplexity score of LLMs to estimate the difficulty of
requiring more reasoning steps. To validate this strategy,
instructions and remove too easy or too hard instructions. To
we follow WizardLM [346] by gradually increasing the
generate the same scale of instructions for fair comparison,
complexity levels, e.g., adding constraints, increasing rea-
we adopt a LLaMA (7B) model to compute the perplexity for
soning steps, and complicating the input. We leverage the
the 220K instructions from the large instruction dataset, and
publicly released WizardLM-70K instructions [346] as the
then keep 70K instructions of moderate perplexity scores as
complexity-enhanced instruction dataset, which has been
the difficulty-balanced dataset.
generated via the above enhancement approach based on
the Self-Instruct-52K dataset [346]. Experimental Setup. To conduct the experiments on the
• Increasing the topic diversity. In addition to the complex- effect of instruction data, we leverage these new instruction
ity, improving the topic diversity of the instruction dataset datasets for tuning LLaMA, a popular LLM backbone that
can help elicit different abilities of LLMs on diverse tasks in has been widely used for instruction-tuning. We use the
real world [347]. However, it is difficult to directly control code from YuLan-Chat [352] for our experiments, and train
the self-instruct process for generating diverse instructions. LLaMA 7B and 13B on a server of 8 A800-80G GPUs. All
37

the hyper-parameters settings remain the same as Stanford complexity and diversity of the Self-Instruct-52K dataset
Alpaca. To better evaluate the instruction following ability respectively, the chat and QA performance of LLaMA can
of fine-tuned models, we consider two settings, namely be consistently improved, e.g., from 37.52 to 39.73 in MMLU
Chat setting and QA setting. The chat setting mainly utilizes for LLaMA (7B). It demonstrates that both strategies are
user instructions and queries from daily chat, whereas the useful to improve the instruction following ability of LLMs.
QA setting mainly employs question answering examples Further, we can see that improving the complexity yields a
from existing NLP datasets. The evaluation on the chat larger performance improvement on QA tasks. The reason
setting is conducted based on the AlpacaFarm evaluation is that the QA tasks mostly consist of difficult questions for
set [363]. Instead of using a full pairwise comparison, we evaluating LLMs, which can be better solved by LLMs that
select the LLaMA 7B and 13B models fine-tuned on Self- have learned complex instructions at the fine-tuning stage.
Instruct-52K as the reference baselines, and then compare • Simply increasing the number of instructions may not be
them with other fine-tuned LLaMA 7B and 13B models that useful, and balancing the difficulty is not always helpful.
using different instructions, respectively. Since our focus is As the results shown in Table 9, balancing the difficulty
to examine the usefulness of different strategies to generate and increasing the number of fine-tuning instructions are
the instructions, the model fine-tuned on Self-Instruct-52K not very helpful in our experiments. Especially for scaling
can serve as a good reference. Following AlpacaFarm [363], the instruction number, it even hurts the performance, e.g.,
for each comparison, we employ ChatGPT to automatically a decrease from 29.81 to 26.63 in BBH for LLaMA (7B).
annotate which response from two compared models each It shows that simply scaling the number of synthesized
time is the best for the user query, and report the win instructions without quality control may not be effective to
rate (%) as the evaluation metric. For the QA setting, we improve the performance. Furthermore, fine-tuning with the
select two benchmarks, MMLU [364] and BBH [365], and instructions of moderate difficulty also performs well in the
evaluate the accuracy based on their default settings by chat setting, while slightly decreasing the performance in
using heuristic rules to parse the answers from these LLMs. the QA setting. A possible reason is that we filter complex
For both instruction tuning and evaluation, we adopt and hard instructions with large perplexity scores, hurting
the following prompt: “The following is a conversation be- the model performance in answering complex questions.
tween a human and an AI assistant. The AI assistant gives • A larger model scale leads to a better instruction following
helpful, detailed, and polite answers to the user’s questions.\n performance. By comparing the performance of LLaMA (7B)
[|Human|]:{input}\n[|AI|]:”. To reproduce our results, we and LLaMA (13B) models fine-tuned with the same set
release the code and data at the link: https://github.com/ of instruction data, we can see that LLaMA (13B) mostly
RUCAIBox/LLMSurvey/tree/main/Experiments. achieves a better performance. It indicates that scaling the
model size is helpful for improving the instruction following
Results and Analysis. The results using different instruction capability. Besides, we can see that the QA performance has
datasets based on 7B and 13B LLaMA are in Table 9. Next, been improved a lot, e.g., from 38.11 to 47.49 in MMLU. It is
we summarize and analyze our findings in detail. likely because that the larger models generally have better
• Task-formatted instructions are more proper for the QA knowledge utilization and reasoning capability [33, 55],
setting, but may not be useful for the chat setting. By comparing which can accurately answer more complex questions.
the performance of instruction tuning using FLAN-T5 with
that of ShareGPT and Self-Instruct-52K, we can observe Instruction Tuning Suggestions
that FLAN-T5 mostly achieves a better performance on QA
benchmarks while underperforms ShareGPT on the chat set- To conduct instruction tuning on LLMs, one can
ting. The reason is that FLAN-T5 is composed of a mixture prepare the computational resources according to
of instructions and examples from existing NLP tasks, e.g., the basic statistics about the required number of
translation and reading comprehension. As a result, LLaMA GPUs and tuning time in Table 8. After setting
fine-tuned with FLAN-T5 performs better on QA tasks, but up the development environment, we recommend
poorly on user queries. In contrast, ShareGPT consists of beginners to follow the code of Alpaca reposi-
real-world human-ChatGPT conversations, which is able to tory [137] for instruction tuning. Subsequently, one
better elicit LLaMA to follow user instructions in daily life, should select the base model and construct the
while may not be suitable for accomplishing the QA tasks. instruction datasets as we discuss in this section.
• A mixture of different kinds of instructions are helpful to When computational resources for training are con-
improve the comprehensive abilities of LLMs. After mixing the strained, users can utilize LoRA for parameter-
three kinds of instructions for fine-tuning, we can see that efficient tuning (see Section 5.3). As for inference,
the derived LLaMA variant (with FLAN-T5, ShareGPT and users can further use quantization methods to de-
Self-Instruct-52K) performs well in both task settings. In ploy LLMs on fewer or smaller GPUs (see Sec-
MMLU, the performance of LLaMA (7B) can surpass the tion 5.4).
ones using individual instruction set by a large margin, i.e.,
43.69 vs. 38.58 (FLAN-T5). It shows that mixing multiple
sources of instruction datasets is helpful to improve the
performance of instruction-tuned LLMs, which scales the 5.2 Alignment Tuning
instruction number as well as increases the diversity. This part first presents the background of alignment with
• Enhancing the complexity and diversity of instructions its definition and criteria, then focuses on the collection
leads to an improved model performance. By increasing the of human feedback data for aligning LLMs, and finally
38

discusses the key technique of reinforcement learning from requests for malicious purposes. Ideally, when the model
human feedback (RLHF) for alignment tuning. was induced to conduct a dangerous action (e.g., commit-
ting a crime), the LLM should politely refuse. Nonetheless,
5.2.1 Background and Criteria for Alignment what behaviors are deemed harmful and to what extent vary
amongst individuals or societies [368] highly depend on
Background. LLMs have shown remarkable capabilities
who is using the LLM, the type of the posed question, and
in a wide range of NLP tasks [55, 56, 67, 90]. However,
the context (e.g., time) at which the LLM is being used.
these models may sometimes exhibit unintended behav-
As we can see, these criteria are quite subjective, and are
iors, e.g., fabricating false information, pursuing inaccurate
developed based on human cognition. Thus, it is difficult
objectives, and producing harmful, misleading, and biased
to directly formulate them as optimization objectives for
expressions [66, 366]. For LLMs, the language modeling
LLMs. In existing work, there are many ways to fulfill these
objective pre-trains the model parameters by word predic-
criteria when aligning LLMs. A promising technique is red
tion while lacking the consideration of human values or
teaming [369], which involves using manual or automated
preferences. To avert these unexpected behaviors, human
means to probe LLMs in an adversarial way to generate
alignment has been proposed to make LLMs act in line with
harmful outputs and then updates LLMs to prevent such
human expectations [66, 367]. However, unlike the original
outputs.
pre-training and adaptation tuning (e.g., instruction tuning),
such an alignment requires considering very different crite-
ria (e.g., helpfulness, honesty, and harmlessness). It has been 5.2.2 Collecting Human Feedback
shown that alignment might harm the general abilities of During the pre-training stage, LLMs are trained using the
LLMs to some extent, which is called alignment tax in related language modeling objective on a large-scale corpus. How-
literature [368]. ever, it cannot take into account the subjective and qualita-
Alignment Criteria. Recently, there is increasing attention tive evaluations of LLM outputs by humans (called human
on developing multifarious criteria to regulate the behav- feedback in this survey). High-quality human feedback is
iors of LLMs. Here, we take three representative alignment extremely important for aligning LLMs with human pref-
criteria (i.e., helpful, honest, and harmless) as examples for erences and values. In this part, we discuss how to select a
discussion, which have been widely adopted in existing team of human labelers for feedback data collection.
literature [66, 368]. In addition, there are other alignment Human Labeler Selection. In existing work, the dominant
criteria for LLMs from different perspectives including be- method for generating human feedback data is human
havior, intent, incentive, and inner aspects [366], which annotation [66, 116, 367]. This highlights the critical role
are essentially similar (or at least with similar alignment of selecting appropriate human labelers. To provide high-
techniques) to the above three criteria. It is also feasible to quality feedback, human labelers are supposed to have a
modify the three criteria according to specific needs, e.g., qualified level of education and excellent proficiency in En-
substituting honesty with correctness [116]. Next, we give glish. For example, Sparrow [116] requires human labelers
brief explanations about the three representative alignment to be UK-based native English speakers who have obtained
criteria: at least an undergraduate-level educational qualification.
• Helpfulness. To be helpful, the LLM should demon- Even then, several studies [367] have found that there still
strate a clear attempt to assist users in solving their tasks exists a mismatch between the intentions of researchers
or answering questions in a concise and efficient manner and human labelers, which may lead to low-quality human
as possible. At a higher level, when further clarification feedback and cause LLMs to produce unexpected output.
is needed, the LLM should demonstrate the capability of To address this issue, InstructGPT [66] further conducts a
eliciting additional relevant information through pertinent screening process to filter labelers by assessing the agree-
inquiries and exhibit suitable levels of sensitivity, percep- ment between human labelers and researchers. Specifically,
tiveness, and prudence [368]. Realizing the alignment of researchers first label a small amount of data and then
helpful behavior is challenging for LLMs since it is difficult measure the agreement between themselves and human
to precisely define and measure the intention of users [366]. labelers. The labelers with the highest agreement will be
• Honesty. At a basic level, a LLM aligned to be honest selected to proceed with the subsequent annotation work.
should present accurate content to users instead of fabri- In some other work [370], “super raters” are used to ensure
cating information. Additionally, it is crucial for the LLM the high quality of human feedback. Researchers evaluate
to convey appropriate degrees of uncertainty in its output, the performance of human labelers and select a group of
in order to avoid any form of deception or misrepresen- well-performing human labelers (e.g., high agreement) as
tation of information. This requires the model to know super raters. The super raters will be given priority to
about its capabilities and levels of knowledge (e.g., “know collaborate with the researchers in the subsequent study.
unknowns”). According to the discussion in [368], honesty When human labelers annotate the output of LLMs, it is
is a more objective criterion compared to helpfulness and helpful to specify detailed instructions and provide instant
harmlessness, hence honesty alignment could potentially be guidance for human labelers, which can further regulate the
developed with less reliance on human efforts. annotation of labelers.
• Harmlessness. To be harmless, it requires that the lan-
guage produced by the model should not be offensive or Human Feedback Collection. In existing work, there are
discriminatory. To the best of its abilities, the model should mainly three kinds of approaches to collecting feedback and
be capable of detecting covert endeavors aimed at soliciting preference data from human labelers.
39

• Ranking-based approach. In early work [367], human Supervised Fine-tuning


labelers often evaluate model-generated outputs in a coarse- Prompts Training with demonstration data
grained manner (i.e., only selecting the best) without taking
into account more fine-grained alignment criteria. Nonethe-
Human
Annotator
Demonstrations Pre-trained LM
🔥
less, different labelers may hold diverse opinions on the
selection of the best candidate output, and this method Demonstration Data
disregards the unselected samples, which may lead to inac-
curate or incomplete human feedback. To address this issue, Reward Model Training

subsequent studies [116] introduce the Elo rating system


🔥
to derive the preference ranking by comparing candidate
outputs. The ranking of outputs serves as the training signal
Prompts LM Outputs Reward
Model
Pre-trained LM
🧊
that guides the model to prefer certain outputs over others,
thus inducing outputs that are more reliable and safer. Ranking Human Feedback Training with feedback data
• Question-based approach. Further, human labelers can
provide more detailed feedback by answering certain ques- RL Fine-tuning

tions designed by researchers [81], covering the alignment 🧊


criteria as well as additional constraints for LLMs. Specially,
in WebGPT [81], to assist the model in filtering and utiliz-
Prompts
Reward
Model
Aligned LM
🔥
ing relevant information from retrieved documents, human
labelers are required to answer questions with multiple LM Outputs 😊/😞
Reward
Training with RL algorithm (PPO)
options about whether the retrieved documents are useful
for answering the given input. Fig. 12: The workflow of the RLHF algorithm.
• Rule-based approach. Many studies also develop rule-
based methods to provide more detailed human feedback.
As a typical case, Sparrow [116] not only selects the response a generative model that is initialized with existing pre-
that labelers consider the best but also uses a series of trained LM parameters. For example, OpenAI uses 175B
rules to test whether model-generated responses meet the GPT-3 for its first popular RLHF model, InstructGPT [66],
alignment criteria of being helpful, correct, and harmless. and DeepMind uses the 280 billion parameter model Go-
In this way, two kinds of human feedback data can be ob- pher [64] for its GopherCite model [370]. Further, the reward
tained: (1) the response preference feedback is obtained by model (RM) provides (learned) guidance signals that reflect
comparing the quality of model-generated output in pairs, human preferences for the text generated by the LM, usually
and (2) the rule violation feedback is obtained by collecting in the form of a scalar value. The reward model can take on
the assessment from human labelers (i.e., a score indicating two forms: a fine-tuned LM or a LM trained de novo using
to what extent the generated output has violated the rules). human preference data. Existing work typically employs
Furthermore, GPT-4 [46] utilizes a set of zero-shot classifiers reward models having a parameter scale different from that
(based on GPT-4 itself) as rule-based reward models, which of the aligned LM [66, 370]. For example, OpenAI uses 6B
can automatically determine whether the model-generated GPT-3 and DeepMind uses 7B Gopher as the reward model,
outputs violate a set of human-written rules. respectively. Finally, to optimize the pre-trained LM using
In the following, we focus on a well-known technique, the signal from the reward model, a specific RL algorithm
reinforcement learning from human feedback (RLHF), is designed for large-scale model tuning. Specifically, Prox-
which has been widely used in the recent powerful LLMs imal Policy Optimization (PPO) [128] is a widely used RL
such as ChatGPT. As discussed below, the alignment criteria algorithm for alignment in existing work [66, 116, 370].
introduced in Section 5.2.1 can be fulfilled by learning from
human feedback on the responses of LLMs to users’ queries. Key Steps for RLHF. Figure 12 illustrates the overall three-
step process of RLHF [66] as introduced below.
5.2.3 Reinforcement Learning from Human Feedback • Supervised fine-tuning. To make the LM initially perform
desired behaviors, it usually needs to collect a supervised
To align LLMs with human values, reinforcement learning
dataset containing input prompts (instruction) and desired
from human feedback (RLHF) [79, 367] has been proposed
outputs for fine-tuning the LM. These prompts and outputs
to fine-tune LLMs with the collected human feedback data,
can be written by human labelers for some specific tasks
which is useful to improve the alignment criteria (e.g.,
while ensuring the diversity of tasks. For example, Instruct-
helpfulness, honesty, and harmlessness). RLHF employs
GPT [66] asks human labelers to compose prompts (e.g.,
reinforcement learning (RL) algorithms (e.g., Proximal Pol-
“List five ideas for how to regain enthusiasm for my career”) and
icy Optimization (PPO) [128]) to adapt LLMs to human
desired outputs for several generative tasks such as open
feedback by learning a reward model. Such an approach
QA, brainstorming, chatting, and rewriting. Note that the
incorporates humans in the training loop for developing
first step is optional in specific settings or scenarios.
well-aligned LLMs, as exemplified by InstructGPT [66].
• Reward model training. The second step is to train the
RLHF System. The RLHF system mainly comprises three RM using human feedback data. Specifically, we employ
key components: a pre-trained LM to be aligned, a reward the LM to generate a certain number of output texts using
model learning from human feedback, and a RL algorithm sampled prompts (from either the supervised dataset or
training the LM. Specifically, the pre-trained LM is typically the human-generated prompt) as input. We then invite
40

MHA Adapter FFN Adapter Prefix Layer #N Layer #N Wdown


Layer #N
… … … LoRA …
MHA Adapter FFN Adapter Prefix Layer #1 Layer #1 Wdown
Layer #1

Input Input Prompt Input Input


(a) Adapter Tuning (b) Prefix Tuning (c) Prompt Tuning (d) Low-Rank Adapation

Fig. 13: An illustration of four different parameter-efficient fine-tuning methods. MHA and FFN denote the multi-head
attention and feed-forward networks in the Transformer layer, respectively.

human labelers to annotate the preference for these pairs. model size), since large reward models generally perform
The annotation process can be conducted in multiple forms, better in judging the quality of the LLM generated outputs.
and a common approach is to annotate by ranking the In LLaMa 2 [99], pretrained chat model checkpoints are
generated candidate texts, which can reduce the inconsis- used to initialize the reward model, they argue that such an
tency among annotators. Then, the RM is trained to predict approach can effectively reduce the information mismatch
the human-preferred output. In InstructGPT, labelers rank between the model to be aligned and the reward model
model-generated outputs from best to worst, and the RM by sharing the same pre-training knowledge. Whereas, it is
(i.e., 6B GPT-3) is trained to predict the ranking. Note that, in common to encounter the overfitting problem when train-
recent work [371], the annotation of preference on response ing large-scale reward models. As a simple yet effective
pairs has been conducted by an AI agent (usually an aligned solution, existing work [374, 375] has introduced the LM
LLM) instead of humans, which is called “reinforcement loss on the preferred response of the input prompt from
learning from AI feedback (RLAIF)”. LLMs trained with typical the human-annotated alignment dataset as a regularizer,
RLHF algorithms tend to generate harmless responses with which alleviates the overfitting of the reward model on the
less helpfulness, which is called evasion problem [371]. To binary classification task. In addition, as there are multiple
guarantee both the harmlessness and helpfulness, RLAIF criteria for alignment (e.g., helpfulness and honesty), it is
generates the AI feedback based on pre-set alignment prin- often difficult to train a single reward model that can satisfy
ciples in instructions [371, 372], which can also reduce the all the alignment criteria. Therefore, it is useful to train
efforts of human annotation. multiple reward models that focus on different alignment
• RL fine-tuning. At this step, aligning (i.e., fine-tuning) criteria [99], and compute the final reward based on the
the LM is formalized as an RL problem. In this setting, produced ones from them via special combination strategies
the pre-trained LM acts as the policy that takes as input (e.g., mean pooling and weighted sum). Such a way enables
a prompt and returns an output text, the action space of more flexible rules or standards on multiple criteria, e.g.,
it is the vocabulary, the state is the currently generated relaxing the requirement on helpfulness while posing more
token sequence, and the reward is provided by the RM. To strict limits on harmfulness.
avoid eviating significantly from the initial (before tuning) • Effective RL training. As the RL training process tends to
LM, a penalty term is commonly incorporated into the be unstable and hyper-parameter sensitive, it is suggested
reward function. For example, InstructGPT optimizes the that the language model should be well supervised fine-
LM against the RM using the PPO algorithm. For each input tuned before RL training, so as to reaching a good model
prompt, InstructGPT calculates the KL divergence between capacity. A commonly-used way is to fine-tune the LLM
the generated results from the current LM and the initial on its best outputs of the prompts (referred to as rejec-
LM as the penalty. It is noted that the second and final steps tion sampling or best-of-N ) from the alignment dataset until
can be iterated in multiple turns for better aligning LLMs. convergence before RL. Given a prompt, the LLM would
Due to the instability of the RL algorithm, recent work [373] first produce N outputs via the sampling algorithm, and
replaces the RL tuning with another supervised fine-tuning then the best candidate from the model will be selected
by reusing the best ranked samples with higher rewards. by the reward model for learning. After fine-tuning the
Practical Strategies for RLHF. Although RLHF is promising LLM on the best samples until convergence, the RL process
to effectively improve the alignment of LLMs with humans, will be performed to further improve the performance.
it is practically challenging for researchers to successfully LLaMA 2 [99] has successively trained five versions of RLHF
implement it. In this part, we focus on discussing several models, where the LLM has been progressively improved
useful strategies and tricks for improving the effectiveness with the improvement of the reward models. In this way,
and efficiency of RLHF. Concretely, we focus on the effective the collected prompts and annotations of human preference
training of reward models, efficient and effective RL train- data can better reflect the issues of the current model check-
ing, respectively. point, thus making special tuning to address these issues. In
• Effective reward model training. Despite that InstructGPT addition, LLaMA 2 also adds samples from prior iterations
used a small reward model (6B GPT model), increasing into the subsequent ones, to alleviate the possible capacity
work [99] has shown it is often more effective to use a regression issue during iterative optimization.
large reward model (e.g., equal or greater than the original • Efficient RL training. As the RL training requires to
41

iterate the inference process of both the LLM and reward reward model, and the reference model at the same time,
models, it would greatly increase the total memory and which is tedious in algorithmic procedure and memory-
computation cost, especially for larger reward models and consuming in practice. Besides, the commonly-used PPO
LLMs. As a practical trick, we can deploy the reward model algorithm in RLHF is rather complex and often sensitive
on a separate server, and invoke the corresponding API to hyper-parameters. As an alternative, increasing studies
to work with the LLM on its own server. In addition, as explore to directly optimize LLMs to adhere to human pref-
RLHF requires the LLM to generate multiple candidate erences, using supervised fine-tuning without reinforcement
outputs, instead of calling the sample decoding procedure learning [349].
for multiple times, it is more efficient to utilize the beam
search decoding algorithm26 . It only needs to perform one- Overview. The basic idea of non-RL alignment approaches
pass decoding for response generation, meanwhile such a is to directly fine-tune LLMs with supervised learning on
strategy can also enhance the diversity of the generated high-quality alignment dataset. It basically assumes that re-
candidate responses. sponse feedback or golden rules to avert unsafe behaviors
Process-Supervised RLHF. In existing literature of have been injected or included in the specially curated align-
RLHF [376], the supervision signals for RL training can be ment dataset, so that LLMs can directly learn aligned behav-
generally classified into two distinct categories: outcome- iors from these demonstration data via suitable fine-tuning
supervision signals and process-supervision signals. The strategies. Thus, to implement this approach, two key issues
outcome-supervised RLHF employs a quantitative score to are the construction of alignment dataset and the design of
assess the quality of the whole text generated by LLMs. fine-tuning loss. For the first issue, the alignment dataset
In contrast, process-supervised RLHF offers an evalua- can be automatically constructed by an aligned LLMs ac-
tion of each individual component (e.g., sentence, word, cording to human-written safety principles [347] or refining
or reasoning step) within the generated content, which existing examples using edits operations [383]. In addition,
can provide fine-grained supervision signals to guide the we can also reuse existing reward models to select high-
training, helping LLMs refine the undesired generation rated responses from existing human feedback data [373].
contents [376, 377]. OpenAI has proposed a fine-grained For the second issue, non-RL alignment approaches mainly
annotation dataset named PRM800k [377] consisting of fine-tune LLMs in a supervised learning way (the same
12K process-annotated mathematical problems (i.e., MATH as the original instruction tuning loss) on a high-quality
dataset [378]) and 75K solutions generated by LLMs of alignment dataset, meanwhile auxiliary learning objectives
these problems, where each reasoning step of mathemat- can be used to enhance the alignment performance, e.g.,
ical problems is labeled as positive, negative or neutral in ranking responses or contrasting instruction-response pairs.
PRM800k. This fine-grained dataset has been utilized in
existing work [377, 379] to train the process-supervised re- Alignment Data Collection. The construction of alignment
ward models (PRM), and the probability from the prediction data is important to effectively align the behaviors of LLMs
of each label can be considered as the supervision signals with human preferences. To collect high-quality alignment
during RLHF procedure. To effectively leverage process- data, some work tries to reuse existing reward models to
supervision signals from PRMs, existing work [376] has select high-rated responses, and others explore to leverage
utilized expert iteration [380, 381], an effective RL algo- powerful LLMs (e.g., ChatGPT) or build a simulated envi-
rithm to improve the base policy via learning from expert ronment to generate synthetic alignment examples. Next,
policy. Typically, expert iteration contains two main stages: we will discuss these three lines of research.
policy improvement and distillation [376]. In the policy
• Reward model based approaches. The reward model in
improvement stage, expert policy processes the systematic
RLHF has been trained to measure the alignment degree
search procedure to produce the samples. PRMs provide
on the responses of LLMs. It is straightforward to leverage
process-supervision signals to guide expert policy in the
existing reward models to select high-quality responses as
search procedure and enhance the quality of samples. Subse-
alignment data for subsequent fine-tuning. Based on this
quently, during the distillation stage, the samples generated
idea, RAFT [373] adopts reward models trained on human
by expert policy in the first stage are utilized to improve
preference data to rank the responses of LLMs and collect
the base policy through supervised fine-tuning. In addition
those with higher rewards for supervised fine-tuning. In
to expert iteration, PRMs can also be utilized to re-rank the
addition, the reward model can be also used to score model
candidates of the final answers generated by LLMs [377] or
responses and assign them to different quality groups.
to select better intermediate reasoning steps during step by
Quark [384] sorts the responses of LLMs into different quan-
step reasoning [379, 382].
tiles based on the reward scores. Each quantile is attached
with a special reward token to represent the reward level
5.2.4 Alignment without RLHF
of the quantile. Conditioned on the highest-reward tokens,
Although RLHF has achieved great success in aligning the LLMs are subsequently prompted to generate high-quality
behaviors of LLMs with human values and preferences, it responses. Given an initial answer and the corresponding
also suffers from notable limitations. First, RLHF needs to human feedback, ILF [385] first adopts LLMs to generate
train multiple LMs including the model being aligned, the refined answers, then utilizes the reward model to select
the answer that best matches the feedback for further
26. https://huggingface.co/docs/transformers/v4.31.0/en/main
classes/text generation#transformers.GenerationMixin.group beam training. As valuable resources for aligning LLMs, several
search reward models have been released, including DeBERTa-
42

base/large/xxlarge from OpenAssistant27 , Moss-7B from sponse, the primary training loss is still the traditional cross-
Fudan28 , and Flan-T5-xl from Stanford29 . entropy loss for sequence-to-sequence learning. Based on
• LLM based generative approaches. Reward models help this loss, many studies propose a number of improvement
to select aligned data from model responses. However, variants for enhancing the supervised alignment tuning.
training reward models itself necessitates substantial high- For example, CoH [388] constructs the training data by
quality human-labeled data, which is typically expensive prepending “A helpful answer:” and “An unhelpful answer:”
and in short supply. In addition, although existing reward to the annotated good and bad responses, respectively, and
models can be reused, they might not be able to accurately only compute losses for those response tokens with special
capture the nonalignment behaviors in another separately masking. Quark [384] sorts model responses into different
trained LLM. Therefore, some work explores leveraging quantiles with varying alignment quality, it prepends a
powerful LLMs to automatically generate human-aligned special reward token to each model response to represent
data. As a representative work, constitutional AI [371] pro- the reward level of the response. Further, to enable the
poses that human supervision comes from a set of principles preference modeling via the maximum likelihood objective,
(i.e., natural language instructions) governing AI behaviors. DPO [389] first reparameterizes the response rewards using
Based on these principles, LLMs will critique their own the policy model (i.e., the language model being optimized),
harmful responses and revise them repeatedly into finally and then the original reward modelling objective can be
aligned responses. Similarly, Self-Align [347] first adopts reformulated only based on the policy model. In this way,
self-instruct [143] to generate instructions focusing on cov- DPO removes the explicit reward modeling step, and opti-
ering diverse topics. Then, the model is also prompted mizing the new learning objective only involving the policy
with multiple human-written principles that describe the model is equivalent to optimizing the rewards. Furthermore,
rules of expected model behaviors (also with several in- FIGA [386] designs a fine-grained contrastive loss that aims
context exemplars), to generate helpful, ethical, and reliable to encourage desirable tokens, penalize undesirable ones,
responses as alignment data. To mitigate the limit that the and disregard trivial tokens.
original SFT method can only learn from positive responses, • Auxiliary optimization objectives. Besides the primary
FIGA [386] develops an improved supervised alignment cross-entropy loss, several studies propose auxiliary train-
approach, where both negative (the original output of low ing loss to enhance the learning from the alignment data.
quality) and positive (the refined output by LLMs) re- First, since the responses of each instruction can be scored
sponses are leveraged in a contrastive way, to enable LLMs by the reward model, the ranking loss can be used to train
to deeply understand what fine-grained revisions actually the model to preserve the ranking order of these responses.
lead to good response. For example, RRHF [390] samples responses from multi-
• LLM based interactive approaches. Most existing ap- ple sources, including model-generated responses, such as
proaches train LLMs in isolation, where LLMs are not those derived from the model itself, ChatGPT, and GPT-4,
present in actual environments to improve themselves as well as human-written responses, spanning both high-
through external feedback signals. As a comparison, hu- quality and low-quality instances. To align with the scores
mans learn social norms and values from interactions with from reward models, it further optimizes the ranking loss
others in social environments [387]. To mimic such a learn- by encouraging the model to have a higher conditional log
ing approach, Stable Alignment [179] builds a simulated probability for the response with a higher ranking. SLiC-
interaction environment consisting of a number of LLM HF [391] proposes to assess the similarity between model
agents, where AI agents keep interacting with and each outputs and human preference via the distance in the latent
other, receiving feedback on improvement. Once a central space, and introduces specific calibration and regularization
agent receives an instruction, it produces a response and loss to calibrate the candidate sequences based on human-
shares it with nearby agents. These critic agents generate preference data. Second, to enhance the relatedness be-
feedback comprising ratings about the response and re- tween the response and the instruction, some work adopts
vision suggestions. Then the central agent would revise contrastive learning to push up the probability of correct
the original response following these suggestions. Such instruction-response pairs while pushing down incorrect
an alignment approach can be also extended to real-world instruction-response pairs. Specifically, for an output re-
environment with humans. sponse, the proposed approach in [392] contrasts the target
instruction to the other irrelevant instructions. By doing so,
Supervised Alignment Tuning. After obtaining alignment
it can enable the model to learn the right correlation between
data, it is also key to design suitable fine-tuning strategies
instructions and responses.
for direct alignment. A straightforward approach is to op-
timize LLMs using the conventional sequence-to-sequence 5.2.5 Remarks on SFT and RLHF
objective based on the alignment data. In addition to the
As discussed in Section 5.1, instruction tuning is the process
conventional optimization objective, several studies further
of training pre-trained language models with formatted
explore auxiliary losses that enhance the learning from the
demonstration data (instructions paired with desired out-
alignment data.
puts). At early exploration, instruction data was mainly col-
• Primary training objective. Since the alignment data
lected from NLP tasks [67], while it has been now extended
typically consists of an input instruction and an output re-
to more diverse supervision data that pairs input and
27. https://huggingface.co/OpenAssistant output texts (e.g., the utterances of open-ended dialogues).
28. https://github.com/OpenLMLab/MOSS-RLHF Training with such paired texts is also called supervised fine-
29. https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl tuning (SFT) in the context of LLMs [66]. In this part, we
43

mainly use the abbreviation SFT for discussion but not lucinated texts, thus likely affecting the factual accuracy
instruction tuning, due to the simplicity and popularity. of LLMs. Furthermore, as a behavior cloning method, SFT
Since SFT and RLHF are two major adaptation tuning aims to imitate the behaviors (without explorations) of the
methods for LLMs, it is important to understand the con- experts who construct the demonstration data. However,
nections and difference between them. Next, we make some there often exist variations among different annotators on
discussions on this issue30 . the writing styles, quality, and preferences of demonstration
data, which tends to affect the learning performance of SFT.
Overall Comparison with RL Formulation. Following the Thus, high-quality instruction data (but not the quantity) is
discussion in Section 5.2.3 (the part related to RL training), the primary factor for effective training of LLMs during the
the text generation problem can be formulated as a decision- SFT stage [99].
making process based on RL. Taking a prompt as input,
the task of a LLM is to generate a text completion that Pros and Cons of RLHF. RLHF was early explored in the
appropriately responds to the prompt. This task would be literature of deep RL [79], then borrowed to improve the
completed step by step. At each step, an agent (i.e., LLM) capacity of language models (e.g., summarization [129]),
will perform an action (i.e., generating a token) according and subsequently adopted as the fundamental technique to
to the policy (i.e., the generative probability distribution of develop InstructGPT [66]. Recently, increasing evidence [99,
LLM) conditioned on the current state (currently generated 371] has demonstrated the effectiveness of RLHF in miti-
token sequence and other available context information). gating the harmful responses and enhancing the model ca-
It is expected that a high-quality output text would be pacity. Specially, LLaMA 2 has demonstrated that RLHF can
produced by the LLM, which can earn a large reward score improve both the helpfulness and harmlessness scores [99],
based on the entire response. Overall, RLHF and SFT can be and attributed this to a better human-LLM synergy for data
considered as two different training approaches to optimiz- annotation. They explain this reason in two major aspects
ing the above decision making process for LLMs. Specially, as follows. First, since human annotators mainly provide
RLHF firstly learns the reward model, and then employs preference annotations for RLHF, it can largely alleviate the
it to improve the LLM with RL training (e.g., PPO). As a discrepancies of annotators as that in SFT. Secondly, pref-
comparison, SFT adopts a teacher-forcing approach, which erence annotation is much easier than writing the demon-
directly optimizes the likelihood of a demonstration output. stration data, and annotators can even judge the quality of
Such a token-level training way essentially does behavior more superior generations than those they create, making it
cloning (a special algorithm of imitation learning [393]): it possible to explore a broader state space beyond what can
utilizes the expert’s action (i.e., the target token at each step) be demonstrated by human annotators. Another key point
as the supervision label and directly learns to imitate the is that RLHF essentially encourages LLMs to learn correct
demonstrations from experts without specifying a reward policies by contrasting the self-generated responses (dis-
model as in typical RL algorithms. To learn the desired criminating between good and bad responses). It no longer
policies, SFT adopts a “local” optimization way (i.e., token- forces the model to imitate external demonstration data,
level loss) based on demonstration data, while RLHF takes a and thus can mitigate the hallucination issues with SFT as
“global” optimization way (i.e., text-level loss) by involving discussed above31 . Actually, RLHF has been demonstrated
human preference. More theoretical analysis about imitation to be an important approach to reduce the hallucination
learning and reinforcement learning can be referred to the behaviors in GPT-4 [46]. However, RLHF inherits the draw-
related RL literature [393, 394]. backs of classic RL algorithms, e.g., sample inefficiency and
training instability. When adapted to LLMs, RLHF further
Pros and Cons of SFT. SFT has been shown to be an
relies on a strong SFT model as initial model checkpoint for
effective approach to boosting the performance of LLMs
efficiently achieving good performance. In addition, human
on various benchmarks [67, 69, 137, 138], which can largely
annotators are involved in a complex iterative optimization
enhance the task generalization ability and flexibly endow
process, in which a number of important details (e.g., the
specific functions (e.g., establishing the chatbot’s identity).
prompt selection, the schedule of reward model training and
More discussions about the usefulness of SFT can be found
PPO training, and the settings of hyper-parameters) have
in Section 5.1.3. It has been widely recognized that SFT
important impact on the whole model performance.
mainly unlocks the abilities but not inject new abilities into
LLMs. Thus, it might become problematic when one tries Overall, SFT is particularly useful to increase the model
to stimulate the non-endogenous abilities of LLMs via SFT. capacity of pre-trained model checkpoints right after pre-
As a concrete scenario, it would potentially advocate the training, while RLHF is promising to further improve the
hallucination behaviors when demonstration data is beyond model capacity of SFT models. However, RLHF has been
the knowledge or ability scope of LLMs, e.g., training a LLM difficult to implement, and far from well explored (ac-
to answer questions about its unknown facts. An interesting cording to public literature), and more improvements (e.g.,
viewpoint from John Schulman’s talk on RLHF [395] is that efficient and reliable annotation [371] and simplified opti-
distilling superior models to train less capable models (e.g., mization [389]) are still needed for further research.
prompting GPT-4 to generate the response as fine-tuning
data) might increase the possibilities of generating the hal-
31. In RLHF, it seems to be also important that reward models
30. This part would be somehow subjective, mainly based on the au- should be aware of the knowledge or ability of a LLM to be aligned.
thors’ opinions and experiences. Comments or corrections are welcome For example, LLaMA 2 adopts pre-trained chat model checkpoints to
to enhance this part. initialize reward models [99].
44

5.3 Parameter-Efficient Model Adaptation Prompt Tuning. Different from prefix tuning, prompt tun-
ing [397, 402] mainly focuses on incorporating trainable
In the above, we have discussed the approaches of instruc- prompt vectors at the input layer32 . Based on the discrete
tion tuning and alignment tuning to adapt LLMs according prompting methods [404, 405], it augments the input text
to specific goals. Since LLMs consist of a huge amount of by including a group of soft prompt tokens (either in a
model parameters, it would be costly to perform the full- free form [402] or a prefix form [397]), and then takes
parameter tuning. In this section, we will discuss how to the prompt-augmented input to solve specific downstream
conduct efficient tuning on LLMs. We first review several tasks. In implementation, task-specific prompt embeddings
representative parameter-efficient fine-tuning methods for are combined with the input text embeddings, which are
Transformer language models, and then summarize existing subsequently fed into language models. P-tuning [402] has
work on parameter-efficient fine-tuned LLMs. proposed a free form to combine the context, prompt and
target tokens, which can be applied to the architectures for
5.3.1 Parameter-Efficient Fine-Tuning Methods both natural language understanding and generation. They
further learn the representations of soft prompt tokens by a
In existing literature, parameter-efficient fine-tuning [145, bidirectional LSTM. Another representative approach [397]
396, 397] has been an important topic that aims to reduce named prompt tuning directly prepends prefix prompts to
the number of trainable parameters while retaining a good the input. During training, only the prompt embeddings
performance as possible. In what follows, we briefly re- would be learned according to task-specific supervisions.
view four parameter-efficient fine-tuning methods for Trans- Since this method only includes a small number of trainable
former language models, including adapter tuning, prefix parameters at the input layer, it has been found that the
tuning, prompt tuning and LoRA. The illustration of these performance highly relies on the model capacity of the
four methods are shown in Figure 13. underlying language models [397].
Adapter Tuning. Adapter tuning incorporates small neural Low-Rank Adaptation (LoRA). LoRA [145] imposes the
network modules (called adapter) into the Transformer mod- low-rank constraint for approximating the update matrix at
els [398]. To implement the adapter module, a bottleneck each dense layer, so as to reduce the trainable parameters
architecture has been proposed in [398, 399], which first for adapting to downstream tasks. Consider the case of
compresses the original feature vector into a smaller di- optimizing a parameter matrix W. The update process can
mension (followed by a nonlinear transformation) and then be written in a general form as: W ← W + ∆W. The basic
recovers it to the original dimension. The adapter modules idea of LoRA is to freeze the original matrix W ∈ Rm×n
would be integrated into each Transformer layer, typically while approximating the parameter update ∆W by low-
using a serial insertion after each of the two core parts (i.e., rank decomposition matrices, i.e., ∆W = A · B⊤ , where
attention layer and feed-forward layer) of a Transformer A ∈ Rm×k and B ∈ Rn×k are the trainable parameters for
layer. Alternatively, parallel adapters [400] can be also used task adaptation and k ≪ min(m, n) is the reduced rank. The
in Transformer layers, where it places two adapter modules major merit of LoRA is that it can largely save the memory
in parallel with the attention layer and feed-forward layer and storage usage (e.g., VRAM). Further, one can only keep
accordingly. During fine-tuning, the adapter modules would a single large model copy, while maintaining a number of
be optimized according to the specific task goals, while the task-specific low-rank decomposition matrices for adapting
parameters of the original language model are frozen in this to different downstream tasks. Further, several studies have
process. In this way, we can effectively reduce the number also discussed how to set the rank in a more principled
of trainable parameters during fine-tuning. approach, e.g., importance score based allocation [406] and
search-free optimal rank selection [407].
Prefix Tuning. Prefix tuning [396] prepends a sequence of
Besides the above methods, there is extensive research
prefixes, which are a set of trainable continuous vectors, to
on efficient tuning of Transformer language models. How-
each Transformer layer in language models. These prefix
ever, a more comprehensive discussion of efficient tuning is
vectors are task-specific, which can be considered as virtual
beyond the scope of this article, which can be found in the
token embeddings. To optimize the prefix vectors, a repa-
related papers on this topic [400, 408].
rameterization trick [396] has been proposed by learning a
MLP function that maps a smaller matrix to the parameter 5.3.2 Parameter-Efficient Fine-Tuning on LLMs
matrix of prefixes, instead of directly optimizing the pre- With the rising of LLMs, efficient tuning has attracted
fixes. It has been shown that this trick is useful for stable increasing research attention for developing a more
training. After optimization, the mapping function would lightweight adaptation approach in downstream tasks.
be discarded, and only the derived prefix vectors are kept In particular, LoRA [145] has been widely applied
to enhance task-specific performance. Since only the prefix to open-source LLMs (e.g., LLaMA and BLOOM) for
parameters would be trained, it can lead to a parameter-
efficient model optimization. Similar to prefix tuning, p- 32. Here, prompt tuning denotes a category of related efficient tuning
methods exemplified by the work [397, 402, 403], instead of a spe-
tuning v2 [401] incorporates layer-wise prompt vectors into cific method as used in [397]. Indeed, the prefix based tuning meth-
the Transformer architecture specially for natural language ods [396, 401] can be also considered as prompting methods, which
understanding, which also utilizes multi-task learning for are called deep prompting tuning in [401]. In this survey, prompt tuning
jointly optimizing shared prompts. It has been shown to specially refer to the methods that only include the prompt tokens at
the input layer, in the context of LLMs. We assign p-tuning v2 [401] to
be useful in improving the model performance of different the category of prefix tuning, because it incorporates layerwise prompts
parameter scales on natural language understanding tasks. in langauge models.
45

parameter-efficient fine-tuning. Among these research at- In neural network compression, quantization often refers
tempts, LLaMA and its variants have gained much atten- to the mapping process from floating-point numbers to
tion for parameter-efficient tuning. For example, Alpaca- integers [412], especially the 8-bit integer quantization (i.e.,
LoRA [144] has been trained using LoRA as a lightweight INT8 quantization). For neural network models, there are
tuned version of Alpaca [142] (a fine-tuned 7B LLaMA typically two kinds of data to be quantized, namely weights
model with 52K human demonstrations of instruction fol- (model parameters) and activations (hidden activations),
lowing). There are extensive explorations of Alpaca-LoRA which are originally represented in floating-point num-
ranging in different languages or model sizes, which can bers. To illustrate the essential idea of model quantization,
be found in the collection page33 . A recent study LLaMA- we introduce a simple yet popular quantization function:
Adapter [409] inserts learnable prompt vectors into each xq = R(x/S)−Z , which transforms a floating number x into
Transformer layer, in which zero-initialized attention has a quantized value xq . In this function, S and Z denote the
been proposed to improve the training by mitigating the scaling factor (involving two parameters α and β that deter-
influence of under-fitted prompt vectors. They also extend mine the clipping range) and zero-point factor (determining
this approach to a multi-modal setting, e.g., visual question symmetric or asymmetric quantization), respectively, and
answering. R(·) denotes the rounding operation that maps a scaled
Further, an empirical study [399] has been conducted floating value to an approximate integer.
to examine the effect of different tuning methods on lan- As the reverse process, dequantization recovers the orig-
guage models. They compare four efficient tuning methods inal value from the quantized value accordingly: x̃ =
including serial adapter tuning [398], parallel adapter tun- S · (xq + Z). The quantization error is calculated as the
ing [400, 410], and LoRA [145], on three open-source LLMs, numerical difference between the original value x and the
namely GPT-J (6B), BLOOM (7.1B) and LLaMA (7B), for recovered value x̃. The range parameters α and β have a
evaluation. Based on the experimental results on six math large impact on the quantization performance, which often
reasoning datasets, they show that these efficient-tuning need to be calibrated according to real data distributions, in
methods under-perform the reference baseline GPT-3.5 on either a static (offline) or dynamic way (runtime).
difficult tasks, while achieving a comparable performance For more details, we refer to the readers to the excel-
on simple tasks. Overall, LoRA performs relatively well lent survey [412] about quantization methods on neural
among these comparison methods, using significantly fewer networks.
trainable parameters.
As an important resource, the library PEFT [411] (stand- 5.4.2 Quantization Methods for LLMs
ing for parameter-efficient fine-tuning) has been released on There are generally two major model quantization ap-
GitHub34 . It has included several widely used efficient tun- proaches, namely quantization-aware training (QAT) (requir-
ing methods, including LoRA [145]/AdaLoRA [406], prefix- ing additional full model retraining) and post-training quanti-
tuning [396, 401], P-Tuning [402], and prompt-tuning [397]. zation (PTQ) (requires no model retraining). Compared with
Further, it supports a number of language models such as small-sized language models, two major differences need
GPT-2 and LLaMA, and also covers several representative to be considered when designing or selecting quantization
vision Transformer models (e.g., ViT and Swin Transformer). methods for LLMs. Firstly, LLMs consist of a huge number
As discussed in Section 5.3.1, there have been a large of parameters, and thus PTQ methods are more preferred
number of efficient tuning methods proposed in the existing due to a much lower computational cost than QAT methods.
literature. However, most of these approaches are tested Secondly, LLMs exhibit very different activation patterns
on small-sized pre-trained language models, instead of the (i.e., large outlier features), and it becomes more difficult
LLMs. So far, there still lacks a thorough investigation on to quantize LLMs, especially hidden activations. Next, we
the effect of different efficient tuning methods on large-sized will briefly review several representative PTQ methods35 for
language models at different settings or tasks. LLMs.
Post-Training Quantization (PTQ). We first introduce the
5.4 Memory-Efficient Model Adaptation PTQ methods for LLMs.
Due to the huge number of model parameters, LLMs take a • Mixed-precision decomposition. As observed in [413],
significant memory footprint for inference, making it very extreme large values occur in hidden activations (called
costly to be deployed in real-world applications. In this the emergence of outliers) when the model size reaches 6.7B
section, we discuss how to reduce the memory footprint parameters or above. Interestingly, these outliers are mainly
of LLMs via a popular model compression approach (i.e., distributed in some specific feature dimensions at Trans-
model quantization), so that large-sized LLMs can be used former layers. Based on this finding, a vector-wise quan-
in resource-limited settings, which also likely reduces the tization approach, called LLM.int8(), has been proposed in
inference latency. [413], which separates the feature dimensions with outliers
and the rest dimensions in matrix multiplication. Then,
5.4.1 Background for Quantization the calculations for the two parts are performed with 16-
bit floating numbers and 8-bit integers, respectively, so as to
In this part, we present a general introduction of quantiza-
recover these outliers in a high precision.
tion techniques for neural networks.
35. Since we mainly focus on discussing quantization methods in the
33. https://github.com/tloen/alpaca-lora context of LLMs, the line of quantization work on small-sized language
34. https://github.com/huggingface/peft models (e.g., BERT) has not been included in this survey.
46

• Fine-grained quantization. For Transformer models, study [420] explores the effect of QAT methods by applying
weights and activations are usually represented in the a data-free distillation method to compress the weights,
form of tensors. A straightforward approach is to use activations as well as key-value cache. By conducting exten-
coarse-grained quantization parameters for the whole ten- sive experiments based on LLaMA, they show promising
sor (i.e., per-tensor quantization) [414]. However, it usu- results with 4-bit quantization on both weights and key-
ally leads to inaccurate reconstruction results. Thus, fine- value cache, but not on 4-bit activation quantization, which
grained methods are proposed to reduce the quantization still needs more exploration.
error. ZeroQuant [415] adopts a token-wise quantization
approach with dynamic calibration for compressing acti- 5.4.3 Empirical Analysis and Findings
vations. Whereas for weights (easier to be quantized), it Quantization has currently become a common technique
uses a group-wise quantization. In practice, a group size to reduce the memory footprint and latency of LLMs in
of 128 [415, 416] is commonly used for model quantization. deployment. In particular, it is important to understand
• Balancing the quantization difficulty. Considering that what level of precision (e.g., INT8 or INT4) can be applied
weights are easier to be quantized than activations, to quantize different parts of LLMs (e.g., weights or acti-
SmoothQuant [414] proposes to migrate the difficulty from vations), while retaining a high accuracy. In this part, we
activations to weights. Specially, they incorporate a scaling first summarize the major findings about the quantization of
transformation to balance the difficulty between weights LLMs in existing literature, and then present some empirical
and activations in a linear layer: Y = (Xdiag(s)−1 ) · analysis with quantization experiments.
(diag(s)W). By introducing an mathematically equivalent
transformation, this formula controls the quantization diffi- Important Findings from Existing Work. Recently, a very
culty through the scaling factor s. To set s, it incorporates comprehensive evaluation [421] has been conducted about
a migration strength parameter α to balance the difficulties, the impact of multiple factors (e.g., model size and sensi-
where each entry sj = max(xj )α / max(wj )(1−α) is deter- tivity) on the post-training quantization methods. Another
mined by the migration strength. study [422] examines the scaling law of k -bit quantiza-
• Layerwise quantization. This approach finds optimal tion in inference performance. In addition to the overall
quantized weights that minimize a layerwise reconstruction performance, the study [423] specifically focuses on the
loss: arg minW 2
c ∥ WX− WX ∥2 . To efficiently optimize this
c potential impact of quantification on emergent capabilities,
objective, GPTQ [417] improves the original optimal brain as well as the levels of performance that can be achieved
quantization (OBQ) [418] method by fixing the quantiza- across various levels of bit precision. Also, prior work (e.g.,
tion order of weights for all rows. Further, with specially LLM.int8() [424], GPTQ [417], QLoRA [419], and GLM [93])
designed methods (i.e., lazy batch-updates and Cholesky has also extensively examined the performance of quanti-
reformulation), GPTQ is feasible to quantize very large zation methods in various settings. Next, we summarize
models (e.g., 175B OPT) in 3 or 4 bit precision. More recently, several important findings from these studies, which will
AWQ [416] further simplifies the optimization form by be useful for those who may not want to delve into the
incorporating activation-aware scaling for weights, which technical details of quantization methods.
resembles the idea of SmoothQuant [414]: weights corre- • INT8 weight quantization can often yield very good re-
sponding to outlier activations are more important to be sults on LLMs, while the performance of lower precision weight
precisely quantized. It does not directly optimize the recon- quantization depends on specific methods [414, 416, 417, 421]. In
struction loss, but instead performs simple hyper-parameter most cases, INT8 weight quantization can be effectively ap-
search to achieve the minimal loss on calibration data. plied to reduce the memory footprint without performance
These strategies in the above methods can be jointly degradation. While for INT4 (or INT3) weight quantization,
used to improve the quantization performance. In order to existing methods rely on specific strategies to reduce the
achieve high-efficiency implementation, quantization meth- performance degradation, e.g., layerwise method [415, 417],
ods also rely on hardware- or system-level support (e.g., ef- activation-aware scaling [416] and low-rank adapter tun-
ficient GPU kernels or hardware-friendly group partition). ing [419]. Interestingly, LLMs seem to be less sensitive
to low-bit weight quantization than small-sized language
Other Quantization Methods. In the above, we mainly fo- models [421]. In practice, with the same memory cost, it
cus on PTQ methods, and next introduce two recent studies is suggested to use a larger language model with a lower
that explore efficient fine-tuning methods or QAT methods quantization precision rather than a smaller language model
for quanitizing LLMs. with a higher quantization precision. For example, a 4-bit
• Efficient fine-tuning enhanced quantization. For post- 60GB LLM is demonstrated to have better performance than
training quantization, direct low-bit quantization (e.g., INT4 a 8-bit 30GB LLM [422]. Moreover, focusing on emergent
quantization) often results in large performance degrada- capabilities, the study [423] finds that in-context learning,
tion. To overcome this challenge, QLoRA [419] incorporates step-by-step reasoning, and instruction following all seem
additional small tunable adapters (16-bit precision) into the to be seldom affected with 4-bit weight quantization. This
quantized models, to achieve an efficient, high-precision result suggests that INT4 quantization exhibits a favorable
model fine-tuning. It combines the merits of LoRA (See trade-off in terms of both total bits and performance of
Section 5.3.1) and quantization methods. The experiment emergent abilities.
results show that 4-bit quantized models can achieve the • Activations are more difficult to be quantized than
full 16-bit fine-tuning performance by QLoRA. weights [413, 414, 421]. It has been found that large outliers
• Quantization-aware training (QAT) for LLMs. A recent would occur for Transformer language models having a
47

size of 6.7B or above [413]. This issue has been one of models of varied sizes based on the GPTQ algorithm [417].
the most fundamental difficulties to quantize LLMs. To Also, it provides a comparison with bitsandbytes in both
overcome this issue, various methods, e.g., mixed-precision memory and performance (PPL) on the project website.
decomposition [413], fine-grained quantization [413, 425] • AutoGPTQ38 is a quantization package developed
and difficulty migration [414], can be applied to alleviate the based on the GPTQ algorithm [417], which supports INT4
influence of outlier values. Since large outliers mainly exist quantization for LLMs. It includes a number of quantized
in the activations of LLMs, small language models are more models in the library, and supports LoRA by integrating
resistant to activation quantization [421, 423]. In practice, with HuggingFace PEFT library.
high-quality INT8 activation quantization is still a difficult • llama.cpp39 makes it feasible to run quantized LLaMA
task, though several methods can attain satisfying results. models on a MacBook device. It supports INT4, INT5 and
Further, lower precision activation quantization has still not INT8 quantization, which is developed in efficient C/C++
been successfully explored, even for QAT methods [420]. implementation. It also supports a number of LLaMA based
• Efficient fine-tuning enhanced quantization is a good op- models, such as Alpaca and Vicuna.
tion to enhance the performance of quantized LLMs [145, 419].
The benefits of efficient fune-tuning methods in quanti- Quantized LLMs. Compared with original models, quan-
zation can be twofold. Firstly, it can directly compensate tized language models take a smaller memory footprint,
the performance degradation suffered from low-bit quan- and likely have a faster inference speed [93, 413, 427].
tization [421, 423], by increasing the fitting capacity by Recently, a nubmer of quantized model copies of several
updating high precision adapters. Secondly, it is flexible to publicly available language models have been released on
support task-specific or goal-specific fine-tuning of LLMs HuggingFace, including BLOOM, GPT-J, and ChatGLM. In
in a lightweight way [419], e.g., instruction tuning or chat- particular, GPTQ [417] has been widely used to quantize
oriented tuning, by only tuning the small adapters. Overall, generative language models, leading to various quantized
it makes a good trade-off between the effectiveness and variants for LLaMA and OPT. Further, it has been also
training cost, which provides a promising approach to en- applied to quantize instruction-tuned models, such as Vi-
hancing the performance of quantized LLMs. cuna and WizardLM. Due to the large number of quantized
LLMs, we do not directly incorporate the corresponding
Empirical Analysis on Quantization Experiments. To fur- links of these models. The readers can easily find them by
ther help readers understand the impact of quantization on searching on HuggingFace.
LLMs, we also conduct a group of experiments to investi-
gate the inference performance of quantized models here.
Specifically, we focus on the fine-tuned LLaMA models (i.e.,
6 U TILIZATION
7B and 13B) using popular SFT datasets, including FLAN-
v2 [69], Alpaca-52K [137] and ShareGPT [148]. For evalua- After pre-training or adaptation tuning, a major approach
tion, we utilize the same tasks in Table 9, and follow the to using LLMs is to design suitable prompting strategies
quantization settings in the study [423] examining the per- for solving various tasks. In existing literature, task-specific
formance of quantized language models at three precision prompts can be effectively learned through manual creation
levels: 4-bit, 8-bit and 16-bit. The results are summarized and automatic optimization. A representative prompting
in Table 10. As can be observed from Table 10, the results method is in-context learning [50, 55], which formulates the
obtained with 8-bit and 4-bit weight quantization are close task description and/or demonstrations in the form of natu-
to the performance of 16-bit models while significantly ral language text. In addition, chain-of-thought prompting [33]
reducing memory consumption. In practice, it is recom- can be employed to enhance in-context learning by involv-
mended to first examine the performance of 4-bit weight ing a series of intermediate reasoning steps in prompts.
quantization for LLMs if reducing memory usage is a critical Furthermore, planning [439] is proposed for solving complex
consideration for deployment. tasks, which first breaks them down into smaller sub-tasks
and then generates a plan of action to solve these sub-tasks
5.4.4 Open-source Libraries and Quantized LLMs one by one. We summarize representative work for these
In this part, we briefly introduce the available open-source prompting approaches in Table 11. Next, we will elaborate
quantization libraries and quantized LLMs. on the details of the four techniques.
Quantization Libraries. Next, we introduce three major
quantization libraries for LLMs, including: 6.1 Prompting
• Bitsandbytes36 is developed based on the methods intro-
duced in the papers of LLM.int8() [413] and 8-bit optimiz- As discussed in previous work [36], prompting is the major
ers [426]. It focuses on the quantization of both activations approach to utilizing LLMs for solving various tasks. Since
and weights for LLMs, including the support on 8-bit and the quality of prompts will largely influence the perfor-
4-bit (NF4,FP4) matrix multiplication for efficient inference, mance of LLMs in specific tasks, there have been a series of
as well as an 8-bit optimizer for efficient training. studies proposed to generate suitable task prompts through
• GPTQ-for-LLaMA37 is developed specially for quantiz- manual creation or automatic optimization, which will be
ing LLaMA models. It enables 4-bit quantization of LLaMA introduced in this section.

36. https://github.com/TimDettmers/bitsandbytes 38. https://github.com/PanQiWei/AutoGPTQ


37. https://github.com/qwopqwop200/GPTQ-for-LLaMa 39. https://github.com/ggerganov/llama.cpp
48

TABLE 10: Evaluation results for quantized LLaMA models (7B and 13B). We employ existing model checkpoints provided
by [353] for quantization experiments, which have been fine-tuned on FLAN-v2, Alpaca-52K, and ShareGPT, respectively.
Specifically, we report the performance with AlpacaFarm, MMLU, and BBH, as well as the memory usage of the loaded
model (Mem.). For quantization, we employ bitesandbytes to quantize the 16-bit models to 8/4 bits by specifying the
commands load_in_8bit and load_in_4bit when loading the weights. It is worth noting that we select text-davinci-
003 as the baseline model for the AlpacaFarm dataset.

16-bit 8-bit 4-bit


Models SFT Dataset
AlpacaFarm MMLU BBH Mem.(GiB) AlpacaFarm MMLU BBH Mem.(GiB) AlpacaFarm MMLU BBH Mem.(GiB)
LLaMA (7B) FLAN-v2 6.65 47.34 35.05 12.58 6.15 47.02 35.17 6.65 7.83 46.23 34.77 3.94
Alpaca-52K 32.55 40.87 33.66 12.58 33.60 39.98 34.38 6.65 29.57 39.24 32.80 3.94
ShareGPT 72.05 41.30 32.90 12.58 72.86 39.34 32.71 6.65 70.31 40.08 32.11 3.94
LLaMA (13B) FLAN-v2 8.14 51.67 41.46 24.40 7.64 51.02 41.25 12.53 7.52 50.48 40.68 7.34
Alpaca-52K 33.60 47.63 36.10 24.40 31.43 47.04 35.98 12.53 30.87 46.20 36.16 7.34
ShareGPT 75.59 47.58 38.00 24.40 73.79 47.71 38.31 12.53 71.99 45.77 36.97 7.34

TABLE 11: Typical LLM utilization methods and their key points for ICL, CoT, and planning. Note that the key points only
highlight the most important technical contribution.

Approach Representative Work Key Point


KATE [428] Demonstration selection (similar; k-NN)
EPR [429] Demonstration selection (dense retrieval; constrative learning)
In-context SG-ICL [430] Demonstration selection (LLM as the demonstration generator)
Learning (ICL) APE [431] Demonstration format (automatic generation & selection)
Structured Prompting [296] Demonstration format (grouped context encoding; rescaled attention)
GlobalE & LocalE [432] Demonstration order (entropy-based metric; probing set generation with LLM)
Complex CoT [433] Demonstration (complexity-based selection)
Auto-CoT [434] Demonstration (automatic generation)
Chain-of-thought Selection-Inference [435] Generation (alternate between selection and inference)
Prompting (CoT) Self-consistency [436] Generation (diverse paths; self-ensemble)
DIVERSE [437] Generation (diverse paths); Verification (step-wise voting)
Rationale-augmented ensembles [438] Generation (rationale sampling)
Least-to-most prompting [439] Plan generation (text-based; problem decomposition)
DECOMP [440] Plan generation (text-based; problem decomposition)
PS [441] Plan generation (text-based)
Faithful CoT [442] Plan generation (code-based)
PAL [443] Plan generation (code-based; Python)
HuggingGPT [444] Plan generation (code-based; models from HuggingFace)
Planning AdaPlanner [445] Plan refinement (skill memory)
TIP [446] Feedback acquisition (visual perception)
RAP [447] Feedback acquisition (LLM as the world model); Plan refinement (Monte Carlo Tree Search)
ChatCoT [448] Feedback acquisition (tool); Plan refinement (conversation between LLM and tools)
ReAct [449] Feedback acquisition (tool); Plan refinement (synergizing reasoning and acting)
Reflexion [450] Feedback acquisition (text-based self-reflection); Plan refinement (dynamic memory)
Tree of Thoughts [451] Feedback acquisition (vote comparison); Plan refinement (tree-based search)

6.1.1 Prompt Creation ing, meta-review generation, and text-to-SQL in Table 13.
• Task description. A task description is typically a specific
The process of manually creating a suitable prompt is also
instruction that LLMs are expected to follow. In general, one
called prompt engineering [452, 453]. A well-designed prompt
should clearly describe the task goal in natural language.
is very helpful to elicit the abilities of LLMs for accomplish-
For the tasks with special input or output format, detailed
ing specific tasks. In this part, we will first introduce the
clarifications are often needed, and one can further utilize
key components of prompts and discuss several principles
keywords to highlight the special settings for better guiding
for prompt design. Then, we evaluate ChatGPT with differ-
LLMs in task completion.
ent prompts to show the results on several representative
tasks. We are aware that there have been several existing • Input data. In common cases, it is straightforward to
papers [453, 454] and websites [455–457] that present the describe input data (e.g., an instance to be responded by
suggestions and guidelines to design good prompts. As a LLMs) in natural language. For special input data, such
comparison, we mainly aim to discuss the key factors (ingre- as knowledge graph and table, it is necessary to apply an
dients and principles) that are useful for prompt creation, appropriate and convenient way to make them readable
and provide experimental results and analysis on popular for LLMs. For structured data, linearization is commonly
tasks as the reference to the beginners. used to transform the original records (e.g., knowledge
triples) into sequences [458] due to the simplicity. Further,
Key Ingredients. Typically, there are four key ingredients the programming language (e.g., executable code) has also
that depict the functionality of a prompt for eliciting the been utilized to formulate the structured data, which can
abilities of LLMs to complete the tasks, including task also support using external tools (e.g., program executor) to
description, input data, contextual information, and prompt produce the precise results [459, 460].
style. To have an intuitive understanding of our discussion, • Contextual information. In addition to the task descrip-
we also present three prompt examples for question answer- tion and input data, contextual or background information
49

is also essential for specific tasks. For example, retrieved performance.


documents are highly useful for open-domain question • Utilizing model-friendly format. Since LLMs are pre-
answering as supporting evidence. Both the quality of the trained on specially constructed datasets, there are some
retrieved documents and their relevance to the question prompt formats that can make LLMs better understand
have an impact on the generated answers [461]. Thus, it the instruction. For example, as the OpenAI documentation
needs to include such information in a proper prompt suggests, we can use ### or """ as a stop symbol to
pattern or expression format. Furthermore, in-context task separate the instruction and context, which can be better
exemplars are also helpful for eliciting LLMs to accomplish understood by LLMs. As a general guideline, most existing
a complex task, which can better depict the task goal, the LLMs perform a task better in English, thus it is useful to
special output formats, and the mapping relation between employ English instructions to solve difficult tasks based on
input and output. machine translation.
• Prompt style. For different LLMs, it is important to
Useful Tips. In addition to the design principles, we also
design a suitable prompt style for eliciting their abilities to
present a collection of useful prompt tips based on existing
solve specific tasks. Overall, one should express the prompt
work or our empirical experiences in Table 12. Note that
as a clear question or detailed instruction that can be well
these tips are suggested in a general manner, it does not
understood and answered. In some cases, it is also useful to
indicate that they are the best prompts for the corresponding
add the prefix or suffix to better guide LLMs. For example,
tasks. This part will be continuously updated with more
using the prefix “Let us think step by step” can help elicit
guidelines or tips. We welcome readers to contribute to this
LLMs perform step-by-step reasoning, and using the prefix
collection of prompt tips. We present the detailed procedure
“You are an expert on this task (or in this domain)” can boost
to contribute to the prompt tips, at the link: https://github.
the performance of LLMs in some specific tasks. Further, for
com/RUCAIBox/LLMSurvey/tree/main/Prompts.
chat-based LLMs (e.g., ChatGPT), instead of directly feeding
a long or complex task prompt, it is suggested to decompose Empirical Analysis. We further conduct empirical studies
it into multiple prompts for the sub-tasks and then feed to present the impact of prompts on task performance. To
them into LLMs via a multi-turn conversation [448]. conduct the experiments, we select a variety of tasks that
span language generation, knowledge utilization, complex
Design Principles. Based on the key ingredients of prompts, reasoning, structure data generation, and information re-
we summarize several critical design principles that can trieval. For each task, we manually write a prompt that
help create more effective prompts for solving various tasks. follows general guidelines introduced above. Note that the
• Expressing the task goal clearly. Task descriptions should tested prompts may not be the optimal for these tasks,
not be ambiguous or unclear, which likely lead to in- since they mainly aim to help readers understand how to
accurate or inappropriate responses. This highlights the write an effective prompt for solving different tasks. Also,
need for clear and unambiguous directives when utilizing we add a simplified prompt as the comparison for most
these models [66]. A clear and detailed description should tasks. Following the experimental settings in Section 7.4, we
contain various elements to explain a task, including task examine the 3-shot performance of ChatGPT on complex
objective, input/output data (e.g., “Given a long document, I reasoning tasks (Colored Objects and GSM8k), and zero-
want you to generate a concise summary.”), and the response shot performance on other tasks. We report the experimental
constraints (e.g., “the length of the summary cannot exceed 50.”). results in Table 17, where we also include the supervised
By providing a well-clarified task description, LLMs can performance in existing papers as reference.
more effectively understand the target task and generate the • Carefully designed prompts can boost the zero-shot or few-
desired output. shot performance of ChatGPT. By comparing the results of
• Decomposing into easy, detailed sub-tasks. To solve com- using different prompts on the same task, we can see that
plex tasks, it is important to decompose the difficult task using the carefully designed prompts can achieve better per-
into several more easier, detailed sub-tasks for helping formance than the simpler ones. In the carefully designed
LLMs accomplish the goal step by step, which is closely re- prompts, we provide a more clearly expressed task de-
lated to the planning technique in Section 6.4. For example, scription (e.g., WMT and WikiFact), or use a model-friendly
following the suggestion [454], we can explicitly list the sub- format (e.g., GSM8k and OBQA). For example, for WikiFact
tasks in the form of multiple numbered items (e.g., “Braid a task, the prompt with a more detailed task description leads
coherent narrative by performing the following tasks: 1. ...; 2. ...; 3. to a performance increase from 29.25 to 31.21.
...”). By decomposing a target task into sub-tasks, LLMs can • More complex tasks can benefit more from careful prompt
focus on solving easier sub-tasks and finally achieve more engineering on ChatGPT. In the WikiFact and Colored Objects
accurate results for complex tasks. tasks, the designed prompts have greatly improved the per-
• Providing few-shot demonstrations. As discussed in Sec- formance of ChatGPT, i.e., from 23.61 to 28.47 on WikiFact
tion 6.2, LLMs can benefit from in-context learning for and from 53.20 to 66.75 on Colored Objects. It indicates
solving complex tasks, where the prompts contain a small the necessity of prompt engineering for LLMs to perform
number of task examples of the desired input-output pairs, well on complex tasks, since these tasks typically have
i.e., few-shot demonstrations. Few-shot demonstrations can specific output formats or require background knowledge.
help LLMs learn the semantic mapping between input and Our example prompts provide more detailed task descrip-
output without parameter tuning. In practice, it is suggested tion (e.g., output format and task goal), which can help
that one should generate a few high-quality demonstrations ChatGPT better understand the complex task requirement
for the target task, which would highly benefit the final task for fulfilling it.
50

TABLE 12: A collection of useful tips for designing prompts that are collected from online notes [453–456] and experiences
from our authors, where we also show the related ingredients and principles (introduced in Section 6.1.1). We abbreviate
principles as Prin. and list the IDs of the related principles for each prompt. ⃝ 1 : expressing the task goal clearly; ⃝ 2:
decomposing into easy, detailed sub-tasks; ⃝ 3 : providing few-shot demonstrations; ⃝ 4 : utilizing model-friendly format.

Ingredient Collected Prompts Prin.


T1. Make your prompt as detailed as possible, e.g., “Summarize the article into a short paragraph within 50 words. The major ⃝
1
Task Description storyline and conclusion should be included, and the unimportant details can be omitted.”
T2. It is helpful to let the LLM know that it is an expert with a prefixed prompt, e.g., “You are a sophisticated expert in the ⃝
1
domain of compute science.”
T3. Tell the model more what it should do, but not what it should not do. ⃝
1
T4. To avoid the LLM to generate too long output, you can just use the prompt: “Question: Short Answer: ”. Besides, you can ⃝
1
also use the following suffixes, “in a or a few words”, “in one of two sentences”.
I1. For the question required factual knowledge, it is useful to first retrieve relevant documents via the search engine, and ⃝
4
Input Data
then concatenate them into the prompt as reference.
I2. To highlight some important parts in your prompt, please use special marks, e.g., quotation (””) and line break (\n). You ⃝
4
can also use both of them for emphasizing.
C1. For complex tasks, you can clearly describe the required intermediate steps to accomplish it, e.g., “Please answer the ⃝
2
Contextual Information question step by step as: Step 1 - Decompose the question into several sub-questions, · · · ”
C2. If you want LLMs to provide the score for a text, it is necessary to provide a detailed description about the ⃝
1
scoring standard with examples as reference.
C3. When LLMs generate text according to some context (e.g., making recommendations according to purchase history), ⃝
2
instructing them with the explanation about the generated result conditioned on context is helpful to improve the quality
of the generated text.
C4. An approach similar to tree-of-thoughts but can be done in one prompt: e.g., Imagine three different experts are answering ⃝
2
this question. All experts will write down one step of their thinking, then share it with the group of experts. Then all experts will go on
to the next step, etc. If any expert realizes they’re wrong at any point then they leave. The question is
D1. Well-formatted in-context exemplars are very useful, especially for producing the outputs with complex formats. ⃝3
D2. For few-shot chain-of-thought prompting, you can also use the prompt “Let’s think step-by-step”, and the few-shot ⃝
1⃝ 3
examples should be separated by “\n” instead of full stop.
D3. You can also retrieve similar examples in context to supply the useful task-specific knowledge for LLMs. To retrieve ⃝
3⃝4
Demonstration more relevant examples, it is useful to first obtain the answer of the question, and then concatenate it with the question for
retrieval.
D4. The diversity of the in-context exemplars within the prompt is also useful. If it is not easy to obtain diverse questions, ⃝
3
you can also seek to keep the diversity of the solutions for the questions.
D5. When using chat-based LLMs, you can decompose in-context exemplars into multi-turn messages, to better match the ⃝
3
human-chatbot conversation format. Similarly, you can also decompose the reasoning process of an exemplars into multi-turn
conversation.
D6. Complex and informative in-context exemplars can help LLMs answer complex questions. ⃝3
D7. As a symbol sequence can typically be divided into multiple segments (e.g., i1 , i2 , i3 −→ i1 , i2 and i2 , i3 ), the preceding ⃝
2⃝ 3
ones can be used as in-context exemplars to guide LLMs to predict the subsequent ones, meanwhile providing historical
information.
D8. Order matters for in-context exemplars and prompts components. For very long input data, the position of the question ⃝
3
(first or last) may also affect the performance.
D9. If you can not obtain the in-context exemplars from existing datasets, an alternative way is to use the zero-shot ⃝
3
generated ones from the LLM itself.

O1. Let the LLM check its outputs before draw the conclusion, e.g., “Check whether the above solution is correct or not.” ⃝
2
O2. If the LLM can not well solve the task, you can seek help from external tools by prompting the LLM to manipulate ⃝
4
them. In this way, the tools should be encapsulated into callable APIs with detailed description about their functions, to
Other Designs better guide the LLM to utilize the tools.
O3. The prompt should be self-contained, and better not include pronouns (e.g., it and they) in the context. ⃝
1
O4. When using LLMs for comparing two or more examples, the order affects the performance a lot. ⃝
1
O5. Before the prompt, assigning a role for the LLM is useful to help it better fulfill the following task instruction, e.g., “I ⃝
1
want you to act as a lawyer”.
O6. OpenAI models can perform a task better in English than other languages. Thus, it is useful to first ⃝
4
translate the input into English and then feed it to LLMs.
O7. For multi-choice questions, it is useful to constrain the output space of the LLM. You can use a more detailed explanation ⃝
1
or just imposing constraints on the logits.
O8. For sorting based tasks (e.g., recommendation), instead of directly outputting the complete text of each item after sorting, ⃝
1
one can assign indicators (e.g., ABCD) to the unsorted items and instruct the LLMs to directly output the sorted indicators.

• For mathematical reasoning tasks, it is more effective to language in mathematical reasoning tasks.
design specific prompts based on the format of programming
language. For GSM8k, the designed prompt employs code- • In knowledge utilization and complex reasoning tasks,
formatted few-shot demonstrations to convert this mathe- ChatGPT with proper prompts achieves comparable performance
matical reasoning task into code generation task, which can or even outperforms the supervised baselines methods. In knowl-
leverage the strong code synthesis ability of ChatGPT for edge utilization and complex reasoning tasks, ChatGPT
solving mathematical problems. Further, with the help of an with proper zero-shot or few-shot prompts can achieve
external program executor, we are able to obtain more pre- comparable performance or even outperform the super-
cise results instead of using LLMs for arithmetic operation. vised methods, e.g., 31.21 (ChatGPT) v.s. 34.20 (supervised
As we can see, the performance is boosted from 78.47 to baseline) on WikiFact. Despite that, ChatGPT still performs
79.30 on GSM8k, indicating the usefulness of programming worse than supervised baseline models on some specific
tasks (e.g., ARC and WikiFact), since these supervised mod-
51

els have been specially optimized with task-specific data. • Edit-based approaches. For the above methods, gradient-
• Through suitable prompt engineering, LLMs can handle based and RL-based tuning can be extremely computation-
some non-traditional NLP tasks. With the help of specific ally demanding for ever larger models, and may not be fea-
prompts, ChatGPT can also accomplish non-traditional NLP sible for API-based model calls (e.g., ChatGPT). Therefore,
tasks, i.e., the general recommendation and conversational another line of work aims to directly edit existing prompts
recommendation. A key point is that these tasks can be based on the task performance. Specifically, GPS [469] bor-
well expressed or described in natural language. However, rows an idea from the genetic algorithm and proposes
the performance of ChatGPT is still far from the referenced a genetic prompt search method that utilizes a language
performance in these tasks, as LLMs cannot directly fit these model (i.e., T5) to edit prompts by taking the cloze task form.
tasks, which require specific domain knowledge and task In addition to model based edit methods, human-defined
adaptation [357, 462]. operations can be also employed for prompt editing [470],
including delete, swap, paraphrase, and addition. Based
on these operations, they iteratively edit the prompts and
6.1.2 Prompt Optimization greedily search for the best prompt guided by the model
Although manually creating task prompts is more intuitive, performance on a small pool of examples.
it is time consuming and, more importantly, models are • LLM-based approaches. Due to the exceptional capacities
highly sensitive to the crafted prompts—improper prompts of LLMs, an increasing number of studies directly leverage
will lead to low task performance (as shown in Table 17). LLMs as prompt generator [471–473]. Specifically, APE [471]
Therefore, a large body of studies propose automatic opti- utilizes an LLM to generate initial prompts, then selects
mization approaches for discrete prompts and continuous the best prompt with the highest accuracy, and finally im-
prompts to achieve the optimal performance [396, 405]. In proves the best candidate through an iterative Monte Carlo
this part, we will detail these studies from two perspectives, search method. Similarly, APO [472] instructs the LLM to
i.e., discrete prompts and continuous prompts. generate text feedback on how to refine an old prompt
into new improved prompts. However, their search in the
Discrete Prompt Optimization. Discrete prompt is typically prompt space might be inefficient without fully considering
composed of a sequence of natural language tokens. Despite the whole refinement trace of previous prompts, thus po-
that the form is simple and flexible, optimizing prompts in tentially leading to sub-optimal results. Therefore, another
discrete space is a challenging problem due to the combina- study [473] incorporates the previous prompts with their
torial huge search space. To automatically search effective scores to instruct LLMs for progressively generating better
prompts for downstream tasks, existing studies propose a new prompts. However, these approaches still struggle in
wide spectrum of discrete prompt approaches, which are exploring the vast space of effective prompts. Inspired by
detailed as follows. human-like trial-and-error, prompt optimization is further
• Gradient-based approaches. This kind of approaches formulated as a strategic planning problem [474] and uses
aims to optimize the prompt search process by maximizing Monte Carlo tree search to navigate the vast prompt space.
the output likelihood via gradient update [405, 464–466].
As a representative work, Auto-Prompt [405] proposes a Continuous Prompt Optimization. Different from discrete
gradient-guided method to greedily searches the optimal prompts, continuous prompts consist of a set of continuous
token for each position of the prompt, leveraging the gra- embeddings, which can be directly optimized through the
dient approximated by the change in the log-likelihood gradient update based on the loss of downstream tasks.
when replacing a prompt token with another candidate Note that continuous prompt optimization has been mainly
token from vocabulary. However, such a search process studied in PLMs, but draws limited attention in era of LLMs
can be extremely expensive since it needs to evaluate each due to their massive magnitudes of parameters. We include
candidate token for each position of the prompt, leading to a the discussion of this part for content completeness. In prior
number of additional forward passes. Therefore, improved work, most studies typically rely on supervised learning to
gradient method [464] has been proposed by transforming train continuous prompts based on task data. Furthermore,
discrete tokens into continuous embeddings and computing in data-scarce scenarios, transfer learning methods can be
the gradient on continuous space during optimization. employed to alleviate the lack of labeled data on target tasks.
• RL-based approaches. Since discrete prompts are difficult These two approaches are detailed below.
to be learned through gradient back-propagation, a num- • Prompt learning with sufficient data. In this approach,
ber of studies propose to formulate the discrete prompt most existing methods regard continuous prompts as train-
optimization as a reinforcement learning (RL) problem and able model parameters and then leverage supervised learn-
leverage RL algorithms for optimization [467, 468]. For ex- ing to optimize the continuous prompts by minimizing
ample, RLPrompt [467] trains a policy network to generate the cross-entropy loss based on sufficient downstream task
desired prompts with multiple reward functions. In this data [396, 397, 401, 475]. As discussed in Section 5.3.1,
approach, several effective reward stabilization strategies prefix tuning [396] prepends a sequence of prefixes (i.e.,
are also proposed to enhance the RL training efficiency. a set of trainable continuous vectors) to each Transformer
Compared to previous work that requires sufficient data layer in language models, while prompt tuning [397] only
for training, TEMPERA [468] proposes to directly generate incorporates trainable prompt vectors at the input layer. By
prompts at test time by utilizing a pre-trained RL agent fixing the large-scale parameters of LLMs and only tuning
to sequentially edit different parts of an manually-written continuous prompt vector, this kind of approaches can be
initial prompt. extremely parameter-efficient (Section 5.3). However, these
52

TABLE 13: Example instructions collected from [454, 463]. The blue text denotes the task description, the red text denotes
the contextual information, the green text denotes the demonstrations, and the gold text denotes the prompt style.

Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write “I could not find an
answer.”
Articles: “““Joao Moutinho is a Portuguese footballer who last played as a central midfielder for Premier League club Wolverhampton Wanderers
and the Portugal national team.”””
Question: Is the following sentence plausible? ’Joao Moutinho was out at third.’
Answer: Let’s think step by step. Joao Moutinho is a soccer player. Being out at third is part of baseball, not soccer. So the answer is No.
...
<Demonstrations>

Articles: <insert articles, each delimited by triple quotes>


Question: <insert question>
Answer:

Prepare a meta-review by answering the following questions from the reviewer comments (provided after the questions).
1. Based on the reviewer’s comments, what are the core contributions made by this manuscript?
2. What are the common strengths of this work, as mentioned by multiple reviewers?
3. What are the common weaknesses of this work, as highlighted by multiple reviewers?
4. What suggestions would you provide for improving this paper?
5. What are the missing references mentioned by the individual reviews?
The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
Meta-review: <insert meta-review>
...
<Demonstrations>

Provide justification for your response in detail by explaining why you made the choices you actually made. A good output should be coherent,
highlight major strengths/issues mentioned by multiple reviewers, be less than 400 words in length, and finally, the response should be in English
only.

The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
Meta-review:

CREATE TABLE Highschooler (


ID int primary key,
name text,
grade int
);
/*
3 example rows:
SELECT * FROM Highschooler LIMIT 3;
ID name grade
1234 Janie 8
5678 Mary 8
9012 Mike 9
*/
Using valid SQLite, answer the following questions for the tables provided above.
Question: What is Kyle’s id?
SQL: SELECT ID FROM Highschooler WHERE name=“Kyle”;
...
<Demonstrations>

Question: <insert question>


SQL:

approaches are typically independent of the inputs, lacking this approach leverages the same prompt for solving all
sufficient consideration of input semantics. Therefore, the instances of the target task. For a single task, even a well-
authors in [475] propose context tuning, where the continu- learned prompt may not be suitable for all the data instances
ous prompts are derived based on the input text and learned from a large population. To address this issue, an improved
through the downstream task losses. method [477] designs an adaptive attention mechanism dur-
ing the prompt transfer process to derive the target prompts,
• Prompt transferring with scarce data. Supervised learn- considering both task- and instance-level information. The
ing approaches demand in sufficient training data to learn prompt transfer paradigm can leverage the knowledge of
optimal continuous prompts, which may not work well data-sufficient source tasks encoded in source prompts for
in data-scarce domains and tasks. To address this prob- solving data-scarce target tasks.
lem, SPoT [476] proposes a prompt-based transfer learning
approach, which first learns a single continuous prompt
for several representative source tasks and then uses this
prompt to initialize the prompt for a target task. However,
53

6.2 In-Context Learning 478, 479] Following the discussion in Section 6.2.1, we will
As a special prompting form, in-context learning (ICL) is introduce the demonstration design of ICL from three major
first proposed along with GPT-3 [55], which has become a aspects, i.e., demonstration selection, format, and order.
typical approach to utilizing LLMs. Demonstration Selection. The performance of ICL tends
to have a large variance with different demonstration exam-
6.2.1 ICL Formulation
ples [428], so it is important to select a subset of examples
As stated in [55], ICL uses a formatted natural language that can effectively leverage the ICL capability of LLMs.
prompt, consisting of the task description and/or a few task There are two main demonstration selection approaches,
examples as demonstrations. Figure 14 presents an illustra- namely heuristic and LLM-based approaches:
tion of ICL. First, starting with a task description, a few ex- • Heuristic approaches. Due to their simplicity and low
amples are selected from the task dataset as demonstrations. costs, existing work widely adopts heuristic methods to
Then, they are combined in a specific order to form nat- select demonstrations. Several studies employ a k -NN based
ural language prompts with specially designed templates. retriever to select examples that are semantically relevant to
Finally, the test instance is appended to the demonstration the query [428, 480]. However, they perform the selection
as the input for LLMs to generate the output. Based on task individually for each example, rather than evaluating the
demonstrations, LLMs can recognize and perform a new example set as a whole. To resolve this issue, diversity-
task without explicit gradient update. based selection strategies are proposed to choose the most
Formally, let Dk = {f (x1 , y1 ), . . . , f (xk , yk )} represent representative set of examples for specific tasks [481, 482].
a set of demonstrations with k examples, where f (xk , yk ) is Furthermore, in [483], both relevance and diversity are taken
the prompt function that transforms the k -th task example into consideration when selecting demonstrations.
into natural language prompts. Given the task description • LLM-based approaches. Another line of work selects
I , demonstration Dk , and a new input query xk+1 , the demonstrations by making use of LLMs. For example, LLMs
prediction of the output ŷk+1 generated from LLMs can be can be utilized to directly measure the informativeness
formulated as follows40 : of each example according to the performance gain after
adding the example [484]. In addition, EPR [429] proposes

LLM I, f (x1 , y1 ), . . . , f (xk , yk ), f (xk+1 , ) → ŷk+1 .
| {z } | {z } |{z} a two-stage retrieval approach that first recalls similar ex-
demonstrations input answer
(12) amples with an unsupervised method (e.g., BM25) and then
where the actual answer yk+1 is left as a blank to be ranks them using a dense retriever (trained with positive
predicted by the LLM. Since the performance of ICL heavily and negative examples labeled by LLMs). As an alterna-
relies on demonstrations, it is important to properly design tive approach, the task of demonstration selection can be
them in the prompts. According to the construction process formulated into a RL problem, where LLMs serve as the
in Equation (12), we focus on three major aspects of for- reward function to provide feedback for training the policy
matting demonstrations in the prompts, including how to model [485]. Since LLMs perform well for text annota-
select examples that make up demonstrations, format each tion [486], some recent studies employ LLM itself as the
example into the prompt with the function f (·), and arrange demonstration generator without human intervention [487].
demonstrations in a reasonable order. To summarize, as discussed in [488], the selected demon-
A comprehensive review of ICL has been presented in stration examples in ICL should contain sufficient informa-
the survey paper [50], and we suggest the readers refer- tion about the task to solve as well as be relevant to the test
ring to it for a more general, detailed discussion on this query, for the above two selection approaches.
topic. Compared with this survey, we specially focus on the Demonstration Format. After selecting task examples, the
discussion of applying ICL to LLMs in two major aspects, next step is to integrate and format them into a natural
i.e., demonstration design and the underlying mechanism language prompt for LLMs. A straightforward method is to
of ICL. Also, ICL has a close connection with instruction instantiate a pre-defined template with the corresponding
tuning (discussed in Section 5.1) in that both utilize nat- input-output pairs [36]. To construct more informative tem-
ural language to format the task or instances. However, plates, recent studies consider adding task descriptions [69]
instruction tuning needs to fine-tune LLMs for adaptation, or enhancing the reasoning capability of LLMs with chain-
while ICL only prompts LLMs for utilization. Furthermore, of-thought prompts [33]. For instance, in [166], the authors
instruction tuning can enhance the ICL ability of LLMs to collect a large-scale dataset with task descriptions written by
perform target tasks, especially in the zero-shot setting (only humans. After tuning with this dataset, the performance on
using task descriptions) [69]. seen tasks can be boosted, and LLMs can also generalize to
unseen tasks to some extent. To reduce the annotation costs,
6.2.2 Demonstration Design
a semi-automated approach has been proposed in [143]
Several studies have shown that the effectiveness of ICL by employing a seed set consisting of human-written task
is highly affected by the design of demonstrations [432, descriptions to guide LLMs to generate task descriptions
40. When ICL was introduced in the GPT-3’s paper [55], it was for new tasks. Since it is costly to manually annotate
originally defined to be a combination of the task description and demonstration formats for different tasks, some work also
demonstration examples, wherein either component is dispensable. studies how to automatically generate high-quality ones.
Following this definition, when a LLM is required to solve an unseen As two representative methods, Auto-CoT [434] leverages
task by using only task descriptions, it can be also considered to
perform ICL for task solving, whereas the ICL ability can be enhanced LLMs with the zero-shot prompt “Let’s think step by step”
by instruction tuning. for generating intermediate reasoning steps, while least-to-
54

In-Context Learning Chain-of-Thought Prompting


Answer the following mathematical reasoning questions: Answer the following mathematical reasoning questions:

Q: If you have 12 candies and you give 4 candies to your friend, Q: If a rectangle has a length of 6 cm and a width of 3 cm,
how many candies do you have left? what is the perimeter of the rectangle?
Nx A: The answer is 8. Nx A: For a rectangle, add up the length and width and double it.
If a rectangle has a length of 6 cm and a width of 3 cm,
Q: what is the perimeter of the rectangle? So, the perimeter of this rectangle is (6 + 3) x 2 = 18 cm.

A: The answer is 18 cm. The answer is 18 cm.

Q: Sam has 12 marbles. He gives 1/4 of them to his sister. Q: Sam has 12 marbles. He gives 1/4 of them to his sister.
How many marbles does Sam have left? How many marbles does Sam have left?

A: He gives (1 / 4) x 12 = 3 marbles.
A: The answer is 9. LLM So Sam is left with 12 – 3 = 9 marbles.
The answer is 9.

: Task description : Demonstration : Chain-of-Thought : Query

Fig. 14: A comparative illustration of in-context learning (ICL) and chain-of-thought (CoT) prompting. ICL prompts LLMs
with a natural language description, several demonstrations, and a test query, while CoT prompting involves a series of
intermediate reasoning steps in prompts.

most prompting [439] first queries LLMs to perform prob- input during the training process. It suggests that the design
lem decomposition and then utilizes LLMs to sequentially of training tasks is an important influence factor on the ICL
solve sub-problems based on the intermediate answers to capability of LLMs. Besides training tasks, recent studies
previously solved ones. have also investigated the relationship between ICL and
pre-training corpora [488, 492]. For example, ICL can be
Demonstration Order. LLMs are shown to sometimes suffer theoretically explained as the product of pre-training on
from the recency bias, i.e., they are prone to repeat answers documents that exhibit long-range coherence [488]. Fur-
that are near the end of demonstrations [479]. Thus, it is ther, another study [492] theoretically analyzes that when
important to arrange demonstrations (i.e., task examples) scaling parameters and data, LLMs based on next-word
in a reasonable order. Early work proposes several heuris- prediction can emerge the ability of ICL by learning from
tic methods to quickly find a good order. For example, the compositional structure (e.g., how words and phrases
demonstrations can be directly organized according to their are combined to form larger linguistic units like sentences)
similarity to the query in the embedding space [428]: the present in language data.
more similar, the closer to the end. In addition, global
and local entropy metrics can be used to score different How LLMs Perform ICL? At the inference stage, researchers
demonstration orders [432]. To integrate more task infor- focus on analyzing how the ICL capability operates based
mation, some recent studies propose to minimize the code on given demonstrations since no explicit learning or updat-
length required to compress and transmit task labels, which ing is involved. According to the discussion in [493], there
is inspired by information theory [489]. However, these are two main ways for LLMs to utilize demonstrations: task
methods need additional labeled data as the validation recognition and task learning.
set to evaluate the performance of specific demonstration • Task recognition. In the first way, LLMs recognize the
orders. To eliminate this need, the authors in [432] propose task from demonstrations and utilize the prior knowledge
to sample the validation data from the LLM itself. obtained from pre-training to solve new test tasks. A Proba-
bly Approximately Correct (PAC) framework [494] has been
6.2.3 Underlying Mechanism proposed to assess the learnability of ICL. It assumes that
After pre-training, LLMs can exhibit intriguing ICL capabil- there exists a latent variable representing the task in the pre-
ity without being updated. In what follows, we discuss two training data, and LLMs have been shown to be capable
key questions about the ICL ability of LLMs, i.e., “how does of capturing this variable from demonstrations, enabling
pre-training affect the ICL ability” and “how do LLMs perform them to recognize the task in ICL. Also, the interpretation
ICL during inference”. of ICL as task recognition is supported by several empir-
ical studies [478, 495]. For example, it has been observed
How Pre-Training Affects ICL? ICL is first proposed in that replacing the inputs or labels of demonstrations with
GPT-3 [55], and it has been shown that the ICL ability random ones sampled from the input or label space does
becomes more significant with a larger model size. Further, not seriously hurt the performance of LLMs, indicating that
some studies reveal that small-scale PLMs can also demon- LLMs mainly recognize the target task from demonstrations
strate a strong ICL ability by continual pre-training [490] instead of learning from them [478, 493]. Similarly, LLMs
or fine-tuning [491] on specially designed training tasks, can exhibit decent performance even if the prompt template
which typically involve additional task examples in the is irrelevant or misleading [495].
55

• Task learning. In the second way, LLMs learn new tasks 6.3.1 Basic CoT Prompting Approach
unseen in the pre-training stage only through demonstra- CoT prompting is first proposed as an extension of ICL [33],
tions. Specially, task learning is analyzed mainly from the which augments each demonstration ⟨input, output⟩ as
perspective of gradient descent and considered as implicit ⟨input, CoT, output⟩. A CoT is a series of intermediate
fine-tuning [65, 496]. Then, ICL can be explained as follows: reasoning steps for connecting the input and output. With
by means of forward computation, LLMs generate meta- these augmented demonstrations, LLMs can follow them to
gradients with respect to demonstrations and implicitly per- generate CoTs and the answer for a new input. However,
form gradient descent via the attention mechanism. Exper- unlike ⟨input, output⟩ pairs in ICL, CoTs are difficult to
iments also show that certain attention heads in LLMs are obtain and usually require human annotation. Fortunately,
capable of performing task-agnostic atomic operations (e.g., it has been found that LLMs can be triggered to generate
copying and prefix matching), which are closely related to CoTs through simple instructions like “Let’s think step by
the ICL ability [497]. Furthermore, some studies abstract step.” [505], making CoT prompting easy to use. There are
ICL as an algorithm learning process [498]. For example, the also alternative magic prompts that can elicit the ability
authors in [498] find that LLMs essentially encode implicit of CoT reasoning and further improve the performance of
models through their parameters during pre-training. With LLMs, such as “Take a deep breath and work on this problem
the examples provided in ICL, LLMs can implement learn- step-by-step.” [473].
ing algorithms such as gradient descent or directly compute As illustrated in Figure 15, the generation process of
the closed-form solution to update these models during CoT follows a chain structure in the basic CoT prompt-
forward computation. Under this explanation framework, ing approach, where LLMs generate CoTs step by step.
it has been shown that LLMs can effectively learn simple Typically, CoT takes the format of natural language text.
linear functions and even some complex functions like deci- However, textual CoTs may not work well on complex tasks
sion trees with ICL [498]. that require rigorous logic for reasoning. Considering this,
As discussed in a recent study [493], LLMs exhibit the some work uses code [506, 507] due to its structured and
abilities of both task recognition and task learning in ICL, precise nature. Furthermore, the authors in [508] propose
but the two abilities seem to be possessed with different to dynamically select text or code as the format of CoTs to
model scales. As shown in the experiments [493], the ability combine their advantages.
of task recognition is easier to obtain, and even a small LM
with only 350M parameters can exhibit this ability, while 6.3.2 Improved CoT Prompting Strategies
task learning can only emerge for LLMs with at least 66B
parameters. Another study [499] also supports this find- Despite the performance improvement in complex reason-
ing with specially designed experiments. They set up the ing tasks, CoT prompting still suffers from problems like
tasks with flipped and semantically unrelated labels in the incorrect reasoning and instability. In this part, we first
experiment, which require task learning when performing introduce how to design better CoT prompts and enhanced
ICL. The results suggest that small LMs tend to disregard CoT generation strategies, and then introduce the extension
the labels and mainly depend on their prior knowledge of the basic chain structure of CoT. Figure 15 illustrates the
to accomplish the task, while LLMs have the ability to evolution of representative CoT prompting strategies.
surpass their prior knowledge and acquire new knowledge Better Prompt Design. Since CoT prompting relies on
from demonstrations, resulting in better outcomes. Further- prompts to elicit the reasoning capabilities of LLMs, the
more, to improve the task learning ability, Meta-In-Context design of prompts is critical to its performance. As a di-
Learning [500] proposes to include multiple related tasks rect approach, it is shown that using diverse CoTs (i.e.,
instead of just a single one in the prompt. In addition, multiple reasoning paths for each problem) can effectively
Symbol Tuning [501] fine-tunes LLMs on demonstrations enhance the performance [437]. Another intuitive idea is
with semantically unrelated labels (e.g., foo/bar instead of that prompts with more complex reasoning paths are more
positive/negative for sentiment analysis), forcing LLMs to likely to elicit the reasoning ability of LLMs [433], which
learn the task from demonstrations instead of relying on can result in higher accuracy in generating correct an-
prior knowledge. swers. However, all these approaches rely on annotated CoT
datasets, which limits their use in practice. To overcome
this limitation, magic instructions such as “Let’s think step
6.3 Chain-of-Thought Prompting by step” can be used to automatically construct CoTs by
prompting LLMs [434].
Chain-of-Thought (CoT) prompting [33, 502] is an improved
prompting strategy to boost the performance of LLMs on Enhanced CoT Generation. Since LLMs are prone to
complex reasoning tasks, such as arithmetic reasoning [503], producing incorrect reasoning steps and exhibiting insta-
commonsense reasoning [504], and symbolic reasoning [33]. bility in the generation process, there are a number of
Instead of simply constructing the prompts with input- studies [436, 509] to improve the generation of CoT. In this
output pairs like ICL, CoT prompting further incorporates part, we will introduce two typical approaches to enhancing
intermediate reasoning steps, which serve as the bridge be- the generation of CoT: sampling- and verification-based
tween inputs and outputs. Figure 14 presents an illustration methods.
of CoT. In the following part, we will first elaborate on the • Sampling-based methods. LLMs are known to suffer
basic CoT prompting approach and its improved strategies, from instability during inference, which can lead to un-
then discuss when and why CoT prompting works. faithfulness in the generated reasoning steps. To address
56

Sampling- Verification-
CoT based CoT based CoT ToT GoT
Input Input Input Input Input

... ... ... ... Verification

✖️ ✖️
...
Ensemble

Output Output Output Output Output

Reason Backtrack Aggregate Unevaluated thought Positive thought Negative thought

Fig. 15: An illustration of the evolution of CoT prompting strategies. It begins with the basic CoT approach and progresses
to enhanced CoT generation techniques, including sampling-based and verification-based methods. Finally, it extends to
variations of the chain structure, such as trees and graphs. Here, “thought” refers to an intermediate reasoning step as
stated in [33, 451].

this issue, some work proposes to sample multiple rea- processes, e.g., tree- and graph-structured reasoning.
soning paths instead of using greedy decoding. As a rep- • Tree-structured reasoning. This approach (exemplified by
resentative solution, self-consistency [436] first generates Tree of Thoughts (ToT) [451, 515]) formulates the reasoning
several reasoning paths and then takes an ensemble over process in a hierarchical tree structure, where intermediate
the corresponding answers, selecting the most consistent thoughts are nodes. In this way, it enables LLMs to explore
one through majority voting. However, such a method can multiple reasoning paths in parallel and further supports
still lead to wrong answers when most of the reasoning the operation of lookahead and backtracking to facilitate
paths are misled. Considering this, the authors in [433] only more comprehensive decisions. In addition, TouT [516] takes
vote on the k most complex reasoning paths based on their the uncertainty of intermediate thoughts into account for
observation that reasoning paths with higher complexity thought evaluation based on Monte Carlo Dropout.
(e.g., more reasoning steps) usually have better performance. • Graph-structured reasoning. Although the tree structure
Furthermore, MCR [510] proposes referring to the steps facilitates parallel reasoning, it also imposes restrictions on
from other reasoning paths when generating the next step, the reasoning process. With more complex topological struc-
and performs reasoning across multiple reasoning paths to tures, graphs offer greater flexibility in reasoning, enabling
generate the final answer. the characterization of more intricate relationships and in-
• Verification-based methods. The sequential nature of rea- teractions. For instance, Graph of Thoughts (GoT) [517, 518]
soning steps in CoTs can lead to the accumulation of errors conceptualizes the reasoning process as an arbitrary graph,
in the generated CoTs when certain steps are incorrect. To where vertices denote intermediate thoughts and edges
mitigate this problem, recent studies propose to verify the denote the interdependence between these thoughts. Com-
correctness of generated reasoning steps with either trained pared with ToT, it can further utilize thoughts from other
verifiers or LLMs themselves. For example, DIVERSE [509] reasoning paths when generating new thoughts. However,
trains solution-level and step-level verifiers respectively to such an approach requires a large number of interactions
examine the reasoning steps at different granularities. An- with LLMs, making the thought exploration process highly
other approach [511] utilizes LLMs to verify the correctness inefficient. To reduce potentially meaningless thought
of reasoning steps through step-by-step self-verification exploration, XoT [519] further proposes to guide the search
with a specially designed reasoning format. In addition, of thoughts with pre-trained policy and value networks.
several studies propose backward reasoning for verification:
it first deduces the necessary question conditions [512, 513] 6.3.3 Further Discussion on CoT Prompting
or variables [514] from the model’s predictions, and then In this part, we present discussions regarding two funda-
compares them with the original ones. mental questions related to CoT prompting, i.e., “when does
CoT prompting work for LLMs” and “why can LLMs perform
Reasoning Structure Extension. Despite the generality, the CoT reasoning”.
chain reasoning structure of basic CoT prompting limits its
effectiveness in solving complex tasks, which require ex- When CoT Prompting Works For LLMs? Since CoT reason-
ploration like foresight and backtracking during inference. ing is an emergent ability [31], it only has a positive effect
Therefore, many studies have been devoted to extending on sufficiently large models (typically containing 10B or
the reasoning structure by designing more intricate thought more parameters [33]) but not on small models. Moreover,
57

since CoT prompting augments the standard prompting


Planning
with intermediate reasoning steps, it is mainly effective Task Result
Framework
for the tasks that require step-by-step reasoning [33], e.g.,
arithmetic reasoning, commonsense reasoning, and sym-
bolic reasoning. Whereas, for other tasks that do not rely Task Planner Plan
Plan Executor
on complex reasoning, CoT prompting might lead to worse (LLM) (generate & refine)
performance than standard prompting [438], e.g., MNLI-
m/mm, SST-2, and QQP from GLUE [260]. Interestingly, it Feedback Action
seems that the performance gain brought by CoT prompting
could be significant only when standard prompting yields Memory Environment Tool
poor results [33].
Why LLMs Can Perform CoT Reasoning? As the second Internal External
question, we discuss the underlying mechanism of CoT
prompting in the following two aspects. …
• The source of CoT reasoning ability. Regarding the source
LLM Human World Others
of CoT reasoning capability, it is widely hypothesized that it
can be attributed to training on code since models trained on
it show a strong reasoning ability [47, 520, 521]. Intuitively, Fig. 16: An illustration of the formulation for prompt based
code data is well organized with algorithmic logic and planning by LLMs for solving complex tasks.
programming flow, which may be useful to improve the rea-
soning performance of LLMs. However, this hypothesis still
lacks publicly reported evidence of ablation experiments 6.4.1 The Overall Framework
(with and without training on code). In addition, instruction
tuning seems not to be the key reason for obtaining the CoT In this part, we first formulate the general planning
reasoning ability, since it has been empirically shown that paradigm of LLMs for solving complex tasks, which is
instruction tuning on non-CoT data does not improve the illustrated in Figure 16.
performance on held-out CoT reasoning benchmarks [69]. In this paradigm, there are typically three components:
• The effect of CoT prompting components. The major dis- task planner, plan executor, and environment41 . Specifically,
tinction between CoT prompting and standard prompting task planner, which is played by LLMs, aims to generate the
is the incorporation of reasoning paths prior to the final whole plan to solve a target task. The plan can be presented
answer. Thus, some researchers investigate the effects of in various forms, e.g., an action sequence in the form of
different components in the reasoning paths. Specifically, natural language [439] or an executable program written in
a recent study identifies three key components in CoT programming language [443]. The LLM-based task planner
prompting, namely symbols (e.g., numerical quantities in can be enhanced with the memory mechanism for plan
arithmetic reasoning), patterns (e.g., equations in arithmetic storage and retrieval, which is helpful for long-horizon
reasoning), and text (i.e., the rest of tokens that are not tasks. Then, plan executor is responsible for executing the
symbols or patterns) [522]. It is shown that the latter two actions in the plan. It can be implemented by models like
parts (i.e., patterns and text) are essential to the model LLMs for textual tasks [441] or by tools like code interpreters
performance, and removing either one would lead to a for coding tasks [450]. Furthermore, environment refers to
significant performance drop. However, the correctness of where the plan executor carries out the actions, which can
symbols and patterns does not seem critical. Further, there be set differently according to specific tasks, e.g., the LLM
exists a symbiotic relationship between text and patterns: itself [527] or an external virtual world like Minecraft [528].
the text helps LLMs to generate useful patterns, and patterns It provides feedback about the execution result of the action to
aid LLMs to understand tasks and generate texts that help the task planner, either in the form of natural language [450]
solve them [522]. or from other multimodal signals [446].
In summary, CoT prompting provides a general and For solving a complex task, the task planner first needs to
flexible approach to eliciting the reasoning ability of LLMs. clearly understand the task goal and generate a reasonable
There are also some preliminary attempts to extend this plan based on the reasoning of LLMs (See Section 6.4.2).
technique to solve multimodal [523] and multilingual Then, the plan executor acts according to the plan in the
tasks [524]. environment, and the environment will produce feedback
for the task planner (See Section 6.4.3). The task planner
can further incorporate the feedback obtained from the
6.4 Planning for Complex Task Solving environment to refine its initial plan and iteratively perform
Prompting with ICL and CoT is a conceptually simple yet the above process to get better results as the task solution
general approach to solving various tasks. However, this (See Section 6.4.4).
approach struggles with complex tasks like mathematical
reasoning [525] and multi-hop question answering [526]. As 41. Despite the similarity with RL, our formulation decouples the
an enhanced approach, prompt-based planning has been planning and execution phases, whereas in RL, they are typically
interleaved in the agent. This paradigm is defined in a general yet
proposed to break down complex tasks into smaller sub- slightly loose way, and it mainly aims to help readers understand the
tasks and generate a plan of actions to accomplish the task. key idea underlying the planning approaches of LLMs.
58

6.4.2 Plan Generation feedback based on the intermediate results from the plan
Plan generation focuses on directly generating action se- executor. For example, Reflexion [450] utilizes LLMs to
quences by prompting LLMs. Based on the format of the transform sparse result signals (e.g., success or failure) into
generated plans, existing work can be divided into two concrete text-based feedback (e.g., “You should recommend
groups: text-based and code-based approaches. comedies that the user mentions in the query instead of horror
movies”) and stores this feedback in long-term memory for
Text-based Approaches. It is straightforward for LLMs to future planning.
generate plans in the form of natural language. In this
approach, LLMs are prompted to generate a sequence of External Feedback. In addition to LLMs, external objects
actions for the plan executor to perform and solve the com- can also provide feedback signals. For example, tools like
plex task. For example, Plan-and-Solve [441] adds explicit code interpreters are widely used in programming tasks to
instructions like “devise a plan” to directly prompt provide real-time error messages [450], models like stable
the LLM for planning in a zero-shot manner, while Self- diffusion [532] can be used in multimodal tasks to provide
planning [529] and DECOMP [440] add demonstrations in visual perception [446], and virtual worlds like Minecraft
the prompt to guide the LLM to devise a plan through ICL. can provide immersive experiences [528]. Besides, some
Following this way, some work further considers incorpo- work (e.g., Generative Agents [533]) explores multi-agent
rating extra tools or models when planning. For example, collaboration in simulated environments, where each agent
ToolFormer [80] first annotates a pre-training corpus with receives feedback not only from interaction with the envi-
potential API calls using LLMs, and then fine-tunes LLMs ronment but also from communication with other agents.
on it, so that LLMs can learn when and how to call APIs
and incorporate the results returned by APIs during gener- 6.4.4 Plan Refinement
ation. HuggingGPT [444] introduces the models available in With access to feedback from the environment, the task
HuggingFace and regards LLMs as the controller to select planner can accordingly refine its current plan and itera-
suitable models based on their descriptions and aggregate tively go through the “planning – execution – refinement” loop
their results as the final solution. for better results. In this part, we summarizes three major
Code-based Approaches. Although text-based approaches refinement approaches in existing work.
sound intuitive, they cannot guarantee faithful execution of
Reasoning. The feedback data from the environment may
the plan, which may lead to failure even when the plan is
not be directly suitable to be utilized by LLMs for plan
sound. To address this issue, code-based approaches have
refinement, e.g., containing irrelevant information or taking
been proposed to generate more verifiable plans in the
a non-language form. To solve this, some work adds the
form of executable code in programming languages, e.g.,
explicit reasoning process to extract critical information
Python or PDDL. In this way, LLMs are first prompted
from feedback [448, 449]. For example, React [449] prompts
to generate the program and then utilize a deterministic
LLMs with demonstrations to generate reasoning traces
solver to execute it. For example, Faithful CoT [442] and
over feedback. It has been widely used in autonomous agent
PAL [443] decompose a reasoning task into two stages: at
projects, such as AutoGPT [534], which can automatically
the first stage, the LLM generates a plan conditioned on the
reason over the observed feedback to revise the initial
query; at the second stage, a deterministic solver executes
plan for solving various user requests. However, these ap-
the plan to derive the final answer. Furthermore, code-based
proaches typically fix the order of reasoning and planning.
approaches can be applied to embodied agents in a similar
To support flexible switching between the two processes for
way. For example, PROGPROMPT [530] and LLM+P [531]
better performance, ChatCoT [448] further unifies the tool-
first utilize LLMs to generate plans in the form of python
augmented reasoning process into a multi-turn conversation
functions or PDDL files, and then leverage a virtual agent
between the LLM-based task planner and the tool-based
or classical planner to solve the problem according to the
environment.
code-based plans.
Backtracking. Early methods mainly consider planning
6.4.3 Feedback Acquisition forward actions while maintaining the existing plan, thus
After executing the generated plan, the environment would likely leading to local optimal plans based on a short-term
produce the feedback signal to the LLM-based task planner, evaluation. To solve this, Tree of Thoughts [527] allows back-
which can be used to refine its initial plan for better results. tracking with search algorithms like breadth-first and depth-
In existing work, there are typically two sources of feedback first search to make global planning. It refines the plan
from the environment, depending on their relationship with step by step by backtracking to the last state in the initial
the LLM-based task planner: internal (i.e., the LLM itself) plan and choosing the next unexplored action. Furthermore,
and external (e.g., tools or virtual worlds) feedback. some studies [446, 535] utilize feedback signals to revise the
entire plan. For example, DEPS [535] selects a better plan
Internal Feedback. The LLM itself can be utilized as a according to feedback signals, while TIP [446] adds feedback
feedback provider. One straightforward way is to directly signals to prompts for the LLM-based planner to revise each
evaluate the quality of the generated plans through prompt- step in the initial plan.
ing. For example, RAP [447] evaluate the likelihood that
each candidate plan can lead to task success, while Tree of Memorization. In order to handle long-horizon tasks, it has
Thoughts [527] proposes to vote across plans by making become a key approach to aid plan refinement with long-
comparisons between them. Further, LLMs can provide term memory in addition to utilizing the short-term memory of
59

LLMs through ICL. For example, Reflexion [450] stores the existing work, the performance on the language modeling
feedback from self-reflection into the memory, so previous tasks typically follows the scaling law [30], which means
feedback can be retrieved for plan refinement. Generative that scaling language models would improve the accuracy
Agents [533] designs the memory stream mechanism as the and reduce the perplexity.
core component of agents for action planning and reflection.
Further, the skill library mechanism [445, 528] is proposed Conditional Text Generation. As an important topic in
to store successful plans in the library, which can be reused language generation, conditional text generation [48] fo-
and synthesized as complex plans for novel tasks. To imple- cuses on generating texts satisfying specific task demands
ment the long-term memory mechanism, tools like vector based on the given conditions, typically including machine
databases (e.g., milvus [536]) can be used to encode plans or translation [624], text summarization [548], and question
feedbacks into high-dimensional vectors for efficient storage answering [557]. To measure the quality of the generated
and retrieval at a large scale. MemoryBank [537] further text, automatic metrics (e.g., Accuracy, BLEU [625] and
proposes the memory updating mechanism to allow mem- ROUGE [626]) and human ratings have been typically used
ory forgetting and strengthening following the Ebbinghaus for evaluating the performance. Due to the powerful lan-
Forgetting Curve theory. guage generation capabilities, LLMs have achieved remark-
able performance on existing datasets and benchmarks. For
instance, GPT-4 exhibits comparable performance as com-
7 C APACITY AND E VALUATION mercial translation products, even for the translation task of
To examine the effectiveness and superiority of LLMs, a languages that are with significant linguistic distance [627].
surge of tasks and benchmarks have been proposed for On news summarization tasks (i.e., CNN/DM and XSUM),
conducting empirical ability evaluation and analysis. In this LLMs also demonstrate comparable performance with hu-
section, we first introduce three types of basic ability evalu- man freelance writers [628]. Despite the rapid progress
ation of LLMs for language generation and understanding, on model capacity, there are increasing concerns on the
then present several advanced ability evaluations with more feasibility of existing automatic metrics to faithfully assess
complicated settings or goals, and finally discuss existing the performance of LLMs in conditional text generation
benchmarks, evaluation approaches, and empirical analysis. tasks [628–630]. As the alternatives to automatic metrics,
recent studies also propose to incorporate LLMs as gener-
ation evaluators to examine the quality of the generated
7.1 Basic Ability
content [138, 631, 632]. Moreover, researchers also explore
In this part, we mainly focus on three basic types of ability more challenging language generation tasks for LLMs, such
evaluation for LLMs, i.e., language generation, knowledge as structured data generation [458] and long text genera-
utilization, and complex reasoning. It is noted that we do not tion [46, 633, 634].
intend to have complete coverage of all the related tasks, but
instead only focus on the most widely discussed or studied Code Synthesis. In addition to generating high-quality nat-
tasks for LLMs. Next, we introduce these tasks in detail. ural language text, existing LLMs also show strong abilities
to generate formal language, especially computer programs
7.1.1 Language Generation (i.e., code) that satisfy specific conditions, called code syn-
According to the task definition, existing tasks about lan- thesis [635]. Unlike natural language generation, as the gen-
guage generation can be roughly categorized into language erated code can be directly checked by execution with cor-
modeling, conditional text generation, and code synthesis responding compilers or interpreters, existing work mostly
tasks. Note that code synthesis is not a typical NLP task, we evaluates the quality of the generated code from LLMs by
include it for discussion because it can be directly solved calculating the pass rate against the test cases, i.e., pass@k 42 .
by a number of LLMs (trained on code data) in a similar Recently, several code benchmarks focusing on functional
generation approach as natural language text. correctness are proposed to assess the code synthesis abil-
ities of LLMs, such as APPS [378], HumanEval [105], and
Language Modeling. As the most fundamental ability of MBPP [208]. Typically, they consist of diverse programming
LLMs, language modeling aims to predict the next token problems, with text specification and test cases for cor-
based on the previous tokens [1], which mainly focuses rectness checking. To improve such an ability, it is key to
on the capacity of basic language understanding and gen- fine-tuning (or pre-training) LLMs on code data, which can
eration. For evaluating such an ability, typical language effectively adapt LLMs to code synthesis tasks [86]. In addi-
modeling datasets that existing work uses include Penn tion, existing work has proposed new strategies to generate
Treebank [538], WikiText-103 [539], and the Pile [161], where code, e.g., sampling multiple candidate solutions [208] and
the metric of perplexity is commonly used for evaluating the planning-guided decoding [636], which can be considered
model performance under the zero-shot setting. Empirical as the imitation of bug-fixing and code-planning processes
studies [55, 93] show that LLMs bring substantial per- by programmers. Impressively, LLMs have recently shown
formance gains over the previous state-of-the-art methods competitive performance with humans by achieving a rank-
on these evaluation datasets. To better test the modeling ing of the top 28% among users on the programming contest
capacity of long-range dependencies in text, the LAMBADA platform Codeforces [114]. Further, GitHub Copilot has been
dataset [233] has been introduced, where LLMs are required released to assist programming in coding IDEs (e.g., Visual
to predict the last word of sentences based on a paragraph of
context. Then, the accuracy and perplexity of the predicted 42. Given k programs generated by the LLM, pass@k is computed as
last words are employed to evaluate LLMs. As shown in 1 when at least one program passes all test cases, or else 0
60

TABLE 14: Representative basic and advanced abilities and corresponding representative datasets for evaluating.

Level Ability Task Dataset


Language Modeling Penn Treebank [538], WikiText-103 [539], the Pile [161], LAMBADA [233]
WMT’14,16,19,20,21,22 [540–545], Flores-101 [546], DiaBLa [547],
Language Generation Conditional Text Generation CNN/DailyMail [548], XSum [549], WikiLingua [550]
OpenDialKG [551]
APPS [378], HumanEval [105], MBPP [208], CodeContest [114], MTPB [86],
Code Synthesis
DS-1000 [552], ODEX [553]
Natural Questions [554], ARC [555], TruthfulQA [556], Web Questions [557],
Closed-Book QA TriviaQA [558], PIQA [559], LC-quad2.0 [560], GrailQA [561], KQApro [562],
CWQ [563], MKQA [564], ScienceQA [565]
Natural Questions [554], OpenBookQA [566], ARC [555], TriviaQA [558],
Knowledge Utilization
Open-Book QA Web Questions [557], MS MARCO [567], QASC [568], SQuAD [569],
Basic WikiMovies [570]
WikiFact [571], FB15k-237 [572], Freebase [573], WN18RR [574],
Knowledge Completion
WordNet [575], LAMA [576], YAGO3-10 [577], YAGO [578]
CSQA [504], StrategyQA [185], HotpotQA [579], ARC [555], BoolQ [580],
PIQA [559], SIQA [581], HellaSwag [582], WinoGrande [583], COPA [584],
Knowledge Reasoning
OpenBookQA [566], ScienceQA [565], proScript [585], ProPara [586],
ExplaGraphs [587], ProofWriter [588], EntailmentBank [589],
ProOntoQA [590]
Complex Reasoning CoinFlip [33], ReverseList [33], LastLetter [33], Boolean Assignment [591],
Symbolic Reasoning Parity [591], Colored Object [70], Penguins in a Table [70],
Repeat Copy [443], Object Counting [443]
MATH [364], GSM8k [184], SVAMP [592], MultiArith [593], ASDiv [503],
Mathematical Reasoning MathQA [594], AQUA-RAT [595], MAWPS [596], DROP [597],
NaturalProofs [598], PISA [599], miniF2F [600], ProofNet [601]
Honestness TruthfulQA [556], HaluEval [602]
Helpfulness HH-RLHF [170]
Human Alignment
HH-RLHF [170], Crows-Pairs [603]
Harmlessness
WinoGender [604], RealToxicityPrompts [605]
Household VirtualHome [606], BEHAVIOR [607], ALFRED [608],ALFWorld [609]
Interaction with
Website Environment WebShop [610], Mind2Web [611]
External Environment
Advanced Open World MineRL [612], MineDojo [613]
Search Engine HotpotQA [579], TriviaQA [558], Natural Questions [554]
Code Executor GSM8k [184], TabMWP [614], Date Understanding [70]
Calculator GSM8k [184], MATH [364], CARP [615]
Tool Manipulation
Model Interface GPT4Tools [616], Gorilla [617]
WebQSP [618], MetaQA [619], WTQ [620]
Data Interface
WikiSQL [621], TabFact [622], Spider [623]

Studio and JetBrains IDEs), which can support a variety inconsistency between human evaluation and automatic
of languages including Python, JavaScript, and Java. A reference-based metrics [628–630, 638]. For example, in
viewpoint article entitled “The End of Programming” [637] in OpenDialKG [551], ChatGPT underperforms a fine-tuned
Communications of the ACM has discussed the impact of AI GPT-2 on BLEU and ROUGE-L metrics, while earning more
programming in the field of computer science, emphasizing favor from human judgment [638]. Furthermore, existing
an important shift towards the highly adaptive LLM as a work argues that even human evaluation may not be robust
new atomic unit of computation. enough [628, 629, 639, 640]. In some cases, it is difficult
to achieve a high level of consensus among human an-
Major Issues. Although LLMs have achieved splendid per- notators [629], and there is also a large gap between the
formance in generating human-like text, they are susceptible annotation quality of crowdworkers and experts [639, 640].
to suffering from two major issues in language generation Thus, how to conduct reliable evaluation for language gen-
as discussed below. eration tasks in the era of LLMs has become a fundamental
• Unreliable generation evaluation. With the advancement yet challenging research topic. Recently, increasing research
of language generation ability of LLMs, existing studies work proposes to leverage LLMs to improve the evaluation
find that the generated texts from LLMs have reached a quality of the generated texts. Specially, LLMs can be used
comparable quality to the reference texts on a variety of text to improve the evaluation quality of existing metrics. For ex-
generation tasks. However, due to the intrinsic weakness ample, Para-Ref [641] augments various automatic metrics
of existing evaluation benchmarks, there exists pronounced by leveraging LLMs to paraphrase existing references into
61

semantically equivalent references with diverse expressions. 7.1.2 Knowledge Utilization


Further, LLMs are widely employed as the evaluators of text Knowledge utilization is an important ability of intelligent
generation in a reference-free manner, including evaluating systems to accomplish knowledge-intensive tasks (e.g., com-
a single prediction [631, 632, 642] or comparing several monsense question answering and fact completion) based
candidates [138, 643–645]. Nevertheless, LLMs may expose on supporting factual evidence. Concretely, it requires LLMs
bias (e.g., order bias or preference for LLM-generated texts to properly utilize the rich factual knowledge from the pre-
over human-written texts) as language generation evalua- training corpus or retrieve external data when necessary. In
tors, demonstrating disparities when compared to human particular, question answering (QA) and knowledge com-
evaluation [632, 646, 647]. pletion have been two commonly used tasks for evaluating
this ability. According to the test tasks (question answering
Unreliable Generation Evaluation or knowledge completion) and evaluation settings (with or
without external resources), we categorize existing knowl-
LLMs have been capable of generating texts with edge utilization tasks into three types, namely closed-book
a comparable quality to human-written texts, QA, open-book QA43 , and knowledge completion.
which however might be underestimated by au-
tomatic reference-based metrics. As an alterna- Closed-Book QA. Closed-book QA tasks [652] test the
tive evaluation approach, LLMs can serve as lan- acquired factual knowledge of LLMs from the pre-training
guage generation evaluators to evaluate a single corpus, where LLMs should answer the question only based
text, compare multiple candidates, and improve on the given context without using external resources. For
existing metrics. However, this evaluation ap- evaluating this ability, there are several datasets that can
proach still needs more inspections and exami- be leveraged, including Natural Questions [554], Web Ques-
nations in real-world tasks. tions [557], and TriviaQA [558], where the accuracy metric is
widely adopted. Empirical results have revealed that LLMs
can perform well in this setting and even match the per-
• Underperforming specialized generation. Although LLMs
formance of state-of-the-art open-domain QA systems [56].
have learned general language patterns to generate coherent
Also, the performance of LLMs on closed-book QA tasks
text, their proficiency in generation might be constrained
shows a scaling law pattern in terms of both model size
when dealing with a specialized domain or task. For in-
and data size: scaling the parameters and training tokens
stance, a language model that has been trained on gen-
can increase the capacity of LLMs and help them learn (or
eral web articles may face challenges when generating a
memorize) more knowledge from the pre-training data [56].
medical report which involves many medical jargon and
Further, under a similar parameter scale, LLMs with more
methods. Intuitively, domain knowledge should be critical
pre-training data relevant to the evaluated tasks would
for model specialization. However, it is not easy to inject
achieve better performance [81]. Also, the closed-book QA
such specialized knowledge into LLMs. As discussed in
setting provides a testbed for probing the accuracy of the
recent analyses [47, 648], when LLMs are trained to exhibit
factual knowledge encoded by LLMs. However, as shown
some specific ability that allows them to excel in some areas,
in existing work [55], LLMs might perform less well on QA
they might struggle in others. Such an issue is related to
tasks relying on fine-grained knowledge, even when it exists
catastrophic forgetting [649, 650] in training neural networks,
in the pre-training data.
which refers to the conflict phenomenon of integrating new
and old knowledge. Similar cases also occur in human align- Open-Book QA. Unlike closed-book QA, in open-book QA
ment of LLMs, where “alignment tax” [66] (e.g., a potential tasks, LLMs can extract useful evidence from the external
loss in the in-context learning ability) has to be paid for knowledge base or document collections, and then answer
aligning to human values and needs. Moreover, due to the question based on the extracted evidence [653–656]. Typ-
the limitations of sequence modeling architecture, LLMs ical open-book QA datasets (e.g., Natural Questions [554],
still face challenges in the understanding and generation OpenBookQA [566], and SQuAD [569]) have overlap with
of structured data. Consequently, they often fall behind closed-book QA datasets, but they incorporate external data
task-specific models on complex structured data tasks, such sources, e.g., Wikipedia. The metrics of accuracy and F1
as knowledge-base question answering and semantic pars- score are widely used in open-book QA tasks for evalua-
ing [458, 651]. Therefore, it is important to develop effective tion. To select relevant knowledge from external resources,
model specialization methods that can flexibly adapt LLMs LLMs are often paired with a text retriever (or even a
to various task scenarios, meanwhile retaining the original search engine), which is trained independently or jointly
abilities as possible. with LLMs [81, 653, 657]. Also, previous work [658–660]
has indicated that retrievers can assist LLMs in verifying
Underperforming Specialized Generation and rectifying the reasoning path. In evaluation, existing
studies mainly focus on testing how LLMs utilize the ex-
LLMs may fall short in mastering generation tracted knowledge to answer the question and show that
tasks that require domain-specific knowledge or
generating structured data. It is non-trivial to 43. In this part, open-book QA refers to the QA tasks that require
to extract and utilize useful information from external knowledge
inject specialized knowledge into LLMs, mean-
resources, as the antithesis of closed-book QA (only using the encoded
while maintaining the original abilities of LLMs. information from pre-training corpus). Note that there is a dataset also
named OpenBookQA [566], which follows the settings of open-book
QA tasks by extracting and utilizing external science facts.
62

Bob’s wife is Amy. Bob’s daughter is Cindy.


Explain RLHF for LLMs.
Who is Cindy to Amy?

RLHF stands for "Rights, Limitations, Harms, and


Cindy is Amy’s daughter-in-law. Freedoms" and is a framework for …… models like
LLMs (Large Language Models).

(a) Intrinsic hallucination (b) Extrinsic hallucination

Fig. 17: Examples of intrinsic and extrinsic hallucination for a public LLM (access date: March 19, 2023). As an example
of intrinsic hallucination, the LLM gives a conflicting judgment about the relationship between Cindy and Amy, which
contradicts the input. For extrinsic hallucination, in this example, the LLM seems to have an incorrect understanding of
the meaning of RLHF (reinforcement learning from human feedback), though it can correctly understand the meaning of
LLMs (in this context).

the retrieved evidence can largely improve the accuracy beyond language tasks, a recent study has shown that large
of the generated answers, even enabling a smaller LLM to vision-language models (LVLM) also face challenges with
outperform 10× larger ones [653, 657]. Further, open-book hallucination, i.e., generating objects that are not present in
QA tasks can be also employed to evaluate the recency the accompanying images [662]. In essence, LLMs seem
of knowledge information. Pre-training or retrieving from to “unconsciously” utilize the knowledge in task solving,
outdated knowledge resources may cause LLMs to generate which still lack an ability to accurately control the use
incorrect answers for time-sensitive questions [653]. of internal or external knowledge. Hallucinations would
mislead LLMs to generate undesired outputs and mostly
Knowledge Completion. In knowledge completion tasks,
degrade the performance, leading to potential risks when
LLMs might be (to some extent) considered as a knowledge
deploying LLMs in real-world applications. To alleviate
base [576], which can be leveraged to complete or predict the
this problem, alignment tuning strategies (as discussed in
missing parts of knowledge units (e.g., knowledge triples).
Section 5.2) have been widely utilized in existing work [66],
Such tasks can probe and evaluate how much and what kind
which rely on tuning LLMs on high-quality data or using
of knowledge LLMs have learned from the pre-training
human feedback. Moreover, the integration of external
data. Existing knowledge completion tasks can be roughly
tools for the provision of credible information sources can
divided into knowledge graph completion tasks (e.g., FB15k-
help alleviate the hallucination issue [81, 602, 659]. Another
237 [572] and WN18RR [574]) and fact completion tasks (e.g.,
line of research work leverages uncertainty estimation of
WikiFact [571]), which aim to complete the triples from a
LLMs to identify hallucinations [663, 664]. For instance,
knowledge graph and incomplete sentences about specific
considering that hallucinated facts are prone to exhibit
facts, respectively. Empirical studies have revealed that it
inconsistency across different sampled outputs, SelfCheck-
is difficult for existing LLMs to accomplish knowledge
GPT [664] detects hallucination by measuring information
completion tasks related to specific relation types [520].
inconsistency within sampled outputs. For the evaluation
As shown in the evaluation results on WikiFact, LLMs
of the hallucination problem, a set of hallucination de-
perform well on several frequent relations that occur in
tection tasks have been proposed, e.g., TruthfulQA [556]
the pre-training data (e.g., currency and author), while
for detecting human falsehood mimicked by models. More
not well on rare ones (e.g., discoverer_or_inventor
recently, HaluEval [602] creates a large-scale LLM-generated
and place_of_birth). Interestingly, under the same eval-
and human-annotated hallucinated samples to evaluate the
uation settings (e.g., in-context learning), InstructGPT (i.e.,
ability of language models to recognize hallucination in both
text-davinci-002) outperforms GPT-3 in all subsets of
task-specific and general scenarios.
WikiFact.
Major Issues. Although LLMs have achieved key progress Hallucination
in capturing and utilizing knowledge information, they
LLMs are prone to generate untruthful informa-
suffer from two major issues as discussed below.
tion that either conflicts with the existing source
• Hallucination. In generating factual texts, a challeng-
or cannot be verified by the available source.
ing issue is hallucination generations [638, 661], where the
Even the most powerful LLMs such as ChatGPT
generated information is either in conflict with the existing
face great challenges in migrating the hallucina-
source (intrinsic hallucination) or cannot be verified by the
tions of the generated texts. This issue can be
available source (extrinsic hallucination), which are illustrated
partially alleviated by special approaches such as
by two examples in Figure 17. Hallucination widely occurs
alignment tuning and tool utilization.
in existing LLMs, even the most superior LLMs such as
GPT-4 [46]. Furthermore, existing work shows that LLMs
encounter difficulties in recognizing the hallucinated con- • Knowledge recency. As another major challenge, LLMs
tent in text [602], even the powerful ChatGPT. Additionally, would encounter difficulties when solving tasks that require
63

the latest knowledge beyond the training data. To tackle for enhancing the complex reasoning capacity of LLMs.
this issue, a straightforward approach is to regularly update As discussed in Section 6.3, CoT involves the intermediate
LLMs with new data. However, it is very costly to fine-tune reasoning steps, which can be manually created [33] or
LLMs, and also likely to cause the catastrophic forgetting automatically generated [674], into the prompts to guide
issue when incrementally training LLMs. Therefore, it is LLMs to perform multi-step reasoning. Such a way largely
necessary to develop efficient and effective approaches that improves the reasoning performance of LLMs, leading to
can integrate new knowledge into existing LLMs, making new state-of-the-art results on several complex knowledge
them up-to-date. Existing studies have explored how to reasoning tasks [33, 56, 526]. Further, after reformulating
utilize the external knowledge source (e.g., search engine) knowledge reasoning tasks into code generation tasks, re-
to complement LLMs, which can be either jointly optimized searchers have found that the performance of LLMs can
with LLMs [653] or used as a plug-and-play module [659]. be further improved [211], especially with the LLMs pre-
For instance, ChatGPT utilizes a retrieval plugin to access trained on code. However, due to the complexity of knowl-
up-to-date information sources [665]. By incorporating the edge reasoning tasks, the performance of current LLMs still
extracted relevant information into the context [666–668], lags behind human results on tasks such as commonsense
LLMs can acquire new factual knowledge and perform reasoning [33, 56, 675]. As a common type of mistakes, LLMs
better on relevant tasks. However, such an approach seems might generate inaccurate intermediate steps, leading to a
to be still at a superficial level. In addition, existing studies wrong final result. To address this issue, existing work has
also explore editing parameters of language models to up- proposed special decoding or ensemble strategies to im-
date intrinsic knowledge [669–671]. Nevertheless, previous prove the accuracy of the whole reasoning chain [436, 437].
work [672] has shown that several parameter editing meth-
Symbolic Reasoning44 . The symbolic reasoning tasks
ods perform not well on LLMs, though they can improve
mainly focus on manipulating the symbols in a formal rule
the performance of small language models. Therefore, it
setting to fulfill some specific goal [51], where the operations
is still difficult to directly amend intrinsic knowledge or
and rules may have never been seen by LLMs during pre-
inject specific knowledge into LLMs, which remains an
training. Existing work [33, 439, 505] commonly evaluates
open research problem [672]. Recently, a useful framework
LLMs on the task of last letter concatenation and coin flip,
EasyEdit [673] has been released to facilitate the research of
where the evaluation examples require the same reasoning
knowledge editing for LLMs.
steps as the in-context examples (called in-domain test) or
Knowledge Recency more steps (called out-of-domain test). For an example of
the out-of-domain test, LLMs could only see the examples
The parametric knowledge of LLMs is hard to be with two words in context, but it requires LLMs to concate-
updated in a timely manner. Augmenting LLMs nate the last letters of three or more words. Typically, the
with external knowledge sources is a practical accuracy of the generated symbols is adopted to evaluate
approach to tackling the issue. However, how the performance of LLMs on these tasks. Thus, LLMs need
to effectively update knowledge within LLMs to understand the semantic relations among the symbolic
remains an open research problem. operations and their composition in complex scenarios.
However, under the out-of-domain setting, as LLMs have
not seen the complex compositions of symbolic operations
7.1.3 Complex Reasoning and rules (e.g., twice the number of operations in context
Complex reasoning refers to the ability of understanding examples), it is hard for LLMs to capture their accurate
and utilizing supporting evidence or logic to derive con- meanings. To solve this issue, existing studies incorporate
clusions or make decisions [51, 52]. According to the type scratchpad [591, 676] and tutor [677] strategies to help
of involved logic and evidence in the reasoning process, LLMs better manipulate symbolic operations, for generating
we consider dividing existing evaluation tasks into three longer and more complex reasoning processes. Another
major categories, namely knowledge reasoning, symbolic line of research work utilizes the formal programming
reasoning, and mathematical reasoning. language to represent the symbolic operations and rules,
which requires LLMs to generate code and perform the
Knowledge Reasoning. The knowledge reasoning tasks
reasoning process by executing it with external interpreters.
rely on logical relations and evidence about factual
Such a way can decompose the complex reasoning process
knowledge to answer the given question. Existing work
into code synthesis and program execution for LLMs and
mainly uses specific datasets to evaluate the reasoning
interpreters, respectively, leading to a simplified reasoning
capacity of the corresponding type of knowledge, e.g.,
process with yet more accurate results [443].
CSQA [504]/StrategyQA [185] for commonsense knowledge
reasoning and ScienceQA [565] for science knowledge rea- Mathematical Reasoning. The mathematical reasoning
soning. In addition to the accuracy of the predicted results, tasks need to comprehensively utilize mathematical knowl-
existing work [565] has also evaluated the quality of the edge, logic, and computation for solving problems or gen-
generated reasoning process, via automatic metrics (e.g., erating proof statements. Existing mathematical reasoning
BLEU) or human evaluation. Typically, these tasks require tasks can be mainly categorized into math problem solv-
LLMs to perform step-by-step reasoning based on factual
knowledge, until reaching the answer to the given ques- 44. Following [33], we mainly discuss symbolic reasoning tasks spe-
cially designed for evaluating LLMs. We do not consider symbolic
tion. To elicit the step-by-step reasoning ability, chain-of- reasoning methods in traditional NLP tasks, such as deducing logical
thought (CoT) prompting strategy [33] has been proposed rules from the knowledge graphs in KBQA.
64

ing and automated theorem proving. For math problem in the task description may cause the model to produce
solving tasks, SVAMP [592], GSM8k [184] and MATH [364] different results [49, 592]. To mitigate this problem, self-
datasets are commonly used for evaluation, where LLMs consistency [436] adopts the ensemble of multiple reasoning
need to generate accurate concrete numbers or equations paths to enhance the decoding process of LLMs.
to answer the mathematical problem. As these tasks also
require multi-step reasoning, the CoT prompting strategy Reasoning Inconsistency
has been widely adopted for LLMs to improve the reasoning
performance [33]. As another practical strategy, continu- LLMs may generate the correct answer following
ally pre-training LLMs on large-scale mathematical corpora an invalid reasoning path, or produce a wrong
can largely boost their performance on mathematical rea- answer after a correct reasoning process, leading
soning tasks [35, 203, 678]. Further, since math problems to inconsistency between the derived answer and
in different languages share the same mathematical logic, the reasoning process. The issue can be alleviated
researchers also propose a multilingual math word problem by fine-tuning LLMs with process-level feedback,
benchmark [524] to evaluate the multilingual mathematical using an ensemble of diverse reasoning paths,
reasoning capacity of LLMs. As another challenging task, and refining the reasoning process with self-
automated theorem proving (ATP) [598, 600, 679] requires reflection or external feedback.
the reasoning model to strictly follow the reasoning logic
and mathematical skills. To evaluate the performance on • Numerical computation. For complex reasoning tasks,
this task, PISA [599] and miniF2F [600] are two typical ATP LLMs still face difficulties in the involved numerical com-
datasets with the proof success rate as the evaluation metric. putation, especially for the symbols that are seldom en-
As a typical approach, existing work on ATP utilizes LLMs countered during pre-training, such as arithmetic with large
to aid the search for proofs using an interactive theorem numbers [49, 677, 690]. To tackle this issue, a direct way is
prover (ITP), such as Lean, Metamath, and Isabelle [680– to tune LLMs on synthesized arithmetic problems [361, 691].
682]. A major limitation of ATP research is the lack of related Also, a surge of studies improve the numerical computation
corpora in formal language. To tackle it, several studies performance by tracing intermediate calculation steps in
utilize LLMs to convert informal statements into formal training and inference stages [361, 676, 692], e.g., scratchpad
proofs for augmenting new data [683] or generate drafts and tracing. In addition, existing work [80] has also incorpo-
proof sketches to reduce the search space of the proofs [684]. rated external tools (e.g., calculator), especially for handling
arithmetic operations. More recently, ChatGPT has provided
Major Issues. In spite of the advancements, LLMs still have a plugin mechanism to use external tools [665]. In this
several limitations in solving complex reasoning tasks. way, LLMs need to learn how to properly manipulate the
• Reasoning inconsistency. With improved reasoning tools. For this purpose, researchers have augmented the
strategies (e.g., CoT prompting), LLMs can solve some com- examples using tools (even the LLM itself) for tuning the
plex reasoning tasks, by performing step-by-step reasoning LLM [80, 693], or devised instructions and exemplars for
based on the supporting logic and evidence. Despite the in-context learning [443]. In addition to the aid of ex-
effectiveness, the reasoning inconsistency issue often occurs in ternal tools, recent studies find that tokenizing digits into
the decomposed reasoning process. Concretely, LLMs may individual tokens (e.g., LLaMA and Galactica tokenizers)
generate the correct answer following an invalid reasoning is a useful approach to enhancing the inherent arithmetic
path, or produce a wrong answer after a correct reason- ability of LLMs [361, 690]. One possible explanation is that
ing process [33, 442], leading to inconsistency between the subword tokenization techniques can result in inconsistent
derived answer and the reasoning process. To alleviate sequences when tokenizing numbers. For instance, with
this problem, existing work has proposed to guide the a subword tokenizer the integer 7481 may be tokenized
whole generation process of LLMs via external tools or as 7 481, while 74815 may be tokenized as 748 15 (the
models [437, 451, 636], to re-check the reasoning process same numerical substrings with different splits) [361]. As a
and final answer for correcting the potential errors [685–687] comparison, digit-based tokenization for numbers can avoid
or fine-tune LLMs with process-based feedback [688, 689]. such an inconsistency, thus likely improving the numerical
For instance, Tree of Thoughts (ToT) [451] empowers LLMs computation ability of LLMs.
to engage in the decision-making process by concurrently
Numerical Computation
exploring and self-evaluating various reasoning paths. To
refine the reasoning processes, Self-Refine [685] elicits feed-
LLMs face difficulties in numerical computation,
back from LLMs on self-generated solutions, enabling the
especially for the symbols that are seldom en-
iterative refinement of solutions based on the feedback.
countered during pre-training. In addition to us-
Moreover, several studies improve the consistency in the
ing mathematical tools, tokenizing digits into in-
reasoning chain of LLMs through the integration of process-
dividual tokens is also an effective design choice
based supervision during training [688, 689]. As a promis-
for improving the arithmetic ability of LLMs.
ing solution, recent approaches reformulate the complex
reasoning tasks into code generation tasks, where the strict
execution of the generated code ensures the consistency
between the reasoning process and the outcome. Also, 7.2 Advanced Ability
it has been revealed that there might exist inconsistency In addition to the above basic evaluation tasks, LLMs also
between tasks with similar inputs, where small changes exhibit some superior abilities that require special consider-
65

ations for evaluation. In this part, we discuss several rep- to continuously acquire new skills based on feedback from
resentative advanced abilities and the corresponding eval- the environment. GITM [696] focuses on solving various
uation approaches, including human alignment, interaction challenges in Minecraft based on LLM, through task de-
with the external environment, and tool manipulation. Next, composition, planning, and invocation of interfaces. Based
we discuss these advanced abilities in detail. on the generated action plans or task completions, existing
work either adopts the regular metrics (e.g., executability
7.2.1 Human Alignment and correctness of the generated action plans) [694] in the
It is desired that LLMs could well conform to human values benchmark or directly conducts real-world experiments and
and needs, i.e., human alignment, which is a key ability for measures the success rate [698], to evaluate such ability. It
the broad use of LLMs in real-world applications. has been shown that LLMs are capable in interacting with
To evaluate this ability, existing studies consider multiple the external environment and generating accurate action
criteria for human alignment, such as helpfulness, honesty, plans [699]. Recently, several improvement methods have
and safety [46, 170, 368]. For helpfulness and honesty, adver- been proposed to enhance the interaction ability of LLMs,
sarial question answering tasks (e.g., TruthfulQA [556]) can e.g., designing code-like prompts [530] and providing real-
be utilized to examine LLM’s ability in detecting possible world grounding [698].
falsehood in the text [46, 81]. Furthermore, harmlessness In addition, recent work also explores multi-agent col-
can be also evaluated by several existing benchmarks, e.g., laboration based on LLMs in simulated environments [533,
CrowS-Pairs [603] and Winogender [604]. Despite the auto- 700, 701]. These studies simulate human social behaviors
matic evaluation with the above datasets, human evaluation by instantiating multiple LLM-based agents with observa-
is still a more direct way to effectively test the human tions, planning, and memories in a sandbox environment.
alignment ability of LLMs. OpenAI invites many experts In controlled evaluation, the abilities of generative agents
in domains related to AI risks to evaluate and improve the to search, plan, and think are evaluated by humans in an
behaviors of GPT-4 when encountering risky contents [46]. interview-like manner. Further, they also conduct descrip-
In addition, for other aspects of human alignment (e.g., tive measurements on multiple agents within a simulated
truthfulness), several studies propose to use specific instruc- environment to examine emergent social behaviors.
tions and devise annotation rules to guide the annotation
process [81]. Empirical studies have revealed that these
7.2.3 Tool Manipulation
strategies can greatly improve the human alignment ability
of LLMs [170]. For instance, after alignment tuning on data When solving complex problems, LLMs can turn to external
collected through interactions with experts, the incorrect tools if they determine it is necessary. By encapsulating
behavior rate of GPT-4 can be largely reduced when it deals available tools with API calls, existing work has involved
with sensitive or disallowed prompts. In addition, high- a variety of external tools, e.g., search engine [81], calcula-
quality pre-training data can reduce the effort required for tor [80], and compiler [443], to enhance the performance of
alignment [46]. For instance, Galactica is potentially more LLMs on several specific tasks. Recently, OpenAI has sup-
harmless due to the less biased contents in the scientific ported the use of plugins in ChatGPT [665], which can equip
corpus [35]. LLMs with broader capacities beyond language modeling.
For example, the web browser plugin enables ChatGPT
7.2.2 Interaction with External Environment to access fresh information. Further, incorporating third-
In addition to standard evaluation tasks, LLMs have the party plugins is particularly key for creating a prosperous
ability to receive feedback from the external environment ecosystem of applications based on LLMs.
and perform actions according to the behavior instruction, To examine the ability of tool manipulation, existing
e.g., generating action plans in natural language to manip- work mostly adopts complex reasoning tasks for evaluation,
ulate agents [694, 695]. Such an ability is also emergent in such as mathematical problem solving (e.g., GSM8k [184]
LLMs that can generate detailed and highly realistic action and SVAMP [592]) or knowledge question answering (e.g.,
plans, while smaller models (e.g., GPT-2) tend to generate TruthfulQA [556]), where the successful utilization of tools is
shorter or meaningless plans [694]. very important for enhancing the required skills that LLMs
To test this ability, several embodied AI environments are incapable in (e.g., numerical calculation). In this way, the
and benchmarks can be used for evaluation, described evaluated performance on these tasks can reflect the ability
as follows. VirtualHome [606] builds a 3D simulator for of LLMs in tool manipulation. To teach LLMs to utilize tools,
household tasks such as cleaning and cooking, in which existing studies add exemplars using tools in context to elicit
the agent can execute natural language actions generated LLMs [443], or fine-tune LLMs on simulated data about
by LLMs. ALFRED [608] includes more challenging tasks tool utilization [80, 693]. It has been found that with the
that require LLMs to accomplish compositional targets. BE- help of tools, LLMs become more capable of handling the
HAVIOR [607] focuses on everyday chores in simulation issues that they are not good at, e.g., equation calculation
environments and requires LLMs to generate complex so- and answering timely questions [80, 448]. However, as
lutions, e.g., changing the internal status of objects. Apart the number of available tools increases, the limited context
from restricted environments such as household tasks, a length of LLMs may pose challenges in describing and
line of research work investigates the proficiency of LLM- demonstrating extensive tool APIs. To address this issue,
based agents to explore open-world environments, such as existing work retrieves the usage of relevant tools, or en-
Minecraft and the Internet [696, 697]. Voyager [697] intro- coding tool information as tokens within the embedding
duces an automatic curriculum module that enables LLMs space [702–704].
66

In addition to existing tools developed by humans, for enhancing the performance, even exceeding the average
LLMs possess the capability to make their own tools for human performance in BBH.
specific tasks autonomously [705]. This enables the models • HELM [520] is a comprehensive benchmark that cur-
to independently explore and manipulate these self-created rently implements a core set of 16 scenarios and 7 categories
tools, thereby expanding their potential for autonomous of metrics. It is built on top of many prior studies, conduct-
exploration in solving a wide range of real-world tasks. ing a holistic evaluation of language models. As shown in
Summary. The above three abilities are of great value to the experimental results of HELM, instruction tuning can
the practical performance of LLMs: conforming to human consistently boost the performance of LLMs in terms of
values and preferences (human alignment), acting properly accuracy, robustness, and fairness. Further, for reasoning
in real-world scenarios (interaction with the external envi- tasks, the LLMs that have been pre-trained on the code
ronment), and expanding the ability scope (tool manipu- corpus show superior performance.
lation). In addition to the above three advanced abilities, • Human-level test benchmarks aim to evaluate the compre-
LLMs might also show other abilities that are specially hensive ability of LLMs with questions designed for testing
related to some tasks (e.g., data annotation [486]) or learning humans, such as AGIEval [708], MMCU [709], M3KE [710],
mechanisms (e.g., self-improvement [706]). It will be an open C-Eval [711] and Xiezhi [712]. These benchmarks encompass
direction to discover, measure and evaluate these newly a wide range of domains, difficulty levels, and languages
emerging abilities, so as to better utilize and improve LLMs. to provide a comprehensive evaluation of LLMs’ general
capabilities. Compared to publicly available models, models
offering API services (e.g., GPT-4, ChatGPT, Claude) demon-
strate superior performance compared to publicly avail-
7.3 Benchmarks and Evaluation Approaches
able models on these evaluation benchmarks. As the best-
In the above, we have discussed the basic and advanced performing model in evaluations, GPT-4 surpasses average
abilities of LLMs. Next, we will introduce existing evalua- human performance in AGIEval [708]. However, it still lags
tion benchmarks and approaches [733, 734]. behind the top human performance on these challenging
benchmarks. Hence, there remains ample room for further
7.3.1 Comprehensive Evaluation Benchmarks enhancements in the overall abilities of LLMs, particularly
Recently, several comprehensive benchmarks [70, 364, 520] for publicly accessible models.
have been released for the evaluation of LLMs. In this The above benchmarks cover a variety of mainstream
part, we introduce several widely used benchmarks, i.e., evaluation tasks and real-world human exam questions for
MMLU, BIG-bench, HELM, and a series of human exam the evaluation of LLMs. Also, there are several benchmarks
benchmarks. that focus on evaluating specific abilities of LLMs, such
• MMLU [364] is a versatile benchmark for large-scale as TyDiQA [735] for multilingual knowledge utilization
evaluation of multi-task knowledge understanding, cover- and MGSM [524] for multilingual mathematical reasoning.
ing a wide range of knowledge domains from mathematics To conduct the evaluation, one can select suitable bench-
and computer science to humanities and social sciences. The marks according to specific goals. In addition, there are also
difficulties of these tasks vary from basic to advanced. As several open-source evaluation frameworks for researchers
shown in existing work, LLMs mostly outperform small to evaluate LLMs on existing benchmarks or extend new
models by a substantial margin on this benchmark [35, 56, tasks for customized evaluations, such as Language Model
57, 69], which shows the scaling law in model size. More Evaluation Harness [736] and OpenAI Evals [46]. Fur-
recently, GPT-4 achieves a remarkable record (86.4% in 5- ther, some researchers also construct continuously updated
shot setting) in MMLU, which is significantly better than leaderboards by aggregating representative benchmarks, to
the previous state-of-the-art models [46]. compare the performance of existing LLMs, such as Open
• BIG-bench [70] is a collaborative benchmark intended LLM Leaderboard [707]. The above benchmarks and leader-
to probe existing LLMs from various aspects. It comprises boards provide important references to demonstrate the ba-
204 tasks that encompass a broad range of topics, includ- sic and advanced abilities of LLMs. We will give more deep
ing linguistics, childhood development, mathematics, com- discussions on pros and cons on evaluation approaches in
monsense reasoning, biology, physics, social bias, software Section 7.3.2.
development, and so on. By scaling the model size, LLMs
can even outperform the average human performance under 7.3.2 Evaluation Approaches
the few-shot setting on 65% of tasks in BIG-bench [56]. After introducing existing benchmarks, in this part, we
Considering the high evaluation cost of the entire bench- will review existing evaluation approaches for assessing
mark, a lightweight benchmark BIG-bench-Lite has been the performance of LLMs. To organize our discussion, we
proposed, which contains 24 small yet diverse and challeng- categorize LLMs into three different types: base LLMs (pre-
ing tasks from BIG-bench. Additionally, the BIG-bench hard trained model checkpoints), fine-tuned LLMs (instruction or
(BBH) benchmark [365] has been proposed to concentrate alignment fine-tuned model checkpoints), and specialized
on investigating the currently unsolvable tasks of LLMs by LLMs (adapted model checkpoints for some specific task
selecting the challenging tasks in which LLMs exhibit infe- or domain). Here, we keep both fine-tuned LLMs and
rior performance compared to humans. Since BBH becomes specialized LLMs, to distinguish the different purposes of
more difficult, small models mostly achieve performance LLMs: general or specific task solvers. To evaluate the three
close to random. As a comparison, CoT prompting can types of LLMs, we can test the LLM’s performance related
elicit the abilities of LLMs to perform step-by-step reasoning to different abilities (e.g., basic or advanced abilities as
67

TABLE 15: A category of existing evaluation work. “General” denotes that the evaluation focuses on an overall performance
of multiple abilities. The evaluated abilities are not limited to the representative basic and advanced abilities mentioned in
Section 7.1 and 7.2.

Method Evaluation Model Types Abilities/Domain Data Source


MMLU [364] Base/Fine-tuned/Specialized General Human exam/practice
BIG-bench [70] Base/Fine-tuned/Specialized General Human annotation
HELM [520] Base/Fine-tuned/Specialized General Benchmark collection
Open LLM Leaderboard [707] Base/Fine-tuned/Specialized General Benchmark collection
AGIEval [708] Base/Fine-tuned/Specialized General Human exam/practice
MMCU [709] Base/Fine-tuned/Specialized General Human exam/practice
M3KE [710] Base/Fine-tuned/Specialized General Human exam/practice
C-Eval [711] Base/Fine-tuned/Specialized General Human exam/practice
Xiezhi [712] Base/Fine-tuned/Specialized General Human exam/practice
OpenCompass [713] Base/Fine-tuned/Specialized General Benchmark collection
Chain-of-Thought Hub [714] Base/Fine-tuned General Benchmark collection
KoLA [715] Base/Fine-tuned Knowledge utilization Web
ARB [716] Fine-tuned Complex reasoning Human exam/practice
APIBench [717] Base/Fine-tuned Tool manipulation Web
Benchmark
APIBank [718] Fine-tuned Tool manipulation Synthesis
ToolAlpaca [719] Base/Fine-tuned Tool manipulation Synthesis
T-Bench [720] Fine-tuned Tool manipulation Synthesis
ToolBench [721] Fine-tuned Tool manipulation Synthesis
BOLAA [722] Base/Fine-tuned Environment interaction Benchmark collection
AgentBench [723] Base/Fine-tuned Environment interaction Human annotation/Synthesis
HaluEval [602] Base/Fine-tuned Human alignment Human annotation/Synthesis
PromptBench [724] Base/Fine-tuned Robustness Benchmark collection
HumanEval [105] Base/Fine-tuned/Specialized Code synthesis Human annotation
MultiMedQA [356] Specialized Healthcare Benchmark collection
FLUE [725] Specialized Finance Benchmark collection
LegalBench [726] Specialized Legal Human annotation
Chatbot Arena [727] Base/Fine-tuned/Specialized Human Alignment Human annotation
Human
SciBench [728] Fine-tuned Complex reasoning Human exam/practice
AlpacaEval [729] Fine-tuned Instruction following Synthesis
MT-bench [727] Fine-tuned Human alignment Human annotation
Model TrustGPT [730] Base/Fine-tuned Human alignment Benchmark collection
LMExamQA [731] Base/Fine-tuned Knowledge utilization Synthesis
ChatEval [732] Base/Fine-tuned Knowledge utilization Benchmark collection

discussed in Section 7.1 and 7.2). In general, there are three recently proposed benchmarks (e.g., OpenCompass [713])
main approaches to evaluating LLMs, namely benchmark- combine these two types for a comprehensive comparison.
based approach [364], human-based approach [727], and • Benchmark based evaluation procedure. To perform the
model-based approach [729]. Table 15 shows an illustration benchmark evaluation, each problem will first be formatted
of the relationship among LLM type, evaluation approach, into a prompt for LLMs to generate the result text. Then,
and tested abilities. Next, we will discuss the evaluation the generated result text will be parsed with human-written
approaches for different types of LLMs. rules to get the predicted answer. Finally, the performance
of LLMs can be automatically calculated using standard
Evaluation of Base LLMs. Base LLMs refer to the model metrics like accuracy by comparing the predicted answer
checkpoints obtained right after pre-training. For base with the ground-truth one. The evaluation approach can be
LLMs, we mainly focus on examining the basic abilities conducted in either the few-shot or zero-shot setting, which
(Section 7.1), such as complex reasoning and knowledge might lead to different evaluation results or rankings. Since
utilization. Since most of these basic abilities can be assessed base LLMs have not been instruction fine-tuned (with rela-
with well-defined tasks, benchmark-based approaches have tively weak task generalization ability), the few-shot setting
been widely used to evaluate base LLMs. Next, we will is often more suitable for evaluation. For some complex
introduce common evaluation benchmarks and evaluation reasoning tasks, CoT prompts also need to be used to fully
procedures for base LLMs. exhibit the capacity during evaluation. Another note is that
• Common benchmarks. To evaluate base LLMs, typical this evaluation approach can also be applied to assess the
benchmarks are designed in the form of close-ended prob- abilities of fine-tuned LLMs. Actually, several leaderboards
lems like multiple-choice questions. These commonly used (e.g., Open LLM Leaderboard [707]) are built upon this
benchmarks can be mainly divided into two categories: approach, evaluating both base and fine-tuned LLMs.
knowledge-oriented and reasoning-oriented benchmarks. Evaluation of Fine-tuned LLMs. Fine-tuned LLMs in this
Knowledge-oriented benchmarks (e.g., MMLU [364] and C- part refer to the model checkpoints obtained after in-
Eval [711]) aim to evaluate the capacity of world knowledge, struction tuning or alignment tuning based on pre-trained
while reasoning-oriented benchmarks (e.g., GSM8K [643], model weights45 . Typically, fine-tuned LLMs will be tested
BBH [365], and MATH [364]) focus on evaluating the ca-
pability of solving complex reasoning tasks. Further, some 45. In some cases, it is also called chat models.
68

on various abilities (e.g., knowledge utilization and hu- examinations and healthcare questions. In this work [356],
man alignment), and thus it is common that they are as- MultiMedQA has been combined with MMLU [364] to
sessed with multiple evaluation approaches. In addition assess the performance of specialized LLMs for healthcare,
to benchmark-based evaluation, human-based and model- such as Med-PaLM [356]. Similarly, FLUE [737] constructs a
based approaches have also been widely used to evaluate benchmark for finance, spanning from financial sentiment
the advanced abilities of fine-tuned LLMs. Next, we will analysis to question answering. It has been used collab-
introduce the two evaluation methods. oratively with BBH [365] to evaluate finical LLMs like
• Human-based evaluation. Unlike automatic evaluation BloombergGPT [360].
for basic abilities, human evaluation typically considers
Pros and Cons of Different Evaluation Approaches. In the
more factors or abilities in real-world use, such as hu-
above, we have discussed different evaluation approaches
man alignment and tool manipulation. In this evaluation
to assess the abilities of LLMs. Next, we simply analyze the
approach, test tasks are usually in the form of open-
pros and cons of each evaluation approach.
ended questions, and human evaluators are invited to make
judgments on the quality of answers generated by LLMs. • Benchmark-based approach. This evaluation approach can
Typically, there are two main types of scoring methods leverage existing benchmarks for assessing the performance
for human evaluators: pairwise comparison and single- of LLMs. The tasks involved in these benchmarks often
answer grading. In pairwise comparison, given the same contain sufficient test samples to measure the core abilities
question, humans are assigned two answers from different (e.g., reasoning). The whole evaluation procedure can be
models to determine which one is better, while in single- (almost) automatic, and it is convenient to carry out test
answer grading, they only need to score a single answer experiments for various base LLMs, especially useful for
at a time. For example, HELM [520] employs humans monitoring the performance of model checkpoints during
to perform single-answer grading on summarization and pre-training. However, LLMs are often sensitive to the eval-
disinformation tasks, while Chatbot Arena [727] constructs uation settings, including the question prompts, zero-shot or
a crowdsourcing platform that allows users to engage in few-shot tests, and the answer parsing methods. Thus, one
conversations with two anonymous chat LLMs and report should take possible influencing factors into consideration
pairwise comparison results. when conducting the evaluation experiments. The evalua-
tion results should be noted with the adopted evaluation
• Model-based evaluation. Since human-based evaluation
settings. Another issue is the data contamination [56, 738],
is both expensive and time-consuming, some work has
i.e., the test data itself or relevant content has been contained
proposed leveraging powerful closed-source LLMs such
in the pre-training corpora. This phenomenon has become
as ChatGPT and GPT-4 as a surrogate for human evalu-
increasingly severe since more and more open data has been
ators [727, 729]. For example, AlpacaEval [729] collects a
collected for developing LLMs.
set of instructions and utilizes a capable LLM (e.g., GPT-4)
as the judge to perform pair-wise comparisons against the • Human-based approach. Human evaluation offers several
reference outputs. Furthermore, MT-bench [727] collects a advantages when assessing the capabilities of LLMs to solve
set of multi-turn questions for evaluation and improves the real-world tasks. One of the key benefits is its ability to
reliability of LLM-based evaluators through methods like directly reflect the actual abilities of LLMs. Based on feed-
ICL and CoT. Compared with human evaluators, LLMs such back and experiences from real users, human evaluation
as ChatGPT and GPT-4 can achieve high agreement with provides a more direct measure of LLMs’ performance in
humans, in both small-scale handcrafted and large-scale real-world scenarios. Further, it can conduct more flexible
crowdsourced evaluation tasks. Despite this, these closed- and diverse evaluation tasks based on human evaluators.
source LLMs are limited in access and have the potential For instance, users can submit various queries and test the
risk of data leakage. To address this, recent work [727] has abilities of LLMs according to their own task cognition. It
explored fine-tuning open-source LLMs (e.g., Vicuna [138]) allows for a deep understanding of the strengths and weak-
as model evaluators using scoring data from human eval- nesses of LLMs across different types of tasks and contexts.
uators, which has narrowed the gap with powerful closed- However, human evaluation also has inherent limitations
source LLMs (e.g., GPT-4). that could potentially affect its accuracy and consistency.
Factors such as personalized tastes and varying education
Evaluation of Specialized LLMs. Specialized LLMs refer levels among evaluators can introduce biases or even incon-
to the model checkpoints specially adapted to some do- sistencies in the evaluation process. In some cases, users’
mains or applications like healthcare [356] and finance [737]. judgments are likely to be subjective, which may not reflect
As special task solvers, specialized LLMs will be tested the true capabilities of the LLMs. Moreover, conducting
not only on general abilities (e.g., basic ability like com- robust and reliable human evaluations often requires a large
plex reasoning and advanced ability like human align- number of evaluators, which can be very expensive and
ment), but also on specific abilities related to their des- time-consuming. In addition, human evaluation is often
ignated domains or applications. For this purpose, one not reproducible, making it infeasible to extend existing
often needs to construct specific benchmarks tailored for the evaluation results or track the progress of LLMs.
target domains or applications. Then, these domain-specific • Model-based approach. As a surrogate for human-based
benchmarks can be combined with general benchmarks to approaches, model-based approaches serve to diminish the
conduct both comprehensive and targeted evaluation for reliance on human involvement, and enable more efficient
specialized LLMs. For example, MultiMedQA [356] is a and scalable evaluation. In addition, LLMs can provide
specific benchmark in healthcare, which includes medical meaningful explanations for the assigned rating scores,
69

thereby enhancing the interpretability of evaluations. De- be accessed via APIs, which have gained much attention
spite their scalability and explanability, model-based ap- from both developers and researchers. Here, we select four
proaches have been found to suffer from several issues, in- representative closed-source models including text-davinci-
cluding position, verbosity, and self-enhancement bias [727]. 002/003 (short as Davinci002/003), ChatGPT, Claude, and
Specially, position bias (i.e., the order to present the re- Claude 2, where the first three models are developed by
sponses) refers to the fact that LLMs tend to assign high OpenAI and the other two are developed by Anthropic.
scores for the answers at specific positions over others,
verbosity bias means that LLMs favor verbose answers even Tasks and Datasets. Next, we set up the evaluation tasks
if they are short in quality compared with shorter answers, and datasets for the abilities discussed in Section 7.1 and
and self-enhancement bias indicates that LLMs often over- Section 7.2. We mainly evaluate the zero-shot performance
rate in their own generations. In addition, since LLMs have of LLMs on these datasets. For more complex tasks that are
limited capacities in solving complex reasoning problems, hard to be solved in the zero-shot manner (e.g., mathemati-
they cannot serve as qualified evaluators for some difficult cal reasoning and tool manipulation), we mainly report the
tasks (e.g., mathematical reasoning). These limitations can 3-shot performance, considering the context length limit of
be mitigated to some extent by specific prompt engineering open-source models.
and fine-tuning strategies [727]. • Language generation. As discussed before, for language
To summarize, our categorization (Table 15) of existing generation, we consider evaluating three kinds of tasks,
work on LLM evaluation is mainly based on two major di- i.e., language modeling, conditional text generation, and
mensions, namely evaluation methodology and model type, code synthesis. Specially, we select four commonly-used
which are further extended with the test abilities. There datasets, namely LAMBADA [233] (language modeling),
are some recent work [733, 734] that also has discussed WMT’22 [545] (machine translation), XSum [549] (text sum-
the categorization or taxonomies of existing work for LLM marization), and HumanEval [105] (code synthesis) for eval-
evaluation. uation. In WMT’22, we construct a new evaluation set
by selecting 1000 examples for each language pair from
the original large-scale test set to examine the average
7.4 Empirical Evaluation performance of LLMs in machine translation. We evaluate
The above evaluation benchmarks and approaches are the zero-shot performance of LLMs on these datasets, and
mainly employed to evaluate the overall abilities of LLMs. compute the accuracy of predicting words for LAMBADA,
In this part, we conduct a fine-grained evaluation of the BLEU-4 for WMT’22, ROUGE-L for XSum, and pass@10 for
abilities discussed in Section 7.1 and Section 7.2. For each HumanEval.
kind of ability, we select representative tasks and datasets • Knowledge utilization. To evaluate the ability of knowl-
for conducting evaluation experiments to examine the cor- edge utilization, we select four question answering datasets
responding performance of LLMs. (i.e., TriviaQA [558], Natural Questions [554], Web Ques-
tions [557], and ARC [555]), and a fact extraction dataset,
7.4.1 Experimental Settings WikiFact [571]. We also report the zero-shot performance of
In this part, we introduce the experimental settings for our LLMs on these datasets, and compute accuracy for ARC and
evaluation. exact match for other datasets.
• Complex reasoning. For complex reasoning, we eval-
Evaluation Models. To conduct the evaluation, we consider
uate the comparison models on OpenbookQA [566], Hel-
representative LLMs from open-source models to closed-
laSwag [582], and SocialIQA [581] for knowledge reason-
source API-accessing models as follows:
ing; Colored Objects [70] and Penguins in the Table [70]
• Open-source models. Existing open-source models can be
for symbolic reasoning; GSM8k [184] and MATH [364] for
categorized into base models and instruction-tuned models.
mathematical reasoning. We compute the accuracy for Open-
Base models are only pre-trained on a large general-purpose
bookQA, HellaSwag, and SocialIQA; solve rate for Colored
corpus with the language modeling objective, but without
Objects and Penguins in the Table; and accuracy for GSM8k
further supervised fine-tuning. In our evaluation, we select
and MATH. For knowledge reasoning tasks, we evaluate
four representative base models including LLaMA (7B) [57],
the zero-shot performance, since they are all QA tasks that
LLaMA 2 (7B) [99], Pythia (7B and 12B) [96], and Falcon
can be solved in a zero-shot setting. For complex symbolic
(7B) [747]46 . Instruction-tuned models are those fine-tuned
reasoning and mathematical reasoning tasks, we leverage
using instructions (i.e., task datasets, daily chat, or syn-
3-shot in-context exemplars to better elicit LLMs to accom-
thetic instructions). In our experiments, we select four rep-
plish them. Following existing work [33, 443], we also utilize
resentative instruction-tuned models including Vicuna (7B
the chain-of-thought prompting strategy for better solving
and 13B) [138], Alpaca (7B) [137], and ChatGLM (6B) [93].
the mathematical reasoning tasks.
In addition, we also include LLaMA 2-Chat (7B) [99] for
• Human alignment. For human alignment, we select
comparison, and it is a representative model that has been
TruthfulQA [556] to measure whether a LLM is truth-
aligned with human via instruction tuning and RLHF, based
ful in generating answers to questions, CrowS-Pairs [603]
on LLaMA 2 (7B).
and WinoGender [604] to assess the stereotypes in LLMs,
• Closed-source models. In addition to the open-source
RealToxityPrompts [605] to evaluate the extent to which
models, there are also closed-source models that can only
LLMs generate toxic language, and HaluEval [602] to test
46. Experiments with larger models are still in schedule due to the the ability of LLMs to recognize hallucination. As the test
limit of computational resources. set of Real-Toxicity-Prompts is too large, we randomly
70

TABLE 16: Evaluation on the eight abilities of LLMs with specially selected tasks. The shade of the Orange and Blue
fonts denote the performance orders of the results in closed-source and open-source models, respectively. This table will
be continuously updated by incorporating the results of more models.

Language Generation Knowledge Utilization


Models
LBD↑ WMT↑ XSum↑ HumanEval↑ TriviaQA↑ NaturalQ↑ WebQ↑ ARC↑ WikiFact↑
ChatGPT 55.81 36.44 21.71 79.88 54.54 21.52 17.77 93.69 29.25
Claude 64.47 31.23 18.63 51.22 40.92 13.77 14.57 66.62 34.34
Claude 2 45.20 12.93 19.13 78.04 54.30 21.30 21.06 79.97 35.83
Davinci003 69.98 37.46 18.19 67.07 51.51 17.76 16.68 88.47 28.29
Davinci002 58.85 35.11 19.15 56.70 52.11 20.47 18.45 89.23 29.15
LLaMA 2-Chat (7B) 56.12 12.62 16.00 11.59 38.93 12.96 11.32 72.35 23.37
Vicuna (13B) 62.45 20.49 17.87 20.73 29.04 10.75 11.52 20.69 28.76
Vicuna (7B) 63.90 19.95 13.59 17.07 28.58 9.17 6.64 16.96 26.95
Alpaca (7B) 63.35 21.52 8.74 13.41 17.14 3.24 3.00 49.75 26.05
ChatGLM (6B) 33.34 16.58 13.48 13.42 13.42 4.40 9.20 55.39 16.01
LLaMA 2 (7B) 66.39 11.57 11.57 17.07 30.92 5.15 2.51 24.16 28.06
LLaMA (7B) 67.68 13.84 8.77 15.24 34.62 7.92 11.12 4.88 19.78
Falcon (7B) 66.89 4.05 10.00 10.37 28.74 10.78 8.46 4.08 23.91
Pythia (12B) 61.19 5.43 8.87 14.63 15.73 1.99 4.72 11.66 20.57
Pythia (7B) 56.96 3.68 8.23 9.15 10.16 1.77 3.74 11.03 15.75
Knowledge Reasoning Symbolic Reasoning Mathematical Reasoning Interaction with Environment
Models
OBQA↑ HellaSwag↑ SocialIQA↑ C-Objects↑ Penguins↑ GSM8k↑ MATH↑ ALFW↑ WebShop↑
ChatGPT 81.20 61.43 73.23 53.20 40.27 78.47 33.78 58.96 45.12/15.60
Claude 81.80 54.95 73.23 59.95 47.65 70.81 20.18 76.87 47.72/23.00
Claude 2 71.60 50.75 58.34 66.76 74.50 82.87 32.24 77.61 34.96/19.20
Davinci003 74.40 62.65 69.70 64.60 61.07 57.16 17.66 65.67 64.08/32.40
Davinci002 69.80 47.81 57.01 62.55 67.11 49.96 14.28 76.87 29.66/15.20
LLaMA 2-Chat (7B) 45.62 74.01 43.84 43.40 38.93 9.63 2.22 11.19 24.51/5.60
Vicuna (13B) 43.65 70.51 45.97 53.55 36.91 18.50 3.72 8.96 22.74/5.00
Vicuna (7B) 43.84 69.25 46.27 44.25 36.24 14.03 3.54 1.49 6.90/1.40
Alpaca (7B) 47.82 69.81 47.55 39.35 40.27 4.93 4.16 4.48 0.00/0.00
ChatGLM (6B) 30.42 29.27 33.18 14.05 14.09 3.41 1.10 0.00 0.00/0.00
LLaMA 2 (7B) 44.81 74.25 41.72 43.95 35.75 10.99 2.64 8.96 0.00/0.00
LLaMA (7B) 42.42 73.91 41.46 39.95 34.90 10.99 3.12 2.24 0.00/0.00
Falcon (7B) 39.46 74.58 42.53 29.80 24.16 1.67 0.94 7.46 0.00/0.00
Pythia (12B) 37.02 65.45 41.53 32.40 26.17 2.88 1.96 5.22 3.68/0.60
Pythia (7B) 34.88 61.82 41.01 29.05 27.52 1.82 1.46 7.46 10.75/1.80
Human Alignment Tool Manipulation
Models
TfQA↑ C-Pairs↓ WinoGender↑ RTP↓ HaluEval↑ HotpotQA↑ Gorilla-TH↑ Gorilla-TF↑ Gorilla-HF↑
ChatGPT 69.16 18.60 62.50/72.50/79.17 3.07 66.64 23.80 67.20 44.53 19.36
Claude 67.93 32.73 71.67/55.00/52.50 3.75 63.75 33.80 22.04 7.74 7.08
Claude 2 71.11 10.67 60.00/60.00/55.83 3.20 50.63 36.4 61.29 22.19 23.67
Davinci003 60.83 0.99 67.50/68.33/79.17 8.81 58.94 34.40 72.58 3.80 6.42
Davinci002 53.73 7.56 72.50/70.00/64.17 10.65 59.67 26.00 2.69 1.02 1.00
LLaMA 2-Chat (7B) 69.77 48.54 47.50/46.67/46.67 4.61 43.82 4.40 0.00 0.00 0.22
Vicuna (13B) 62.30 45.95 50.83/50.83/52.50 5.00 49.01 11.20 0.00 0.44 0.89
Vicuna (7B) 57.77 67.44 49.17/49.17/49.17 4.70 43.44 6.20 0.00 0.00 0.33
Alpaca (7B) 46.14 65.45 53.33/51.67/53.33 4.78 44.16 11.60 0.00 0.00 0.11
ChatGLM (6B) 63.53 50.53 47.50/47.50/46.67 2.89 41.82 4.00 0.00 0.00 0.00
LLaMA 2 (7B) 50.06 51.39 48.83/48.83/50.83 6.17 42.23 3.80 0.00 0.00 0.11
LLaMA (7B) 47.86 67.84 54.17/52.50/51.67 5.94 14.18 1.60 0.00 0.00 0.11
Falcon (7B) 53.24 68.04 50.00/50.83/50.00 6.71 37.41 1.00 0.00 0.00 0.00
Pythia (12B) 54.47 65.78 49.17/48.33/49.17 6.59 27.09 0.40 0.00 0.00 0.00
Pythia (7B) 50.92 64.79 51.67/49.17/50.00 13.02 25.84 0.20 0.00 0.00 0.00

sample 10000 examples from it for evaluation. We fol- perplexity and coreference resolution score. For RealTox-
low LLaMA [57] to report the zero-shot performance, and ityPrompts, we utilize the Perspective-API47 for toxicity
compute the accuracy of identifying a claim as true for evaluation.
TruthfulQA, accuracy of recognizing biased sentences (high • Interaction with environment. To test this ability, we
perplexity) for CrowS-Pairs, coreference resolution accuracy select ALFWorld [609] and WebShop [610] for evaluation,
(he/she/they) for WinoGender, toxicity score for RealToxi- which simulate real-world scenarios such as household
tyPrompts, and average accuracy of recognizing hallucina- and e-commerce environments. We follow the setting of
tions for HaluEval. For TruthfulQA, we follow existing ReAct [449] that evaluate the 1-shot and 2-shot performance
work [57] that utilizes text-davinci-003 to replace humans of LLMs on WebShop and ALFWorld respectively, and com-
for scoring. For Crows-Pairs and WinoGender, we follow
the experimental settings of LLaMA [57] to compute the 47. https://perspectiveapi.com/
71

TABLE 17: Prompt examples and their performance of ChatGPT on representative tasks. For most tasks, we compare the
performance for simple and complex prompts. We also present the reported performance of supervised methods. “LG”,
“KU”, “CR”, “SDG”, “IR” are short for “language generation”, “knowledge utilization”, “complex reasoning”, “structured
data generation”, “information retrieval”. “-” means there is no reported supervised result previously on this dataset.

Tasks Datasets Instructions ChatGPT Supervised


I want you to act as a translator. Please translate the English 20.66
sentence into Czech.
Translation WMT 41.40 [739]
I want you to act as a translator. Translate the given English 21.12
sentence into Czech, and ensure that the translated sentence is
semantically consistent with the given sentence. \n Sentence:
{source sentence} \n Translation:
LG
Please generate a one-sentence summary for the given document. 21.71

Summarization XSum {document} Try your best to summarize the main content of the given 23.01 42.08 [740]
document. And generate a short summary in 1 sentence for it.\n
Summary:
Choose your answer to the question. {query} {options} 85.19
Closed-Book QA ARC 92.00 [741]
Choose a correct answer according to the given question, and output 85.86
the corresponding id, do not answer other content except the answer
id.
Choose your answer to the question: {question} {choices}. You must 81.20
KU only output A, B, C, or D without any extra explanation. The answer
is
Open-Book QA OBQA 87.20 [741]
Following is a question that requires multi-step reasoning, use 82.20
of additional common and commonsense knowledge, and rich text
comprehension. Choose your answer to the question: \n Question:
Frilled sharks and angler fish live far beneath the surface of the
ocean, which is why they are known as \n Choices: \n A. Deep sea
animals \n B. fish \n C. Long Sea Fish \n D. Far Sea Animals \n You
must only output A, B, C, or D without any extra explanation. The
answer is
Complete the sentence with one or a few words. 29.25
Fact Extraction WikiF 34.20 [520]
Complete the given sentence with one entity name in Wikipedia (MUST 31.21
be a noun) as short as possible, and ensure that the completed
sentence conforms to the facts.
Problem: {problem}\n Answer: 53.20
Symbolic Reasoning C-Objects —
You are an expert in reasoning problem. Here are some examples 66.75
about symbolic reasoning. You can use the knowledge in examples and
solve the last problem. You should follow the examples and generate
the final answer without external solution or words.
CR Problem: {problem}\n Solution: Let’s think step by step. 78.47

Math Word Problems GSM8k Let’s use python to solve math problems. Here are three examples 79.30 63.20 [742]
how to do it,\n Q: Olivia has $23. She bought five bagels for $3
each. How much money does she have left?\n‘‘‘def solution():\n
"""Olivia has $23. She bought five bagels for $3 each. How
much money does she have left?"""\n money_initial = 23\n
bagels = 5\n bagel_cost = 3\n money_spent = bagels *
bagel_cost\n money_left = money_initial - money_spent\n
result = money_left\n return result‘‘‘\n ...... \n How about
this question?\n Q:
Code Synthesis HumanEval I want you act as a code completer. Given a code snippet, your 79.88 48.20 [743]
objective is to complete the code and ensure that it can achieve
the described functionality.
SDG
Text-to-SQL Spider ### Complete sqlite SQL query only and with no explanation.\n 70.10 84.10 [744]
#\n### Sqlite SQL tables, with their properties: \n#\n{table}\n#
{foreign_key}\n#\n### {question}\n SELECT
Recommendation MovieLens I’ve watched the following movies in the past in order: \n 48.80 76.25 [745]
{user_his_text} \n\n Now there are {recall_budget} candidate movies
that I can watch next: \n {candidate_text_order} \n Please rank
these {recall_budget} movies by measuring the possibilities that I
would like to watch next most, according to my watching history.
Please think step by step. \n Note that my most recently watched
movie is {recent_item}. Please show me your ranking results with
IR order numbers. Split your output with line break. You MUST rank the
given candidate movies. You can not generate movies that are not in
the given candidate list.
Conversational ReDial Recommend 10 items that are consistent with user preference. The 17.20 25.60 [746]
Recommenda- recommendation list can contain items that the dialog mentioned
tion before. The format of the recommendation list is: no. title (year).
Don’t mention anything other than the title of items in your
recommendation list
72

pute success rate for ALFWorld and average score/success rate these three models have been specially optimized towards
for WebShop. Further, we also follow ReAct [449] to reduce these advanced abilities, e.g., supporting the use of external
the length of the input prompt and utilize line break as the plugins.
EOS token. • All the comparison models perform not well on very diffi-
• Tool manipulation. For tool manipulation, we consider cult reasoning tasks. On MATH and HotpotQA, all models
two kinds of tools including search engine and model in- (including ChatGPT) perform not well. The two tasks are
terfaces. Therefore, we adopt two tool manipulation bench- very difficult to solve, requiring accurate understanding of
marks, i.e., HotpotQA [579] and Gorilla [617]. HotpotQA complex mathematical knowledge and performing multi-
requires LLMs to use search engine to retrieve documents hop reasoning across documents, respectively. Further, these
from the web, and Gorilla to invoke model APIs from models also have a relatively weak performance on machine
three hubs of TorchHub, TensorHub and HuggingFace. We translation task (WMT). A possible reason is that WMT also
compute exact match for HotpotQA and accuracy for Gorilla. contains many evaluation examples in minor languages,
For HotpotQA, we follow ReAct [449] to report the 3-shot which might not be well covered in the pre-training data
performance. For Gorilla, we follow the code released by its of these LLMs.
paper [617], and evaluate the zero-shot performance.
Analysis of Open-Source Models. Next, we continue to
Implementation Details. For each task and dataset, we show our analysis and findings about eight open-source
evaluate the compared LLMs using the same prompts and models (i.e., LLaMA 2-Chat, Vicuna, Alpaca, ChatGLM,
results parsing method provided by existing work (i.e., LLaMA 2, LLaMA, Pythia and Falcon) as follows:
TruthfulQA, HotPotQA, Gorilla, HaluEval) or designed ac- • Instruction-tuned models mostly perform better than the
cording to our empirical experience (i.e., TriviaQA, Nat- base models. Among all the compared open-source methods,
ural Questions, Web Questions, ARC, WikiFact, GSM8k, the instruction-tuned models (i.e., LLaMA 2-Chat, Vicuna,
MATH, C-Objects, Penguins, LAMBADA, WMT’22, XSum, Alpaca and ChatGLM) mostly perform better than non-
HumanEval, CrowS-Pairs, WinoGender, RealToxityPrompt). instruction-tuned models (i.e., LLaMA 2, LLaMA, Pythia
Specifically, all the experiments about closed-source models and Falcon). It indicates that instruction tuning is generally
are based on invoking their official APIs, while for open- capable of improving the few-shot or zero-shot ability of
source models, we utilize their publicly available code and LLMs in solving various tasks. However, after instruction
model parameters, and perform the inference on 8 A800- tuning, Vicuna (7B) and Alpaca (7B) suffer from perfor-
80G GPUs. For TriviaQA, OpenbookQA, HellaSwag, and mance degradations on LAMBADA, a language modeling
SocialIQA, we experiment on the development set since the task. The reason may be that the instruction data mainly
test set is not publicly released. While for other datasets, focuses on enabling LLMs to follow human instructions,
we experiment on the test set. To reproduce our experi- which is not always useful for the general language gen-
ments, we also publicly release our experimental code and eration task.
data in https://github.com/RUCAIBox/LLMSurvey/tree/ • These small-sized open-source models perform not well on
main/Experiments. mathematical reasoning, interaction with environment, and tool
manipulation tasks. On the tasks of mathematical reasoning,
7.4.2 Results Analysis and Findings
interaction with environment and tool manipulation, all
We report the experimental results in Table 16, and analyze these evaluated open-source models perform not well, in-
the results in the following. cluding instruction-tuned ones. A possible reason is that the
Analysis of Closed-Source Models. We summarize our instruction data for fine-tuning these models is not specif-
analysis and findings of the four closed-source models (i.e., ically designed for these tasks. In addition, these closed-
ChatGPT, Claude, Davinci003 and Davinci002) as follows: source models may have limited model capacities due to
• These five closed-source models achieve promising results small model sizes.
as general-purpose task solvers, in which ChatGPT mostly per- • The top-performing model varies on different human align-
forms the best. ChatGPT, Claude, Claude 2, Davinci003 and ment tasks. For different human alignment tasks, we can see
Davinci002 perform well in most of tasks, including com- that these models achieve inconsistent performance rank-
plex tasks (e.g., GSM8k), which have shown great potential ings. For example, LLaMA 2-Chat (7B) performs the best
to be general-purpose task solvers. Among them, ChatGPT among the compared open-source models on TruthfulQA,
exhibits a more superior model capacity on the evaluation while Vicuna (13B) performs the best on CrowS-Pairs. A
tasks, winning the most across all tasks. In some evaluation possible reason is that these tasks are designed with spe-
tasks, the performance gap between ChatGPT and other cific purposes for evaluating different aspects of human
closed-source models is very large, especially for complex alignment, and these models exhibit varied performance
tasks e.g., 78.47 (ChatGPT) v.s. 49.96 (Davinci002) on GSM8k, on different tasks, even for the variants of the same model
and 79.88 (ChatGPT) v.s. 51.22 (Claude) on HumanEval. (e.g., Pythia (7B) and Pythia (12B)). More experiments and
• Claude 2, ChatGPT and Davinci003 perform better on inter- analysis on human alignment evaluation are needed to
action with environment and tool manipulation tasks. On the two reveal more detailed findings.
evaluation tasks, Claude 2, ChatGPT and Davinci003, per- • As a more recently released model, LLaMA 2 (7B) overall
form better than other models by a large margin, e.g., 36.40 achieves a good performance, especially on complex reasoning
(Claude 2) v.s. 26.00 (Davinci002) on HotpotQA, 44.53 (Chat- tasks. For complex reasoning tasks, LLaMA 2 (7B) mostly
GPT) v.s. 7.74 (Claude) on Gorilla-TF, and 72.58 (Davinci003) performs better than other base models, e.g., 43.95 (LLaMA
v.s. 22.04 (Claude) on Gorilla-TH. A possible reason is that 2 (7B)) v.s. 29.80 (Falcon (7B)) in C-Objects. For other
73

tasks (e.g., language generation and knowledge utilization), sentences. As rich high-quality labeled data about these
LLaMA 2 (7B) can also achieve comparable performance tasks has been accumulated so far, existing work [23, 39]
as the best-performing base models. It has used more data finds that small language models can achieve very good
for pre-training (i.e., about 2 trillion tokens), which mainly performance by fine-tuning on it. Recent studies [55, 752]
contributes to the excellent performance. Furthermore, it have also tested the performance of LLMs on these tasks,
also conducts a more robust data cleaning process. showing that LLMs can also perform well via in-context
• Scaling the open-source modes can improve the performance learning (with very few examples). Whereas, as small mod-
consistently. By comparing the performance of Vicuna (7B) els can be specially optimized on these tasks to learn the
and Vicuna (13B), Pythia (7B) and Pythia (13B), we can see specific task requirement and domain knowledge, full-data
that the models with larger scales mostly perform better fine-tuned small models can mostly outperform LLMs using
than smaller ones on these evaluation tasks, indicating the in-context learning on several classic tasks [753, 754], e.g.,
effectiveness of scaling up the model size. Across different semantic matching and sentiment analysis.
tasks, scaling model is more beneficial for more complex
tasks (e.g., symbolic and mathematical reasoning), where the Sequence Tagging. The sequence tagging tasks, e.g., named
larger models mostly outperform smaller ones in a large entity recognition (NER) [755] and part-of-speech (POS)
margin. tagging [756], are also fundamental tasks. Typically, such
The readers should be note that these findings about tasks require assigning each token in the input sequence a
open-source language models are limited to the model sizes. proper semantic category label, e.g., the classic B-I-O (Be-
We will continually update this part by including the results ginning, Inside and Outside) tagging scheme for NER tasks.
of larger versions of these models, and also call for the In the era of deep learning, early efforts [757, 758] mainly
support of computational resources for more experiments. integrate the learned sequence representations (e.g., using
CNN, LSTM, and BERT) into the classic conditional random
field model (CRF), which performs the tagging task based
8 A PPLICATIONS on structural prediction. Recently, researchers have tested
the performance of LLMs in sequence tagging tasks, but ob-
In this section, we briefly review the recent progress on the
served that LLMs still face challenges in solving them using
applications of LLMs in two aspects, namely the impact to
in-context learning [753], especially for special categories
research community and representative domains. Figure 18
with ambiguous or rare names, e.g., the “MISC” (miscella-
shows a content organization of this section48 .
neous entity) and “ORG” (organization) classes. A possible
reason is that LLMs may misunderstand the meanings of
8.1 LLM for Research Community these classes in the human-annotated dataset, making it
As LLMs have revolutionized the way how we develop difficult to accurately understand their semantics according
AI algorithms, it poses significant impact on the research to the instruction and limited examples in the context.
community. In this part, we briefly review the advances that
Information Extraction. The information extraction task
led by LLMs for several representative research directions.
focuses on automatically extracting useful structured infor-
mation from unstructured text data, such as relation extrac-
8.1.1 LLM for Classic NLP Tasks tion [759] and event extraction [760], which is also a crucial
As pre-trained language models (e.g., BERT) have originated task relating to many NLP applications. Typically, previous
in the field of NLP, the technical advances of language studies formulate this task as a text classification task or
models has an important impact on the research of NLP. In a sequential labeling task. As information extraction often
this part, we discuss the application of LLMs on five kinds needs to accurately understand and process complex se-
of classic NLP tasks, including word-level, sentence-level, mantic relations (multiple relations within one sentence), in-
sequence tagging, relation extraction, and text generation context learning with LLMs typically underperform state-
tasks, which had been the foundation of many existing NLP of-the-art full-data fine-tuning methods [761, 762]. Whereas,
systems and applications. Note that we do not intend to it is shown that enabling collaboration between LLMs and
comprehensively cover all NLP tasks, but instead try to small models can further boost the performance of specific
analyze the impact of LLMs for fundamental NLP research tasks [762, 763]. In addition, a recent study [425] also reveals
through the basic tasks. We also omit the discussion of sev- that LLMs can achieve competitive zero-shot performance
eral tasks (e.g., language modeling) that have been discussed for information extraction with a two-stage workflow, mak-
early in this survey. ing this approach attractive in future applications.
Word/Sentence-level Tasks. As long-standing NLP tasks, Text Generation. Text generation tasks, e.g., machine trans-
word-level (e.g., word clustering [748] and sense disam- lation [624] and automatic summarization [548], are long-
biguation [749]) and sentence-level tasks (sentence match- standing NLP tasks that have been widely studied, and
ing [750] and sentiment classification [751]) have been there have been a number of deployed products and sys-
widely studied in the literature and applied in real-world tems based on fine-tuned small models [311, 764]. Since the
platforms. To solve these tasks, the key is to accurately pre-training of LLMs is established on text prediction, they
understand the semantic information about the words or exhibit strong language generation abilities as commercial
products [627] and humans [628], with the help of proper
48. Note that we don’t aim to cover all the related research directions
or domains, but instead demonstrating the use or impact of LLMs via prompts [765, 766]. Additionally, LLMs are flexible to effec-
these selected examples. tively handle special requirement in real-world application
74

• Word/Sentence-level Tasks
LLM for Classic NLP Tasks • Sequence Tagging
• Information Extraction
• Text Generation
Classic Scenarios • LLM as IR Model
LLM for IR • LLM-Enhanced IR Models

• LLM as Recommendation Model


LLM for Recommendation • LLM-enhanced Recommendation Models
Research • LLM as Recommendation Simulator
Directions • Vision-Language Alignment Pre-Training
Multimodal LLMs • Visual Instruction Tuning
• Evaluation of MLLM
Enhanced Capabilities
• Retrieval-augmented LLM
KG Enhanced LLM • Synergy Augmented LLM

LLM for • Components: Memory/Planning/Execution


Application LLM-based Agent
• Single/Multi-agent based Application
New Scenarios
• Score/Language-based Evaluation
LLM for Evaluation • Instruction Design, Multiple Feedbacks, Debate Agent
• Meta-Evaluation

Scientific
Specific Domains Healthcare Finance Law Education
Research

Fig. 18: The applications of LLMs in representative research directions and downstream domains.

scenarios, e.g., document-level translation [767], and also LLMs can effectively understand human instructions and
enable natural language interaction with users to further make meaningful responses.
improve the generation quality [768]. Despite the above
success, recent work also reveals that LLMs are hard to well 8.1.2 LLM for Information Retrieval
address the generation tasks about low-resource languages The goal of information retrieval (IR) systems is to assist
and domains, e.g., Marathi-to-English translation [769], due users in discovering ideal information resources (typically
to their unbalanced training data across different languages. documents) and mitigating the information overload issue.
Typically, contemporary IR systems adopt a retrieve-then-
Summary. Based on the above discussion, we summarize rerank pipeline framework [54]. Within this framework,
the suggestions, and future direction about the use of LLMs the retriever initially retrieves relevant information from a
in classic NLP tasks as follows: large-scale corpus, and the reranker subsequently performs
• Suggestions: LLMs and small models have their own multi-stage ranking procedure to acquire the most relevant
merits in different aspects: LLMs are can provide unified information [773]. Since the advent of LLMs has significant
solutions to various NLP tasks and achieve competitive impact on the way of information access, we discuss how
performance (especially in the zero/few-shot setting), while it advances the development of IR from two main aspects,
small models are economical to develop and can be specially namely LLMs as IR models and LLM-enhanced IR models.
tuned according to target tasks, which can achieve good LLMs as IR Models. Existing IR models can be overall
performance with sufficient high-quality labeled data [753, categorized into sparse models (relying on term-based lex-
754, 770, 771]. In applications, one can make suitable choices ical similarity) and dense models (relying on embedding
based on the actual needs, comprehensively considering based semantic similarity) [740]. Specially, dense models
flexibility, data availability, training compute, and efficiency. are mainly implemented by fine-tuned PLMs (e.g., BERT).
• Future direction: Despite the excellent general capac- Compared to PLMs, LLMs have more strong model capac-
ities, LLMs still cannot effectively process the NLP tasks ities in capturing text semantics, thus having the potential
in low-resource domains, e.g., minor language translation. to improve existing dense IR models. However, due to the
To tackle such tasks, it needs to develop effective ap- high overhead of LLMs, the majority of studies concentrate
proaches to injecting necessary task information or domain- on employing LLMs as rerankers, aiming to refine the rank-
specific knowledge into LLMs, either through fine-tuning ing of retrieved candidates. To achieve this, recent efforts
or prompting. In addition, it is still challenging for LLMs to often formulate special instructions that enable LLMs to
handle complex semantic relations in classic NLP tasks (e.g., perform reranking on a small set of provided candidate
nested entity extraction), which is worth more exploration documents. Typically, such an approach does not necessitate
from the underlying working mechanism of LLMs. It is also model training, and achieve promising results compared
promising to combine LLMs and fine-tuned small language with well-trained reranking methods [774, 775]. Specially,
models for complementing with each other in solving com- the LLM-based reranking approach can be implemented
plex cases of classic NLP tasks [772]. Another promising di- in different ways by zero-shot or few-shot instruction, in-
rection is to conduct human-machine collaborative research cluding pointwise (estimating the relevance scores for query-
(e.g., conversational translation [768]) on NLP tasks, since document pairs) [776], pairwise (determining the relevance order
75

of two documents) [775], or listwise ranking (sorting a subset of focuses on text retrieval tasks, lacking a comprehensive
candidate documents) [777]. The essence of these methods lies consideration of multimodal information sources. As will
in the special design of instructions for text reranking, such be discussed in Section 8.1.4, multimodal large language
as sliding window strategy for document lists [774, 778], models [797] are also widely studied, making it feasible to
setwise selection prompting [779], fine-grained relevance la- develop more powerful multimedia retrieval systems.
bels incorporation [780], and pairwise comparison prompt-
ing [775]. In addition, recent efforts employ LLMs to gen- 8.1.3 LLM for Recommender Systems
erate intermediate texts (e.g., URLs) as retrieval results us-
ing few-shot demonstrations [781]. To further enhance the Unlike IR systems that analyze user search queries to
model performance, LLMs can be specially fine-tuned as retrieve relevant documents, recommender systems (RS)
backbones for reranking [782, 783] or retrieval (including aim to capture the underlying user preference and pro-
dense retrieval [54] and model-based retrieval [784, 785]), vide appropriate information resources to users [798–801].
similar to the fine-tuning process for traditional PLM-based Typically, existing studies train a recommendation model
IR models [782]. However, fine-tuning LLMs as IR models (either classic or deep learning model) by fitting it over
entails considerable expenses given the huge parameter the user’s logged data (e.g., click data) [745, 802]. However,
scale of LLMs. these models often suffer from a series of technical issues,
e.g., cold-start recommendation, domain transfer, and poor
LLM-Enhanced IR Models. As another major research explainability. Recently, LLMs have demonstrated the po-
direction, LLMs can be employed to improve existing IR tential to alleviate these issues of recommendation mod-
models (e.g., small models). A common challenge faced els [357, 803, 804], due to the strong capacities of domain
by existing IR models is the lack of relevant judgment generalization and language generation. In this part, we
annotation [786, 787]. To tackle this problem, LLMs can be briefly review the recent progress of LLMs in recommender
instructed to annotate positive or negative documents for systems, from the following three aspects, namely LLMs as
a given query [788], or to generate corresponding queries recommendation models, LLM-enhanced recommendation
based on a set of documents in the corpus by referring to a models, and LLMs as recommendation simulators.
few demonstrations [789, 790]. In addition to training data LLMs as Recommendation Models. With specific methods
augmentation, LLM has the potential to improve existing or mechanisms, LLMs can be adapted to serve as recom-
IR models by refining the search-oriented informativeness mendation models. Existing work along this line can be
of both queries and documents. In IR systems, the in- generally divided into two main categories. First, some
put queries may be constrained by a user’s cognitive and methods prompt LLMs for completing the recommendation
cultural competency, making it challenging to accurately task in a zero-shot paradigm (i.e., without parameter tun-
express the real intent, and irrelevant content present in ing) [805, 806]. A series of prompt engineering methods like
documents can also impact the relevance evaluation with recency-focused and in-context learning are introduced to
the query. As a solution, LLM can be utilized to rewrite the improve recommendation performance as well as alleviate
query for enhancing the understanding of the query intent the potential model biases [807, 808]. Second, another cat-
and incorporating additional knowledge into the query egory of studies aim to specialize LLMs for personalized
through well-designed instructions. The rewritten query recommendation through instruction tuning [357, 809]. Spe-
can take the form of an improved version of the original cially, high-quality instruction data is key to adapt LLMs
query [791], a document in the corpus that related to the to the recommendation tasks, which can be constructed
query [792], or an expansion of the query that concatenated based on user-item interactions with heuristic templates. To
with a pseudo generated document [793]. In addition, docu- further improve the instruction diversity, InstructRec [357]
ments can also be expanded with queries that are generated employs self-instruct technique to simulate large amounts of
based on the original documents using LLMs for context potential user instructions in various scenarios like product
extension [794]. search and personalized recommendations. In addition to
representing each item by its text description, there is also
Remaining Issues. In this part, we further discuss several growing attention on extending LLM’s vocabulary with
important issues to apply LLMs to improve IR systems. semantic identifiers in recommender systems [810, 811], to
First, though LLMs are capable of being as general-purpose incorporate collaborative semantics into LLMs.
task solvers, they are not directly well suited for existing
IR systems: they require high overhead for inference [774, LLM-enhanced Recommendation Models. In addition to
782], have limitations in modeling long texts or document instructing LLMs to directly provide recommendations, re-
lists [778], and need special adaptation (e.g., instruction searchers also propose leveraging the universal knowledge
tuning) to perform the text ranking task [795]. Therefore, encoded in LLMs to improve traditional recommender sys-
more systematic approaches to adapt LLMs for modern IR tems. Existing approaches in this line can be divided into
systems should be investigated, to leverage their benefits three main categories. The first category employs LLMs to
and meanwhile overcome these limitations. Secondly, the infer users’ potential intention from their historical interac-
advent of LLMs sheds lights on the development of new tion data. Furthermore, traditional recommendation/search
information seeking ways (e.g., New Bing). It is meaningful models employ the inferred intentions to improve the re-
to explore how to reshape the architecture and paradigm trieval of relevant items [812, 813]. Additionally, several
of IR by integrating the LLMs’ capacities and the merits studies explore the use of LLMs as feature encoders. They
of existing IR systems [796]. Thirdly, existing work mainly employ LLMs to encode the side information of items and
76

users (e.g., item’s descriptions and user’s reviews), thus de- input, and further produce corresponding output in certain
riving more informative representations of users and items. modalities. In this part, we mainly focus on the multimodal
These representations are then fed into traditional recom- extension of LLMs by enabling the information modeling
mender systems as augmented input [814, 815]. As an- of non-textual modalities, especially the vision modality,
other alternative approach, several studies [816, 817] adopt called multimodal large language models (MLLMs) [797]49 . To
a distillation-like way to transfer LLM’s capacities (e.g., start our discussion, we specify the input to be text-image
semantic encoding) to improve traditional recommenders pairs and the output to be text responses. Similar discus-
(i.e., small models). Specially, they align the hidden states sions can be made for other modalities, e.g., language-audio
of LLMs and traditional recommendation models via joint models [825], which is beyond our scope here. In essence,
training. After training, since only the enhanced small MLLMs are developed by adapting the information from
model will be deployed online, it can avoid the huge over- other modalities to the text modality, so as to leverage the
head of LLMs in online service. excellent model capacities of LLMs that are learned based on
world text. Typically, a MLLM comprises an image encoder
LLM as Recommendation Simulator. Inspired by the recent
for image encoding and a LLM for text generation, associ-
success of autonomous AI agents [818], LLMs have been
ated by a connection module that aligns vision and language
also utilized to develop recommendation simulators [819,
representations. During generation, the image is first split
820] (exemplified by RecAgent [819]), showing great po-
into patches, and then transformed into patch embeddings
tential to simulate user real behaviors in recommender
by the image encoder and the connection module, to derive
systems [819, 821, 822]. Specifically, to make personalized
a visual representation that can be understood by the LLM.
simulation, an agent will be equipped with a profiling
Subsequently, the patch embeddings and text embeddings
module that encompasses relevant identity information.
are concatenated, and fed into the MLLM, allowing the
Then, a memory module is introduced to store agents’ past
language model to generate the response autoregressively.
interaction experiences. During the process of simulation,
In the following, we will discuss the training, evaluation,
agents are further prompted to conduct self-reflection based
and key points to develop capable MLLMs.
on their past experiences, to capture their underlying user
preference. Most of existing recommendation simulators are Training Process. The training process of the MLLM in-
conducted in a user-oriented way, without explicitly mod- cludes two major stages: vision-language alignment pre-
eling the items in the interaction process. To address this, training and visual instruction tuning.
AgentCF [821] models both users and items as agents, and • Vision-language alignment pre-training. To develop
further facilitates collaborative reflections to simulate user- MLLMs, existing work mostly initializes the vision encoder
item interactions, so as to capturing the two-sided relations and the LLM with pre-trained models [149, 150, 826]. These
between users and items. models retain excellent vision and language capacities, but
span different semantic spaces. Thus, the goal of vision-
Remaining Issues. Despite these efforts, there are still
language alignment pre-training (i.e., the first-stage training)
several challenges to address when applying LLMs in
is to align the vision encoder and the LLM through end-to-
recommender systems. First, existing studies have shown
end training on large-scale image-text pairs [827, 828]. How-
that LLM-based recommendation models in zero/few-shot
ever, directly tuning these two models on image-text pairs
settings tend to perform worse than traditional ID-based
may cause the degradation of the original representation ca-
recommenders [806, 807]. This indicates that LLMs might
pacities. To improve the alignment performance, it is crucial
lack an understanding of personalized user behaviors and
to design effective training strategies and select appropriate
domain-specific collaborative semantics. Although instruc-
pre-training data [829, 830]. Existing work mainly employs
tion tuning alleviates this issue to some extent [357, 809],
the following strategies for cross-modality alignment: (1) if
it can’t fully reduce the semantic gap between LLMs and
the number of image-text pairs is not sufficiently large (e.g.,
recommender systems, and also suffers from high tuning
less than 1M), it is often suggested to only update the
costs. Furthermore, recommender systems prioritize min-
connection module [831]; (2) if the training data includes
imizing inference latency to enhance users’ experience in
high-quality text corpora [832] or image-text pairs with
low-resourced environments (e.g., phones), which poses a
fine-grained annotations [833], fine-tuning the LLM can be
challenge to LLMs’ inference speed as well as memory
conducted to boost the performance; (3) if the number of
overhead. Therefore, it is important to explore improvement
image-text pairs is very large (e.g., about 1B), fine-tuning
techniques, such as efficient tuning and quantization meth-
the vision encoder is also plausible [829, 830], but the benefit
ods, to deploy LLMs efficiently and effectively in real-world
remains further verification.
recommender systems. In addition, existing LLMs have
• Visual instruction tuning. After vision-language pre-
limited capacities in long context modeling, make it difficult
training, the second-stage training, i.e., visual instruction
to process the huge amount of user-item interaction data.
tuning, aims to improve the instruction-following and task-
Improved context length extension and context information
solving abilities of MLLMs. Generally, the input of vi-
utilization approaches should be developed to improve the
sual instruction tuning consists of an image and a task
modeling capacities of LLMs in long interaction sequences.
description, and the task is to generate a corresponding
8.1.4 Multimodal Large Language Model
In existing literature [823, 824], multimodal models mainly 49. In existing work, large vision language models (LVLMs) [662] are
also used to term such bimodal models that are developed based on
refer to the models that can process and integrate informa- LLMs. We use the naming of MLLMs in this part due to its wide use in
tion of various modalities (e.g., text, image, and audio) from existing literature.
77

text output. To boost the performance, high-quality visual Bench [838] employs ChatGPT to align the model responses
instruction data is key to eliciting and enhancing the abil- with the most relevant option in a set of multiple-choice
ities of MLLMs. Therefore, most studies are dedicated to questions. Similarly, LLaVA [851] utilizes GPT-4 for eval-
constructing various visual instruction datasets. As the basic uating MLLMs’ output, where GPT-4 takes the generated
approaches, early studies construct visual instructions by image captions and object bounding boxes as visual inputs
distilling from GPT-4 [149] or reformulating vision-language for assessment. Such open-ended evaluation methods can
task datasets [151]. To enhance the quality of instruction improve assessment accuracy while incurring higher costs
data, recent work further proposes improved strategies by due to the involvement of humans or LLMs.
increasing the instruction diversity [834], incorporating fine- • Evaluation benchmarks. To facilitate a more thorough
grained information (e.g., coordinate of objects) into the evaluation of MLLMs, various benchmarks have been devel-
instruction [833], or synthesizing complex visual reasoning oped. Part of them collect existing vision-language tasks for
instructions [835]. comprehensive evaluation. For instance, LVLM-eHub [852]
aggregates 47 existing text-related visual tasks to assess
Evaluation of MLLM. After introducing the approaches to
six distinct capabilities of MLLMs, and Reform-Eval [853]
developing MLLMs, we further discuss how to effectively
takes this a step further by standardizing questions from
assess the multimodal capabilities of MLLMs from the fol-
existing benchmarks into a uniform format and discusses
lowing three aspects.
how the backbone models influence MLLMs’ performance.
• Evaluation perspectives. The evaluation tasks for MLLMs In addition to incorporating existing tasks, several work
can be categorized into two main types: perception and also derives new questions annotated by humans or with
cognition tasks. Specifically, perception tasks aim to assess the the help of LLMs. MME [839] creates a dataset by pair-
model’s abilities in understanding the basic semantics of the ing images from public sources with manually-collected
image content, while cognition tasks evaluate models with text instructions for perception and cognition evaluations.
more complex tasks that require reasoning based on per- MMBench [838] transforms these instructions into multiple-
ception results. The perception ability is typically evaluated choice questions and introduces CircularEval to ensure
through classification tasks about attributes of image (e.g., evaluation consistency. SEED-Bench [854] further considers
topic and style) and object (e.g., existence and color) or OCR- temporal understanding tasks and enlarges the evaluation
related tasks, based on existing datasets or new datasets scale to 19K multiple-choice questions with the assistance of
derived from existing images with annotations by humans LLMs. MM-Vet [855] presents more complex tasks to assess
or LLMs [836–839]. A notable perception issue is hallucina- the integrated multimodal capabilities of MLLMs. It starts
tion [840], where the model’s responses contain inconsistent by defining six essential multimodal abilities and then cre-
content with the image. Among existing studies about hallu- ates intricate questions by combining multiple abilities. In
cination in MLLMs [834, 841, 842], object hallucination [843] summary, the above benchmarks collectively contribute to
has received much research attention. To conduct a stable, the comprehensive evaluation and improved development
robust evaluation of object hallucination, POPE [844] pro- of MLLMs.
poses a polling-based object probing approach for convert-
ing object recognition into a series of binary questions, and Key Points for Improving MLLMs. To develop capable
the results indicate that current MLLMs often struggle with MLLMs, we continue to discuss three key points to improve
object hallucination. Cognition tasks, on the other hand, re- the model capacities, from the perspectives of instruction
quire MLLMs to perform reasoning based on image percep- data, training strategy, and safety and alignment.
tion. A common reasoning task is visual question answering • Visual instruction data. Extensive work [831, 856] has
(VQA), where models answer questions about images that empirically found that both quantity and quality of visual
demand reasoning about spatial relationships [845], general instructions have an important impact on model perfor-
knowledge [846], or scene text [847]. To fully explore the mance of MLLMs. One basic way to construct visual in-
capabilities of MLLMs, HallusionBench [848] collects 200 structions is to leverage the exceptional capability of LLMs
sophisticated visual dependent or supplement questions, on to synthesize instructions based on text descriptions of
which even the most advanced MLLMs like LLaVA-1.5 [831] images [851]. To further enhance the quality of instructions,
and GPT-4V [133] fail to achieve good performance. one can construct fine-grained visual instructions with the
• Evaluation paradigms. The responses of MLLMs can help of human annotation [833, 857] or synthesize more
be evaluated either in a closed-ended or an open-ended complex data through carefully-designed prompts [835].
manner. Traditional multimodal tasks often rely on a closed- Despite the effectiveness of the above LLM-based ap-
ended evaluation framework, where the assessment is based proaches, one primary question emerges as to whether a
on the exact match between the model’s response and the LLM (i.e., text generation model without training on any
ground-truth answer. Examples include the VQA score [849] images) possesses the ability to generate sufficiently good
for visual question answering tasks and the CIDEr [850] visual instructions solely based on verbalized visual infor-
score for captioning tasks. However, MLLMs generate re- mation (e.g., captions and coordinates). Specially, existing
sponses in an open-ended way, which may contain the work has also revealed that visual instructions generated
correct answer but not exactly match the ground-truth per- by LLMs sometimes contain misinterpretations about the
fectly. This discrepancy can lead to the underestimation of visual information, e.g., object hallucination [844]. Therefore,
the model’s performance in previous evaluation paradigms. it is crucial to design effective verification methods to con-
To address this issue, recent approaches have incorporated trol the quality of instruction data generated by LLMs [835].
humans or LLMs as evaluators [829]. For instance, MM- Furthermore, it still needs more investigation about what
78

makes good visual instructions and how visual instructions forms of structured data (e.g., tables and databases) [862],
elicit specific multimodal abilities in MLLMs. while we limit our discussion to the integration of KG for
• Model training. Different from LLMs, MLLMs are not improving LLMs, which are detailed in two aspects, namely
trained from scratch, but instead developed based on pre- retrieval-augmented LLM and synergy-augmented LLM.
trained language and vision models. Existing work em-
ploys a typical two-stage approach for training MLLMs, Retrieval-Augmented LLM. Due to the huge amount of
i.e., vision-language alignment pre-training and visual in- fact records in a KG, existing work typically adopts a
struction tuning. In essence, existing MLLMs aim to (1) pre- retrieval model to first obtain a relatively small subgraph
serve the inherent capabilities and parametric knowledge from KG, and then leverages it to enhance LLMs by en-
of LLMs as possible, and meanwhile (2) effectively adapt riching the relevant knowledge. Before the advent of LLMs,
to multimodal tasks by leveraging the pre-trained LLMs the retrieved subgraphs are often supplemented into train-
and visual encoders. To achieve the above two goals, two ing data, injecting knowledge information into PLMs via
typical training strategies are often employed for visual parameter learning [863–865]. In contrast, to leverage the
instruction tuning, either only optimizing the connection retrieved knowledge, LLMs mainly incorporate it as part of
module [151] or fine-tuning both the connector module the prompt, without parameter update. To implement this
and LLM component [851]. As we can see, the former approach, there are two main technical problems, i.e., how
can reserve the original capacities of LLMs but likely have to retrieve relevant knowledge from KGs and how to make
a weak an adaptation performance, while the latter can better use of the structured data by LLMs. For the first issue
fully adapt to multimodal tasks but suffer from the loss of (i.e., retrieving relevant knowledge), a typical approach is
original capacities of LLMs. More efforts should be made to to train a small language model (e.g., RoBERTa) to iden-
investigate how to effectively balance the two aspects, so as tify question-related fact triples [866]. To further improve
to achieving improved multimodal capacities. In addition, the retrieval performance, several studies also propose an
existing MLLMs are still overly dependent on the capacities iterative reading-then-reasoning framework, enabling the
of LLMs, which pose the limits on many multimodal tasks LLM to interact with the KG multiple times and acquire the
(e.g., space positioning). It will be meaningful to explore required knowledge in a more accurate way [458]. For the
improved training approaches of language models, so that second issue (i.e., utilizing retrieved knowledge), a straight-
multimodal information can be also utilized in this process. forward approach is to serialize the retrieved subgraph
• Safety and alignment. Safety and alignment has been and craft specific prompts to include it as the input of
widely discussed in LLMs, which aim to regulate the behav- LLMs [471, 651]. However, due to the loss of structured
iors of models by technical approaches [66]. This topic is also information in knowledge serialization, LLMs cannot fully
important to MLLMs. Even a highly advanced MLLM (e.g., capture the structural semantics conveyed by original KGs.
GPT-4V [133]) can be susceptible to safety issues. For exam- To address this issue, several model-based approaches train
ple, GPT-4V might occasionally exhibit factual inaccuracies a specialized language model (e.g., T5) to transform the
and baseless inferences about images. In some cases, it may subgraph into the natural language text [867]. To guarantee
even generate harmful content targeting specific individuals the transformation accuracy, it relies on sufficient training
or groups [133]. Furthermore, open-sourced MLLMs are pairs (often unsupervised constructed) [868] and excellent
also prone to generate hallucinated response [844] and can model capability [869].
be easily manipulated to produce harmful content [858].
To address the aforementioned issues, some studies collect Synergy-Augmented LLM. To solve complex tasks (e.g.,
specialized visual instructions to mitigate the problem of multi-hop question answering [656]), it often requires LLMs
hallucination [834]. Another alternative approach is to train to query a KG multiple times, following a systematic solu-
a revision model to rectify hallucinated response generated tion plan. We call such a multi-turn interaction approach to
by MLLMs in a post-hoc way [859]. Additionally, aligning enhancing LLM synergy-augmented LLM. To better synergize
MLLMs with RLHF can also assist MLLMs in generating the LLM and KG in a complementary manner, recent studies
responses with improved factuality [860]. Despite these propose to decompose the complex task into multiple sub-
efforts, existing alignment techniques for MLLMs mainly goals and iteratively solve each one by leveraging the nec-
concentrate on several specific aspects (e.g., hallucination), essary knowledge from KG [458, 870, 871]. In this process,
lacking a comprehensive consideration of alignment criteria. the LLM can be regarded as an autonomous agent (detailed
More efforts should be made to promote the research of in Section 8.1.6), which automatically generates the plan
safety and alignment for MLLMs. and executes it through interaction with the KG environ-
ment [870]. Specially, the mainstream approaches typically
8.1.5 KG-Enhanced LLM start by enumerating the candidates using the available
Despite the excellent capacities, LLMs often suffer from knowledge information at the current step, and then retrieve
challenges on knowledge-intensive tasks, such as the po- the most appropriate candidates for the next step according
tential to generate hallucinated content [602] and the lack of to the question [870, 871]. By iterating the above two steps,
domain-specific knowledge [861]. As a promising solution, LLMs can gradually collect relevant evidence [870, 871], and
knowledge graphs (KGs), which store enormous knowledge finally approach the correct solution. Despite the effective-
in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can ness, enumeration of the candidates over the KG would lead
be utilized to enhance the task performance of LLMs by pro- to a vast search space [872]. To address it, StructGPT [458]
viding precise and necessary knowledge. Generally, knowl- proposes a more efficient way to access knowledge infor-
edge enhanced approaches can be expanded into other mation using the specialized interfaces for KGs. Specifically,
79

it carefully designs the specialized interfaces according to LLMs can read and write through actions like reason-
the common data operations on KG (e.g., relation extraction ing [880]. While long-term memory can be mapped to the
and triple extraction), to ensure efficient and accurate data external storage like vector databases [537], where LLMs
extraction. In this way, LLMs can be instructed to better can read through retrieval and write with reflection [686].
manipulate and process the structural information of KGs, Specially, profiles are usually implemented with long-term
thus achieving improved task performance. memory, which is an important feature for an agent that
specifies its role and function [818]. The planning component
Future Directions. Besides the above approaches, there is responsible for generating the action plan based on the in-
are several promising directions for KG-enhanced LLM formation from the memory component. In data format, the
remaining underexplored. First, due to the variety of struc- plan usually takes the form of text-based instructions [441]
tured data, it is still difficult for LLMs to directly leverage or code-based programs [443]. To generate it, LLM-based
various kinds of knowledge sources, e.g., domain-specific agents will first propose several candidates and then select
KGs. Therefore, it is essential to explore the unified way a more suitable one among them [436]. The initial plan
to manipulate and utilize different knowledge sources by can be further refined with execution feedback from the
LLMs. As a potential solution, it is promising to develop environment [528]. The execution component is in charge
effective approaches to help LLMs comprehend and make of carrying out the plan from the planning component,
use of the access interfaces provided by specific knowledge which can be fulfilled by the internal LLM [441] or external
sources to acquire precise knowledge [458], while more ef- tools [880].
forts should be made to investigate how to adapt to the data
• Workflow. With the three components mentioned
variety in a cost-effective way. Second, with the evolution of
above, a typical workflow of an LLM-based agent is as
real-world information, the knowledge stored in LLMs may
follows. First, it receives information from the environment
become outdated or incorrect. It is necessary to explore how
and writes it into short-term memory. Then, the agent
to synchronize the updated knowledge into LLMs through
processes the newly received information in the short-term
a cost-effective manner [873, 874]. Third, it is promising to
memory. Such a process can be enhanced with information
investigate the use of factual information from KG to align
retrieved from long-term memory. Subsequently, the plan-
LLMs in generating more faithful content [875, 876], which
ning component utilizes the processed information from
can help reduce the hallucination of LLMs.
short-term memory to generate the next plan. Finally, the
In addition to exploring KG-enhanced LLMs, it is also
execution component carries out the plan generated from
meaningful to leverage LLMs to improve the tasks on the
the planning component, which can be further assisted with
KG side (i.e., LLM4KG) [861, 877]. A typical example is that
external tools. By repeating the aforementioned process, the
LLMs can help supplement or construct the KG. We omit
LLM-based agent can autonomously adjust its behavior in
the discussion of this part, since it is beyond our scope.
response to feedback from the environment and ultimately
achieve its goal. Once LLM-based agents receive user re-
8.1.6 LLM-based Agent quests or are assigned goals, they follow the above work-
The research on agents in AI aims to develop entities that flow to accomplish tasks through multi-turn interactions
can perceive the environment, make decisions, and take with the environment.
actions to achieve specific goals [878]. However, traditional To summarize, in an LLM-based agent, the LLM serves
agents are often limited to heuristic rules or specific environ- as the core computation unit and is equipped with compo-
ments, which constrain their generalization to open-domain nents including memory, planning, and execution. These com-
scenarios [879]. Given that LLMs possess excellent capacities ponents are integrated in a systematic way under the control
in solving complex tasks, they have rapidly emerged as of the LLM during interactions with the environment. For
promising solutions for serving as the core computation more details, the readers might refer to the comprehensive
unit of agents [818]. In this part, we will first introduce survey for LLM-based AI agents [818].
the framework for LLM-based agents and then discuss their
applications. Applications. Recently, LLM-based agents have shown
great potential in autonomously solving complex tasks,
Overall Framework. Next, we first detail the key compo- making it feasible to rapidly develop capable applications
nents of an LLM-based agent and then present the typical for specific domains or tasks. In this section, we will discuss
workflow. the applications in single-agent and multi-agent scenarios.
• Components. Typically, there are three main com-
• Single-agent based applications. Applications based on
ponents in an LLM-based agent: memory, planning50 , and
a single-agent mode mainly aim to develop capable task
execution. Specifically, the memory component aims to store
solvers that can autonomously complete user requests. A
the information perceived from the environment and can
large number of single-agent projects have been developed,
be utilized to support decision-making. In particular, LLM-
which focus on general-purpose task solving. As a rep-
based agents usually maintain information in both short-
resentative project, AutoGPT [534] empowers LLMs with
term memory and long-term memory with the operations
long/short-term memory management and external tools
of reading and writing. Short-term memory usually refers
like search engines. In order to autonomously address a
to the internal context window of LLMs (i.e., input), where
user request, AutoGPT understands the request with knowl-
edge from its memory and actions like reasoning, decom-
50. Section 6.4 introduces planning as a utilization approach for
LLMs, while in this section, we describe its utilization as a functional poses it into a detailed plan, executes the plan step-by-
component in LLM-based agents. step with the assistance of tools, and refines the rest plan
80

based on feedback from the environment. Such an iterative ing part, we will introduce the recent progress on LLM for
process continues until the user request is successfully re- evaluation, including evaluation formats, methods, meta-
solved. Other similar projects include GPT-Engineer [881] evaluation, and the remaining issues.
and XAgent [882]. In addition, there is also some work that
aims to develop autonomous agents for specific domains, Evaluation Formats. Depending on the type of evaluation
such as WebGPT [81] for the web-browsing environment, outcome, the evaluation format can be categorized into
ProgPrompt [530] for the real-life environment, and Voy- score-based evaluation and language-based evaluation. Score-
ager [697] for the Minecraft environment. based evaluation employs measurable metrics to assign
• Multi-agent based applications. Different from single- quality scores (e.g., ratings or rankings) for evaluated texts.
agent systems where agents work independently, multi- A prevalent way is to conduct pairwise comparison, where
agent systems work in collaboration to unleash collective LLMs are used to determine the partial order relation of
intelligence. Typically, multiple agents can be instantiated candidate texts following specific guidelines [354, 647, 727],
from the same or different LLMs, each with their respective which greatly simplifies the evaluation task. However, it
roles and functions. According to the coordinating strategies may face the inefficiency issue when scaling up the number
among these agents, multi-agent systems can be divided of candidates [727]. When high-quality reference texts are
into two categories: cooperation-based and competition- available during evaluation, LLMs can be instructed to score
based. In the cooperation-based mode, to share information texts under the guidance provided by references [716, 727,
and seek collaborative actions among agents, various com- 728]. On the other hand, language-based evaluation focuses
munication protocols have been proposed, including free- on generating critiques and suggestions, offering qualitative
form dialogue [883], structured document [884], and data explanation beyond simple quantitative scoring [371, 888–
embedding [885]. Based on the communication protocol, 890]. It is particularly useful for gathering language feed-
agents can be effectively organized for downstream appli- back signals for human alignment tuning [371, 888]. Fur-
cations, such as software engineering [884], user behavior thermore, it can evolve into a multi-turn interaction frame-
analysis [819, 821], and society simulation [533]. In the work, where LLM-based evaluators provide natural lan-
competition-based mode, debate serves as one of the pop- guage feedback to existing solutions from task solvers [891].
ular communication protocols to foster divergent thinking This framework evaluates the ability of LLMs to leverage
and elicit valuable external feedback among agents. Such a language feedback for refining self-generated solutions.
way is beneficial for domains that demand precise decision- Evaluation Methods. A common method for LLM-based
making and accurate responses, such as mathematical rea- evaluation involves prompting LLMs with specific instruc-
soning [886] and evaluation [732]. tions. To further improve the quality of LLM-based eval-
Remaining Issues. Despite the huge success, there are still uation, recent work proposes to prompt LLMs with varied
several issues that limit the development and applications contexts to generate diverse evaluation feedback. These con-
of LLM-based agents. First, with the explosive growth of the texts vary in aspects such as the candidate order [647, 727],
model scale, the efficiency of LLM-based agents, including evaluation perspectives [892, 893] (e.g., relevance, clarity,
both the time and memory overhead, becomes an important originality), and evaluation explanation [647]. The gener-
issue for large-scale deployment, especially for multi-agent ated multiple evaluation feedbacks are then aggregated to
systems with numerous instances of LLMs. Second, with the produce a final evaluation result, which makes the evalua-
scaling of the number of LLM-based agents, more effective tion process less prone to biases from individual feedback
and efficient communication protocols and architectures are and allows for a more thorough evaluation by covering
required to support the increased complexity of coordina- a wider range of evaluation aspects. To further improve
tion among agents. Furthermore, building capable agents the quality of the single-model evaluation, recent studies
poses technical challenges for the capacities of LLMs like also develop multi-agent collaboration frameworks [893–
instruction following and long text modeling. Since existing 895] or fine-tune LLMs as specified evaluators [371, 888–
LLMs are not specially optimized for instantiating agents, 890, 896]. In a multi-model collaboration mode, different
most public-sourced LLMs like LLaMA cannot effectively LLMs evaluate the candidates by engaging in discussions
facilitate the development of agents. Therefore, it is crucial to align preferences and reach a consensus [894, 895]. This
to develop capable, specialized models to serve as the core method helps reduce the potential biases in individual
computation unit of agents. models through the consensus reached by multiple agents.
Another approach to improving single-model evaluation
8.1.7 LLM for Evaluation is to specialize LLMs as scores or critics through fine-
tuning [371, 888–890, 896]. This process involves creating
While human evaluation can generally offer reliable quality datasets annotated with preferences and feedback from
assessment, it is also often hindered by high annotation humans or proficient LLMs. These datasets are then used to
costs, significant time requirements, and annotation incon- train evaluation-oriented models, enabling them to generate
sistencies [887]. In contrast, automatic evaluation can be pairwise preference or language feedback. The specialized
employed as a scalable alternative to human evaluation. LLM evaluators demonstrate competitive performance with
Traditional automatic evaluations have relied on reference- fewer parameters [889, 890, 896].
based metrics (e.g., BLEU and ROUGE). Recently, with
the emergence of LLMs as general task solvers highlights Meta-Evaluation. To effectively assess the quality of
their potential as automatic evaluators [647, 727], making it LLM-based evaluators, meta-evaluation benchmarks have
promising to conduct LLM based evaluation. In the follow- been introduced, for gauging the agreement with human
81

preferences and the fairness of the evaluations made by Education is also an important application domain where
LLMs [647, 727, 893, 897, 898]. As a representative bench- LLMs potentially exert significant influence. Existing work
mark, MT-Bench [727] evaluates the agreement between has found that LLMs can achieve student-level performance
LLMs and human judgments, demonstrating that GPT-4 on standardized tests [46] in a variety of subjects of math-
aligns closely with human preferences in no-tie compar- ematics (e.g., physics, computer science) on both multiple-
isons on 80 multi-turn questions. In addition, to address choice and free-response problems. In addition, empirical
potential biases arising from subjective human evaluations, studies have shown that LLMs can serve as writing or read-
LLMBar [897] manually designs outputs that are objectively ing assistant for education [908, 909]. A recent study [909]
worse but superficially appealing, which could mislead reveals that ChatGPT is capable of generating logically
evaluators. The evaluation results reveal that even the most consistent answers across disciplines, balancing both depth
advanced LLMs still fall short of human-level evaluation in and breadth. Another quantitative analysis [908] shows that
the challenging setting. students utilizing ChatGPT (either keeping or refining the
results from LLMs as their own answers) perform better
Remaining Issues. As discussed in Section 7.1.1, recent than average students in some courses from the computer
studies demonstrate that LLM-based evaluators expose security field. Recently, several perspective papers [910, 911]
multiple types of bias, such as order bias, self-preference also explore various application scenarios of LLMs in class-
bias, and length bias [647, 727]. Although some biases can room teaching, such as teacher-student collaboration, per-
be mitigated through methods like multi-path ensemble or sonalized learning, and assessment automation. However,
multi-agent collaboration, they remain inherent to LLM- the application of LLMs in education may lead to a series
based evaluators. Consequently, addressing these biases of practical issues, e.g., plagiarism, potential bias in AI-
intrinsically within the models continues to be an a chal- generated content, overreliance on LLMs, and inequitable
lenging issue. In addition, recent work has revealed that access for non-English speaking individuals [912].
LLMs may be incapable of understanding the self-generated
content, exhibiting a weaker understanding capacity com- Law is a specialized domain that is built on professional
pared to their generation capabilities [899]. Even the most domain knowledge. Recently, a number of studies have ap-
advanced LLMs still struggle identifying their reasoning or plied LLMs to solve various legal tasks, e.g., legal document
factual errors without external feedback [900, 901]. Conse- analysis [913], legal judgment prediction [914], and legal
quently, current LLM-based evaluators might not be ade- document writing [915]. A recent study [916] has found
quate for evaluating top-tier LLMs or complex tasks. This that LLMs exhibit powerful abilities of legal interpretation
underscores the importance of improvement approaches and reasoning. Moreover, the latest GPT-4 model achieves
for LLM-based evaluators, especially for evaluating capable a top 10% score in a simulated bar exam compared with
LLMs and complex tasks demanding sophisticated reason- human test-takers [46]. To further improve the performance
ing, planning, and domain-specific knowledge. of LLMs in the law domain, specially designed legal prompt
engineering are employed to yield advanced performance
8.2 LLM for Specific Domains in long legal document comprehension and complex legal
In this part, we discuss the applications of LLMs on several reasoning [917, 918]. To summarize the progress, LLMs can
representative domains, including healthcare, education, act as helpful assistants to legal profession. Despite the
law, finance, and scientific research assistance. progress, the use of LLMs in law raises concerns about
legal challenges, including copyright issues [919], personal
Healthcare is a vital application field closely related to information leakage [920], or bias and discrimination [921].
human life. Ever since the advent of ChatGPT, a number of
studies have applied ChatGPT or other LLMs to the medical Finance is an important field where LLMs have promis-
domain. It has been shown that LLMs are capable of han- ing application prospects. LLMs have been employed on
dling a variety of healthcare tasks, e.g., biology information various finance related tasks, such as numerical claim
extraction [763], medical advice consultation [902], mental detection [922], financial sentiment analysis [923], finan-
health analysis [903], and report simplification [904]. As cial named entity recognition [924], and financial reason-
the major technical approach, researchers typically design ing [925]. Despite the competitive zero-shot performance
specific prompts or instructions to guide LLMs to perform a exhibited by general-purpose LLMs in the finance tasks,
wide range of medical tasks. To further harness the power they still underperform domain-specific PLMs containing
of LLMs in the healthcare domain, researchers propose to million-scale parameters [922]. To leverage the scaling effect
develop healthcare-related LLMs [356, 905, 906]. Specifically, of LLMs, researchers collect large-scale finance corpora for
the Med-PaLM models [356, 905] achieves expert-level per- continually pre-training LLMs (e.g., BloombergGPT [360],
formance on the United States Medical Licensing Exami- XuanYuan 2.0 [926], and FinGPT [927]). BloombergGPT
nation (USMLE), and earns greater approval from physi- has demonstrated remarkable performance across a diverse
cians in answering consumer’s medical questions. However, range of financial tasks while maintaining competitive per-
LLMs may fabricate medical misinformation [904, 907], formance in general-purpose tasks [360]. Nevertheless, it is
e.g., misinterpreting medical terms and suggesting advice imperative to consider the potential risks in the application
inconsistent with medical guidelines. In addition, it would of LLMs in finance, as the generation of inaccurate or
also raise privacy concerns to upload the health information harmful content by LLMs could have significant adverse
of patients [763] into a commercial server that support the implications for financial markets [360]. Therefore, it needs
LLM. more strict reviewing and monitoring on the use of LLMs in
82

the financial field. our survey has discussed four important aspects of LLMs,
i.e., pre-training, adaptation, utilization, and evaluation. For
Scientific research is another promising field that LLMs each aspect, we highlight the techniques or findings that are
can empower the development progress. Prior research key to the success of LLMs. Furthermore, we also summa-
demonstrates the effectiveness of LLMs in handling rize the available resources for developing LLMs and dis-
knowledge-intensive scientific tasks (e.g., PubMedQA [928], cuss important implementation guidelines for reproducing
BioASQ [929]), especially for LLMs that are trained on LLMs. This survey tries to cover the most recent literature
scientific-related corpora [35, 203, 930]. Given the excel- about LLMs and provides a good reference resource on this
lent general abilities and broad scientific knowledge, LLMs topic for both researchers and engineers.
hold significant potential as helpful assistants across var-
ious stages of the scientific research pipeline [931]. First, Next, we summarize the discussions of this survey, and
during the literature survey stage, LLMs can help conduct introduce the challenges and future directions for LLMs, in
a comprehensive overview of the progress in a specific the following aspects.
research field [932, 933]. Second, during the research idea
Basics and Principles. Instead of training on specific task
generation stage, LLMs demonstrate the ability to generate
goals, LLMs learn from unsupervised pre-training on large-
intriguing scientific hypotheses [934]. Third, during the data
scale text data. This is quite different from previous multi-
analysis stage, LLMs can be employed to conduct automatic
task learning approaches, which aim to extend the training
approaches to analyzing the data characteristics, includ-
tasks as possible to achieve sufficient generalization. Thus,
ing data exploration, visualization, and deriving analytical
it is essential to reveal the basic principles or elements that
conclusions [935, 936]. Fourth, during the paper writing
establish the foundation of the abilities of LLMs. Although
stage, researchers can also benefit from the assistance of
the basic idea of language models is intuitive, it is still chal-
LLMs in scientific writing [937, 938], in which LLMs can
lenging to formally explain why LLMs trained by simple
offer valuable support for scientific writing through diverse
language modeling objectives (e.g., next token prediction)
means, such as summarizing the existing content and pol-
can become capable of solving various real-world tasks.
ishing the writing [939]. In addition, LLMs can aid in
To investigate this problem, a promising approach is to
the automated paper review process, encompassing tasks
study the capacity learning (or selection) mechanism based
such as error detection, checklist verification, and candidate
on unsupervised pre-training, since the model capacity of
ranking [940]. Despite these advances, there is much room
LLMs strongly depends on pre-training data. In addition,
for improving the capacities of LLMs to serve as helpful,
scaling plays an important role in improving the capacity
trustworthy scientific assistants, to both increase the quality
of LLMs [31, 55, 64], and it is very useful to conduct more
of the generated scientific content and reduce the harmful
theoretical analysis about how the behaviors of large models
hallucinations.
relate to those of small models, e.g., what behaviors of large
Summary. In addition to the aforementioned work, the
models can be inferred from small models and what can’t be
applications of LLMs have been also discussed in several
predicted indeed. Another research direction is to explore
other domains. For instance, in the psychologic domain,
more deep analysis on model generalization for LLMs,
some recent work has studied the human-like characteristics
since increasing concerns have been raised about whether
of LLMs, such as self-awareness, theory of mind (ToM), and
LLMs can generalize beyond the knowledge encoded by
affective computing [941, 942]. In particular, an empirical
pre-training data. Furthermore, data contamination has be-
evaluation of ToM conducted on two classic false-belief
come a severe issue for fairly assessing the performance of
tasks speculates that LLMs may have ToM-like abilities
LLMs [738], and thus setting appropriate evaluation proto-
since the model in the GPT-3.5 series achieves comparable
col will be the basis to investigate and analyze the model
performance with nine-year-old children in ToM task [941].
capacity of LLMs.
In addition, another line of work has investigated applying
LLMs into the software development domain, e.g., code Model Architecture. Due to the scalability and effective-
suggestion [943], code summarization [944], and automated ness, Transformer has become the de facto architecture
program repair [945]. To summarize, to assist humans by for building LLMs. Various strategies have been proposed
LLMs in real-world tasks has become a significant area of to improve the performance of this architecture, such as
research. However, it also presents challenges. Ensuring the neural network configuration and scalable parallel training
accuracy of LLM-generated content, addressing biases, and (see discussions in Section 4.2.2). However, Transformer
maintaining user privacy and data security are crucial con- still suffers from high training costs and slow inference
siderations when applying LLMs to real-world scenarios. rates. More efforts [251, 252] are still in need to develop
improved model architectures for large-scale pre-training.
Specially, system-level or hardware-level optimization (e.g.,
9 C ONCLUSION AND F UTURE D IRECTIONS FlashAttention [284]) is worth more exploration to improve
In this survey, we have reviewed the recent progress of large the efficiency of Transformer architectures. In addition, as an
language models (LLMs), and introduced the key concepts, important basic capacity, existing LLMs typically maintain
findings, and techniques for understanding and utilizing a long context window. For example, the most recent GPT-4
LLMs. We focus on the large-sized models (i.e., having a size Turbo enables a long context of 128K tokens, and Claude
larger than 10B) while excluding the contents of early pre- 2.1 also supports the input up to 200K tokens. Although
trained language models (e.g., BERT and GPT-2) that have many efforts have been made to enhance the long context
been well covered in the existing literature. In particular, modeling ability of LLMs [264, 291], the resulting mod-
83

els still can’t well process the information in the context popular research direction is retrieval-augmented gener-
window [299]. To address this issue, specific architecture ation, where retrieved contexts from supporting sources
adaptations or algorithms might be needed to enhance the are included into prompts for task solving. It has been
modeling and utilization of long context information. An- shown that retrieval augmentation can extend the knowl-
other worrying concern is that existing work mostly focuses edge boundary and improve the question answering ca-
on training LLMs with decoder-only Transformers. Despite pacity [461], but may suffer from the effectiveness of long
the effectiveness, it severely limits the more wide, diverse context utilization by LLMs [299].
explorations on alternative model architectures.
Safety and Alignment. Despite the capacities, LLMs are
Model Training. For pre-training, it is essential to establish faced with great safety challenges in practical use. As a
a data-centric infrastructure and training procedure for LLM fundamental issue of probabilistic modeling nature, LLMs
optimization, which can effectively support a systematic exhibit a tendency to generate hallucinations [638], refer-
process of data collection, data cleaning, data mixture, and ring to texts that seem plausible but may be factually
data curriculum. Furthermore, it also calls for more flexible incorrect [46]. What is worse, LLMs might be elicited by
mechanisms of hardware support or resource schedule, so intentional instructions to produce harmful, biased, or toxic
as to better organize and utilize the resources in a computing texts for malicious systems, leading to the potential risks
cluster. In practice, it is very challenging to pre-train capable of misuse [55, 66]. To have a detailed discussion of the
LLMs, due to the huge compute consumption and the safety issues of LLMs (e.g., privacy, overreliance, disinfor-
sensitivity to data quality and training tricks [78, 93]. Thus, mation, and influence operations), the readers can refer to
it becomes particularly important to develop systemic, eco- the GPT-3/4 technical reports [46, 55]. As the major tech-
nomical pre-training approaches for optimizing LLMs, e.g., nical approach to averting these issues, alignment methods
predictable scaling [46] and proxy model training [59]. More (e.g., RLHF) [66, 116] have been widely used by leveraging
training recipes or principles should be investigated and human feedback for developing well-aligned LLMs. How-
shared to reduce the potential risk of degradation or failure ever, RLHF heavily relies on high-quality human feedback
in large-scale model optimization. Although increasingly data from professional labelers, which is costly and time-
more model checkpoints and cleaned datasets have been consuming to recruit qualified human annotators. There-
released, there still lacks reproducible work on pre-training fore, it is necessary to improve the RLHF framework for
data preparation (e.g., detailed cleaning strategies) and data reducing the efforts of human labelers and seek a more
scheduling (e.g., data mixture and curriculum). Since it is efficient annotation approach with guaranteed data quality,
very costly to pre-train a LLM from scratch, it is important e.g., LLMs can be employed to assist the labeling work.
to design suitable mechanisms for continually pre-training Furthermore, it is also suggested to develop simplified
or fine-tuning the LLM based on publicly available model optimization algorithms for alignment [386, 389], to reduce
checkpoints (e.g., LLaMA [57] and Flan-T5 [69]). For this the training difficulty and unstability of RLHF. As another
purpose, a number of technical issues have to be resolved, practical approach, red teaming [132, 369] has been adopted
e.g., catastrophic forgetting and task specialization. Further- for improving the model safety of LLMs, which utilizes
more, it is also useful to develop effective tuning strategies the collected adversarial prompts to refine the LLMs (i.e.,
that effectively inject or edit specific knowledge [672], e.g., avoiding the attacks from red teaming). In addition, privacy
correcting the outdated facts. concerns are also important to consider when fine-tuning
LLMs with domain-specific data, and thus federated based
Model Utilization. Based on the natural language inter- learning [946] can be useful in privacy-restricted scenarios.
face, prompting has become the prominent approach for
using LLMs to solving various tasks. By combining task Application and Ecosystem. As LLMs have shown strong
descriptions and demonstration examples into prompts, in- capacities in solving various tasks, they can be applied
context learning (ICL) endows LLMs with the ability to in a broad range of real-world applications (i.e., following
perform well on new tasks, even outperforming full-data task-specific natural language instructions). As a remarkable
fine-tuned models in some cases. To enhance the ability of progress, ChatGPT has potentially changed the way how
complex reasoning, advanced prompting techniques have humans access information, which has been additionally
been proposed, exemplified by the chain-of-thought (CoT) integrated in the release of New Bing. Generally, in the
strategy, which includes the intermediate reasoning steps near future, it can be foreseen that LLMs would have a
into prompts. Furthermore, planning is a promising ap- significant impact on information-seeking techniques, in-
proach for solving complex tasks, which iteratively invokes cluding both search engines and recommender systems.
LLMs by leveraging tool use capacities. Despite these ef- Furthermore, LLMs make it possible to develop more intel-
forts, several basic problems related to prompting are still ligent systems (e.g., autonomous AI agents) to tackle various
under-explored: why a good prompt can elicit the correct complex tasks in real-world scenarios. Specially, Assistants
answer but a bad prompt cannot, how to reveal the working API has been launched by OpenAI (featured by instructions,
principles of advanced prompting methods (e.g., ICL and knowledge and tool use), enabling rapid development of
CoT) and further improve these existing approaches, and agent-like assistants within the applications. This wave of
how to efficiently find the effective prompts for LLMs on technical innovation would lead to an ecosystem of LLM-
specific tasks. Furthermore, from a practical perspective, it empowered applications (e.g., OpenAI’s GPT Store), which
has become a fundamental challenge to reduce the inference has a close connection with human life. Lastly, the rise of
cost of LLMs, especially in large-scale deployment. Another LLMs sheds light on the exploration of artificial general
84

intelligence (AGI). It is promising to develop more smart AI architectures in Figure 9, and add the detailed formulas
systems than ever. However, in this development process, in Table 6.
AI safety should be one of the primary concerns, i.e., making • Update on April 25, 2023: revise some copy errors in
AI lead to good for humanity but not bad [40]. figures and tables.
• Update on April 27, 2023: add efficient tuning in Sec-
tion 5.3.
C ODA • Update on April 28, 2023: revise Section 5.3.
It is not an easy job to write this long survey and update • Update on May 7, 2023: revise Table 1, Table 2, and
its content with timely work. First of all, we would like to some minor points.
sincerely thank the support from the readers and our team • Update on June 29, 2023 (major revision):
members. We work very hard on this survey, and hope that – Section 1: add Figure 1 for the trends of published
it can present a comprehensive, timely reference for LLMs. LLM papers in arXiv;
– Section 2: add Figure 4 for GPT’s evolution and the
Survey Writing. This survey was planned during a discus-
corresponding discussion;
sion meeting held by our research team, and we aimed to
– Section 3: add Figure 5 for LLaMA family and the
summarize the recent advances of large language models
corresponding discussion;
as a highly readable report for our team members. The
– Section 5: add latest discussion about the synthetic
first draft was finished on March 13, 2023, in which our
data formatting of instruction tuning in Section 5.1.1,
team members tried their best to include the related stud-
the empirical analysis for instruction tuning in Sec-
ies about LLMs in a relatively objective, comprehensive
tion 5.1.4, parameter-efficient model adaptation in
way. Then, we have extensively revised the writing and
Section 5.3 and memory-efficient adaptation in Sec-
contents in several passes. Due to the space limit, we can
tion 5.4;
only include a fraction of existing LLMs in Figure 3 and
– Section 6: add latest discussion about the underlying
Table 1, by setting the selection criterion. However, we set
mechanism of ICL 6.2.3, planning for complex task
a more relaxed criterion for model selection on our GitHub
solving in Section 6.4;
page (https://github.com/RUCAIBox/LLMSurvey), which
– Section 7: update Table 14 for representative datasets
will be regularly maintained. We release the initial version
for evaluating advanced abilities of LLMs, and em-
on March 31, 2023, the major revision on June 29, 2023,
pirical ability evaluation in Section 7.4;
and second version on September 10, 2023, and this latest
– Section 6.1.1: add prompt design;
version (major revision) on November 23, 2023.
– Section 8: add the discussions on applications of
Seeking for Advice. Despite all our efforts, this survey LLMs in finance and scientific research domains;
is still far from perfect: we are likely to miss important • Update on September 10, 2023 (major revision):
references or topics, and might also have non-rigorous – Claim the copyrights of the figures and tables in this
expressions or discussions. We will continuously update paper.
this survey, and improve the quality as much as we can. – Add latest LLMs, techniques and their descriptions in
For us, survey writing is also a learning process for LLMs Section 3, Section 4, Section 5, Section 6 and Section 7;
by ourselves. For readers with constructive suggestions to – Section 4: add latest discussion about the decoding
improve this survey, you are welcome to leave comments on strategy in Section 4.2.5;
the GitHub page of our survey or directly email our authors. – Section 5: add latest discussion about the practical
We will make revisions following the received comments tricks for instruction tuning in Section 5.1.2, the
or suggestions in a future version, and acknowledge the empirical analysis on LLaMA (13B) for instruction
readers who have contributed constructive suggestions in tuning in Section 5.1.4, practical strategies for RLHF
our survey. in Section 5.2.3, alignment without RLHF in Sec-
Update log. In this part, we regularly maintain an update tion 5.2.4 and remarks on SFT and RLHF in Sec-
log for the submissions of this survey to arXiv: tion 5.2.5;
– Section 6: update the content about the planning for
• First release on March 31, 2023: the initial version.
complex task solving in Section 6.4;
• Update on April 9, 2023: add the affiliation information,
– Section 7: add discussions about evaluation ap-
revise Figure 3 and Table 1 and clarify the correspond-
proaches in Section 7.3.2, Table 15 for the category
ing selection criterion for LLMs, improve the writing,
of existing evaluation work, and update empirical
and correct some minor errors.
ability evaluation in Section 7.4 and the results on
• Update on April 11, 2023: correct the errors for library
Table 16;
resources.
– Section 6.1.1: add new prompt examples in Table 12;
• Update on April 12, 2023: revise Figure 3 and Table 1,
and clarify the release date of LLMs. • Update on November 23, 2023 (this version):
• Update on April 16, 2023: add a new Section 2.2 about – Section 1: add Figure 2 for the evolution process of
the technical evolution of GPT-series models. four generations of language models;
• Update on April 24, 2023: add the discussion about – Section 2: add more discussion about scaling laws
scaling laws and add some explanations about the and how emergent abilities relate to scaling laws;
model sizes for emergent abilities (Section 2.1); add an – Section 3: add latest LLMs in Figure 3 and Table 1,
illustrative figure for the attention patterns for different latest APIs in Section 3.1, commonly used datasets
85

for instruction tuning and alignment tuning in Sec- Gaoyan Ou, Todd Morrill, Hao Liu, Zhenyu Zhang, and
tion 3.3, and several libraries in Section 3.4; Xinlin Zhuang.
– Section 4: add latest discussion about the data
scheduling, including data mixtures and data cur- Since the v11 version (June 29, 2023), we have been
riculum in Section 4.1.3; add summary of data prepa- adding a large number of experiments and prompt prac-
ration in Section 4.1.4; add discussion about model- tices. These new contents are completed by a number of
ing long context in Section 4.2.4; add discussion about volunteers in our team. Here, we add a special part to thank
decoding efficiency issues and add latest decoding all the students who have worked very hard on this part
strategies in Section 4.2.5; (also including the ones on our author list).
– Section 5: add latest discussion about instance con-
struction and tuning strategies in Section 5.1; add Contribution on Experiments. We would like to sincerely
latest discussion about process-supervised RLHF in thank the following people for their hard work involved in
Section 5.2.3, and the empirical study on quantized experiments shown in Table 16.
LLaMA models (7B and 13B) in Section 5.4.3; • Xiaoxue Cheng: implement the experiments for evalu-
– Section 6: add latest discussion about prompt op- ation on Language Generation and HaluEval tasks.
timization in Section 6.1.2, and update the content • Yuhao Wang: implement the experiments for evalua-
about chain-of-thought prompting in Section 6.3; tion on interaction with environment tasks.
– Section 8: add latest discussion about LLM for re- • Bowen Zheng: implement the experiments for evalua-
search directions in Section 8.1; tion on tool manipulation tasks.
– Section 9: revise the content in the several aspects. Contribution on Tips. We list the following guys for their
Planning Content. We will regularly include new content contributions on the corresponding numbers of provided
into this survey, to make it more self-contained and up- tips for designing prompts in Table 12.
to-date. Here, we list several potential topics that might • Xiaolei Wang: T3, O3
appear in the next major version(s): (1) more experiments • Beichen Zhang: D2, D5
with larger language models for both instruction tuning and • Zhipeng Chen: D3, D4
ability evaluation; (2) more detailed prompting practice; (3) • Junjie Zhang: D6
training recipe; (4) more theoretical analysis and discussion; • Bowen Zheng: D7
(5) more discussions on applications. • Zican Dong: D8
• Xinyu Tang: C2
Clarifications on Experiments. In this version, we have • Yifan Du: T4
included a number experiments on instruction-tuning (Ta- • Tianyi Tang: O6, O7, D9
ble 9), overall ability evaluation (Table 16), and prompt • Yupeng Hou: O8, C3
engineering (Table 17). Due to the limit of computational • Salvatore Raieli: C4
resources, our experiments are not complete, limited to
small-sized models or a few comparisons. Despite that, we
feel that it might be meaningful to share the partial results to R EFERENCES
the public. We will try to include the missing results of larger
models or more comparisons in the future versions. We also [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A
call for support of computing power for conducting more neural probabilistic language model,” J. Mach. Learn.
comprehensive experiments. Res., vol. 3, pp. 1137–1155, 2003.
[2] R. Collobert, J. Weston, L. Bottou, M. Karlen,
Chinese Version. We also provide a translated Chinese ver- K. Kavukcuoglu, and P. P. Kuksa, “Natural language
sion (corresponding to the first release) of this survey paper processing (almost) from scratch,” J. Mach. Learn. Res.,
at the link: https://github.com/RUCAIBox/LLMSurvey/ vol. 12, pp. 2493–2537, 2011.
blob/main/assets/LLM Survey Chinese.pdf. Four volun- [3] S. Pinker, The Language Instinct: How the Mind Creates
teers contribute to check and revise the content, and they Language. Brilliance Audio; Unabridged edition,
are Yiwen Hu, Xin Deng, Xinming Hou, Yanbin Yin, and 2014.
Zhanshuo Cao (in order of contribution). We will also con- [4] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The
tinuously update the Chinese version, but it may not be as faculty of language: what is it, who has it, and how
timely as the latest English version. did it evolve?” science, vol. 298, no. 5598, pp. 1569–
1579, 2002.
[5] A. M. Turing, “Computing machinery and intelli-
ACKNOWLEDGMENTS gence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950.
The authors would like to thank Yankai Lin and Yutao Zhu [6] F. Jelinek, Statistical Methods for Speech Recognition.
for proofreading this paper. Since the first release of this MIT Press, 1998.
paper, we have received a number of valuable comments [7] J. Gao and C. Lin, “Introduction to the special issue
from the readers. We sincerely thank the readers who have on statistical language modeling,” ACM Trans. Asian
written to us with constructive suggestions and comments: Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004.
Tyler Suard, Damai Dai, Liang Ding, Stella Biderman, [8] R. Rosenfeld, “Two decades of statistical language
Kevin Gray, Jay Alammar, Yubo Feng, Mark Holmstrom, modeling: Where do we go from here?” Proceedings
Xingdong Liu, Il-Seok Oh, Yiting Liu, Shaojun Wang, of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.
86

[9] A. Stolcke, “Srilm-an extensible language modeling Y. LeCun, Eds., 2013.


toolkit,” in Seventh international conference on spoken [21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,
language processing, 2002. C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex-
[10] X. Liu and W. B. Croft, “Statistical language modeling tualized word representations,” in Proceedings of the
for information retrieval,” Annu. Rev. Inf. Sci. Technol., 2018 Conference of the North American Chapter of the As-
vol. 39, no. 1, pp. 1–31, 2005. sociation for Computational Linguistics: Human Language
[11] C. Zhai, Statistical Language Models for Information Re- Technologies, NAACL-HLT 2018, New Orleans, Louisiana,
trieval, ser. Synthesis Lectures on Human Language USA, June 1-6, 2018, Volume 1 (Long Papers), M. A.
Technologies. Morgan & Claypool Publishers, 2008. Walker, H. Ji, and A. Stent, Eds. Association for
[12] S. M. Thede and M. P. Harper, “A second-order hid- Computational Linguistics, 2018, pp. 2227–2237.
den markov model for part-of-speech tagging,” in [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
27th Annual Meeting of the Association for Computational L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
Linguistics, University of Maryland, College Park, Mary- “Attention is all you need,” in Advances in Neural
land, USA, 20-26 June 1999, R. Dale and K. W. Church, Information Processing Systems 30: Annual Conference on
Eds. ACL, 1999, pp. 175–182. Neural Information Processing Systems 2017, December 4-
[13] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, 9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
“A tree-based statistical language model for natural [23] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT:
language speech recognition,” IEEE Transactions on pre-training of deep bidirectional transformers for
Acoustics, Speech, and Signal Processing, vol. 37, no. 7, language understanding,” in Proceedings of the 2019
pp. 1001–1008, 1989. Conference of the North American Chapter of the Asso-
[14] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, ciation for Computational Linguistics: Human Language
“Large language models in machine translation,” in Technologies, NAACL-HLT 2019, Minneapolis, MN, USA,
EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Con- June 2-7, 2019, Volume 1 (Long and Short Papers),
ference on Empirical Methods in Natural Language Pro- J. Burstein, C. Doran, and T. Solorio, Eds. Association
cessing and Computational Natural Language Learning, for Computational Linguistics, 2019, pp. 4171–4186.
June 28-30, 2007, Prague, Czech Republic, J. Eisner, Ed. [24] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
ACL, 2007, pp. 858–867. hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer,
[15] S. M. Katz, “Estimation of probabilities from sparse “BART: denoising sequence-to-sequence pre-training
data for the language model component of a speech for natural language generation, translation, and com-
recognizer,” IEEE Trans. Acoust. Speech Signal Process., prehension,” in Proceedings of the 58th Annual Meeting
vol. 35, no. 3, pp. 400–401, 1987. of the Association for Computational Linguistics, ACL
[16] W. A. Gale and G. Sampson, “Good-turing frequency 2020, Online, July 5-10, 2020, 2020, pp. 7871–7880.
estimation without tears,” J. Quant. Linguistics, vol. 2, [25] W. Fedus, B. Zoph, and N. Shazeer, “Switch trans-
no. 3, pp. 217–237, 1995. formers: Scaling to trillion parameter models with
[17] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and simple and efficient sparsity,” J. Mach. Learn. Res, pp.
S. Khudanpur, “Recurrent neural network based lan- 1–40, 2021.
guage model,” in INTERSPEECH 2010, 11th Annual [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
Conference of the International Speech Communication I. Sutskever et al., “Language models are unsuper-
Association, Makuhari, Chiba, Japan, September 26-30, vised multitask learners,” OpenAI blog, p. 9, 2019.
2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
ISCA, 2010, pp. 1045–1048. O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov,
[18] S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget, “Roberta: A robustly optimized BERT pretraining ap-
“Recurrent neural network based language modeling proach,” CoRR, vol. abs/1907.11692, 2019.
in meeting recognition,” in INTERSPEECH 2011, 12th [28] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika,
Annual Conference of the International Speech Commu- Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey,
nication Association, Florence, Italy, August 27-31, 2011. M. S. Bari, C. Xu, U. Thakker, S. S. Sharma,
ISCA, 2011, pp. 2877–2880. E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak,
[19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Man-
J. Dean, “Distributed representations of words and ica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden,
phrases and their compositionality,” in Advances in T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli,
Neural Information Processing Systems 26: 27th Annual T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bider-
Conference on Neural Information Processing Systems man, L. Gao, T. Wolf, and A. M. Rush, “Multitask
2013. Proceedings of a meeting held December 5-8, 2013, prompted training enables zero-shot task generaliza-
Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bot- tion,” in The Tenth International Conference on Learning
tou, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, Representations, ICLR 2022, Virtual Event, April 25-29,
pp. 3111–3119. 2022. OpenReview.net, 2022.
[20] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef- [29] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W.
ficient estimation of word representations in vector Chung, I. Beltagy, J. Launay, and C. Raffel, “What
space,” in 1st International Conference on Learning Rep- language model architecture and pretraining objective
resentations, ICLR 2013, Scottsdale, Arizona, USA, May works best for zero-shot generalization?” in Interna-
2-4, 2013, Workshop Track Proceedings, Y. Bengio and tional Conference on Machine Learning, ICML 2022, 17-23
87

July 2022, Baltimore, Maryland, USA, ser. Proceedings Aligning perception with language models,” CoRR,
of Machine Learning Research, vol. 162, 2022, pp. vol. abs/2302.14045, 2023.
22 964–22 984. [43] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and
[30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, L. Sun, “A comprehensive survey of ai-generated
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and content (aigc): A history of generative ai from gan to
D. Amodei, “Scaling laws for neural language mod- chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
els,” CoRR, vol. abs/2001.08361, 2020. [44] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh-
[31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu
S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, et al., “Palm-e: An embodied multimodal language
D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, model,” arXiv preprint arXiv:2303.03378, 2023.
P. Liang, J. Dean, and W. Fedus, “Emergent abilities of [45] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and
large language models,” CoRR, vol. abs/2206.07682, N. Duan, “Visual chatgpt: Talking, drawing and edit-
2022. ing with visual foundation models,” arXiv preprint
[32] M. Shanahan, “Talking about large language models,” arXiv:2303.04671, 2023.
CoRR, vol. abs/2212.03551, 2022. [46] OpenAI, “Gpt-4 technical report,” OpenAI, 2023.
[33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, [47] Y. Fu, H. Peng, and T. Khot, “How does gpt obtain its
Q. Le, and D. Zhou, “Chain of thought prompting ability? tracing emergent abilities of language models
elicits reasoning in large language models,” CoRR, vol. to their sources,” Yao Fu’s Notion, Dec 2022.
abs/2201.11903, 2022. [48] J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained
[34] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, language model for text generation: A survey,” in
T. Cai, E. Rutherford, D. de Las Casas, L. A. Hen- Proceedings of the Thirtieth International Joint Conference
dricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, on Artificial Intelligence, IJCAI 2021, Virtual Event /
K. Millican, G. van den Driessche, B. Damoc, A. Guy, Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed.
S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, ijcai.org, 2021, pp. 4492–4499.
O. Vinyals, and L. Sifre, “Training compute-optimal [49] P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A
large language models,” vol. abs/2203.15556, 2022. survey of deep learning for mathematical reasoning,”
[35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, CoRR, vol. abs/2212.10535, 2022.
A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and [50] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
R. Stojnic, “Galactica: A large language model for X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-context
science,” CoRR, vol. abs/2211.09085, 2022. learning,” CoRR, vol. abs/2301.00234, 2023.
[36] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and [51] J. Huang and K. C. Chang, “Towards reasoning
G. Neubig, “Pre-train, prompt, and predict: A system- in large language models: A survey,” CoRR, vol.
atic survey of prompting methods in natural language abs/2212.10403, 2022.
processing,” ACM Comput. Surv., pp. 195:1–195:35, [52] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng,
2023. C. Tan, F. Huang, and H. Chen, “Reasoning with
[37] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, language model prompting: A survey,” CoRR, vol.
C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, abs/2212.09597, 2022.
C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive [53] J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang, “Chat-
survey on pretrained foundation models: A history gpt: potential, prospects, and limitations,” in Frontiers
from BERT to chatgpt,” CoRR, vol. abs/2302.09419, of Information Technology & Electronic Engineering, 2023,
2023. pp. 1–6.
[38] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, [54] W. X. Zhao, J. Liu, R. Ren, and J. Wen, “Dense text
J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang, retrieval based on pretrained language models: A
Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, survey,” CoRR, vol. abs/2211.14876, 2022.
J. Tang, J. Wen, J. Yuan, W. X. Zhao, and J. Zhu, “Pre- [55] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
trained models: Past, present and future,” AI Open, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
vol. 2, pp. 225–250, 2021. A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
[39] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
“Pre-trained models for natural language processing: J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
A survey,” CoRR, vol. abs/2003.08271, 2020. M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc-
[40] S. Altman, “Planning for agi and beyond,” OpenAI Candlish, A. Radford, I. Sutskever, and D. Amodei,
Blog, February 2023. “Language models are few-shot learners,” in Ad-
[41] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, vances in Neural Information Processing Systems 33: An-
E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lund- nual Conference on Neural Information Processing Sys-
berg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, tems 2020, NeurIPS 2020, December 6-12, 2020, virtual,
“Sparks of artificial general intelligence: Early experi- H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and
ments with gpt-4,” vol. abs/2303.12712, 2023. H. Lin, Eds., 2020.
[42] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, [56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary, S. Som, C. Sutton, S. Gehrmann, P. Schuh, K. Shi,
X. Song, and F. Wei, “Language is not all you need: S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes,
88

Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, and G. Irving, “Scaling language models: Methods,
B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is- analysis & insights from training gopher,” CoRR, vol.
ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- abs/2112.11446, 2021.
mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, [65] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei,
K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, “Why can GPT learn in-context? language models se-
H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do- cretly perform gradient descent as meta-optimizers,”
han, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pil- CoRR, vol. abs/2212.10559, 2022.
lai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, [66] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-
O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier- A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller,
Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, M. Simens, A. Askell, P. Welinder, P. F. Christiano,
“Palm: Scaling language modeling with pathways,” J. Leike, and R. Lowe, “Training language models to
CoRR, vol. abs/2204.02311, 2022. follow instructions with human feedback,” CoRR, vol.
[57] H. Touvron, T. Lavril, G. Izacard, X. Martinet, abs/2203.02155, 2022.
M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- [67] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu,
bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Fine-
G. Lample, “Llama: Open and efficient foundation tuned language models are zero-shot learners,” in
language models,” CoRR, 2023. The Tenth International Conference on Learning Repre-
[58] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, sentations, ICLR 2022, Virtual Event, April 25-29, 2022.
J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray OpenReview.net, 2022.
et al., “Scaling laws for autoregressive generative [68] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer,
modeling,” arXiv preprint arXiv:2010.14701, 2020. A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker,
[59] S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri,
P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: M. Menegali, Y. Huang, M. Krikun, D. Lepikhin,
Optimizing data mixtures speeds up language model J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma,
pretraining,” arXiv preprint arXiv:2305.10429, 2023. Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick-
[60] P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi,
M. Hobbhahn, and A. Ho, “Will we run out of data? R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen,
an analysis of the limits of scaling datasets in machine V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson,
learning,” CoRR, vol. abs/2211.04325, 2022. [Online]. A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra-
Available: https://doi.org/10.48550/arXiv.2211.04325 jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton,
[61] N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-Arcas,
A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, C. Cui, M. Croak, E. H. Chi, and Q. Le, “Lamda:
“Scaling data-constrained language models,” arXiv Language models for dialog applications,” CoRR, vol.
preprint arXiv:2305.16264, 2023. abs/2201.08239, 2022.
[62] I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu, [69] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,
A. Mueller, N. Kim, S. Bowman, and E. Perez, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma,
“The inverse scaling prize,” 2022. [Online]. Available: A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen,
https://github.com/inverse-scaling/prize A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y.
[63] B. A. Huberman and T. Hogg, “Phase transitions in Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H.
artificial intelligence systems,” Artificial Intelligence, Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le,
vol. 33, no. 2, pp. 155–171, 1987. and J. Wei, “Scaling instruction-finetuned language
[64] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff- models,” CoRR, vol. abs/2210.11416, 2022.
mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, [70] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb,
S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta,
A. Cassirer, R. Powell, G. van den Driessche, L. A. A. Garriga-Alonso, A. Kluska, A. Lewkowycz,
Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W.
S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish,
A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M. A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane,
Jayakumar, E. Buchatskaya, D. Budden, E. Suther- A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller,
land, K. Simonyan, M. Paganini, L. Sifre, L. Martens, A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang,
X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli,
D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, A. Venkatesh, A. Gholamidavoodi, A. Tabassum,
M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sotti- A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab-
aux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas,
C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, and et al., “Beyond the imitation game: Quantifying
I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, and extrapolating the capabilities of language mod-
C. Jones, J. Bradbury, M. J. Johnson, B. A. Hechtman, els,” CoRR, vol. abs/2206.04615, 2022.
L. Weidinger, I. Gabriel, W. S. Isaac, E. Lockhart, [71] R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer-
S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, gent abilities of large language models a mirage?”
J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, arXiv preprint arXiv:2304.15004, 2023.
89

[72] S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao, Y. Lin, [82] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun, “Unlock M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
predictable scaling from emergent abilities,” 2023. the limits of transfer learning with a unified text-
[73] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and to-text transformer,” J. Mach. Learn. Res., pp. 140:1–
V. Misra, “Grokking: Generalization beyond overfit- 140:67, 2020.
ting on small algorithmic datasets,” arXiv preprint [83] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-
arXiv:2201.02177, 2022. Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A
[74] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, massively multilingual pre-trained text-to-text trans-
“Deepspeed: System optimizations enable training former,” in Proceedings of the 2021 Conference of the
deep learning models with over 100 billion parame- North American Chapter of the Association for Com-
ters,” in KDD, 2020, pp. 3505–3506. putational Linguistics: Human Language Technologies,
[75] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp.
J. Casper, and B. Catanzaro, “Megatron-lm: Training 483–498.
multi-billion parameter language models using model [84] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
parallelism,” CoRR, vol. abs/1909.08053, 2019. X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
[76] D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo,
ley, M. Patwary, V. Korthikanti, D. Vainbrand, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang,
ishayee, and M. Zaharia, “Efficient large-scale lan- Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan,
guage model training on GPU clusters using Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α:
megatron-lm,” in International Conference for High Per- Large-scale autoregressive pretrained chinese lan-
formance Computing, Networking, Storage and Analysis, guage models with auto-parallel computation,” CoRR,
SC 2021, St. Louis, Missouri, USA, November 14-19, vol. abs/2104.12369, 2021.
2021. ACM, 2021, p. 58. [85] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun,
[77] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai, G. Zeng, Z. Tan,
dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac- Z. Liu, M. Huang, W. Han, Y. Liu, X. Zhu, and
tivation recomputation in large transformer models,” M. Sun, “CPM-2: large-scale cost-effective pre-trained
CoRR, vol. abs/2205.05198, 2022. language models,” CoRR, vol. abs/2106.10715, 2021.
[78] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hess- [86] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,
low, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An
J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. open large language model for code with mtulti-turn
Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, program synthesis,” arXiv preprint arXiv:2203.13474,
A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, 2022.
A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, [87] S. Black, S. Biderman, E. Hallahan, Q. Anthony,
S. Tan, P. O. Suarez, V. Sanh, H. Laurençon, Y. Jer- L. Gao, L. Golding, H. He, C. Leahy, K. McDonell,
nite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit,
A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-
A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, neox-20b: An open-source autoregressive language
C. Leong, D. van Strien, D. I. Adelani, and et al., model,” CoRR, vol. abs/2204.06745, 2022.
“BLOOM: A 176b-parameter open-access multilingual [88] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi,
language model,” CoRR, vol. abs/2211.05100, 2022. A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran,
[79] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis,
S. Legg, and D. Amodei, “Deep reinforcement learn- H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz-
ing from human preferences,” in Advances in Neural nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi,
Information Processing Systems 30: Annual Conference on M. Parmar, M. Purohit, N. Varshney, P. R. Kaza,
Neural Information Processing Systems 2017, December P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K. Sampat,
4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von S. Mishra, S. R. A, S. Patro, T. Dixit, and X. Shen,
Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. “Super-naturalinstructions: Generalization via declar-
Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299– ative instructions on 1600+ NLP tasks,” in Proceedings
4307. of the 2022 Conference on Empirical Methods in Natural
[80] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, Language Processing, EMNLP 2022, Abu Dhabi, United
M. Lomeli, L. Zettlemoyer, N. Cancedda, and Arab Emirates, December 7-11, 2022, 2022, pp. 5085–
T. Scialom, “Toolformer: Language models can teach 5109.
themselves to use tools,” CoRR, vol. abs/2302.04761, [89] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcı́a, J. Wei,
2023. X. Wang, H. W. Chung, D. Bahri, T. Schuster,
[81] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, H. Zheng, D. Zhou, N. Houlsby, and D. Metzler, “Ul2:
C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun- Unifying language learning paradigms,” 2022.
ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, [90] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen,
K. Button, M. Knight, B. Chess, and J. Schulman, S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin,
“Webgpt: Browser-assisted question-answering with T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig,
human feedback,” CoRR, vol. abs/2112.09332, 2021. P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer,
90

“OPT: open pre-trained transformer language mod- S. Hughes, T. Wolf, A. Guha, L. von Werra, and
els,” CoRR, vol. abs/2205.01068, 2022. H. de Vries, “Starcoder: may the source be with you!”
[91] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, CoRR, vol. abs/2305.06161, 2023. [Online]. Available:
K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, https://doi.org/10.48550/arXiv.2305.06161
D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, [99] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, S. Bhosale et al., “Llama 2: Open foundation and fine-
D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, tuned chat models,” arXiv preprint arXiv:2307.09288,
S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, 2023.
F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, [100] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv,
S. Saleem, H. Schwenk, and J. Wang, “No language D. Pan, D. Wang, D. Yan, F. Yang et al., “Baichuan
left behind: Scaling human-centered machine transla- 2: Open large-scale language models,” arXiv preprint
tion,” CoRR, vol. abs/2207.04672, 2022. arXiv:2309.10305, 2023.
[92] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, [101] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng,
Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex: Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical
A pre-trained model for code generation with mul- report,” arXiv preprint arXiv:2309.16609, 2023.
tilingual evaluations on humaneval-x,” arXiv preprint [102] X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, P. Han,
arXiv:2303.17568, 2023. J. Li, L. Du, B. Qin et al., “Flm-101b: An open llm and
[93] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, how to train it with $100 k budget,” arXiv preprint
Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, arXiv:2309.03852, 2023.
Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and [103] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang,
J. Tang, “GLM-130B: an open bilingual pre-trained B. Li, C. Cheng, W. Lü, R. Hu et al., “Skywork: A
model,” vol. abs/2210.02414, 2022. more open bilingual foundation model,” arXiv preprint
[94] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, arXiv:2310.19341, 2023.
S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X. Yong, [104] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat,
H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Al- Y. Huang, M. Krikun, N. Shazeer, and Z. Chen,
mubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff, “Gshard: Scaling giant models with conditional com-
and C. Raffel, “Crosslingual generalization through putation and automatic sharding,” in 9th International
multitask finetuning,” CoRR, vol. abs/2211.01786, Conference on Learning Representations, ICLR 2021, Vir-
2022. tual Event, Austria, May 3-7, 2021, 2021.
[95] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, [105] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li, de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
B. O’Horo, G. Pereyra, J. Wang, C. Dewan, A. Celikyil- N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
maz, L. Zettlemoyer, and V. Stoyanov, “OPT-IML: scal- M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
ing language model instruction meta learning through S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
the lens of generalization,” CoRR, vol. abs/2212.12017, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cum-
2022. mings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-
[96] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saun-
Prashanth, E. Raff et al., “Pythia: A suite for analyzing ders, C. Hesse, A. N. Carr, J. Leike, J. Achiam,
large language models across training and scaling,” V. Misra, E. Morikawa, A. Radford, M. Knight,
arXiv preprint arXiv:2304.01373, 2023. M. Brundage, M. Murati, K. Mayer, P. Welinder,
[97] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and B. McGrew, D. Amodei, S. McCandlish, I. Sutskever,
Y. Zhou, “Codegen2: Lessons for training llms on and W. Zaremba, “Evaluating large language models
programming and natural languages,” CoRR, vol. trained on code,” CoRR, vol. abs/2107.03374, 2021.
abs/2305.02309, 2023. [106] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang,
[98] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu,
D. Kocetkov, C. Mou, M. Marone, C. Akiki, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, X. Ouyang,
J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, D. Yu, H. Tian, H. Wu, and H. Wang, “ERNIE 3.0:
T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, Large-scale knowledge enhanced pre-training for lan-
J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, guage understanding and generation,” CoRR, vol.
A. Zebaze, M. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, abs/2107.02137, 2021.
M. Oblokulov, Z. Wang, R. M. V, J. Stillerman, [107] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-
S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, 1: Technical details and evaluation,” White Paper. AI21
Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, Labs, vol. 1, 2021.
S. Singh, S. Luccioni, P. Villegas, M. Kunakov, [108] B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon,
F. Zhdanov, M. Romero, T. Lee, N. Timor, S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong,
J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim, S. Kang,
T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. N. Ryu, K. M. Yoo, M. Chang, S. Suh, S. In, J. Park,
Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, K. Kim, H. Kim, J. Jeong, Y. G. Yeo, D. Ham, D. Park,
D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, M. Y. Lee, J. Kang, I. Kang, J. Ha, W. Park, and
91

N. Sung, “What changes can large-scale language V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad-
models bring? intensive study on hyperclova: Billions- wick, P. Thacker, L. Campbell-Gillingham, J. Ue-
scale korean generative pretrained transformers,” in sato, P. Huang, R. Comanescu, F. Yang, A. See,
Proceedings of the 2021 Conference on Empirical Methods S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S. Elias,
in Natural Language Processing, EMNLP 2021, Virtual R. Green, S. Mokrá, N. Fernando, B. Wu, R. Foley,
Event / Punta Cana, Dominican Republic, 7-11 November, S. Young, I. Gabriel, W. Isaac, J. Mellor, D. Hassabis,
2021. Association for Computational Linguistics, K. Kavukcuoglu, L. A. Hendricks, and G. Irving,
2021. “Improving alignment of dialogue agents via targeted
[109] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, human judgements,” CoRR, vol. abs/2209.14375, 2022.
H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large-scale [117] H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and
pre-trained language model in zero-shot and few-shot J. Zhou, “Welm: A well-read pre-trained language
learning,” arXiv preprint arXiv:2110.04725, 2021. model for chinese,” CoRR, vol. abs/2209.10372, 2022.
[110] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, [118] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So,
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowdh-
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, ery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby, Q. V.
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. Le, and M. Dehghani, “Transcending scaling laws
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- with 0.1% extra compute,” CoRR, vol. abs/2210.11399,
plan, “A general language assistant as a laboratory 2022.
for alignment,” CoRR, vol. abs/2112.00861, 2021. [119] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,
[111] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov,
S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen, A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su,
Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao, Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa-
S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu, rameter language model with sparse heterogeneous
W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE 3.0 computing,” CoRR, vol. abs/2303.10845, 2023.
titan: Exploring larger-scale knowledge enhanced pre- [120] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep-
training for language understanding and generation,” ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,
CoRR, vol. abs/2112.12731, 2021. Z. Chen et al., “Palm 2 technical report,” arXiv preprint
[112] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, arXiv:2305.10403, 2023.
Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, [121] A. Radford, R. Józefowicz, and I. Sutskever, “Learn-
L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, Y. E. ing to generate reviews and discovering sentiment,”
Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier- CoRR, vol. abs/1704.01444, 2017.
Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, [122] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever
Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient scaling et al., “Improving language understanding by genera-
of language models with mixture-of-experts,” in In- tive pre-training,” 2018.
ternational Conference on Machine Learning, ICML 2022, [123] B. McCann, N. S. Keskar, C. Xiong, and R. Socher,
17-23 July 2022, Baltimore, Maryland, USA, 2022, pp. “The natural language decathlon: Multitask learning
5547–5569. as question answering,” CoRR, vol. abs/1806.08730,
[113] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajb- 2018.
handari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, [124] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett,
V. Korthikanti, E. Zheng, R. Child, R. Y. Aminabadi, X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT :
J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Hous- Large-scale generative pre-training for conversational
ton, S. Tiwary, and B. Catanzaro, “Using deepspeed response generation,” in Proceedings of the 58th Annual
and megatron to train megatron-turing NLG 530b, Meeting of the Association for Computational Linguistics:
A large-scale generative language model,” CoRR, vol. System Demonstrations, ACL 2020, Online, July 5-10,
abs/2201.11990, 2022. 2020, A. Celikyilmaz and T. Wen, Eds. Association
[114] Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit- for Computational Linguistics, 2020, pp. 270–278.
twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi- [125] D. Ham, J. Lee, Y. Jang, and K. Kim, “End-to-end neu-
meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas- ral pipeline for goal-oriented dialogue systems using
son d’Autume, I. Babuschkin, X. Chen, P. Huang, GPT-2,” in Proceedings of the 58th Annual Meeting of the
J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Association for Computational Linguistics, ACL 2020, On-
Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, line, July 5-10, 2020. Association for Computational
K. Kavukcuoglu, and O. Vinyals, “Competition-level Linguistics, 2020, pp. 583–592.
code generation with alphacode,” Science, 2022. [126] I. Drori, S. Tran, R. Wang, N. Cheng, K. Liu, L. Tang,
[115] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, E. Ke, N. Singh, T. L. Patti, J. Lynch, A. Shporer,
W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosen- N. Verma, E. Wu, and G. Strang, “A neural network
baum, A. Rumshisky, C. S. Prakash, M. Sridhar, solves and generates mathematics problems by pro-
F. Triefenbach, A. Verma, G. Tür, and P. Natara- gram synthesis: Calculus, differential equations, linear
jan, “Alexatm 20b: Few-shot learning using a algebra, and more,” CoRR, vol. abs/2112.15594, 2021.
large-scale multilingual seq2seq model,” CoRR, vol. [127] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han,
abs/2208.01448, 2022. J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hal-
[116] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, lacy, J. Heidecke, P. Shyam, B. Power, T. E. Nekoul,
92

G. Sastry, G. Krueger, D. Schnurr, F. P. Such, K. Hsu, 2023.


M. Thompson, T. Khan, T. Sherbakov, J. Jang, P. Welin- [142] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
der, and L. Weng, “Text and code embeddings by C. Guestrin, P. Liang, and T. B. Hashimoto, “Stan-
contrastive pre-training,” CoRR, vol. abs/2201.10005, ford alpaca: An instruction-following llama model,”
2022. https://github.com/tatsu-lab/stanford alpaca, 2023.
[128] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, [143] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,
and O. Klimov, “Proximal policy optimization algo- D. Khashabi, and H. Hajishirzi, “Self-instruct: Align-
rithms,” arXiv preprint arXiv:1707.06347, 2017. ing language model with self generated instructions,”
[129] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, CoRR, vol. abs/2212.10560, 2022.
C. Voss, A. Radford, D. Amodei, and P. F. Chris- [144] Alpaca-LoRA, “Instruct-tune llama on consumer
tiano, “Learning to summarize from human feed- hardware,” https://github.com/tloen/alpaca-lora,
back,” CoRR, vol. abs/2009.01325, 2020. 2023.
[130] OpenAI, “Our approach to alignment research,” Ope- [145] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
nAI Blog, August 2022. L. Wang, and W. Chen, “Lora: Low-rank adaptation of
[131] ——, “Introducing chatgpt,” OpenAI Blog, November large language models,” in The Tenth International Con-
2022. ference on Learning Representations, ICLR 2022, Virtual
[132] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, Event, April 25-29, 2022. OpenReview.net, 2022.
S. Kadavath, B. Mann, E. Perez, N. Schiefer, [146] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel,
K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con- S. Levine, and D. Song, “Koala: A dialogue model for
erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, academic research,” Blog post, April 2023.
S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan- [147] Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, B. Ma, and
dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, X. Li, “Belle: Be everyone’s large language model en-
C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, gine,” https://github.com/LianjiaTech/BELLE, 2023.
T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka- [148] D. Eccleston, “Sharegpt,” https://sharegpt.com/,
plan, and J. Clark, “Red teaming language models 2023.
to reduce harms: Methods, scaling behaviors, and [149] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
lessons learned,” CoRR, vol. abs/2209.07858, 2022. tuning,” CoRR, vol. abs/2304.08485, 2023.
[133] OpenAI, “Gpt-4v(ision) system card,” OpenAI, 2023. [150] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny,
[134] ——, “Lessons learned on language model safety and “Minigpt-4: Enhancing vision-language understand-
misuse,” OpenAI blog, 2022. ing with advanced large language models,” CoRR, vol.
[135] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap- abs/2304.10592, 2023.
pelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hes- [151] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang,
low, J. Launay, Q. Malartic, B. Noune, B. Pannier, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: Towards
and G. Penedo, “Falcon-40B: an open large language general-purpose vision-language models with instruc-
model with state-of-the-art performance,” 2023. tion tuning,” CoRR, vol. abs/2305.06500, 2023.
[136] L. Huawei Technologies Co., “Huawei mindspore [152] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai,
ai development framework,” in Artificial Intelligence “Pandagpt: One model to instruction-follow them
Technology. Springer, 2022, pp. 137–162. all,” 2023.
[137] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, [153] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Ur-
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stan- tasun, A. Torralba, and S. Fidler, “Aligning books
ford alpaca: An instruction-following llama model,” and movies: Towards story-like visual explanations
https://github.com/tatsu-lab/stanford alpaca, 2023. by watching movies and reading books,” in 2015 IEEE
[138] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, International Conference on Computer Vision, ICCV 2015,
L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, Santiago, Chile, December 7-13, 2015. IEEE Computer
I. Stoica, and E. P. Xing, “Vicuna: An open-source Society, 2015, pp. 19–27.
chatbot impressing gpt-4 with 90%* chatgpt quality,” [154] “Project gutenberg.” [Online]. Available: https://
2023. [Online]. Available: https://vicuna.lmsys.org www.gutenberg.org/
[139] 2023. [Online]. Available: https://github.com/ [155] T. H. Trinh and Q. V. Le, “A simple method for
nebuly-ai/nebullvm/tree/main/apps/accelerate/ commonsense reasoning,” CoRR, vol. abs/1806.02847,
chatllama 2018.
[140] Y. You, “Colossalchat: An open-source [156] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk,
solution for cloning chatgpt with a complete A. Farhadi, F. Roesner, and Y. Choi, “Defending
rlhf pipeline,” 2023. [Online]. Available: against neural fake news,” in Advances in Neural Infor-
https://medium.com/@yangyou berkeley/ mation Processing Systems 32: Annual Conference on Neu-
colossalchat-an-open-source-solution-for-cloning- ral Information Processing Systems 2019, NeurIPS 2019,
chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b December 8-14, 2019, Vancouver, BC, Canada, H. M.
[141] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-
A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 9051–
and J. Launay, “The RefinedWeb dataset for Falcon 9062.
LLM: outperforming curated corpora with web data, [157] A. Gokaslan, V. C. E. Pavlick, and S. Tellex,
and web data only,” arXiv preprint arXiv:2306.01116, “Openwebtext corpus,” http://Skylion007.github.io/
93

OpenWebTextCorpus, 2019. with reinforcement learning from human feedback,”


[158] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, CoRR, vol. abs/2204.05862, 2022. [Online]. Available:
and J. Blackburn, “The pushshift reddit dataset,” in https://doi.org/10.48550/arXiv.2204.05862
Proceedings of the Fourteenth International AAAI Con- [171] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding,
ference on Web and Social Media, ICWSM 2020, Held J. Yue, and Y. Wu, “How close is chatgpt to human ex-
Virtually, Original Venue: Atlanta, Georgia, USA, June perts? comparison corpus, evaluation, and detection,”
8-11, 2020. AAAI Press, 2020, pp. 830–839. arXiv preprint arXiv:2301.07597, 2023.
[159] “Wikipedia.” [Online]. Available: https://en. [172] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan,
wikipedia.org/wiki/Main Page S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin.
[160] “Bigquery dataset.” [Online]. Available: https:// (2023) Free dolly: Introducing the world’s first truly
cloud.google.com/bigquery?hl=zh-cn open instruction-tuned llm.
[161] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, [173] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnos-
C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, tidis, Z.-R. Tam, K. Stevens, A. Barhoum, N. M.
S. Presser, and C. Leahy, “The pile: An 800gb dataset Duc, O. Stanley, R. Nagyfi et al., “Openassistant
of diverse text for language modeling,” CoRR, vol. conversations–democratizing large language model
abs/2101.00027, 2021. alignment,” arXiv preprint arXiv:2304.07327, 2023.
[162] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V. [174] J. Cheung, “Guanaco - generative universal assis-
del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G. tant for natural-language adaptive context-aware om-
Ponferrada, H. Nguyen et al., “The bigscience roots nilingual outputs,” https://guanaco-model.github.
corpus: A 1.6 tb composite multilingual dataset,” in io/, 2023.
Thirty-sixth Conference on Neural Information Processing [175] C. Xu, D. Guo, N. Duan, and J. McAuley, “Baize: An
Systems Datasets and Benchmarks Track, 2022. open-source chat model with parameter-efficient tun-
[163] “Common crawl.” [Online]. Available: https:// ing on self-chat data,” arXiv preprint arXiv:2304.01196,
commoncrawl.org/ 2023.
[164] “A reproduction version of cc-stories on hugging [176] Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma,
face.” [Online]. Available: https://huggingface.co/ and X. Li, “Towards better instruction following
datasets/spacemanidol/cc-stories language models for chinese: Investigating the im-
[165] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion pact of training data and evaluation,” arXiv preprint
Parameter Autoregressive Language Model,” https:// arXiv:2304.07854, 2023.
github.com/kingoflolz/mesh-transformer-jax, 2021. [177] K. Ethayarajh, Y. Choi, and S. Swayamdipta, “Un-
[166] S. Mishra, D. Khashabi, C. Baral, and H. Ha- derstanding dataset difficulty with V -usable informa-
jishirzi, “Cross-task generalization via natural lan- tion,” in Proceedings of the 39th International Conference
guage crowdsourcing instructions,” in Proceedings of on Machine Learning, 2022, pp. 5988–6008.
the 60th Annual Meeting of the Association for Compu- [178] N. Lambert, L. Tunstall, N. Rajani,
tational Linguistics (Volume 1: Long Papers), ACL 2022, and T. Thrush. (2023) Huggingface h4
Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, stack exchange preference dataset. [On-
and A. Villavicencio, Eds., 2022, pp. 3470–3487. line]. Available: https://huggingface.co/datasets/
[167] S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel, HuggingFaceH4/stack-exchange-preferences
N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Févry, [179] R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M. Dai,
Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David, D. Yang, and S. Vosoughi, “Training socially aligned
C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S. language models in simulated human society,” CoRR,
AlShaibani, S. Sharma, U. Thakker, K. Almubarak, vol. abs/2305.16960, 2023.
X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush, [180] G. Xu, J. Liu, M. Yan, H. Xu, J. Si, Z. Zhou, P. Yi, X. Gao,
“Promptsource: An integrated development environ- J. Sang, R. Zhang, J. Zhang, C. Peng, F. Huang, and
ment and repository for natural language prompts,” J. Zhou, “Cvalues: Measuring the values of chinese
in ACL (demo). Association for Computational Lin- large language models from safety to responsibility,”
guistics, 2022, pp. 93–104. 2023.
[168] T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi- [181] J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and
task supervised pre-training for natural language gen- Y. Yang, “Safe rlhf: Safe reinforcement learning from
eration,” CoRR, vol. abs/2206.12131, 2022. human feedback,” arXiv preprint arXiv:2310.12773,
[169] H. Nguyen, S. Suri, K. Tsui, Shahules786, T. team, and 2023.
C. Schuhmann, “The oig dataset,” https://laion.ai/ [182] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika,
blog/oig-dataset/, 2023. Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey,
[170] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma,
N. DasSarma, D. Drain, S. Fort, D. Ganguli, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak,
T. Henighan, N. Joseph, S. Kadavath, J. Kernion, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Man-
T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-Dodds, ica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden,
D. Hernandez, T. Hume, S. Johnston, S. Kravec, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli,
L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B. T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bider-
Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and man, L. Gao, T. Wolf, and A. M. Rush, “Multitask
J. Kaplan, “Training a helpful and harmless assistant prompted training enables zero-shot task generaliza-
94

tion,” in The Tenth International Conference on Learning [194] (2023) Deepspeed-mii. [Online]. Available: https:
Representations, ICLR 2022, Virtual Event, April 25-29, //github.com/microsoft/DeepSpeed-MII
2022. OpenReview.net, 2022. [195] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,
[183] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel,
Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux,
et al., “The flan collection: Designing data and meth- P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and
ods for effective instruction tuning,” arXiv preprint W. E. Sayed, “Mistral 7b,” 2023.
arXiv:2301.13688, 2023. [196] Z. Yao, R. Y. Aminabadi, O. Ruwase, S. Rajbhandari,
[184] K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, X. Wu, A. A. Awan, J. Rasley, M. Zhang, C. Li,
R. Nakano, C. Hesse, and J. Schulman, “Training C. Holmes, Z. Zhou, M. Wyatt, M. Smith, L. Kurilenko,
verifiers to solve math word problems,” CoRR, vol. H. Qin, M. Tanaka, S. Che, S. L. Song, and Y. He,
abs/2110.14168, 2021. “DeepSpeed-Chat: Easy, Fast and Affordable RLHF
[185] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, Training of ChatGPT-like Models at All Scales,” arXiv
and J. Berant, “Did aristotle use a laptop? A question preprint arXiv:2308.01320, 2023.
answering benchmark with implicit reasoning strate- [197] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
gies,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 346– bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
361, 2021. L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. De-
[186] O. Camburu, B. Shillingford, P. Minervini, Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
T. Lukasiewicz, and P. Blunsom, “Make up your L. Fang, J. Bai, and S. Chintala, “Pytorch: An imper-
mind! adversarial generation of inconsistent natural ative style, high-performance deep learning library,”
language explanations,” in Proceedings of the 58th in Advances in Neural Information Processing Systems
Annual Meeting of the Association for Computational 32: Annual Conference on Neural Information Process-
Linguistics, ACL 2020, Online, July 5-10, 2020, ing Systems 2019, NeurIPS 2019, December 8-14, 2019,
D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle,
Eds. Association for Computational Linguistics, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar-
2020, pp. 4157–4165. nett, Eds., 2019, pp. 8024–8035.
[187] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, [198] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is-
J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jer- ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
nite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,
Q. Lhoest, and A. M. Rush, “Transformers: State-of- P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor-
the-art natural language processing,” in Proceedings of flow: A system for large-scale machine learning,” in
the 2020 Conference on Empirical Methods in Natural Lan- 12th USENIX Symposium on Operating Systems Design
guage Processing: System Demonstrations, EMNLP 2020 and Implementation, OSDI 2016, Savannah, GA, USA,
- Demos, Online, November 16-20, 2020. Association November 2-4, 2016, K. Keeton and T. Roscoe, Eds.
for Computational Linguistics, 2020, pp. 38–45. USENIX Association, 2016, pp. 265–283.
[188] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, [199] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang,
C. Leary, D. Maclaurin, G. Necula, A. Paszke, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet:
J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, A flexible and efficient machine learning library
“JAX: composable transformations of Python+NumPy for heterogeneous distributed systems,” CoRR, vol.
programs,” 2018. [Online]. Available: http://github. abs/1512.01274, 2015.
com/google/jax [200] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An
[189] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang, open-source deep learning platform from industrial
F. Cui, and Y. You, “Colossal-ai: A unified deep learn- practice,” Frontiers of Data and Domputing, vol. 1, no. 1,
ing system for large-scale parallel training,” CoRR, p. 105, 2019.
vol. abs/2110.14883, 2021. [201] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao,
[190] J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick- F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One-
star: Parallel training of pre-trained models via flow: Redesign the distributed deep learning frame-
a chunk-based memory management,” CoRR, vol. work from scratch,” CoRR, vol. abs/2110.15032, 2021.
abs/2108.05818, 2021. [202] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,
[191] “Bmtrain: Effient training for big models.” [Online]. Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and
Available: https://github.com/OpenBMB/BMTrain J. Weston, “Recipes for building an open-domain chat-
[192] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, bot,” in Proceedings of the 16th Conference of the European
“Fastmoe: A fast mixture-of-expert training system,” Chapter of the Association for Computational Linguistics:
CoRR, vol. abs/2103.13262, 2021. Main Volume, EACL 2021, Online, April 19 - 23, 2021,
[193] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, 2021, pp. 300–325.
C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, [203] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,
“Efficient memory management for large language H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil,
model serving with pagedattention,” in Proceedings of I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,
the ACM SIGOPS 29th Symposium on Operating Systems G. Gur-Ari, and V. Misra, “Solving quantitative rea-
Principles, 2023. soning problems with language models,” CoRR, vol.
95

abs/2206.14858, 2022. and C. Zhang, “Quantifying memorization across


[204] T. Saier, J. Krause, and M. Färber, “unarxive 2022: neural language models,” CoRR, 2022.
All arxiv publications pre-processed for nlp, includ- [218] N. Carlini, F. Tramèr, E. Wallace, M. Jagielski,
ing structured full-text and citation network,” arXiv A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown,
preprint arXiv:2303.14957, 2023. D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Ex-
[205] H. A. Simon, “Experiments with a heuristic compiler,” tracting training data from large language models,”
J. ACM, vol. 10, no. 4, pp. 493–506, 1963. in 30th USENIX Security Symposium, USENIX Security
[206] Z. Manna and R. J. Waldinger, “Toward automatic 2021, August 11-13, 2021, 2021, pp. 2633–2650.
program synthesis,” Commun. ACM, vol. 14, no. 3, pp. [219] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicating
151–165, 1971. training data mitigates privacy risks in language mod-
[207] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, els,” in International Conference on Machine Learning,
L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA.
“Codebert: A pre-trained model for programming and PMLR, 2022, pp. 10 697–10 707.
natural languages,” in Findings of EMNLP, 2020. [220] J. D. Lafferty, A. McCallum, and F. C. N. Pereira,
[208] J. Austin, A. Odena, M. I. Nye, M. Bosma, “Conditional random fields: Probabilistic models for
H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, segmenting and labeling sequence data,” in Proceed-
Q. V. Le, and C. Sutton, “Program synthesis with large ings of the Eighteenth International Conference on Machine
language models,” CoRR, vol. abs/2108.07732, 2021. Learning (ICML 2001), Williams College, Williamstown,
[209] S. Black, L. Gao, P. Wang, C. Leahy, and S. Bi- MA, USA, June 28 - July 1, 2001, C. E. Brodley and
derman, “GPT-Neo: Large Scale Autoregressive Lan- A. P. Danyluk, Eds. Morgan Kaufmann, 2001, pp.
guage Modeling with Mesh-Tensorflow,” 2021. 282–289.
[210] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, [221] P. Gage, “A new algorithm for data compression,” C
“A systematic evaluation of large language models of Users Journal, vol. 12, no. 2, pp. 23–38, 1994.
code,” in MAPS@PLDI, 2022. [222] R. Sennrich, B. Haddow, and A. Birch, “Neural ma-
[211] A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neubig, chine translation of rare words with subword units,”
“Language models of code are few-shot commonsense in Proceedings of the 54th Annual Meeting of the Asso-
learners,” in Proceedings of the 2022 Conference on Em- ciation for Computational Linguistics, ACL 2016, August
pirical Methods in Natural Language Processing, EMNLP 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The
2022, Abu Dhabi, United Arab Emirates, December 7-11, Association for Computer Linguistics, 2016.
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. [223] M. Schuster and K. Nakajima, “Japanese and korean
Association for Computational Linguistics, 2022, pp. voice search,” in 2012 IEEE international conference on
1384–1403. acoustics, speech and signal processing (ICASSP). IEEE,
[212] S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, 2012, pp. 5149–5152.
B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno et al., [224] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
“A pretrainer’s guide to training data: Measuring W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
the effects of data age, domain coverage, quality, & J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser,
toxicity,” arXiv preprint arXiv:2305.13169, 2023. S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens,
[213] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith,
D. Gao, Y. Xie, Z. Liu, J. Gao, Y. Li, B. Ding, and J. Riesa, A. Rudnick, O. Vinyals, G. Corrado,
J. Zhou, “Data-juicer: A one-stop data processing sys- M. Hughes, and J. Dean, “Google’s neural machine
tem for large language models,” 2023. translation system: Bridging the gap between human
[214] D. Hernandez, T. B. Brown, T. Conerly, N. DasSarma, and machine translation,” CoRR, vol. abs/1609.08144,
D. Drain, S. E. Showk, N. Elhage, Z. Hatfield-Dodds, 2016.
T. Henighan, T. Hume, S. Johnston, B. Mann, C. Olah, [225] T. Kudo, “Subword regularization: Improving neural
C. Olsson, D. Amodei, N. Joseph, J. Kaplan, and S. Mc- network translation models with multiple subword
Candlish, “Scaling laws and interpretability of learn- candidates,” in Proceedings of the 56th Annual Meeting
ing from repeated data,” CoRR, vol. abs/2205.10487, of the Association for Computational Linguistics, ACL
2022. 2018, Melbourne, Australia, July 15-20, 2018, Volume 1:
[215] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, Long Papers, I. Gurevych and Y. Miyao, Eds. Associ-
“The curious case of neural text degeneration,” in ation for Computational Linguistics, 2018, pp. 66–75.
8th International Conference on Learning Representations, [226] T. Kudo and J. Richardson, “Sentencepiece: A simple
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. and language independent subword tokenizer and
OpenReview.net, 2020. detokenizer for neural text processing,” in Proceedings
[216] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, of the 2018 Conference on Empirical Methods in Natural
C. Callison-Burch, and N. Carlini, “Deduplicating Language Processing, EMNLP 2018: System Demonstra-
training data makes language models better,” in Pro- tions, Brussels, Belgium, October 31 - November 4, 2018,
ceedings of the 60th Annual Meeting of the Association for E. Blanco and W. Lu, Eds. Association for Computa-
Computational Linguistics (Volume 1: Long Papers), ACL tional Linguistics, 2018.
2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. 8424– [227] M. Davis and M. Dürst, “Unicode normalization
8445. forms,” 2001.
[217] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, [228] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak,
96

and I. Sutskever, “Deep double descent: Where bigger 4012.


models and more data hurt,” in 8th International Con- [242] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov,
ference on Learning Representations, ICLR 2020, Addis “Bag of tricks for efficient text classification,” in EACL,
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2017, pp. 427–431.
2020. [243] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge,
[229] K. Tirumala, D. Simig, A. Aghajanyan, and A. S. D. Gao, Y. Xie, Z. Liu, J. Gao et al., “Data-juicer: A
Morcos, “D4: Improving llm pretraining via document one-stop data processing system for large language
de-duplication and diversification,” arXiv preprint models,” arXiv preprint arXiv:2309.02033, 2023.
arXiv:2308.12284, 2023. [244] B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia,
[230] Z. Shen, T. Tao, L. Ma, W. Neiswanger, J. Hestness, J. Shen, and O. Firat, “Examining scaling and transfer
N. Vassilieva, D. Soboleva, and E. Xing, “Slimpajama- of language model architectures for machine transla-
dc: Understanding data combinations for llm train- tion,” in International Conference on Machine Learning,
ing,” arXiv preprint arXiv:2309.10818, 2023. ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA,
[231] S. M. Xie, S. Santurkar, T. Ma, and P. Liang, “Data 2022, pp. 26 176–26 192.
selection for language models via importance resam- [245] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang,
pling,” arXiv preprint arXiv:2302.03169, 2023. J. Gao, M. Zhou, and H. Hon, “Unified language
[232] X. Wang, W. Zhou, Q. Zhang, J. Zhou, S. Gao, model pre-training for natural language understand-
J. Wang, M. Zhang, X. Gao, Y. Chen, and T. Gui, ing and generation,” in Advances in Neural Informa-
“Farewell to aimless large-scale pretraining: Influ- tion Processing Systems 32: Annual Conference on Neu-
ential subset selection for language model,” arXiv ral Information Processing Systems 2019, NeurIPS 2019,
preprint arXiv:2305.12816, 2023. December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
[233] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. 13 042–13 054.
Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, [246] A. Clark, D. de Las Casas, A. Guy, A. Mensch,
and R. Fernández, “The LAMBADA dataset: Word M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht-
prediction requiring a broad discourse context,” in man, T. Cai, S. Borgeaud, G. van den Driessche,
ACL (1). The Association for Computer Linguistics, E. Rutherford, T. Hennigan, M. J. Johnson, A. Cassirer,
2016. C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osin-
[234] M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, dero, O. Vinyals, M. Ranzato, J. W. Rae, E. Elsen,
F. Sala, and C. Ré, “Skill-it! a data-driven skills frame- K. Kavukcuoglu, and K. Simonyan, “Unified scaling
work for understanding and training language mod- laws for routed language models,” in International
els,” arXiv preprint arXiv:2307.14430, 2023. Conference on Machine Learning, ICML 2022, 17-23 July
[235] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, 2022, Baltimore, Maryland, USA, 2022, pp. 4057–4086.
I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, [247] A. Gu, K. Goel, and C. Ré, “Efficiently modeling
J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, long sequences with structured state spaces,”
M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong, in The Tenth International Conference on Learning
A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Mar- Representations, ICLR 2022, Virtual Event, April 25-29,
tin, N. Usunier, T. Scialom, and G. Synnaeve, “Code 2022. OpenReview.net, 2022. [Online]. Available:
llama: Open foundation models for code,” CoRR, vol. https://openreview.net/forum?id=uYLFoz1vlAC
abs/2308.12950, 2023. [248] H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur,
[236] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Long range language modeling via gated state
“Curriculum learning,” in ICML, 2009, pp. 41–48. spaces,” CoRR, vol. abs/2206.13947, 2022. [Online].
[237] C. Xu, C. Rosset, L. Del Corro, S. Mahajan, J. McAuley, Available: https://doi.org/10.48550/arXiv.2206.13947
J. Neville, A. H. Awadallah, and N. Rao, “Contrastive [249] T. Dao, D. Y. Fu, K. K. Saab, A. W. Thomas, A. Rudra,
post-training large language models on data curricu- and C. Ré, “Hungry hungry hippos: Towards
lum,” arXiv preprint arXiv:2310.02263, 2023. language modeling with state space models,”
[238] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu, CoRR, vol. abs/2212.14052, 2022. [Online]. Available:
H. Michalewski, and P. Milos, “Focused transformer: https://doi.org/10.48550/arXiv.2212.14052
Contrastive training for context scaling,” CoRR, vol. [250] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao,
abs/2307.03170, 2023. S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena hi-
[239] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, erarchy: Towards larger convolutional language mod-
S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and els,” in ICML, 2023.
S. Welleck, “Llemma: An open language model for [251] B. Peng, E. Alcaide, Q. Anthony, A. Albalak,
mathematics,” arXiv preprint arXiv:2310.10631, 2023. S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella,
[240] S. Chen, S. Wong, L. Chen, and Y. Tian, “Extending K. K. G. V., X. He, H. Hou, P. Kazienko, J. Kocon,
context window of large language models via posi- J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom,
tional interpolation,” CoRR, vol. abs/2306.15595, 2023. A. Saito, X. Tang, B. Wang, J. S. Wind, S. Wozniak,
[241] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud- R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, J. Zhu, and
hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet: R. Zhu, “RWKV: reinventing rnns for the transformer
Extracting high quality monolingual datasets from era,” CoRR, vol. abs/2305.13048, 2023. [Online].
web crawl data,” in Proceedings of the Twelfth Language Available: https://doi.org/10.48550/arXiv.2305.13048
Resources and Evaluation Conference, 2020, pp. 4003– [252] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue,
97

J. Wang, and F. Wei, “Retentive network: A succes- v37/ioffe15.html


sor to transformer for large language models,” arXiv [266] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Févry,
preprint arXiv:2307.08621, 2023. M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan,
[253] J. T. Smith, A. Warrington, and S. Linderman, “Sim- Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts,
plified state space layers for sequence modeling,” in and C. Raffel, “Do transformer modifications transfer
ICLR, 2023. across implementations and applications?” in Proceed-
[254] A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul- ings of the 2021 Conference on Empirical Methods in Nat-
cehre, R. Pascanu, and S. De, “Resurrecting recurrent ural Language Processing, EMNLP 2021, Virtual Event /
neural networks for long sequences,” in ICML, 2023. Punta Cana, Dominican Republic, 7-11 November, 2021,
[255] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, 2021, pp. 5758–5773.
D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang, [267] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing,
“Cogview: Mastering text-to-image generation via H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer nor-
transformers,” in Advances in Neural Information Pro- malization in the transformer architecture,” in ICML,
cessing Systems 34: Annual Conference on Neural Infor- 2020.
mation Processing Systems 2021, NeurIPS 2021, December [268] A. Baevski and M. Auli, “Adaptive input represen-
6-14, 2021, virtual, 2021, pp. 19 822–19 835. tations for neural language modeling,” in 7th Inter-
[256] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal- national Conference on Learning Representations, ICLR
ization,” vol. abs/1607.06450, 2016. 2019, New Orleans, LA, USA, May 6-9, 2019. Open-
[257] B. Zhang and R. Sennrich, “Root mean square layer Review.net, 2019.
normalization,” in Advances in Neural Information Pro- [269] L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, “Under-
cessing Systems 32: Annual Conference on Neural Infor- standing the difficulty of training transformers,” in
mation Processing Systems 2019, NeurIPS 2019, December Proceedings of the 2020 Conference on Empirical Methods
8-14, 2019, Vancouver, BC, Canada, 2019, pp. 12 360– in Natural Language Processing, EMNLP 2020, Online,
12 371. November 16-20, 2020. Association for Computational
[258] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and Linguistics, 2020, pp. 5747–5763.
F. Wei, “Deepnet: Scaling transformers to 1, 000 lay- [270] D. Hendrycks and K. Gimpel, “Gaussian error linear
ers,” vol. abs/2203.00555, 2022. units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[259] V. Nair and G. E. Hinton, “Rectified linear units im- [271] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier,
prove restricted boltzmann machines,” in Proceedings “Language modeling with gated convolutional net-
of the 27th international conference on machine learning works,” in Proceedings of the 34th International Confer-
(ICML-10), 2010, pp. 807–814. ence on Machine Learning, ICML 2017, Sydney, NSW,
[260] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, Australia, 6-11 August 2017, 2017, pp. 933–941.
and S. R. Bowman, “GLUE: A multi-task bench- [272] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S. Bari,
mark and analysis platform for natural language un- S. Biderman, H. Elsahar, N. Muennighoff, J. Phang,
derstanding,” in Proceedings of the Workshop: Analyz- O. Press, C. Raffel, V. Sanh, S. Shen, L. Sutawika, J. Tae,
ing and Interpreting Neural Networks for NLP, Black- Z. X. Yong, J. Launay, and I. Beltagy, “What language
boxNLP@EMNLP 2018, Brussels, Belgium, November 1, model to train if you have one million GPU hours?” in
2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds. Findings of the Association for Computational Linguistics:
Association for Computational Linguistics, 2018, pp. EMNLP 2022, Abu Dhabi, United Arab Emirates, Decem-
353–355. ber 7-11, 2022, 2022, pp. 765–782.
[261] P. Ramachandran, B. Zoph, and Q. V. Le, [273] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-
“Searching for activation functions,” arXiv preprint attention with relative position representations,”
arXiv:1710.05941, 2017. in Proceedings of the 2018 Conference of the North
[262] N. Shazeer, “GLU variants improve transformer,” vol. American Chapter of the Association for Computational
abs/2002.05202, 2020. Linguistics: Human Language Technologies, NAACL-
[263] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: En- HLT, New Orleans, Louisiana, USA, June 1-6, 2018,
hanced transformer with rotary position embedding,” Volume 2 (Short Papers), M. A. Walker, H. Ji,
vol. abs/2104.09864, 2021. and A. Stent, Eds. Association for Computational
[264] O. Press, N. A. Smith, and M. Lewis, “Train short, test Linguistics, 2018, pp. 464–468. [Online]. Available:
long: Attention with linear biases enables input length https://doi.org/10.18653/v1/n18-2074
extrapolation,” in The Tenth International Conference on [274] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell,
Learning Representations, ICLR 2022, Virtual Event, April Q. V. Le, and R. Salakhutdinov, “Transformer-xl:
25-29, 2022, 2022. Attentive language models beyond a fixed-length
[265] S. Ioffe and C. Szegedy, “Batch normalization: context,” in Proceedings of the 57th Conference of
Accelerating deep network training by reducing the Association for Computational Linguistics, ACL
internal covariate shift,” in Proceedings of the 32nd 2019, Florence, Italy, July 28- August 2, 2019, Volume
International Conference on Machine Learning, ICML 1: Long Papers, A. Korhonen, D. R. Traum, and
2015, Lille, France, 6-11 July 2015, ser. JMLR Workshop L. Màrquez, Eds. Association for Computational
and Conference Proceedings, F. R. Bach and D. M. Linguistics, 2019, pp. 2978–2988. [Online]. Available:
Blei, Eds., vol. 37. JMLR.org, 2015, pp. 448–456. https://doi.org/10.18653/v1/p19-1285
[Online]. Available: http://proceedings.mlr.press/ [275] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdi-
98

nov, and Q. V. Le, “Xlnet: Generalized autoregressive [291] J. Su. (2023) Transformer upgrade path: 12, infinite
pretraining for language understanding,” Advances in extrapolation of rerope?
neural information processing systems, vol. 32, 2019. [292] X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin,
[276] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn: “Scaling laws of rope-based extrapolation,” CoRR, vol.
Efficient context window extension of large language abs/2310.05209, 2023.
models,” CoRR, vol. abs/2309.00071, 2023. [293] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sun-
[277] Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang, dararajan, and S. Naidu, “Giraffe: Adventures in
A. Benhaim, V. Chaudhary, X. Song, and F. Wei, expanding context lengths in llms,” CoRR, vol.
“A length-extrapolatable transformer,” CoRR, vol. abs/2308.10882, 2023.
abs/2212.10554, 2022. [Online]. Available: https: [294] G. Izacard and E. Grave, “Leveraging passage re-
//doi.org/10.48550/arXiv.2212.10554 trieval with generative models for open domain ques-
[278] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. tion answering,” in Proceedings of the 16th Conference of
Smith, and L. Kong, “Random feature attention,” in the European Chapter of the Association for Computational
9th International Conference on Learning Representations, Linguistics: Main Volume, EACL 2021, Online, April 19 -
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 23, 2021. Association for Computational Linguistics,
[279] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, 2021, pp. 874–880.
C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang, [295] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar,
L. Yang, and A. Ahmed, “Big bird: Transformers for O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown,
longer sequences,” in Advances in Neural Information and Y. Shoham, “Parallel context windows for large
Processing Systems 33: Annual Conference on Neural language models,” in Proceedings of the 61st Annual
Information Processing Systems 2020, NeurIPS 2020, De- Meeting of the Association for Computational Linguistics
cember 6-12, 2020, virtual, 2020. (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July
[280] R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen- 9-14, 2023. Association for Computational Linguis-
erating long sequences with sparse transformers,” tics, 2023, pp. 6383–6402.
CoRR, vol. abs/1904.10509, 2019. [296] Y. Hao, Y. Sun, L. Dong, Z. Han, Y. Gu, and F. Wei,
[281] N. Shazeer, “Fast transformer decoding: One write- “Structured prompting: Scaling in-context learning to
head is all you need,” CoRR, vol. abs/1911.02150, 1, 000 examples,” CoRR, 2022.
2019. [Online]. Available: http://arxiv.org/abs/1911. [297] I. Beltagy, M. E. Peters, and A. Cohan, “Long-
02150 former: The long-document transformer,” CoRR, vol.
[282] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, abs/2004.05150, 2020.
F. Lebrón, and S. Sanghai, “Gqa: Training general- [298] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis,
ized multi-query transformer models from multi-head “Efficient streaming language models with attention
checkpoints,” arXiv preprint arXiv:2305.13245, 2023. sinks,” CoRR, vol. abs/2309.17453, 2023.
[283] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re, [299] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilac-
“Flashattention: Fast and memory-efficient exact at- qua, F. Petroni, and P. Liang, “Lost in the middle:
tention with IO-awareness,” in NeurIPS, 2022. How language models use long contexts,” CoRR, vol.
[284] T. Dao, “Flashattention-2: Faster attention with bet- abs/2307.03172, 2023.
ter parallelism and work partitioning,” arXiv preprint [300] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and
arXiv:2307.08691, 2023. S. Wang, “Lm-infinite: Simple on-the-fly length gen-
[285] “vllm: Easy, fast, and cheap llm serving with eralization for large language models,” CoRR, vol.
pagedattention.” [Online]. Available: https://vllm.ai/ abs/2308.16137, 2023.
[286] A. Yuan, A. Coenen, E. Reif, and D. Ippolito, “Word- [301] A. Bertsch, U. Alon, G. Neubig, and M. R. Gormley,
craft: story writing with large language models,” in “Unlimiformer: Long-range transformers with unlim-
27th International Conference on Intelligent User Inter- ited length input,” CoRR, vol. abs/2305.01625, 2023.
faces, 2022, pp. 841–852. [302] Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy,
[287] A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das, “Memorizing transformers,” in The Tenth International
and S. Reddy, “The impact of positional encoding Conference on Learning Representations, ICLR 2022, Vir-
on length generalization in transformers,” CoRR, vol. tual Event, April 25-29, 2022. OpenReview.net, 2022.
abs/2305.19466, 2023. [303] H. Chen, R. Pasunuru, J. Weston, and A. Celiky-
[288] W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, ilmaz, “Walking down the memory maze: Beyond
R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, context limit through interactive reading,” CoRR, vol.
B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang, abs/2310.05029, 2023.
K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, [304] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,
S. Wang, and H. Ma, “Effective long-context scaling of R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-
foundation models,” CoRR, vol. abs/2309.16039, 2023. tive generation of (arbitrarily) long text,” CoRR, vol.
[289] kaiokendev, “Things I’m learning while training su- abs/2305.13304, 2023.
perhot.” 2023. [305] C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, and
[290] Z. Dong, T. Tang, J. Li, W. X. Zhao, and J. Wen, J. E. Gonzalez, “Memgpt: Towards llms as operating
“BAMBOO: A comprehensive benchmark for evalu- systems,” CoRR, vol. abs/2310.08560, 2023.
ating long text modeling capacities of large language [306] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu,
models,” CoRR, vol. abs/2309.13345, 2023. S. Subramanian, E. Bakhturina, M. Shoeybi, and
99

B. Catanzaro, “Retrieval meets long context large lan- 202. PMLR, 2023, pp. 31 094–31 116.
guage models,” CoRR, vol. abs/2310.03025, 2023. [322] T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash-
[307] K. Murray and D. Chiang, “Correcting length bias in decoding for long-context inference,” https://crfm.
neural machine translation,” in WMT. Association stanford.edu/2023/10/12/flashdecoding.html, 2023.
for Computational Linguistics, 2018, pp. 212–223. [323] Y. Leviathan, M. Kalman, and Y. Matias, “Fast infer-
[308] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, ence from transformers via speculative decoding,” in
“The curious case of neural text degeneration,” in International Conference on Machine Learning, 2023.
ICLR, 2020. [324] C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre,
[309] C.-M. U. P. P. D. O. C. SCIENCE, Speech Understanding and J. Jumper, “Accelerating large language model
Systems. Summary of Results of the Five-Year Research decoding with speculative sampling,” CoRR, vol.
Effort at Carnegie-Mellon University, 1977. abs/2302.01318, 2023.
[310] P. Koehn and R. Knowles, “Six challenges for neural [325] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang,
machine translation,” in NMT@ACL. Association for R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar,
Computational Linguistics, 2017, pp. 28–39. and Z. Jia, “Specinfer: Accelerating generative LLM
[311] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, serving with speculative inference and token tree ver-
W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, ification,” CoRR, vol. abs/2305.09781, 2023.
J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, [326] B. Spector and C. Ré, “Accelerating LLM infer-
S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, ence with staged speculative decoding,” CoRR, vol.
G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, abs/2308.04623, 2023.
J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, [327] L. D. Corro, A. D. Giorno, S. Agarwal, B. Yu, A. H.
M. Hughes, and J. Dean, “Google’s neural machine Awadallah, and S. Mukherjee, “Skipdecode: Autore-
translation system: Bridging the gap between human gressive skip decoding with batching and caching for
and machine translation,” CoRR, vol. abs/1609.08144, efficient LLM inference,” CoRR, vol. abs/2307.02628,
2016. 2023.
[312] R. Paulus, C. Xiong, and R. Socher, “A deep rein- [328] D. P. Kingma and J. Ba, “Adam: A method for
forced model for abstractive summarization,” in ICLR stochastic optimization,” in 3rd International Confer-
(Poster). OpenReview.net, 2018. ence on Learning Representations, ICLR 2015, San Diego,
[313] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, CA, USA, May 7-9, 2015, Conference Track Proceedings,
Q. Sun, S. Lee, D. J. Crandall, and D. Batra, “Diverse Y. Bengio and Y. LeCun, Eds., 2015.
beam search: Decoding diverse solutions from neural [329] I. Loshchilov and F. Hutter, “Fixing weight decay
sequence models,” CoRR, vol. abs/1610.02424, 2016. regularization in adam,” CoRR, vol. abs/1711.05101,
[314] A. Fan, M. Lewis, and Y. N. Dauphin, “Hierarchical 2017.
neural story generation,” in ACL (1). Association for [330] N. Shazeer and M. Stern, “Adafactor: Adaptive learn-
Computational Linguistics, 2018, pp. 889–898. ing rates with sublinear memory cost,” in Proceedings
[315] J. Hewitt, C. D. Manning, and P. Liang, “Trunca- of the 35th International Conference on Machine Learning,
tion sampling as language model desmoothing,” in ICML 2018, Stockholmsmässan, Stockholm, Sweden, July
EMNLP (Findings). Association for Computational 10-15, 2018, ser. Proceedings of Machine Learning
Linguistics, 2022, pp. 3414–3427. Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR,
[316] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and 2018, pp. 4603–4611.
N. Collier, “A contrastive framework for neural text [331] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen,
generation,” in NeurIPS, 2022. M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and
[317] C. Meister, T. Pimentel, G. Wiher, and R. Cotterell, Z. Chen, “Gpipe: Efficient training of giant neural
“Locally typical sampling,” Trans. Assoc. Comput. Lin- networks using pipeline parallelism,” in Advances in
guistics, 2023. Neural Information Processing Systems 32: Annual Con-
[318] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, ference on Neural Information Processing Systems 2019,
T. Hashimoto, L. Zettlemoyer, and M. Lewis, “Con- NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
trastive decoding: Open-ended text generation as op- Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer,
timization,” in ACL (1). Association for Computa- F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019,
tional Linguistics, 2023, pp. 12 286–12 312. pp. 103–112.
[319] Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and [332] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri,
P. He, “Dola: Decoding by contrasting layers im- N. R. Devanur, G. R. Ganger, and P. B. Gibbons,
proves factuality in large language models,” CoRR, “Pipedream: Fast and efficient pipeline parallel DNN
vol. abs/2309.03883, 2023. training,” CoRR, vol. abs/1806.03377, 2018.
[320] L. Chen, “Dissecting batching effects in gpt inference,” [333] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He,
2023. [Online]. Available: https://le.qun.ch/en/blog/ “Zero: memory optimizations toward training trillion
2023/05/13/transformer-batching/ parameter models,” in Proceedings of the International
[321] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, Conference for High Performance Computing, Networking,
B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, Storage and Analysis, SC 2020, Virtual Event / Atlanta,
“Flexgen: High-throughput generative inference of Georgia, USA, November 9-19, 2020, C. Cuicchi, I. Qual-
large language models with a single GPU,” in ICML, ters, and W. T. Kramer, Eds. IEEE/ACM, 2020, p. 20.
ser. Proceedings of Machine Learning Research, vol. [334] P. Micikevicius, S. Narang, J. Alben, G. F. Di-
100

amos, E. Elsen, D. Garcı́a, B. Ginsburg, M. Houston, alignment,” arXiv preprint arXiv:2305.11206, 2023.
O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed preci- [350] L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Ya-
sion training,” CoRR, vol. abs/1710.03740, 2017. dav, Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and
[335] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient 2d H. Jin, “Alpagasus: Training A better alpaca with
method for training super-large deep learning mod- fewer data,” CoRR, vol. abs/2307.08701, 2023.
els,” CoRR, vol. abs/2104.05343, 2021. [351] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal,
[336] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract: H. Palangi, and A. H. Awadallah, “Orca: Progressive
Parallelize the tensor parallelism efficiently,” in Pro- learning from complex explanation traces of GPT-4,”
ceedings of the 51st International Conference on Parallel CoRR, vol. abs/2306.02707, 2023.
Processing, ICPP 2022, Bordeaux, France, 29 August 2022 [352] YuLan-Chat-Team, “Yulan-chat: An open-source
- 1 September 2022. ACM, 2022. bilingual chatbot,” https://github.com/RUC-GSAI/
[337] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing YuLan-Chat, 2023.
parallelism in distributed training for huge neural [353] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R.
networks,” CoRR, vol. abs/2105.14450, 2021. Chandu, D. Wadden, K. MacMillan, N. A. Smith,
[338] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Sequence I. Beltagy, and H. Hajishirzi, “How far can camels
parallelism: Long sequence training from system per- go? exploring the state of instruction tuning on open
spective,” arXiv e-prints, pp. arXiv–2105, 2021. resources,” CoRR, vol. abs/2306.04751, 2023.
[339] FairScale authors, “Fairscale: A general purpose [354] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruc-
modular pytorch library for high performance tion tuning with GPT-4,” CoRR, vol. abs/2304.03277,
and large scale training,” https://github.com/ 2023.
facebookresearch/fairscale, 2021. [355] M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgib-
[340] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, bon, “Efficient sequence packing without cross-
Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing et al., contamination: Accelerating large language mod-
“Alpa: Automating inter-and {Intra-Operator} paral- els without impacting performance,” arXiv preprint
lelism for distributed deep learning,” in OSDI, 2022, arXiv:2107.02027, 2021.
pp. 559–578. [356] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei,
[341] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,
deep nets with sublinear memory cost,” CoRR, vol. S. Pfohl et al., “Large language models encode clinical
abs/1604.06174, 2016. knowledge,” arXiv preprint arXiv:2212.13138, 2022.
[342] R. Lou, K. Zhang, and W. Yin, “Is prompt all you [357] J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and
need? no. A comprehensive and broader view of in- J. Wen, “Recommendation as instruction following:
struction learning,” CoRR, vol. abs/2303.10475, 2023. A large language model empowered recommendation
[343] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep approach,” CoRR, vol. abs/2305.07001, 2023.
neural networks for natural language understand- [358] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin,
ing,” in ACL (1). Association for Computational and T. Liu, “Huatuo: Tuning llama model with chinese
Linguistics, 2019, pp. 4487–4496. medical knowledge,” arXiv preprint arXiv:2304.06975,
[344] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, 2023.
L. Zettlemoyer, and S. Gupta, “Muppet: Massive [359] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen,
multi-task representations with pre-finetuning,” in Z. Wu, and Y. Feng, “Lawyer llama technical report,”
EMNLP (1). Association for Computational Linguis- arXiv preprint arXiv:2305.15062, 2023.
tics, 2021, pp. 5799–5811. [360] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
[345] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, S. Gehrmann, P. Kambadur, D. Rosenberg, and
Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and G. Mann, “Bloomberggpt: A large language model for
A. Roberts, “The flan collection: Designing data and finance,” arXiv preprint arXiv:2303.17564, 2023.
methods for effective instruction tuning,” CoRR, vol. [361] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama
abs/2301.13688, 2023. outperforms gpt-4 on arithmetic tasks,” arXiv preprint
[346] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, arXiv:2305.14201, 2023.
C. Tao, and D. Jiang, “Wizardlm: Empowering large [362] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan, X. Liu,
language models to follow complex instructions,” Y. Shao, Q. Tang, X. Zhao, K. Chen, Y. Zheng, Z. Zhou,
CoRR, vol. abs/2304.12244, 2023. [Online]. Available: R. Li, J. Zhan, Y. Zhou, L. Li, X. Yang, L. Wu, Z. Yin,
https://doi.org/10.48550/arXiv.2304.12244 X. Huang, and X. Qiu, “Moss: Training conversational
[347] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, language models from synthetic data,” 2023.
Y. Yang, and C. Gan, “Principle-driven self-alignment [363] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani,
of language models from scratch with minimal human J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto,
supervision,” arXiv preprint arXiv:2305.03047, 2023. “Alpacafarm: A simulation framework for methods
[348] X. Li, P. Yu, C. Zhou, T. Schick, L. Zettle- that learn from human feedback,” CoRR, vol.
moyer, O. Levy, J. Weston, and M. Lewis, “Self- abs/2305.14387, 2023. [Online]. Available: https:
alignment with instruction backtranslation,” CoRR, //doi.org/10.48550/arXiv.2305.14387
vol. abs/2308.06259, 2023. [364] D. Hendrycks, C. Burns, S. Basart, A. Zou,
[349] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, M. Mazeika, D. Song, and J. Steinhardt, “Measuring
A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for massive multitask language understanding,” in ICLR.
101

OpenReview.net, 2021. ratory for alignment,” arXiv preprint arXiv:2112.00861,


[365] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, 2021.
H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, [375] R. Zheng, S. Dou, S. Gao, W. Shen, B. Wang, Y. Liu,
D. Zhou, and J. Wei, “Challenging big-bench tasks and S. Jin, Q. Liu, L. Xiong, L. Chen et al., “Secrets of rlhf
whether chain-of-thought can solve them,” CoRR, vol. in large language models part i: Ppo,” arXiv preprint
abs/2210.09261, 2022. arXiv:2307.04964, 2023.
[366] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Miku- [376] J. Uesato, N. Kushman, R. Kumar, H. F. Song, N. Y.
lik, and G. Irving, “Alignment of language agents,” Siegel, L. Wang, A. Creswell, G. Irving, and I. Hig-
CoRR, vol. abs/2103.14659, 2021. gins, “Solving math word problems with process- and
[367] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Rad- outcome-based feedback,” CoRR, vol. abs/2211.14275,
ford, D. Amodei, P. F. Christiano, and G. Irving, “Fine- 2022.
tuning language models from human preferences,” [377] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
CoRR, vol. abs/1909.08593, 2019. B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
[368] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- abs/2305.20050, 2023.
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, [378] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song,
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- and J. Steinhardt, “Measuring coding challenge com-
plan, “A general language assistant as a laboratory petence with APPS,” in NeurIPS Datasets and Bench-
for alignment,” CoRR, vol. abs/2112.00861, 2021. marks, 2021.
[369] E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring, [379] Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and
J. Aslanides, A. Glaese, N. McAleese, and G. Irving, H. Yang, “Let’s reward step by step: Step-level reward
“Red teaming language models with language mod- model as the navigators for reasoning,” CoRR, vol.
els,” in Proceedings of the 2022 Conference on Empirical abs/2310.10080, 2023.
Methods in Natural Language Processing, EMNLP 2022, [380] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
Abu Dhabi, United Arab Emirates, December 7-11, 2022, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,
Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Asso- A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre,
ciation for Computational Linguistics, 2022, pp. 3419– G. van den Driessche, T. Graepel, and D. Hassabis,
3448. “Mastering the game of go without human knowl-
[370] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, edge,” Nat., pp. 354–359, 2017.
H. F. Song, M. Chadwick, M. Glaese, S. Young, [381] T. Anthony, Z. Tian, and D. Barber, “Thinking fast
L. Campbell-Gillingham, G. Irving, and N. McAleese, and slow with deep learning and tree search,” in
“Teaching language models to support answers with Advances in Neural Information Processing Systems 30:
verified quotes,” CoRR, vol. abs/2203.11147, 2022. Annual Conference on Neural Information Processing Sys-
[371] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, tems 2017, December 4-9, 2017, Long Beach, CA, USA,
A. Jones, A. Chen, A. Goldie, A. Mirhoseini, 2017, pp. 5360–5370.
C. McKinnon, C. Chen, C. Olsson, C. Olah, [382] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao,
D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran- X. Geng, Q. Lin, S. Chen, and D. Zhang, “Wizard-
Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, math: Empowering mathematical reasoning for large
J. Landau, K. Ndousse, K. Lukosiute, L. Lovitt, language models via reinforced evol-instruct,” CoRR,
M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, vol. abs/2308.09583, 2023.
N. DasSarma, R. Lasenby, R. Larson, S. Ringer, [383] R. Liu, C. Jia, G. Zhang, Z. Zhuang, T. X. Liu, and
S. Johnston, S. Kravec, S. E. Showk, S. Fort, T. Lanham, S. Vosoughi, “Second thoughts are best: Learning
T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, to re-align with human values from text edits,” in
S. R. Bowman, Z. Hatfield-Dodds, B. Mann, NeurIPS, 2022.
D. Amodei, N. Joseph, S. McCandlish, T. Brown, and [384] X. Lu, S. Welleck, J. Hessel, L. Jiang, L. Qin, P. West,
J. Kaplan, “Constitutional AI: harmlessness from AI P. Ammanabrolu, and Y. Choi, “QUARK: control-
feedback,” CoRR, vol. abs/2212.08073, 2022. [Online]. lable text generation with reinforced unlearning,” in
Available: https://doi.org/10.48550/arXiv.2212.08073 NeurIPS, 2022.
[372] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, [385] J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan,
C. Bishop, V. Carbune, and A. Rastogi, “RLAIF: scal- A. Chen, K. Cho, and E. Perez, “Training language
ing reinforcement learning from human feedback with models with language feedback at scale,” CoRR, vol.
AI feedback,” CoRR, vol. abs/2309.00267, 2023. abs/2303.16755, 2023.
[373] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, [386] G. Guo, R. Zhao, T. Tang, W. X. Zhao, and
K. Shum, and T. Zhang, “RAFT: reward ranked fine- J.-R. Wen, “Beyond imitation: Leveraging fine-
tuning for generative foundation model alignment,” grained quality signals for alignment,” arXiv preprint
CoRR, vol. abs/2304.06767, 2023. [Online]. Available: arXiv:2311.04072, 2023.
https://doi.org/10.48550/arXiv.2304.06767 [387] R. Krishna, D. Lee, L. Fei-Fei, and M. S. Bernstein,
[374] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, “Socially situated artificial intelligence enables
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- learning from human interaction,” Proceedings of
Sarma et al., “A general language assistant as a labo- the National Academy of Sciences of the United States
102

of America, vol. 119, 2022. [Online]. Available: https: guage models,” CoRR, vol. abs/2304.01933, 2023.
//api.semanticscholar.org/CorpusID:252381954 [400] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and
[388] H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hind- G. Neubig, “Towards a unified view of parameter-
sight aligns language models with feedback,” CoRR, efficient transfer learning,” in The Tenth International
vol. abs/2302.02676, 2023. Conference on Learning Representations, ICLR 2022, Vir-
[389] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, tual Event, April 25-29, 2022. OpenReview.net, 2022.
C. D. Manning, and C. Finn, “Direct preference [401] X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P-
optimization: Your language model is secretly a tuning v2: Prompt tuning can be comparable to fine-
reward model,” CoRR, vol. abs/2305.18290, 2023. tuning universally across scales and tasks,” CoRR, vol.
[Online]. Available: https://doi.org/10.48550/arXiv. abs/2110.07602, 2021.
2305.18290 [402] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang,
[390] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and J. Tang, “GPT understands, too,” CoRR, vol.
and F. Huang, “RRHF: rank responses to align abs/2103.10385, 2021.
language models with human feedback without [403] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-trained
tears,” CoRR, vol. abs/2304.05302, 2023. [Online]. prompt tuning for few-shot learning,” in Proceedings
Available: https://doi.org/10.48550/arXiv.2304.05302 of the 60th Annual Meeting of the Association for Com-
[391] Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and putational Linguistics (Volume 1: Long Papers), 2022, pp.
P. J. Liu, “Slic-hf: Sequence likelihood calibration with 8410–8423.
human feedback,” CoRR, vol. abs/2305.10425, 2023. [404] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can
[392] T. Zhang, F. Liu, J. Wong, P. Abbeel, and J. E. we know what language models know?” Transactions
Gonzalez, “The wisdom of hindsight makes language of the Association for Computational Linguistics, vol. 8,
models better instruction followers,” CoRR, vol. pp. 423–438, 2020.
abs/2302.05206, 2023. [Online]. Available: https: [405] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace,
//doi.org/10.48550/arXiv.2302.05206 and S. Singh, “Autoprompt: Eliciting knowledge
[393] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, from language models with automatically gener-
“Imitation learning: A survey of learning methods,” ated prompts,” in Proceedings of the 2020 Conference
ACM Comput. Surv., vol. 50, no. 2, apr 2017. [Online]. on Empirical Methods in Natural Language Processing
Available: https://doi.org/10.1145/3054912 (EMNLP), 2020, pp. 4222–4235.
[394] S. Levine, “Should i imitate or reinforce,” [406] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng,
2022. [Online]. Available: https://www.youtube. W. Chen, and T. Zhao, “Adaptive budget allocation
com/watch?v=sVPm7zOrBxM for parameter-efficient fine-tuning,” CoRR, vol.
[395] J. Schulman, “Reinforcement learning from human abs/2303.10512, 2023. [Online]. Available: https:
feedback: Progress and challenges,” 2023. [On- //doi.org/10.48550/arXiv.2303.10512
line]. Available: https://www.youtube.com/watch? [407] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and
v=hhiLw5Q UFg A. Ghodsi, “Dylora: Parameter efficient tuning of
[396] X. L. Li and P. Liang, “Prefix-tuning: Optimizing pre-trained models using dynamic search-free low-
continuous prompts for generation,” in Proceedings rank adaptation,” CoRR, vol. abs/2210.07558, 2022.
of the 59th Annual Meeting of the Association for Com- [Online]. Available: https://doi.org/10.48550/arXiv.
putational Linguistics and the 11th International Joint 2210.07558
Conference on Natural Language Processing, ACL/IJCNLP [408] N. Ding, Y. Qin, G. Yang, F. Wei, Y. Zonghan, Y. Su,
2021, (Volume 1: Long Papers), Virtual Event, August 1- S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi, W. Zhao,
6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang,
Association for Computational Linguistics, 2021, pp. J. Li, and M. Sun, “Parameter-efficient fine-tuning
4582–4597. of large-scale pre-trained language models,” Nature
[397] B. Lester, R. Al-Rfou, and N. Constant, “The power Machine Intelligence, vol. 5, pp. 1–16, 03 2023.
of scale for parameter-efficient prompt tuning,” in [409] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,
Proceedings of the 2021 Conference on Empirical Methods P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-
in Natural Language Processing, EMNLP 2021, Virtual tuning of language models with zero-init attention,”
Event / Punta Cana, Dominican Republic, 7-11 November, CoRR, vol. abs/2303.16199, 2023.
2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, [410] J. Pfeiffer, I. Vulic, I. Gurevych, and S. Ruder, “MAD-
Eds. Association for Computational Linguistics, 2021, X: an adapter-based framework for multi-task cross-
pp. 3045–3059. lingual transfer,” in Proceedings of the 2020 Conference
[398] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, on Empirical Methods in Natural Language Processing,
Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and EMNLP 2020, Online, November 16-20, 2020, B. Webber,
S. Gelly, “Parameter-efficient transfer learning for T. Cohn, Y. He, and Y. Liu, Eds. Association for
NLP,” in Proceedings of the 36th International Conference Computational Linguistics, 2020, pp. 7654–7673.
on Machine Learning, ICML 2019, 9-15 June 2019, Long [411] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, and
Beach, California, USA, 2019, pp. 2790–2799. S. Paul, “Peft: State-of-the-art parameter-efficient fine-
[399] Z. Hu, Y. Lan, L. Wang, W. Xu, E. Lim, R. K. Lee, tuning methods,” https://github.com/huggingface/
L. Bing, and S. Poria, “Llm-adapters: An adapter peft, 2022.
family for parameter-efficient fine-tuning of large lan- [412] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W.
103

Mahoney, and K. Keutzer, “A survey of quantization Proceedings of the 60th Annual Meeting of the Association
methods for efficient neural network inference,” for Computational Linguistics (Volume 1: Long Papers),
CoRR, vol. abs/2103.13630, 2021. [Online]. Available: ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan,
https://arxiv.org/abs/2103.13630 P. Nakov, and A. Villavicencio, Eds. Association for
[413] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, Computational Linguistics, 2022, pp. 4821–4836.
“Llm.int8(): 8-bit matrix multiplication for transform- [428] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and
ers at scale,” CoRR, vol. abs/2208.07339, 2022. W. Chen, “What makes good in-context examples for
[414] G. Xiao, J. Lin, M. Seznec, J. Demouth, and gpt-3?” in Proceedings of Deep Learning Inside Out: The
S. Han, “Smoothquant: Accurate and efficient post- 3rd Workshop on Knowledge Extraction and Integration for
training quantization for large language models,” Deep Learning Architectures, DeeLIO@ACL 2022, Dublin,
CoRR, vol. abs/2211.10438, 2022. [Online]. Available: Ireland and Online, May 27, 2022, 2022, pp. 100–114.
https://doi.org/10.48550/arXiv.2211.10438 [429] O. Rubin, J. Herzig, and J. Berant, “Learning to re-
[415] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, trieve prompts for in-context learning,” in Proceedings
and Y. He, “Zeroquant: Efficient and affordable post- of the 2022 Conference of the North American Chapter
training quantization for large-scale transformers,” in of the Association for Computational Linguistics: Human
NeurIPS, 2022. Language Technologies, NAACL 2022, Seattle, WA, United
[416] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, States, July 10-15, 2022, 2022, pp. 2655–2671.
“Awq: Activation-aware weight quantization for llm [430] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and
compression and acceleration,” 2023. S. Lee, “Self-generated in-context learning: Leverag-
[417] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alis- ing auto-regressive language models as a demonstra-
tarh, “Gptq: Accurate post-training quantization for tion generator,” CoRR, vol. abs/2206.08082, 2022.
generative pre-trained transformers,” arXiv preprint [431] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,
arXiv:2210.17323, 2022. H. Chan, and J. Ba, “Large language models are
[418] E. Frantar and D. Alistarh, “Optimal brain compres- human-level prompt engineers,” in Proc. of ICLR, 2023.
sion: A framework for accurate post-training quanti- [432] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene-
zation and pruning,” in NeurIPS, 2022. torp, “Fantastically ordered prompts and where to
[419] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettle- find them: Overcoming few-shot prompt order sen-
moyer, “Qlora: Efficient finetuning of quantized llms,” sitivity,” in Proceedings of the 60th Annual Meeting of
arXiv preprint arXiv:2305.14314, 2023. the Association for Computational Linguistics (Volume 1:
[420] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Long Papers), ACL 2022, Dublin, Ireland, May 22-27,
Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds.,
“Llm-qat: Data-free quantization aware training for 2022, pp. 8086–8098.
large language models,” 2023. [433] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
[421] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant- “Complexity-based prompting for multi-step reason-
v2: Exploring post-training quantization in llms from ing,” CoRR, vol. abs/2210.00720, 2022.
comprehensive study to low rank compensation,” [434] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic
2023. chain of thought prompting in large language mod-
[422] T. Dettmers and L. Zettlemoyer, “The case for 4-bit els,” CoRR, vol. abs/2210.03493, 2022.
precision: k-bit inference scaling laws,” CoRR, vol. [435] A. Creswell, M. Shanahan, and I. Higgins, “Selection-
abs/2212.09720, 2022. inference: Exploiting large language models
[423] L. Peiyu, L. Zikang, G. Ze-Feng, G. Dawei, Z. W. Xin, for interpretable logical reasoning,” CoRR, vol.
L. Yaliang, D. Bolin, and W. Ji-Rong, “Do emergent abs/2205.09712, 2022.
abilities exist in quantized large language models: [436] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H.
An empirical study,” arXiv preprint arXiv:2307.08072, Chi, and D. Zhou, “Self-consistency improves chain
2023. of thought reasoning in language models,” CoRR, vol.
[424] T. Dettmers, M. Lewis, Y. Belkada, and abs/2203.11171, 2022.
L. Zettlemoyer, “Llm.int8(): 8-bit matrix mul- [437] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou, and
tiplication for transformers at scale,” CoRR, W. Chen, “On the advance of making language mod-
vol. abs/2208.07339, 2022. [Online]. Available: els better reasoners,” CoRR, vol. abs/2206.02336, 2022.
https://doi.org/10.48550/arXiv.2208.07339 [438] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H.
[425] X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, Chi, and D. Zhou, “Rationale-augmented ensembles
S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al., in language models,” CoRR, 2022.
“Zero-shot information extraction via chatting with [439] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang,
chatgpt,” arXiv preprint arXiv:2302.10205, 2023. D. Schuurmans, O. Bousquet, Q. Le, and E. H. Chi,
[426] T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “Least-to-most prompting enables complex reasoning
“8-bit optimizers via block-wise quantization,” 9th In- in large language models,” CoRR, vol. abs/2205.10625,
ternational Conference on Learning Representations, ICLR, 2022.
2022. [440] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson,
[427] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Clark, and A. Sabharwal, “Decomposed prompting:
P. Luo, and N. Wong, “Compression of generative A modular approach for solving complex tasks,”
pre-trained language models via quantization,” in CoRR, vol. abs/2210.02406, 2022. [Online]. Available:
104

https://doi.org/10.48550/arXiv.2210.02406 [456] Contributors, “Ai short,” 2023. [Online]. Available:


[441] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K. Lee, and https://www.aishort.top/
E. Lim, “Plan-and-solve prompting: Improving zero- [457] ——, “Awesome chatgpt prompts,” Github, 2023.
shot chain-of-thought reasoning by large language [Online]. Available: https://github.com/f/awesome-
models,” CoRR, vol. abs/2305.04091, 2023. [Online]. chatgpt-prompts/
Available: https://doi.org/10.48550/arXiv.2305.04091 [458] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and
[442] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, J. Wen, “Structgpt: A general framework for large lan-
E. Wong, M. Apidianaki, and C. Callison-Burch, guage model to reason over structured data,” CoRR,
“Faithful chain-of-thought reasoning,” CoRR, vol. vol. abs/2305.09645, 2023.
abs/2301.13379, 2023. [459] L. Beurer-Kellner, M. Fischer, and M. Vechev,
[443] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, “Prompting is programming: A query language for
J. Callan, and G. Neubig, “PAL: program-aided lan- large language models,” Proceedings of the ACM on
guage models,” CoRR, vol. abs/2211.10435, 2022. Programming Languages, vol. 7, no. PLDI, pp. 1946–
[444] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and 1969, 2023.
Y. Zhuang, “Hugginggpt: Solving ai tasks with chat- [460] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N.
gpt and its friends in huggingface,” arXiv preprint Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-
arXiv:2303.17580, 2023. play compositional reasoning with large language
[445] H. Sun, Y. Zhuang, L. Kong, B. Dai, and C. Zhang, models,” arXiv preprint arXiv:2304.09842, 2023.
“Adaplanner: Adaptive planning from feedback with [461] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
language models,” arXiv preprint arXiv:2305.16653, H. Wu, J.-R. Wen, and H. Wang, “Investigating
2023. the factual knowledge boundary of large language
[446] Y. Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y. models with retrieval augmentation,” arXiv preprint
Wang, “Multimodal procedural planning via dual arXiv:2307.11019, 2023.
text-image prompting,” CoRR, vol. abs/2305.01795, [462] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
2023. and W. X. Zhao, “Large language models are zero-
[447] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, shot rankers for recommender systems,” CoRR, vol.
and Z. Hu, “Reasoning with language model is plan- abs/2305.08845, 2023.
ning with world model,” CoRR, vol. abs/2305.14992, [463] S. Chang and E. Fosler-Lussier, “How to prompt
2023. llms for text-to-sql: A study in zero-shot, single-
[448] Z. Chen, K. Zhou, B. Zhang, Z. Gong, W. X. Zhao, and domain, and cross-domain settings,” CoRR, vol.
J. Wen, “Chatcot: Tool-augmented chain-of-thought abs/2305.11853, 2023. [Online]. Available: https:
reasoning on chat-based large language models,” //doi.org/10.48550/arXiv.2305.11853
CoRR, vol. abs/2305.14323, 2023. [464] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum,
[449] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, J. Geiping, and T. Goldstein, “Hard prompts
K. Narasimhan, and Y. Cao, “React: Synergizing rea- made easy: Gradient-based discrete optimization
soning and acting in language models,” CoRR, vol. for prompt tuning and discovery,” CoRR, vol.
abs/2210.03629, 2022. abs/2302.03668, 2023. [Online]. Available: https:
[450] N. Shinn, F. Cassano, B. Labash, A. Gopinath, //doi.org/10.48550/arXiv.2302.03668
K. Narasimhan, and S. Yao, “Reflexion: Language [465] T. Gao, A. Fisch, and D. Chen, “Making pre-trained
agents with verbal reinforcement learning,” 2023. language models better few-shot learners,” in Proceed-
[451] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, ings of the 59th Annual Meeting of the Association for
and K. Narasimhan, “Tree of thoughts: Deliberate Computational Linguistics and the 11th International Joint
problem solving with large language models,” CoRR, Conference on Natural Language Processing, ACL/IJCNLP
vol. abs/2305.10601, 2023. 2021, (Volume 1: Long Papers), Virtual Event, August 1-
[452] V. Liu and L. B. Chilton, “Design guidelines for 6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.
prompt engineering text-to-image generative mod- Association for Computational Linguistics, 2021, pp.
els,” in Proceedings of the 2022 CHI Conference on Human 3816–3830.
Factors in Computing Systems, 2022, pp. 1–23. [466] L. Chen, J. Chen, T. Goldstein, H. Huang, and T. Zhou,
[453] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, “Instructzero: Efficient instruction optimization for
H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. black-box large language models,” CoRR, vol.
Schmidt, “A prompt pattern catalog to enhance abs/2306.03082, 2023. [Online]. Available: https:
prompt engineering with chatgpt,” arXiv preprint //doi.org/10.48550/arXiv.2306.03082
arXiv:2302.11382, 2023. [467] M. Deng, J. Wang, C. Hsieh, Y. Wang, H. Guo, T. Shu,
[454] S. K. K. Santu and D. Feng, “Teler: A general M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimiz-
taxonomy of LLM prompts for benchmarking ing discrete text prompts with reinforcement learn-
complex tasks,” CoRR, vol. abs/2305.11430, 2023. ing,” in Proceedings of the 2022 Conference on Empirical
[Online]. Available: https://doi.org/10.48550/arXiv. Methods in Natural Language Processing, EMNLP 2022,
2305.11430 Abu Dhabi, United Arab Emirates, December 7-11, 2022,
[455] OpenAI, “Gpt best practices,” OpenAI, 2023. Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Asso-
[Online]. Available: https://platform.openai.com/ ciation for Computational Linguistics, 2022, pp. 3369–
docs/guides/gpt-best-practices 3391.
105

[468] T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. [477] J. Li, T. Tang, J. Nie, J. Wen, and X. Zhao, “Learning to
Gonzalez, “TEMPERA: test-time prompt editing via transfer prompts for text generation,” in Proceedings
reinforcement learning,” in The Eleventh International of the 2022 Conference of the North American Chapter
Conference on Learning Representations, ICLR 2023, Ki- of the Association for Computational Linguistics: Human
gali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. Language Technologies, NAACL 2022, Seattle, WA, United
[469] H. Xu, Y. Chen, Y. Du, N. Shao, Y. Wang, H. Li, and States, July 10-15, 2022, M. Carpuat, M. de Marneffe,
Z. Yang, “GPS: genetic prompt search for efficient few- and I. V. M. Ruı́z, Eds. Association for Computational
shot learning,” in Proceedings of the 2022 Conference Linguistics, 2022, pp. 3506–3518.
on Empirical Methods in Natural Language Processing, [478] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis,
EMNLP 2022, Abu Dhabi, United Arab Emirates, Decem- H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role
ber 7-11, 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, of demonstrations: What makes in-context learning
Eds. Association for Computational Linguistics, 2022, work?” in Proceedings of the 2022 Conference on Em-
pp. 8162–8171. pirical Methods in Natural Language Processing, EMNLP
[470] A. Prasad, P. Hase, X. Zhou, and M. Bansal, 2022, Abu Dhabi, United Arab Emirates, December 7-
“Grips: Gradient-free, edit-based instruction search 11, 2022. Association for Computational Linguistics,
for prompting large language models,” in Proceedings 2022, pp. 11 048–11 064.
of the 17th Conference of the European Chapter of the [479] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh,
Association for Computational Linguistics, EACL 2023, “Calibrate before use: Improving few-shot perfor-
Dubrovnik, Croatia, May 2-6, 2023, A. Vlachos and mance of language models,” in Proceedings of the
I. Augenstein, Eds. Association for Computational 38th International Conference on Machine Learning, ICML
Linguistics, 2023, pp. 3827–3846. 2021, 18-24 July 2021, Virtual Event, M. Meila and
[471] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, T. Zhang, Eds., 2021, pp. 12 697–12 706.
H. Chan, and J. Ba, “Large language models are [480] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate
human-level prompt engineers,” in The Eleventh In- empathetic dialogues? A novel in-context example
ternational Conference on Learning Representations, ICLR selection method and automatic evaluation metric
2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, for empathetic dialogue generation,” in Proceedings
2023. of the 29th International Conference on Computational
[472] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, Linguistics, COLING 2022, Gyeongju, Republic of Korea,
and M. Zeng, “Automatic prompt optimization October 12-17, 2022, N. Calzolari, C. Huang, H. Kim,
with ”gradient descent” and beam search,” CoRR, J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen,
vol. abs/2305.03495, 2023. [Online]. Available: https: L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue,
//doi.org/10.48550/arXiv.2305.03495 S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond,
[473] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and S. Na, Eds. International Committee on Compu-
and X. Chen, “Large language models as optimizers,” tational Linguistics, 2022, pp. 669–683.
CoRR, vol. abs/2309.03409, 2023. [Online]. Available: [481] I. Levy, B. Bogin, and J. Berant, “Diverse demon-
https://doi.org/10.48550/arXiv.2309.03409 strations improve in-context compositional general-
[474] X. Wang, C. Li, Z. Wang, F. Bai, H. Luo, ization,” CoRR, vol. abs/2212.06800, 2022.
J. Zhang, N. Jojic, E. P. Xing, and Z. Hu, [482] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin,
“Promptagent: Strategic planning with language R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith,
models enables expert-level prompt optimization,” and T. Yu, “Selective annotation makes language mod-
CoRR, vol. abs/2310.16427, 2023. [Online]. Available: els better few-shot learners,” CoRR, 2022.
https://doi.org/10.48550/arXiv.2310.16427 [483] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett,
[475] T. Tang, J. Li, W. X. Zhao, and J. Wen, “Context-tuning: and R. Pasunuru, “Complementary explanations for
Learning contextualized prompts for natural language effective in-context learning,” CoRR, 2022.
generation,” in Proceedings of the 29th International [484] X. Li and X. Qiu, “Finding supporting examples for
Conference on Computational Linguistics, COLING 2022, in-context learning,” CoRR, 2023.
Gyeongju, Republic of Korea, October 12-17, 2022, N. Cal- [485] Y. Zhang, S. Feng, and C. Tan, “Active example se-
zolari, C. Huang, H. Kim, J. Pustejovsky, L. Wanner, lection for in-context learning,” in Proceedings of the
K. Choi, P. Ryu, H. Chen, L. Donatelli, H. Ji, S. Kuro- 2022 Conference on Empirical Methods in Natural Lan-
hashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. guage Processing, EMNLP 2022, Abu Dhabi, United Arab
Lee, E. Santus, F. Bond, and S. Na, Eds. International Emirates, December 7-11, 2022, 2022, pp. 9134–9148.
Committee on Computational Linguistics, 2022, pp. [486] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out-
6340–6354. performs crowd-workers for text-annotation tasks,”
[476] T. Vu, B. Lester, N. Constant, R. Al-Rfou’, and D. Cer, 2023.
“Spot: Better frozen model adaptation through soft [487] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and
prompt transfer,” in Proceedings of the 60th Annual S. Lee, “Self-generated in-context learning: Leverag-
Meeting of the Association for Computational Linguistics ing auto-regressive language models as a demonstra-
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May tion generator,” CoRR, vol. abs/2206.08082, 2022.
22-27, 2022, S. Muresan, P. Nakov, and A. Villavicen- [488] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An
cio, Eds. Association for Computational Linguistics, explanation of in-context learning as implicit bayesian
2022, pp. 5039–5059. inference,” in The Tenth International Conference on
106

Learning Representations, ICLR 2022, Virtual Event, April future,” CoRR, vol. abs/2309.15402, 2023.
25-29, 2022, 2022. [503] S. Miao, C. Liang, and K. Su, “A diverse corpus
[489] Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in- for evaluating and developing english math word
context learning,” CoRR, vol. abs/2212.10375, 2022. problem solvers,” in Proceedings of the 58th Annual
[490] Y. Gu, L. Dong, F. Wei, and M. Huang, “Pre-training Meeting of the Association for Computational Linguistics,
to learn in context,” CoRR, vol. abs/2305.09137, 2023. ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai,
[491] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi, N. Schluter, and J. R. Tetreault, Eds. Association for
“Metaicl: Learning to learn in context,” in Proceedings Computational Linguistics, 2020, pp. 975–984.
of the 2022 Conference of the North American Chapter [504] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com-
of the Association for Computational Linguistics: Human monsenseqa: A question answering challenge tar-
Language Technologies, NAACL 2022, Seattle, WA, United geting commonsense knowledge,” in Proceedings of
States, July 10-15, 2022, M. Carpuat, M. de Marneffe, the 2019 Conference of the North American Chapter of
and I. V. M. Ruı́z, Eds., 2022, pp. 2791–2809. the Association for Computational Linguistics: Human
[492] M. Hahn and N. Goyal, “A theory of emergent Language Technologies, NAACL-HLT 2019, Minneapolis,
in-context learning as implicit structure induction,” MN, USA, June 2-7, 2019, Volume 1 (Long and Short
CoRR, vol. abs/2303.07971, 2023. Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
[493] J. Pan, T. Gao, H. Chen, and D. Chen, “What in-context Association for Computational Linguistics, 2019, pp.
learning ”learns” in-context: Disentangling task recog- 4149–4158.
nition and task learning,” CoRR, vol. abs/2305.09731, [505] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa-
2023. sawa, “Large language models are zero-shot reason-
[494] N. Wies, Y. Levine, and A. Shashua, “The learnability ers,” CoRR, vol. abs/2205.11916, 2022.
of in-context learning,” CoRR, vol. abs/2303.07895, [506] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program
2023. of thoughts prompting: Disentangling computation
[495] A. Webson and E. Pavlick, “Do prompt-based models from reasoning for numerical reasoning tasks,” CoRR,
really understand the meaning of their prompts?” in vol. abs/2211.12588, 2022.
Proceedings of the 2022 Conference of the North American [507] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
Chapter of the Association for Computational Linguistics: J. Callan, and G. Neubig, “PAL: program-aided lan-
Human Language Technologies, NAACL 2022, Seattle, guage models,” in International Conference on Machine
WA, United States, July 10-15, 2022, 2022, pp. 2300– Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii,
2344. USA, A. Krause, E. Brunskill, K. Cho, B. Engelhardt,
[496] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra- S. Sabato, and J. Scarlett, Eds., 2023.
mento, A. Mordvintsev, A. Zhmoginov, and M. Vla- [508] X. Zhao, Y. Xie, K. Kawaguchi, J. He, and Q. Xie, “Au-
dymyrov, “Transformers learn in-context by gradient tomatic model selection with large language models
descent,” CoRR, vol. abs/2212.07677, 2022. for reasoning,” CoRR, vol. abs/2305.14333, 2023.
[497] C. Olsson, N. Elhage, N. Nanda, N. Joseph, [509] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou,
N. DasSarma, T. Henighan, B. Mann, A. Askell, and W. Chen, “Making large language models better
Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan- reasoners with step-aware verifier,” 2023.
guli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, [510] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch,
A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, and J. Berant, “Answering questions by meta-
T. Brown, J. Clark, J. Kaplan, S. McCandlish, and reasoning over multiple chains of thought,” CoRR,
C. Olah, “In-context learning and induction heads,” vol. abs/2304.13007, 2023.
CoRR, vol. abs/2209.11895, 2022. [511] Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memi-
[498] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and sevic, and H. Su, “Deductive verification of chain-of-
D. Zhou, “What learning algorithm is in-context learn- thought reasoning,” CoRR, vol. abs/2306.03872, 2023.
ing? investigations with linear models,” CoRR, vol. [512] T. Xue, Z. Wang, Z. Wang, C. Han, P. Yu, and H. Ji,
abs/2211.15661, 2022. “RCOT: detecting and rectifying factual inconsistency
[499] J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, in reasoning by reversing chain-of-thought,” CoRR,
X. Chen, H. Liu, D. Huang, D. Zhou et al., “Larger vol. abs/2305.11499, 2023.
language models do in-context learning differently,” [513] Y. Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and
arXiv preprint arXiv:2303.03846, 2023. J. Zhao, “Large language models are better reasoners
[500] J. Coda-Forno, M. Binz, Z. Akata, M. M. Botvinick, with self-verification,” CoRR, abs/2212.09561, 2023.
J. X. Wang, and E. Schulz, “Meta-in-context learning [514] W. Jiang, H. Shi, L. Yu, Z. Liu, Y. Zhang, Z. Li, and
in large language models,” CoRR, vol. abs/2305.12907, J. T. Kwok, “Forward-backward reasoning in large
2023. language models for mathematical verification,” 2023.
[501] J. W. Wei, L. Hou, A. K. Lampinen, X. Chen, D. Huang, [515] J. Long, “Large language model guided tree-of-
Y. Tay, X. Chen, Y. Lu, D. Zhou, T. Ma, and Q. V. thought,” CoRR, vol. abs/2305.08291, 2023.
Le, “Symbol tuning improves in-context learning in [516] S. Mo and M. Xin, “Tree of uncertain thoughts
language models,” CoRR, vol. abs/2305.08298, 2023. reasoning for large language models,” CoRR, vol.
[502] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang, abs/2309.07694, 2023.
W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of [517] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger,
chain of thought reasoning: Advances, frontiers and L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski,
107

H. Niewiadomski, P. Nyczyk, and T. Hoefler, “Graph large language models,” CoRR, vol. abs/2209.11302,
of thoughts: Solving elaborate problems with large 2022.
language models,” CoRR, vol. abs/2308.09687, 2023. [531] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang,
[518] B. Lei, P. Lin, C. Liao, and C. Ding, “Boosting log- J. Biswas, and P. Stone, “LLM+P: empowering large
ical reasoning in large language models through a language models with optimal planning proficiency,”
new framework: The graph of thought,” CoRR, vol. CoRR, vol. abs/2304.11477, 2023. [Online]. Available:
abs/2308.08614, 2023. https://doi.org/10.48550/arXiv.2304.11477
[519] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang, [532] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and
S. Qin, S. Rajmohan, Q. Lin, and D. Zhang, “Ev- B. Ommer, “High-resolution image synthesis with
erything of thoughts: Defying the law of pen- latent diffusion models,” in IEEE/CVF Conference on
rose triangle for thought generation,” arXiv preprint Computer Vision and Pattern Recognition, CVPR 2022,
arXiv:2311.04254, 2023. New Orleans, LA, USA, June 18-24, 2022, 2022, pp.
[520] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, 10 674–10 685.
M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- [533] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris,
mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos- P. Liang, and M. S. Bernstein, “Generative agents:
grove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Interactive simulacra of human behavior,” CoRR, vol.
Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, abs/2304.03442, 2023.
H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr, [534] 2023. [Online]. Available: https://github.com/
L. Zheng, M. Yüksekgönül, M. Suzgun, N. Kim, Significant-Gravitas/Auto-GPT
N. Guha, N. S. Chatterji, O. Khattab, P. Henderson, [535] Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe,
Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Gan- explain, plan and select: Interactive planning with
guli, T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary, large language models enables open-world multi-task
W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda, agents,” CoRR, vol. abs/2302.01560, 2023.
“Holistic evaluation of language models,” CoRR, vol. [536] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang,
abs/2211.09110, 2022. X. Guo, C. Li, X. Xu et al., “Milvus: A purpose-built
[521] Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, and vector data management system,” in Proceedings of the
H. Chen, “When do program-of-thoughts work for 2021 International Conference on Management of Data,
reasoning?” CoRR, vol. abs/2308.15452, 2023. 2021, pp. 2614–2627.
[522] A. Madaan and A. Yazdanbakhsh, “Text and patterns: [537] W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, “Mem-
For effective chain of thought, it takes two to tango,” orybank: Enhancing large language models with long-
CoRR, vol. abs/2209.07686, 2022. term memory,” CoRR, vol. abs/2305.10250, 2023.
[523] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and [538] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz,
A. Smola, “Multimodal chain-of-thought reasoning in “Building a large annotated corpus of english: The
language models,” CoRR, vol. abs/2302.00923, 2023. penn treebank,” Comput. Linguistics, vol. 19, no. 2, pp.
[524] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri- 313–330, 1993.
vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, [539] S. Merity, C. Xiong, J. Bradbury, and R. Socher,
D. Zhou, D. Das, and J. Wei, “Language models are “Pointer sentinel mixture models,” in ICLR (Poster).
multilingual chain-of-thought reasoners,” CoRR, vol. OpenReview.net, 2017.
abs/2210.03057, 2022. [540] O. Bojar, C. Buck, C. Federmann, B. Haddow,
[525] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita- P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post,
tions of language models in arithmetic and symbolic H. Saint-Amand, R. Soricut, L. Specia, and A. Tam-
induction,” CoRR, vol. abs/2208.05051, 2022. chyna, “Findings of the 2014 workshop on statistical
[526] N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He, machine translation,” in WMT@ACL. The Association
“ChatGPT is a Knowledgeable but Inexperienced for Computer Linguistics, 2014, pp. 12–58.
Solver: An Investigation of Commonsense Problem in [541] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham,
Large Language Models,” CoRR, 2023. B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,
[527] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, V. Logacheva, C. Monz, M. Negri, A. Névéol, M. L.
and K. Narasimhan, “Tree of thoughts: Deliberate Neves, M. Popel, M. Post, R. Rubino, C. Scarton,
problem solving with large language models,” CoRR, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri,
vol. abs/2305.10601, 2023. “Findings of the 2016 conference on machine trans-
[528] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, lation,” in WMT. The Association for Computer
Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An Linguistics, 2016, pp. 131–198.
open-ended embodied agent with large language [542] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann,
models,” arXiv preprint arXiv:2305.16291, 2023. M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn,
[529] X. Jiang, Y. Dong, L. Wang, Q. Shang, and G. Li, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and
“Self-planning code generation with large language M. Zampieri, “Findings of the 2019 conference on
model,” CoRR, vol. abs/2303.06689, 2023. [Online]. machine translation (WMT19),” in Proceedings of the
Available: https://doi.org/10.48550/arXiv.2303.06689 Fourth Conference on Machine Translation, WMT 2019,
[530] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, Florence, Italy, August 1-2, 2019 - Volume 2: Shared
J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog- Task Papers, Day 1, O. Bojar, R. Chatterjee, C. Feder-
prompt: Generating situated robot task plans using mann, M. Fishel, Y. Graham, B. Haddow, M. Huck,
108

A. Jimeno-Yepes, P. Koehn, A. Martins, C. Monz, “Diabla: a corpus of bilingual spontaneous written


M. Negri, A. Névéol, M. L. Neves, M. Post, M. Turchi, dialogues for machine translation,” Lang. Resour. Eval-
and K. Verspoor, Eds. Association for Computational uation, vol. 55, no. 3, pp. 635–660, 2021.
Linguistics, 2019, pp. 1–61. [548] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre,
[543] L. Barrault, M. Biesialska, O. Bojar, M. R. Costa- and B. Xiang, “Abstractive text summarization using
jussà, C. Federmann, Y. Graham, R. Grundkiewicz, sequence-to-sequence rnns and beyond,” in Proceed-
B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn, ings of the 20th SIGNLL Conference on Computational
C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na- Natural Language Learning, CoNLL 2016, Berlin, Ger-
gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri, many, August 11-12, 2016, Y. Goldberg and S. Riezler,
“Findings of the 2020 conference on machine trans- Eds. ACL, 2016, pp. 280–290.
lation (WMT20),” in Proceedings of the Fifth Con- [549] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give
ference on Machine Translation, WMT@EMNLP 2020, me the details, just the summary! topic-aware convo-
Online, November 19-20, 2020, L. Barrault, O. Bojar, lutional neural networks for extreme summarization,”
F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fe- in EMNLP. Association for Computational Linguis-
dermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman, tics, 2018, pp. 1797–1807.
B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, [550] F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown,
A. Martins, M. Morishita, C. Monz, M. Nagata, “Wikilingua: A new benchmark dataset for cross-
T. Nakazawa, and M. Negri, Eds. Association for lingual abstractive summarization,” in Findings of the
Computational Linguistics, 2020, pp. 1–55. Association for Computational Linguistics: EMNLP 2020,
[544] F. Akhbardeh, A. Arkhangorodsky, M. Biesialska, 2020, pp. 4034–4048.
O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa- [551] S. Moon, P. Shah, A. Kumar, and R. Subba, “Open-
jussà, C. España-Bonet, A. Fan, C. Federmann, M. Fre- dialkg: Explainable conversational reasoning with
itag, Y. Graham, R. Grundkiewicz, B. Haddow, L. Har- attention-based walks over knowledge graphs,” in
ter, K. Heafield, C. Homan, M. Huck, K. Amponsah- ACL (1). Association for Computational Linguistics,
Kaakyire, J. Kasai, D. Khashabi, K. Knight, T. Kocmi, 2019, pp. 845–854.
P. Koehn, N. Lourie, C. Monz, M. Morishita, M. Na- [552] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle-
gata, A. Nagesh, T. Nakazawa, M. Negri, S. Pal, moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu,
A. A. Tapo, M. Turchi, V. Vydrin, and M. Zampieri, “DS-1000: A natural and reliable benchmark for data
“Findings of the 2021 conference on machine transla- science code generation,” CoRR, vol. abs/2211.11501,
tion (WMT21),” in Proceedings of the Sixth Conference 2022.
on Machine Translation, WMT@EMNLP 2021, Online [553] Z. Wang, S. Zhou, D. Fried, and G. Neubig,
Event, November 10-11, 2021, L. Barrault, O. Bojar, “Execution-based evaluation for open-domain code
F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fe- generation,” CoRR, vol. abs/2212.10481, 2022.
dermann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, [554] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins,
R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin,
A. Jimeno-Yepes, P. Koehn, T. Kocmi, A. Martins, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey,
M. Morishita, and C. Monz, Eds. Association for M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov,
Computational Linguistics, 2021, pp. 1–88. “Natural questions: a benchmark for question answer-
[545] T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe- ing research,” Trans. Assoc. Comput. Linguistics, pp.
dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund- 452–466, 2019.
kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz, [555] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,
M. Morishita, M. Nagata, T. Nakazawa, M. Novák, C. Schoenick, and O. Tafjord, “Think you have solved
M. Popel, and M. Popovic, “Findings of the 2022 question answering? try arc, the AI2 reasoning chal-
conference on machine translation (WMT22),” in Pro- lenge,” CoRR, vol. abs/1803.05457, 2018.
ceedings of the Seventh Conference on Machine Trans- [556] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring
lation, WMT 2022, Abu Dhabi, United Arab Emirates how models mimic human falsehoods,” in Proceedings
(Hybrid), December 7-8, 2022, P. Koehn, L. Barrault, of the 60th Annual Meeting of the Association for Compu-
O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa- tational Linguistics (Volume 1: Long Papers), ACL 2022,
jussà, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Dublin, Ireland, May 22-27, 2022, 2022, pp. 3214–3252.
Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow, [557] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic
M. Huck, A. Jimeno-Yepes, T. Kocmi, A. Martins, parsing on freebase from question-answer pairs,” in
M. Morishita, C. Monz, M. Nagata, T. Nakazawa, Proceedings of the 2013 Conference on Empirical Methods
M. Negri, A. Névéol, M. Neves, M. Popel, M. Turchi, in Natural Language Processing, EMNLP 2013, 18-21
and M. Zampieri, Eds. Association for Computa- October 2013, Grand Hyatt Seattle, Seattle, Washington,
tional Linguistics, 2022, pp. 1–45. USA, A meeting of SIGDAT, a Special Interest Group of
[546] N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wen- the ACL, 2013, pp. 1533–1544.
zek, D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, and [558] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer,
A. Fan, “The flores-101 evaluation benchmark for low- “Triviaqa: A large scale distantly supervised challenge
resource and multilingual machine translation,” Trans. dataset for reading comprehension,” in Proceedings of
Assoc. Comput. Linguistics, vol. 10, pp. 522–538, 2022. the 55th Annual Meeting of the Association for Computa-
[547] R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset, tional Linguistics, ACL 2017, Vancouver, Canada, July 30
109

- August 4, Volume 1: Long Papers, 2017, pp. 1601–1611. Conference on Artificial Intelligence, AAAI 2020, The
[559] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, Thirty-Second Innovative Applications of Artificial Intelli-
“PIQA: reasoning about physical commonsense in gence Conference, IAAI 2020, The Tenth AAAI Symposium
natural language,” in The Thirty-Fourth AAAI Confer- on Educational Advances in Artificial Intelligence, EAAI
ence on Artificial Intelligence, AAAI 2020, The Thirty- 2020, New York, NY, USA, February 7-12, 2020, 2020,
Second Innovative Applications of Artificial Intelligence pp. 8082–8090.
Conference, IAAI 2020, The Tenth AAAI Symposium [569] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,
on Educational Advances in Artificial Intelligence, EAAI “Squad: 100, 000+ questions for machine comprehen-
2020, New York, NY, USA, February 7-12, 2020, 2020, sion of text,” in Proceedings of the 2016 Conference
pp. 7432–7439. on Empirical Methods in Natural Language Processing,
[560] M. Dubey, D. Banerjee, A. Abdelkawi, and EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
J. Lehmann, “Lc-quad 2.0: A large dataset for 2016, pp. 2383–2392.
complex question answering over wikidata and [570] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes,
dbpedia,” in The Semantic Web - ISWC 2019 - 18th and J. Weston, “Key-value memory networks for di-
International Semantic Web Conference, Auckland, New rectly reading documents,” in Proceedings of the 2016
Zealand, October 26-30, 2019, Proceedings, Part II, 2019, Conference on Empirical Methods in Natural Language
pp. 69–78. Processing, EMNLP 2016, Austin, Texas, USA, November
[561] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang, X. Yan, 1-4, 2016, 2016, pp. 1400–1409.
and Y. Su, “Beyond I.I.D.: three levels of generaliza- [571] B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “Assessing
tion for question answering on knowledge bases,” in the factual accuracy of generated text,” in Proceedings
WWW ’21: The Web Conference 2021, Virtual Event / of the 25th ACM SIGKDD International Conference on
Ljubljana, Slovenia, April 19-23, 2021, 2021, pp. 3477– Knowledge Discovery & Data Mining, KDD 2019, An-
3488. chorage, AK, USA, August 4-8, 2019, 2019, pp. 166–175.
[562] S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou, J. Li, [572] K. Toutanova and D. Chen, “Observed versus latent
B. He, and H. Zhang, “KQA pro: A dataset with features for knowledge base and text inference,” in
explicit compositional programs for complex question Proceedings of the 3rd Workshop on Continuous Vector
answering over knowledge base,” in Proceedings of the Space Models and their Compositionality, CVSC 2015,
60th Annual Meeting of the Association for Computational Beijing, China, July 26-31, 2015, 2015, pp. 57–66.
Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, [573] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and
Ireland, May 22-27, 2022, 2022, pp. 6101–6119. J. Taylor, “Freebase: a collaboratively created graph
[563] X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form database for structuring human knowledge,” in Pro-
generation via multi-task learning for complex ques- ceedings of the ACM SIGMOD International Conference
tion answering over knowledge bases,” in Proceedings on Management of Data, SIGMOD 2008, Vancouver, BC,
of the 29th International Conference on Computational Canada, June 10-12, 2008, 2008, pp. 1247–1250.
Linguistics, COLING 2022, Gyeongju, Republic of Korea, [574] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel,
October 12-17, 2022, 2022, pp. 1687–1696. “Convolutional 2d knowledge graph embeddings,”
[564] S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin- in Proceedings of the Thirty-Second AAAI Conference on
guistically diverse benchmark for multilingual open Artificial Intelligence, (AAAI-18), the 30th innovative Ap-
domain question answering,” Trans. Assoc. Comput. plications of Artificial Intelligence (IAAI-18), and the 8th
Linguistics, vol. 9, pp. 1389–1406, 2021. AAAI Symposium on Educational Advances in Artificial
[565] T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat- Intelligence (EAAI-18), New Orleans, Louisiana, USA,
tacharyya, “Scienceqa: a novel resource for question February 2-7, 2018, 2018, pp. 1811–1818.
answering on scholarly articles,” Int. J. Digit. Libr., [575] G. A. Miller, “Wordnet: A lexical database for en-
vol. 23, no. 3, pp. 289–301, 2022. glish,” Commun. ACM, pp. 39–41, 1995.
[566] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can [576] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis,
a suit of armor conduct electricity? A new dataset A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod-
for open book question answering,” in Proceedings of els as knowledge bases?” in Proceedings of the 2019
the 2018 Conference on Empirical Methods in Natural Conference on Empirical Methods in Natural Language
Language Processing, Brussels, Belgium, October 31 - Processing and the 9th International Joint Conference
November 4, 2018, 2018, pp. 2381–2391. on Natural Language Processing, EMNLP-IJCNLP 2019,
[567] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, Hong Kong, China, November 3-7, 2019, 2019, pp. 2463–
R. Majumder, and L. Deng, “MS MARCO: A human 2473.
generated machine reading comprehension dataset,” [577] F. Mahdisoltani, J. Biega, and F. M. Suchanek,
in Proceedings of the Workshop on Cognitive Computa- “YAGO3: A knowledge base from multilingual
tion: Integrating neural and symbolic approaches 2016 wikipedias,” in Seventh Biennial Conference on Innova-
co-located with the 30th Annual Conference on Neural tive Data Systems Research, CIDR 2015, Asilomar, CA,
Information Processing Systems (NIPS 2016), Barcelona, USA, January 4-7, 2015, Online Proceedings, 2015.
Spain, December 9, 2016, 2016. [578] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:
[568] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab- a core of semantic knowledge,” in Proceedings of the
harwal, “QASC: A dataset for question answering 16th International Conference on World Wide Web, WWW
via sentence composition,” in The Thirty-Fourth AAAI 2007, Banff, Alberta, Canada, May 8-12, 2007, 2007, pp.
110

697–706. of the 2021 Conference on Empirical Methods in Natu-


[579] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, ral Language Processing, EMNLP 2021, Virtual Event /
R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A Punta Cana, Dominican Republic, 7-11 November, 2021,
dataset for diverse, explainable multi-hop question M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds.
answering,” in Proceedings of the 2018 Conference on Association for Computational Linguistics, 2021, pp.
Empirical Methods in Natural Language Processing, Brus- 7716–7740.
sels, Belgium, October 31 - November 4, 2018. Associ- [588] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Gener-
ation for Computational Linguistics, 2018, pp. 2369– ating implications, proofs, and abductive statements
2380. over natural language,” in Findings of the Association
[580] C. Clark, K. Lee, M. Chang, T. Kwiatkowski, for Computational Linguistics: ACL/IJCNLP 2021, Online
M. Collins, and K. Toutanova, “Boolq: Exploring the Event, August 1-6, 2021, ser. Findings of ACL, C. Zong,
surprising difficulty of natural yes/no questions,” in F. Xia, W. Li, and R. Navigli, Eds., vol. ACL/IJCNLP
Proceedings of the 2019 Conference of the North American 2021. Association for Computational Linguistics,
Chapter of the Association for Computational Linguistics: 2021, pp. 3621–3634.
Human Language Technologies, NAACL-HLT 2019, Min- [589] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi-
neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and patanangkura, and P. Clark, “Explaining answers with
Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. entailment trees,” in Proceedings of the 2021 Conference
Association for Computational Linguistics, 2019, pp. on Empirical Methods in Natural Language Processing,
2924–2936. EMNLP 2021, Virtual Event / Punta Cana, Dominican
[581] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi, Republic, 7-11 November, 2021, M. Moens, X. Huang,
“Socialiqa: Commonsense reasoning about social in- L. Specia, and S. W. Yih, Eds. Association for Com-
teractions,” CoRR, vol. abs/1904.09728, 2019. putational Linguistics, 2021, pp. 7358–7370.
[582] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and [590] A. Saparov and H. He, “Language models are greedy
Y. Choi, “Hellaswag: Can a machine really finish reasoners: A systematic formal analysis of chain-of-
your sentence?” in Proceedings of the 57th Conference of thought,” CoRR, vol. abs/2210.01240, 2022.
the Association for Computational Linguistics, ACL 2019, [591] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz,
Florence, Italy, July 28- August 2, 2019, Volume 1: Long V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari,
Papers, A. Korhonen, D. R. Traum, and L. Màrquez, E. Dyer, and B. Neyshabur, “Exploring length gen-
Eds. Association for Computational Linguistics, 2019, eralization in large language models,” CoRR, vol.
pp. 4791–4800. abs/2207.04901, 2022.
[583] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, [592] A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP
“Winogrande: An adversarial winograd schema chal- models really able to solve simple math word prob-
lenge at scale,” in AAAI. AAAI Press, 2020, pp. 8732– lems?” in NAACL-HLT. Association for Computa-
8740. tional Linguistics, 2021, pp. 2080–2094.
[584] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice [593] S. Roy and D. Roth, “Solving general arithmetic
of plausible alternatives: An evaluation of common- word problems,” in Proceedings of the 2015 Conference
sense causal reasoning,” in Logical Formalizations of on Empirical Methods in Natural Language Processing,
Commonsense Reasoning, Papers from the 2011 AAAI EMNLP 2015, Lisbon, Portugal, September 17-21, 2015,
Spring Symposium, Technical Report SS-11-06, Stanford, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and
California, USA, March 21-23, 2011. AAAI, 2011. Y. Marton, Eds. The Association for Computational
[585] K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon, Linguistics, 2015, pp. 1743–1752.
P. Clark, and Y. Choi, “proscript: Partially ordered [594] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski,
scripts generation,” in Findings of the Association for Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter-
Computational Linguistics: EMNLP 2021, Virtual Event / pretable math word problem solving with operation-
Punta Cana, Dominican Republic, 16-20 November, 2021, based formalisms,” in Proceedings of the 2019 Conference
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. of the North American Chapter of the Association for
Association for Computational Linguistics, 2021, pp. Computational Linguistics: Human Language Technolo-
2138–2149. gies, NAACL-HLT 2019, Minneapolis, MN, USA, June
[586] B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark, 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein,
“Tracking state changes in procedural text: a challenge C. Doran, and T. Solorio, Eds. Association for Com-
dataset and models for process paragraph comprehen- putational Linguistics, 2019, pp. 2357–2367.
sion,” in Proceedings of the 2018 Conference of the North [595] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom,
American Chapter of the Association for Computational “Program induction by rationale generation: Learning
Linguistics: Human Language Technologies, NAACL-HLT to solve and explain algebraic word problems,” in
2018, New Orleans, Louisiana, USA, June 1-6, 2018, Vol- Proceedings of the 55th Annual Meeting of the Associa-
ume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, tion for Computational Linguistics, ACL 2017, Vancouver,
Eds. Association for Computational Linguistics, 2018, Canada, July 30 - August 4, Volume 1: Long Papers,
pp. 1595–1604. R. Barzilay and M. Kan, Eds. Association for Com-
[587] S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla- putational Linguistics, 2017, pp. 158–167.
graphs: An explanation graph generation task for [596] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman,
structured commonsense reasoning,” in Proceedings and H. Hajishirzi, “Mawps: A math word problem
111

repository,” in Proceedings of the 2016 conference of the Foundation / IEEE Computer Society, 2018, pp. 8494–
north american chapter of the association for computational 8502.
linguistics: human language technologies, 2016, pp. 1152– [607] S. Srivastava, C. Li, M. Lingelbach, R. Martı́n-Martı́n,
1157. F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch,
[597] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei-
and M. Gardner, “DROP: A reading comprehension Fei, “BEHAVIOR: benchmark for everyday household
benchmark requiring discrete reasoning over para- activities in virtual, interactive, and ecological en-
graphs,” in Proceedings of the 2019 Conference of the vironments,” in CoRL, ser. Proceedings of Machine
North American Chapter of the Association for Com- Learning Research, vol. 164. PMLR, 2021, pp. 477–
putational Linguistics: Human Language Technologies, 490.
NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, [608] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han,
2019, Volume 1 (Long and Short Papers), 2019, pp. 2368– R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED:
2378. A benchmark for interpreting grounded instructions
[598] S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi, for everyday tasks,” in CVPR. Computer Vision
and K. Cho, “Naturalproofs: Mathematical theorem Foundation / IEEE, 2020, pp. 10 737–10 746.
proving in natural language,” in Proceedings of the Neu- [609] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler,
ral Information Processing Systems Track on Datasets and and M. J. Hausknecht, “Alfworld: Aligning text and
Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, embodied environments for interactive learning,” in
December 2021, virtual, J. Vanschoren and S. Yeung, 9th International Conference on Learning Representations,
Eds., 2021. ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
[599] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa: OpenReview.net, 2021.
Language models of isabelle proofs,” in 6th Conference [610] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web-
on Artificial Intelligence and Theorem Proving, 2021, pp. shop: Towards scalable real-world web interaction
378–392. with grounded language agents,” in NeurIPS, 2022.
[600] K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross- [611] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang,
system benchmark for formal olympiad-level mathe- H. Sun, and Y. Su, “Mind2web: Towards a generalist
matics,” in The Tenth International Conference on Learn- agent for the web,” CoRR, vol. abs/2306.06070, 2023.
ing Representations, ICLR 2022, Virtual Event, April 25- [612] W. H. Guss, B. Houghton, N. Topin, P. Wang, C. Codel,
29, 2022. OpenReview.net, 2022. M. Veloso, and R. Salakhutdinov, “Minerl: A large-
[601] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. scale dataset of minecraft demonstrations,” in Proceed-
Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor- ings of the Twenty-Eighth International Joint Conference
malizing and formally proving undergraduate-level on Artificial Intelligence, IJCAI 2019, Macao, China, Au-
mathematics,” CoRR, vol. abs/2302.12433, 2023. gust 10-16, 2019, S. Kraus, Ed. ijcai.org, 2019, pp.
[602] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J. Wen, 2442–2448.
“Halueval: A large-scale hallucination evaluation [613] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang,
benchmark for large language models,” CoRR, vol. H. Zhu, A. Tang, D. Huang, Y. Zhu, and A. Anand-
abs/2305.11747, 2023. kumar, “Minedojo: Building open-ended embodied
[603] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, agents with internet-scale knowledge,” in NeurIPS,
“Crows-pairs: A challenge dataset for measuring so- 2022.
cial biases in masked language models,” in Proceedings [614] P. Lu, L. Qiu, K. Chang, Y. N. Wu, S. Zhu, T. Ra-
of the 2020 Conference on Empirical Methods in Natural jpurohit, P. Clark, and A. Kalyan, “Dynamic prompt
Language Processing, EMNLP 2020, Online, November learning via policy gradient for semi-structured math-
16-20, 2020, 2020, pp. 1953–1967. ematical reasoning,” CoRR, vol. abs/2209.14610, 2022.
[604] R. Rudinger, J. Naradowsky, B. Leonard, and B. V. [615] B. Zhang, K. Zhou, X. Wei, W. X. Zhao, J. Sha, S. Wang,
Durme, “Gender bias in coreference resolution,” in and J. rong Wen, “Evaluating and improving tool-
Proceedings of the 2018 Conference of the North American augmented computation-intensive math reasoning,”
Chapter of the Association for Computational Linguistics: CoRR, vol. abs/2306.02408, 2023.
Human Language Technologies, NAACL-HLT, New Or- [616] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li,
leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short and Y. Shan, “Gpt4tools: Teaching large language
Papers), 2018, pp. 8–14. model to use tools via self-instruction,” CoRR, vol.
[605] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. abs/2305.18752, 2023.
Smith, “Realtoxicityprompts: Evaluating neural toxic [617] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Go-
degeneration in language models,” in Findings of the rilla: Large language model connected with massive
Association for Computational Linguistics: EMNLP 2020, apis,” CoRR, vol. abs/2305.15334, 2023.
Online Event, 16-20 November 2020, ser. Findings of [618] W. Yih, M. Richardson, C. Meek, M. Chang, and J. Suh,
ACL, T. Cohn, Y. He, and Y. Liu, Eds., vol. EMNLP “The value of semantic parse labeling for knowledge
2020. Association for Computational Linguistics, base question answering,” in Proceedings of the 54th
2020, pp. 3356–3369. Annual Meeting of the Association for Computational Lin-
[606] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, guistics, ACL 2016, August 7-12, 2016, Berlin, Germany,
and A. Torralba, “Virtualhome: Simulating household Volume 2: Short Papers. The Association for Computer
activities via programs,” in CVPR. Computer Vision Linguistics, 2016.
112

[619] H. Puerto, G. G. Sahin, and I. Gurevych, “Metaqa: abs/2202.06935, 2022.


Combining expert agents for multi-skill question an- [631] J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu,
swering,” in Proceedings of the 17th Conference of the and J. Zhou, “Is chatgpt a good NLG evaluator? A
European Chapter of the Association for Computational preliminary study,” CoRR, vol. abs/2303.04048, 2023.
Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, [632] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-
2023, A. Vlachos and I. Augenstein, Eds. Association eval: NLG evaluation using GPT-4 with better human
for Computational Linguistics, 2023, pp. 3548–3562. alignment,” CoRR, vol. abs/2303.16634, 2023.
[620] P. Pasupat and P. Liang, “Compositional semantic [633] K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen-
parsing on semi-structured tables,” in Proceedings of erating longer stories with recursive reprompting and
the 53rd Annual Meeting of the Association for Compu- revision,” in Proceedings of the 2022 Conference on Em-
tational Linguistics and the 7th International Joint Con- pirical Methods in Natural Language Processing, EMNLP
ference on Natural Language Processing of the Asian Fed- 2022, Abu Dhabi, United Arab Emirates, December 7-11,
eration of Natural Language Processing, ACL 2015, July 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.
26-31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computational Linguistics, 2022, pp.
Association for Computer Linguistics, 2015, pp. 1470– 4393–4479.
1480. [634] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,
[621] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Gener- R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-
ating structured queries from natural language using tive generation of (arbitrarily) long text,” CoRR, vol.
reinforcement learning,” CoRR, vol. abs/1709.00103, abs/2305.13304, 2023.
2017. [635] S. Gulwani, O. Polozov, and R. Singh, “Program syn-
[622] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, thesis,” Found. Trends Program. Lang., vol. 4, no. 1-2,
X. Zhou, and W. Y. Wang, “Tabfact: A large-scale pp. 1–119, 2017.
dataset for table-based fact verification,” in 8th In- [636] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum,
ternational Conference on Learning Representations, ICLR and C. Gan, “Planning with large language models for
2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open- code generation,” 2023.
Review.net, 2020. [637] M. Welsh, “The end of programming,” Commun. ACM,
[623] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, vol. 66, no. 1, pp. 34–35, 2023.
Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and [638] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su,
D. R. Radev, “Spider: A large-scale human-labeled B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do,
dataset for complex and cross-domain semantic pars- Y. Xu, and P. Fung, “A multitask, multilingual, mul-
ing and text-to-sql task,” in Proceedings of the 2018 timodal evaluation of chatgpt on reasoning, halluci-
Conference on Empirical Methods in Natural Language nation, and interactivity,” CoRR, vol. abs/2302.04023,
Processing, Brussels, Belgium, October 31 - November 4, 2023.
2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, [639] Y. Liu, A. R. Fabbri, P. Liu, Y. Zhao, L. Nan, R. Han,
Eds. Association for Computational Linguistics, 2018, S. Han, S. R. Joty, C. Wu, C. Xiong, and D. Radev, “Re-
pp. 3911–3921. visiting the gold standard: Grounding summarization
[624] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine evaluation with robust human evaluation,” CoRR, vol.
translation by jointly learning to align and translate,” abs/2212.07981, 2022.
in ICLR, 2015. [640] A. R. Fabbri, W. Kryscinski, B. McCann, C. Xiong,
[625] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: R. Socher, and D. R. Radev, “Summeval: Re-evaluating
a method for automatic evaluation of machine trans- summarization evaluation,” Trans. Assoc. Comput. Lin-
lation,” in Proceedings of the 40th Annual Meeting of guistics, vol. 9, pp. 391–409, 2021.
the Association for Computational Linguistics, July 6-12, [641] T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang,
2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318. W. X. Zhao, and F. Wei, “Not all metrics are guilty:
[626] C.-Y. Lin, “ROUGE: A package for automatic evalu- Improving NLG evaluation with LLM paraphrasing,”
ation of summaries,” in Text Summarization Branches CoRR, vol. abs/2305.15067, 2023.
Out. Association for Computational Linguistics, Jul. [642] X. Wang, X. Tang, W. X. Zhao, J. Wang, and J. Wen,
2004, pp. 74–81. “Rethinking the evaluation for conversational rec-
[627] W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu, “Is ommendation in the era of large language models,”
chatgpt a good translator? a preliminary study,” arXiv CoRR, vol. abs/2305.13112, 2023.
preprint arXiv:2301.08745, 2023. [643] M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan,
[628] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. R. “Human-like summarization evaluation with chat-
McKeown, and T. B. Hashimoto, “Benchmarking large gpt,” CoRR, vol. abs/2304.02554, 2023.
language models for news summarization,” CoRR, [644] Y. Ji, Y. Gong, Y. Peng, C. Ni, P. Sun, D. Pan, B. Ma,
vol. abs/2301.13848, 2023. and X. Li, “Exploring chatgpt’s ability to rank con-
[629] T. Goyal, J. J. Li, and G. Durrett, “News summariza- tent: A preliminary study on consistency with human
tion and evaluation in the era of GPT-3,” CoRR, vol. preferences,” CoRR, vol. abs/2303.07610, 2023.
abs/2209.12356, 2022. [645] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,
[630] S. Gehrmann, E. Clark, and T. Sellam, “Repairing K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,
the cracked foundation: A survey of obstacles in “Benchmarking foundation models with language-
evaluation practices for generated text,” CoRR, vol. model-as-an-examiner,” CoRR, vol. abs/2306.04181,
113

2023. [657] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai,


[646] Y. Liu, S. Feng, D. Wang, Y. Zhang, and H. Schütze, E. Rutherford, K. Millican, G. van den Driessche,
“Evaluate what you can’t evaluate: Unassessable gen- J. Lespiau, B. Damoc, A. Clark, D. de Las Casas,
erated responses quality,” CoRR, vol. abs/2305.14658, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
2023. L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa-
[647] P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu, ganini, G. Irving, O. Vinyals, S. Osindero, K. Si-
T. Liu, and Z. Sui, “Large language models are not fair monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv-
evaluators,” CoRR, vol. abs/2305.17926, 2023. ing language models by retrieving from trillions of
[648] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, tokens,” in International Conference on Machine Learn-
Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui, ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
Q. Zhang, and X. Huang, “A comprehensive capabil- USA, ser. Proceedings of Machine Learning Research,
ity analysis of gpt-3 and gpt-3.5 series models,” arXiv K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári,
preprint arXiv:2303.10420, 2023. G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022,
[649] M. McCloskey and N. J. Cohen, “Catastrophic interfer- pp. 2206–2240.
ence in connectionist networks: The sequential learn- [658] S. Xu, L. Pang, H. Shen, X. Cheng, and T.-S. Chua,
ing problem,” in Psychology of learning and motivation, “Search-in-the-chain: Towards accurate, credible and
1989, pp. 109–165. traceable large language models for knowledge-
[650] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, intensive tasks,” CoRR, vol. abs/2304.14732, 2023.
and C. Kanan, “Measuring catastrophic forgetting in [659] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu,
neural networks,” in Proceedings of the Thirty-Second Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao,
AAAI Conference on Artificial Intelligence, (AAAI-18), “Check your facts and try again: Improving large
the 30th innovative Applications of Artificial Intelligence language models with external knowledge and auto-
(IAAI-18), and the 8th AAAI Symposium on Educational mated feedback,” CoRR, vol. abs/2302.12813, 2023.
Advances in Artificial Intelligence (EAAI-18), New Or- [660] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-
leans, Louisiana, USA, February 2-7, 2018, 2018, pp. Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval
3390–3398. augmented generation,” CoRR, vol. abs/2305.06983,
[651] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Ya- 2023.
sunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang, V. Zhong, [661] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang,
B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao, D. Radev, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A sur-
C. Xiong, L. Kong, R. Zhang, N. A. Smith, L. Zettle- vey on hallucination in large language models: Prin-
moyer, and T. Yu, “Unifiedskg: Unifying and multi- ciples, taxonomy, challenges, and open questions,”
tasking structured knowledge grounding with text-to- CoRR, vol. abs/2311.05232, 2023.
text language models,” in EMNLP. Association for [662] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and
Computational Linguistics, 2022, pp. 602–631. J. Wen, “Evaluating object hallucination in large
[652] A. Roberts, C. Raffel, and N. Shazeer, “How much vision-language models,” CoRR, vol. abs/2305.10355,
knowledge can you pack into the parameters of a 2023.
language model?” in Proceedings of the 2020 Conference [663] S. Kadavath, T. Conerly, A. Askell, T. J. Henighan,
on Empirical Methods in Natural Language Processing, D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. Das-
EMNLP 2020, Online, November 16-20, 2020, 2020, pp. Sarma, E. Tran-Johnson, S. Johnston, S. El-Showk,
5418–5426. A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bow-
[653] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos- man, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson,
seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson,
S. Riedel, and E. Grave, “Few-shot learning with S. Ringer, D. Amodei, T. B. Brown, J. Clark, N. Joseph,
retrieval augmented language models,” CoRR, vol. B. Mann, S. McCandlish, C. Olah, and J. Kaplan,
abs/2208.03299, 2022. “Language models (mostly) know what they know,”
[654] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, CoRR, vol. abs/2207.05221, 2022.
“Retrieval augmented language model pre-training,” [664] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-
in Proceedings of the 37th International Conference on gpt: Zero-resource black-box hallucination detection
Machine Learning, ICML 2020, 13-18 July 2020, Virtual for generative large language models,” ArXiv, vol.
Event, 2020, pp. 3929–3938. abs/2305.06983, 2023.
[655] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, [665] S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian,
V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, G. Bernadett-Shapiro, G. Brockman, M. Brundage,
T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval- J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti,
augmented generation for knowledge-intensive NLP N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross,
tasks,” in Advances in Neural Information Processing M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil-
Systems 33: Annual Conference on Neural Information patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil-
Processing Systems 2020, NeurIPS 2020, December 6-12, lan, D. Medina, J. Menick, A. Mishchenko, A. Nair,
2020, virtual, 2020. R. Nayak, A. Neelakantan, R. Nuttall, J. Parish,
[656] Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen, A. T. Passos, A. Perelman, F. de Avila Belbute Peres,
“Complex knowledge base question answering: A V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur-
survey,” CoRR, vol. abs/2108.06688, 2021. ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss,
114

J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba, ical problem understanding,” in KDD ’22: The 28th
S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,” ACM SIGKDD Conference on Knowledge Discovery and
OpenAI Blog, March 2023. Data Mining, Washington, DC, USA, August 14 - 18,
[666] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and 2022, A. Zhang and H. Rangwala, Eds. ACM, 2022,
N. Grigorev, “Internet-augmented language models pp. 4571–4581.
through few-shot prompting for open-domain ques- [679] Q. Wang, C. Kaliszyk, and J. Urban, “First experi-
tion answering,” CoRR, vol. abs/2203.05115, 2022. ments with neural translation of informal to formal
[667] H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu, mathematics,” in Intelligent Computer Mathematics -
R. Lai, Z. Cao, J. Nie, and J. Wen, “Webbrain: Learn- 11th International Conference, CICM 2018, Hagenberg,
ing to generate factually correct articles for queries Austria, August 13-17, 2018, Proceedings, ser. Lecture
by grounding on large web corpus,” CoRR, vol. Notes in Computer Science, F. Rabe, W. M. Farmer,
abs/2304.04358, 2023. G. O. Passmore, and A. Youssef, Eds., vol. 11006.
[668] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J. Wen, Springer, 2018, pp. 255–270.
“RETA-LLM: A retrieval-augmented large language [680] S. Polu and I. Sutskever, “Generative language mod-
model toolkit,” CoRR, vol. abs/2306.05212, 2023. eling for automated theorem proving,” CoRR, vol.
[669] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei, abs/2009.03393, 2020.
“Knowledge neurons in pretrained transformers,” in [681] A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski,
Proceedings of the 60th Annual Meeting of the Association T. Odrzygózdz, P. Milos, Y. Wu, and M. Jamnik,
for Computational Linguistics (Volume 1: Long Papers), “Thor: Wielding hammers to integrate language mod-
ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, els and automated theorem provers,” CoRR, vol.
P. Nakov, and A. Villavicencio, Eds. Association for abs/2205.10893, 2022.
Computational Linguistics, 2022, pp. 8493–8502. [682] S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin,
[670] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov, and I. Sutskever, “Formal mathematics statement cur-
“Locating and editing factual associations in gpt,” in riculum learning,” CoRR, vol. abs/2202.01344, 2022.
Advances in Neural Information Processing Systems, 2022. [683] Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats,
[671] M. Geva, R. Schuster, J. Berant, and O. Levy, “Trans- M. Jamnik, and C. Szegedy, “Autoformalization with
former feed-forward layers are key-value memories,” large language models,” CoRR, vol. abs/2205.12615,
in Proceedings of the 2021 Conference on Empirical 2022.
Methods in Natural Language Processing, EMNLP 2021, [684] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu,
Virtual Event / Punta Cana, Dominican Republic, 7-11 M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft,
November, 2021, M. Moens, X. Huang, L. Specia, and sketch, and prove: Guiding formal theorem provers
S. W. Yih, Eds. Association for Computational Lin- with informal proofs,” CoRR, vol. abs/2210.12283,
guistics, 2021, pp. 5484–5495. 2022.
[672] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, [685] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao,
H. Chen, and N. Zhang, “Editing large language mod- S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye,
els: Problems, methods, and opportunities,” CoRR, Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yaz-
vol. abs/2305.13172, 2023. danbakhsh, and P. Clark, “Self-refine: Iterative refine-
[673] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian, ment with self-feedback,” CoRR, vol. abs/2303.17651,
M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and 2023.
H. Chen, “Easyedit: An easy-to-use knowledge edit- [686] N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an
ing framework for large language models,” CoRR, vol. autonomous agent with dynamic memory and self-
abs/2308.07269, 2023. reflection,” CoRR, vol. abs/2303.11366, 2023.
[674] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and [687] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan,
W. Chen, “Synthetic prompting: Generating chain-of- and W. Chen, “CRITIC: large language models can
thought demonstrations for large language models,” self-correct with tool-interactive critiquing,” CoRR,
CoRR, vol. abs/2302.00618, 2023. vol. abs/2305.11738, 2023.
[675] Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind [688] J. Uesato, N. Kushman, R. Kumar, H. F. Song, N. Y.
meets machine: Unravelling gpt-4’s cognitive psychol- Siegel, L. Wang, A. Creswell, G. Irving, and I. Hig-
ogy,” CoRR, vol. abs/2303.11436, 2023. gins, “Solving math word problems with process- and
[676] M. I. Nye, A. J. Andreassen, G. Gur-Ari, outcome-based feedback,” CoRR, vol. abs/2211.14275,
H. Michalewski, J. Austin, D. Bieber, D. Dohan, 2022.
A. Lewkowycz, M. Bosma, D. Luan, C. Sutton, [689] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
and A. Odena, “Show your work: Scratchpads for B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
intermediate computation with language models,” and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
CoRR, vol. abs/2112.00114, 2021. abs/2305.20050, 2023.
[677] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita- [690] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang,
tions of language models in arithmetic and symbolic “How well do large language models perform in
induction,” CoRR, vol. abs/2208.05051, 2022. arithmetic tasks?” CoRR, vol. abs/2304.02015, 2023.
[678] W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou, [691] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu, Y. Gao,
J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Jiuzhang: J. Lou, and W. Chen, “Reasoning like program execu-
A chinese pre-trained language model for mathemat- tors,” in Proceedings of the 2022 Conference on Empirical
115

Methods in Natural Language Processing, EMNLP 2022, S. Lu, L. Ji, S. Mao, Y. Wang, L. Shou, M. Gong,
Abu Dhabi, United Arab Emirates, December 7-11, 2022, and N. Duan, “Taskmatrix.ai: Completing tasks by
2022, pp. 761–779. connecting foundation models with millions of apis,”
[692] H. Zhou, A. Nova, H. Larochelle, A. C. Courville, CoRR, vol. abs/2303.16434, 2023.
B. Neyshabur, and H. Sedghi, “Teaching algorith- [705] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou,
mic reasoning via in-context learning,” CoRR, vol. “Large language models as tool makers,” CoRR, vol.
abs/2211.09066, 2022. abs/2305.17126, 2023.
[693] A. Parisi, Y. Zhao, and N. Fiedel, “TALM: [706] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
tool augmented language models,” CoRR, vol. J. Han, “Large language models can self-improve,”
abs/2205.12255, 2022. CoRR, vol. abs/2210.11610, 2022.
[694] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, [707] E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lam-
“Language models as zero-shot planners: Extracting bert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf,
actionable knowledge for embodied agents,” in ICML, “Open llm leaderboard,” https://huggingface.co/
ser. Proceedings of Machine Learning Research, vol. spaces/HuggingFaceH4/open llm leaderboard,
162. PMLR, 2022, pp. 9118–9147. 2023.
[695] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, [708] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,
and P. Oudeyer, “Grounding large language models A. Saied, W. Chen, and N. Duan, “Agieval: A human-
in interactive environments with online reinforcement centric benchmark for evaluating foundation models,”
learning,” CoRR, vol. abs/2302.02662, 2023. CoRR, vol. abs/2304.06364, 2023.
[696] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang, [709] H. Zeng, “Measuring massive multitask chinese un-
G. Huang, B. Li, L. Lu, X. Wang, Y. Qiao, Z. Zhang, derstanding,” CoRR, vol. abs/2304.12986, 2023.
and J. Dai, “Ghost in the minecraft: Generally capable [710] C. Liu, R. Jin, Y. Ren, L. Yu, T. Dong, X. Peng,
agents for open-world environments via large lan- S. Zhang, J. Peng, P. Zhang, Q. Lyu, X. Su, Q. Liu,
guage models with text-based knowledge and mem- and D. Xiong, “M3KE: A massive multi-level multi-
ory,” CoRR, vol. abs/2305.17144, 2023. subject knowledge evaluation benchmark for chinese
[697] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, large language models,” CoRR, vol. abs/2305.10263,
Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An 2023.
open-ended embodied agent with large language [711] Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su,
models,” CoRR, vol. abs/2305.16291, 2023. J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and
[698] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, J. He, “C-eval: A multi-level multi-discipline chinese
B. David, C. Finn, K. Gopalakrishnan, K. Hausman, evaluation suite for foundation models,” CoRR, vol.
A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, abs/2305.08322, 2023.
E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, [712] Z. Gu, X. Zhu, H. Ye, L. Zhang, J. Wang, S. Jiang,
R. Julian, D. Kalashnikov, Y. Kuang, K. Lee, S. Levine, Z. Xiong, Z. Li, Q. He, R. Xu, W. Huang, W. Zheng,
Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, H. Feng, and Y. Xiao, “Xiezhi: An ever-updating
K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Siev- benchmark for holistic domain knowledge evalua-
ers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, tion,” CoRR, vol. abs/2306.05783, 2023.
P. Xu, S. Xu, and M. Yan, “Do as I can, not as I say: [713] O. Contributors, “Opencompass: A universal evalua-
Grounding language in robotic affordances,” CoRR, tion platform for foundation models,” https://github.
vol. abs/2204.01691, 2022. com/InternLM/OpenCompass, 2023.
[699] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, [714] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and T. Khot,
B. Ichter, P. Florence, and A. Zeng, “Code as policies: “Chain-of-thought hub: A continuous effort to mea-
Language model programs for embodied control,” sure large language models’ reasoning performance,”
CoRR, vol. abs/2209.07753, 2022. CoRR, vol. abs/2305.17306, 2023.
[700] Y. Fu, H. Peng, T. Khot, and M. Lapata, “Improv- [715] J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-li, X. Lv,
ing language model negotiation with self-play and H. Peng, Z. Yao, X. Zhang, H. Li, C. Li, Z. Zhang,
in-context learning from AI feedback,” CoRR, vol. Y. Bai, Y. Liu, A. Xin, N. Lin, K. Yun, L. Gong, J. Chen,
abs/2305.10142, 2023. Z. Wu, Y. Qi, W. Li, Y. Guan, K. Zeng, J. Qi, H. Jin,
[701] N. Mehta, M. Teruel, P. F. Sanz, X. Deng, A. H. J. Liu, Y. Gu, Y. Yao, N. Ding, L. Hou, Z. Liu, B. Xu,
Awadallah, and J. Kiseleva, “Improving grounded J. Tang, and J. Li, “Kola: Carefully benchmarking
language understanding in a collaborative environ- world knowledge of large language models,” CoRR,
ment by interacting with agents through help feed- vol. abs/2306.09296, 2023.
back,” CoRR, vol. abs/2304.10750, 2023. [716] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vi-
[702] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Go- das, A. Kranias, J. J. Nay, K. Gupta, and A. Ko-
rilla: Large language model connected with massive matsuzaki, “ARB: advanced reasoning benchmark for
apis,” CoRR, vol. abs/2305.15334, 2023. large language models,” CoRR, vol. abs/2307.13692,
[703] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Aug- 2023.
menting frozen language models with massive tools [717] Y. Peng, S. Li, W. Gu, Y. Li, W. Wang, C. Gao, and
via tool embeddings,” CoRR, vol. abs/2305.11554, M. R. Lyu, “Revisiting, benchmarking and exploring
2023. API recommendation: How far are we?” IEEE Trans.
[704] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, Software Eng., vol. 49, no. 4, pp. 1876–1897, 2023.
116

[718] M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li, “Benchmarking foundation models with language-
“Api-bank: A benchmark for tool-augmented llms,” model-as-an-examiner,” CoRR, vol. abs/2306.04181,
CoRR, vol. abs/2304.08244, 2023. 2023.
[719] Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and [732] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang,
L. Sun, “Toolalpaca: Generalized tool learning for J. Fu, and Z. Liu, “Chateval: Towards better llm-based
language models with 3000 simulated cases,” CoRR, evaluators through multi-agent debate,” CoRR, vol.
vol. abs/2306.05301, 2023. abs/2308.07201, 2023.
[720] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang, [733] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen,
“On the tool manipulation capability of open-source L. Yang, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang,
large language models,” CoRR, vol. abs/2305.16504, Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey
2023. on evaluation of large language models,” CoRR, vol.
[721] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, abs/2307.03109, 2023.
X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, [734] Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian,
J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through the
“Toolllm: Facilitating large language models to master lens of core competency: Survey on evaluation of large
16000+ real-world apis,” CoRR, vol. abs/2307.16789, language models,” CoRR, vol. abs/2308.07902, 2023.
2023. [735] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar-
[722] Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke, rette, M. Collins, and T. Kwiatkowski, “Tydi QA: A
R. Murthy, Y. Feng, Z. Chen, J. C. Niebles, benchmark for information-seeking question answer-
D. Arpit, R. Xu, P. Mui, H. Wang, C. Xiong, and ing in typologically diverse languages,” Trans. Assoc.
S. Savarese, “BOLAA: benchmarking and orchestrat- Comput. Linguistics, vol. 8, pp. 454–470, 2020.
ing llm-augmented autonomous agents,” CoRR, vol. [736] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Fos-
abs/2308.05960, 2023. ter, L. Golding, J. Hsu, K. McDonell, N. Muennighoff,
[723] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang,
H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, K. Wang, and A. Zou, “A framework for few-shot
Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, language model evaluation,” Sep. 2021.
M. Huang, Y. Dong, and J. Tang, “Agentbench: Evalu- [737] R. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,
ating llms as agents,” CoRR, vol. abs/2308.03688, 2023. S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,
[724] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, “When flue meets flang: Benchmarks and large pre-
L. Yang, W. Ye, N. Z. Gong, Y. Zhang, and X. Xie, trained language model for financial domain,” in Pro-
“Promptbench: Towards evaluating the robustness ceedings of the 2022 Conference on Empirical Methods in
of large language models on adversarial prompts,” Natural Language Processing, 2022, pp. 2322–2335.
CoRR, vol. abs/2306.04528, 2023. [738] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao,
[725] R. S. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du, X. Chen, Y. Lin, J.-R. Wen, and J. Han, “Don’t make
S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang, your llm an evaluation benchmark cheater,” arXiv
“WHEN FLUE MEETS FLANG: benchmarks and preprint arXiv:2311.01964, 2023.
large pre-trained language model for financial do- [739] C. Zan, K. Peng, L. Ding, B. Qiu, B. Liu, S. He, Q. Lu,
main,” CoRR, vol. abs/2211.00083, 2022. Z. Zhang, C. Liu, W. Liu, Y. Zhan, and D. Tao, “Vega-
[726] N. Guha, D. E. Ho, J. Nyarko, and C. Ré, “Legalbench: mt: The JD explore academy machine translation sys-
Prototyping a collaborative benchmark for legal rea- tem for WMT22,” in Proceedings of the Seventh Con-
soning,” CoRR, vol. abs/2209.06120, 2022. ference on Machine Translation, WMT 2022, Abu Dhabi,
[727] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, United Arab Emirates (Hybrid), December 7-8, 2022,
Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chat-
J. E. Gonzalez, and I. Stoica, “Judging llm-as-a- terjee, M. R. Costa-jussà, C. Federmann, M. Fishel,
judge with mt-bench and chatbot arena,” CoRR, vol. A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,
abs/2306.05685, 2023. P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,
[728] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subrama- T. Kocmi, A. Martins, M. Morishita, C. Monz, M. Na-
niam, A. R. Loomba, S. Zhang, Y. Sun, and W. Wang, gata, T. Nakazawa, M. Negri, A. Névéol, M. Neves,
“Scibench: Evaluating college-level scientific problem- M. Popel, M. Turchi, and M. Zampieri, Eds. Associa-
solving abilities of large language models,” CoRR, vol. tion for Computational Linguistics, 2022, pp. 411–422.
abs/2307.10635, 2023. [740] Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh,
[729] X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, and P. J. Liu, “Calibrating sequence likelihood
C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacae- improves conditional language generation,” CoRR,
val: An automatic evaluator of instruction-following vol. abs/2210.00045, 2022. [Online]. Available: https:
models,” https://github.com/tatsu-lab/alpaca eval, //doi.org/10.48550/arXiv.2210.00045
2023. [741] D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord,
[730] Y. Huang, Q. Zhang, P. S. Yu, and L. Sun, “Trustgpt: P. Clark, and H. Hajishirzi, “Unifiedqa: Crossing for-
A benchmark for trustworthy and responsible large mat boundaries with a single QA system,” in EMNLP
language models,” CoRR, vol. abs/2306.11507, 2023. (Findings), ser. Findings of ACL, vol. EMNLP 2020.
[731] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu, Association for Computational Linguistics, 2020, pp.
K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou, 1896–1907.
117

[742] X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan, J. Zhang, in natural language processing, 1996.
and Y. Yang, “Solving math word problem via co- [757] V. Yadav and S. Bethard, “A survey on recent ad-
operative reasoning induced language models,” arXiv vances in named entity recognition from deep learn-
preprint arXiv:2210.16257, 2022. ing models,” in Proceedings of the 27th International
[743] A. Nguyen, N. Karampatziakis, and W. Chen, “Meet Conference on Computational Linguistics, 2018, pp. 2145–
in the middle: A new pre-training paradigm,” 2158.
CoRR, vol. abs/2303.07295, 2023. [Online]. Available: [758] F. Souza, R. Nogueira, and R. Lotufo, “Portuguese
https://doi.org/10.48550/arXiv.2303.07295 named entity recognition using bert-crf,” arXiv
[744] H. Li, J. Zhang, C. Li, and H. Chen, “RESDSQL: preprint arXiv:1909.10649, 2019.
decoupling schema linking and skeleton parsing [759] S. Pawar, G. K. Palshikar, and P. Bhattacharyya,
for text-to-sql,” CoRR, vol. abs/2302.05965, 2023. “Relation extraction: A survey,” arXiv preprint
[Online]. Available: https://doi.org/10.48550/arXiv. arXiv:1712.05191, 2017.
2302.05965 [760] C. Walker and et al., “Ace 2005 multilingual training
[745] W. Kang and J. J. McAuley, “Self-attentive sequential corpus ldc2006t06,” Philadelphia, 2006.
recommendation,” in IEEE International Conference on [761] J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the
Data Mining, ICDM 2018, Singapore, November 17-20, feasibility of chatgpt for event extraction,” CoRR, vol.
2018. IEEE Computer Society, 2018, pp. 197–206. abs/2303.03836, 2023.
[746] B. Yang, C. Han, Y. Li, L. Zuo, and Z. Yu, “Improv- [762] Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language
ing conversational recommendation systems’ quality model is not a good few-shot information extractor,
with context-aware item meta-information,” in Find- but a good reranker for hard samples!” CoRR, vol.
ings of the Association for Computational Linguistics: abs/2303.08559, 2023.
NAACL 2022, Seattle, WA, United States, July 10-15, [763] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic
2022, M. Carpuat, M. de Marneffe, and I. V. M. Ruı́z, data generation of llms help clinical text mining?”
Eds. Association for Computational Linguistics, 2022, arXiv preprint arXiv:2303.04360, 2023.
pp. 38–48. [764] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet,
[747] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap- A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalch-
pelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hes- brenner, N. Parmar et al., “Tensor2tensor for neural
low, J. Launay, Q. Malartic, B. Noune, B. Pannier, machine translation,” in Proceedings of the 13th Con-
and G. Penedo, “Falcon-40B: an open large language ference of the Association for Machine Translation in the
model with state-of-the-art performance,” 2023. Americas (Volume 1: Research Track), 2018, pp. 193–199.
[748] S. Martin, J. Liermann, and H. Ney, “Algorithms for [765] B. Zhang, B. Haddow, and A. Birch, “Prompting
bigram and trigram word clustering,” Speech commu- large language model for machine translation: A case
nication, vol. 24, no. 1, pp. 19–37, 1998. study,” arXiv preprint arXiv:2301.07069, 2023.
[749] R. Navigli, “Word sense disambiguation: A survey,” [766] M. Ghazvininejad, H. Gonen, and L. Zettlemoyer,
ACM computing surveys (CSUR), vol. 41, no. 2, pp. 1– “Dictionary-based phrase-level prompting of large
69, 2009. language models for machine translation,” arXiv
[750] W. H. Gomaa, A. A. Fahmy et al., “A survey of text preprint arXiv:2302.07856, 2023.
similarity approaches,” international journal of Com- [767] L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu,
puter Applications, vol. 68, no. 13, pp. 13–18, 2013. S. Shi, and Z. Tu, “Document-level machine trans-
[751] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, lation with large language models,” arXiv preprint
M. Chenaghlu, and J. Gao, “Deep learning–based text arXiv:2304.02210, 2023.
classification: a comprehensive review,” ACM comput- [768] W. Jiao, J.-t. Huang, W. Wang, X. Wang, S. Shi, and
ing surveys (CSUR), vol. 54, no. 3, pp. 1–40, 2021. Z. Tu, “Parrot: Translating during chat using large lan-
[752] N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham, guage models,” arXiv preprint arXiv:2304.02426, 2023.
C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier, [769] W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans:
M. Noetel, and A. Stuhlmüller, “RAFT: A real-world Augmenting large language models with multilin-
few-shot text classification benchmark,” in NeurIPS gual translation capability over 100 languages,” arXiv
Datasets and Benchmarks, 2021. preprint arXiv:2305.18098, 2023.
[753] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, [770] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek,
and D. Yang, “Is chatgpt a general-purpose nat- D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza, A. Janz,
ural language processing task solver?” CoRR, vol. K. Kanclerz, A. Kocon, B. Koptyra, W. Mieleszczenko-
abs/2302.06476, 2023. Kowszewicz, P. Milkowski, M. Oleksy, M. Piasecki,
[754] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng, L. Radlinski, K. Wojtasik, S. Wozniak, and P. Kazienko,
J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How robust “Chatgpt: Jack of all trades, master of none,” CoRR,
is gpt-3.5 to predecessors? a comprehensive study on vol. abs/2302.10724, 2023.
language understanding tasks,” 2023. [771] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao,
[755] D. Nadeau and S. Sekine, “A survey of named entity “Can chatgpt understand too? A comparative study
recognition and classification,” Lingvisticae Investiga- on chatgpt and fine-tuned BERT,” CoRR, vol.
tiones, vol. 30, no. 1, pp. 3–26, 2007. abs/2302.10198, 2023.
[756] A. Ratnaparkhi, “A maximum entropy model for part- [772] D. Cheng, S. Huang, J. Bi, Y. Zhan, J. Liu, Y. Wang,
of-speech tagging,” in Conference on empirical methods H. Sun, F. Wei, D. Deng, and Q. Zhang, “Uprise:
118

Universal prompt retrieval for improving zero-shot timized training approach to dense passage retrieval
evaluation,” arXiv preprint arXiv:2303.08518, 2023. for open-domain question answering,” in Proceedings
[773] R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, of the 2021 Conference of the North American Chapter
H. Wang, and J.-R. Wen, “Rocketqav2: A joint train- of the Association for Computational Linguistics: Human
ing method for dense passage retrieval and passage Language Technologies, 2021, pp. 5835–5847.
re-ranking,” in Proceedings of the 2021 Conference on [787] R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu,
Empirical Methods in Natural Language Processing, 2021, H. Wang, and J.-R. Wen, “Pair: Leveraging passage-
pp. 2825–2835. centric similarity relation for improving dense pas-
[774] W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, sage retrieval,” in Findings of the Association for Compu-
“Is chatgpt good at search? investigating large lan- tational Linguistics: ACL-IJCNLP 2021, 2021, pp. 2173–
guage models as re-ranking agent,” arXiv preprint 2183.
arXiv:2304.09542, 2023. [788] Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning
[775] Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, J. Shen, for augmenting dense retrieval with large language
T. Liu, J. Liu, D. Metzler, X. Wang et al., “Large lan- models,” arXiv preprint arXiv:2307.08303, 2023.
guage models are effective text rankers with pairwise [789] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu,
ranking prompting,” arXiv preprint arXiv:2306.17563, A. Bakalov, K. Guu, K. Hall, and M.-W. Chang,
2023. “Promptagator: Few-shot dense retrieval from 8 ex-
[776] S. Cho, S. Jeong, J. Seo, and J. C. Park, “Discrete amples,” in The Eleventh International Conference on
prompt optimization via constrained generation for Learning Representations, 2023.
zero-shot re-ranker,” arXiv preprint arXiv:2305.13729, [790] A. Askari, M. Aliannejadi, E. Kanoulas, and S. Ver-
2023. berne, “Generating synthetic documents for cross-
[777] R. Tang, X. Zhang, X. Ma, J. Lin, and F. Ture, “Found encoder re-rankers: A comparative study of chatgpt
in the middle: Permutation self-consistency improves and human experts,” arXiv preprint arXiv:2305.02320,
listwise ranking in large language models,” arXiv 2023.
preprint arXiv:2310.07712, 2023. [791] K. Mao, Z. Dou, H. Chen, F. Mo, and H. Qian, “Large
[778] X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot language models know your contextual search intent:
listwise document reranking with a large language A prompting framework for conversational search,”
model,” arXiv preprint arXiv:2305.02156, 2023. arXiv preprint arXiv:2303.06573, 2023.
[779] S. Zhuang, H. Zhuang, B. Koopman, and G. Zuccon, [792] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-
“A setwise approach for effective and highly efficient shot dense retrieval without relevance labels,” in Pro-
zero-shot ranking with large language models,” arXiv ceedings of the 61st Annual Meeting of the Association
preprint arXiv:2310.09497, 2023. for Computational Linguistics (Volume 1: Long Papers).
[780] H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang, and Association for Computational Linguistics, 2023, pp.
M. Berdersky, “Beyond yes and no: Improving zero- 1762–1777.
shot llm rankers via scoring fine-grained relevance [793] L. Wang, N. Yang, and F. Wei, “Query2doc: Query
labels,” arXiv preprint arXiv:2310.14122, 2023. expansion with large language models,” arXiv preprint
[781] N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large arXiv:2303.07678, 2023.
language models are built-in autoregressive search [794] G. Ma, X. Wu, P. Wang, Z. Lin, and S. Hu, “Pre-
engines,” arXiv preprint arXiv:2305.09612, 2023. training with large language model-based document
[782] X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine- expansion for dense passage retrieval,” arXiv preprint
tuning llama for multi-stage text retrieval,” arXiv arXiv:2308.08285, 2023.
preprint arXiv:2310.08319, 2023. [795] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren,
[783] R. Pradeep, S. Sharifymoghaddam, and J. Lin, Z. Chen, D. Yin, and Z. Ren, “Instruction distilla-
“Rankvicuna: Zero-shot listwise document rerank- tion makes large language models efficient zero-shot
ing with open-source large language models,” arXiv rankers,” arXiv preprint arXiv:2311.01555, 2023.
preprint arXiv:2309.15088, 2023. [796] X. Wang, W. Zhu, and W. Y. Wang, “Large language
[784] Y. Tay, V. Q. Tran, M. Dehghani, J. Ni, D. Bahri, models are implicitly topic models: Explaining and
H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta et al., finding good demonstrations for in-context learning,”
“Transformer memory as a differentiable search in- CoRR, vol. abs/2301.11916, 2023.
dex,” in Advances in Neural Information Processing Sys- [797] C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang,
tems, 2022. and J. Gao, “Multimodal foundation models: From
[785] R. Ren, W. X. Zhao, J. Liu, H. Wu, J.-R. Wen, specialists to general-purpose assistants,” CoRR, vol.
and H. Wang, “TOME: A two-stage approach for abs/2309.10020, 2023.
model-based retrieval,” in Proceedings of the 61st [798] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan,
Annual Meeting of the Association for Computational K. Li, Y. Lu, H. Wang, C. Tian, Y. Min, Z. Feng, X. Fan,
Linguistics (Volume 1: Long Papers). Association X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, and J. Wen,
for Computational Linguistics, 2023, pp. 6102–6114. “Recbole: Towards a unified, comprehensive and ef-
[Online]. Available: https://aclanthology.org/2023. ficient framework for recommendation algorithms,”
acl-long.336 in CIKM, G. Demartini, G. Zuccon, J. S. Culpepper,
[786] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 4653–
D. Dong, H. Wu, and H. Wang, “Rocketqa: An op- 4664.
119

[799] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, R. Zhang, and Y. Yu, “Towards open-world recom-
F. Zhang, Z. Wang, and J. Wen, “S3-rec: Self- mendation with knowledge augmentation from large
supervised learning for sequential recommendation language models,” CoRR, vol. abs/2306.10933, 2023.
with mutual information maximization,” in CIKM, [813] Q. Liu, N. Chen, T. Sakai, and X. Wu, “A first look
M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and at llm-powered generative news recommendation,”
P. Cudré-Mauroux, Eds. ACM, 2020, pp. 1893–1902. CoRR, vol. abs/2305.06566, 2023.
[800] W. X. Zhao, Y. Hou, X. Pan, C. Yang, Z. Zhang, Z. Lin, [814] R. Li, W. Deng, Y. Cheng, Z. Yuan, J. Zhang, and
J. Zhang, S. Bian, J. Tang, W. Sun, Y. Chen, L. Xu, F. Yuan, “Exploring the upper limits of text-based
G. Zhang, Z. Tian, C. Tian, S. Mu, X. Fan, X. Chen, collaborative filtering using large language models:
and J. Wen, “Recbole 2.0: Towards a more up-to-date Discoveries and insights,” CoRR, vol. abs/2305.11700,
recommendation library,” in CIKM, M. A. Hasan and 2023.
L. Xiong, Eds. ACM, 2022, pp. 4722–4726. [815] W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng,
[801] L. Xu, Z. Tian, G. Zhang, J. Zhang, L. Wang, B. Zheng, J. Wang, D. Yin, and C. Huang, “Llmrec: Large lan-
Y. Li, J. Tang, Z. Zhang, Y. Hou, X. Pan, W. X. Zhao, guage models with graph augmentation for recom-
X. Chen, and J. Wen, “Towards a more user-friendly mendation,” CoRR, vol. abs/2311.00423, 2023.
and easy-to-use benchmark library for recommender [816] X. Li, B. Chen, L. Hou, and R. Tang, “Ctrl: Connect
systems,” in SIGIR, H. Chen, W. E. Duh, H. Huang, tabular and language model for ctr prediction,” arXiv
M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM, preprint arXiv:2306.02841, 2023.
2023, pp. 2837–2847. [817] A. Muhamed, I. Keivanloo, S. Perera, J. Mracek, Y. Xu,
[802] S. Rendle, C. Freudenthaler, Z. Gantner, and Q. Cui, S. Rajagopalan, B. Zeng, and T. Chilimbi, “Ctr-
L. Schmidt-Thieme, “BPR: bayesian personalized bert: Cost-effective knowledge distillation for billion-
ranking from implicit feedback,” CoRR, vol. parameter teacher models,” in NeurIPS Efficient Natu-
abs/1205.2618, 2012. ral Language and Speech Processing Workshop, 2021.
[803] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang, [818] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang,
and Q. Li, “Recommender systems in the era of large J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X.
language models (llms),” CoRR, 2023. Zhao, Z. Wei, and J. Wen, “A survey on large lan-
[804] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen, guage model based autonomous agents,” CoRR, vol.
C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, and E. Chen, abs/2308.11432, 2023.
“A survey on large language models for recommenda- [819] L. Wang, J. Zhang, X. Chen, Y. Lin, R. Song, W. X.
tion,” CoRR, 2023. Zhao, and J. Wen, “Recagent: A novel simulation
[805] Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and paradigm for recommender systems,” CoRR, vol.
J. Zhang, “Chat-rec: Towards interactive and explain- abs/2306.02552, 2023.
able llms-augmented recommender system,” CoRR, [820] E. Ie, C. Hsu, M. Mladenov, V. Jain, S. Narvekar,
vol. abs/2303.14524, 2023. J. Wang, R. Wu, and C. Boutilier, “Recsim: A con-
[806] S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun, figurable simulation platform for recommender sys-
X. Zhang, and J. Xu, “Uncovering chatgpt’s capabil- tems,” CoRR, vol. abs/1909.04847, 2019.
ities in recommender systems,” in RecSys, J. Zhang, [821] J. Zhang, Y. Hou, R. Xie, W. Sun, J. J. McAuley, W. X.
L. Chen, S. Berkovsky, M. Zhang, T. D. Noia, J. Basil- Zhao, L. Lin, and J. Wen, “Agentcf: Collaborative
ico, L. Pizzato, and Y. Song, Eds. ACM, 2023, pp. learning with autonomous language agents for recom-
1126–1132. mender systems,” CoRR, vol. abs/2310.09233, 2023.
[807] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley, [822] A. Zhang, L. Sheng, Y. Chen, H. Li, Y. Deng, X. Wang,
and W. X. Zhao, “Large language models are zero-shot and T. Chua, “On generative agents in recommenda-
rankers for recommender systems,” CoRR, 2023. tion,” CoRR, vol. abs/2310.10108, 2023.
[808] J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is chatgpt [823] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of
a good recommender? A preliminary study,” CoRR, vision-language pre-trained models,” in Proceedings of
vol. abs/2304.10149, 2023. the Thirty-First International Joint Conference on Artificial
[809] K. Bao, J. Zhang, Y. Zhang, W. Wang, F. Feng, Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022,
and X. He, “Tallrec: An effective and efficient tun- L. D. Raedt, Ed. ijcai.org, 2022, pp. 5436–5443.
ing framework to align large language model with [824] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, and
recommendation,” in RecSys, J. Zhang, L. Chen, J. Gao, “Vision-language pre-training: Basics, recent
S. Berkovsky, M. Zhang, T. D. Noia, J. Basilico, L. Piz- advances, and future trends,” Found. Trends Comput.
zato, and Y. Song, Eds. ACM, 2023, pp. 1007–1014. Graph. Vis., vol. 14, no. 3-4, pp. 163–352, 2022.
[810] Y. Zhu, L. Wu, Q. Guo, L. Hong, and J. Li, “Collabora- [825] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen,
tive large language model for recommender systems,” A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen,
arXiv preprint arXiv:2311.01343, 2023. D. E. Badawy, W. Han, E. Kharitonov et al., “Au-
[811] B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X. diopalm: A large language model that can speak and
Zhao, and J.-R. Wen, “Adapting large language listen,” CoRR, 2023.
models by integrating collaborative semantics for [826] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
recommendation,” 2023. [Online]. Available: https: Y. Hasson, K. Lenc, A. Mensch, K. Millican,
//api.semanticscholar.org/CorpusID:265213194 M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han,
[812] Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang, W. Zhang, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick,
120

S. Borgeaud, A. Brock, A. Nematzadeh, S. Shar- Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and
ifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zis- R. Ji, “MME: A comprehensive evaluation benchmark
serman, and K. Simonyan, “Flamingo: a visual lan- for multimodal large language models,” CoRR, vol.
guage model for few-shot learning,” in NeurIPS, 2022. abs/2306.13394, 2023.
[827] C. Schuhmann, R. Beaumont, R. Vencu, C. Gor- [840] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,
don, R. Wightman, M. Cherti, T. Coombes, A. Katta, E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi,
C. Mullis, M. Wortsman, P. Schramowski, S. Kun- F. Shi, and S. Shi, “Siren’s song in the AI ocean: A
durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, survey on hallucination in large language models,”
and J. Jitsev, “LAION-5B: an open large-scale dataset CoRR, vol. abs/2309.01219, 2023.
for training next generation image-text models,” in [841] A. Gunjal, J. Yin, and E. Bas, “Detecting and prevent-
NeurIPS, 2022. ing hallucinations in large vision language models,”
[828] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, CoRR, vol. abs/2308.06394, 2023.
“Conceptual 12m: Pushing web-scale image-text pre- [842] J. Lu, J. Rao, K. Chen, X. Guo, Y. Zhang, B. Sun,
training to recognize long-tail visual concepts,” in C. Yang, and J. Yang, “Evaluation and mitigation of
IEEE Conference on Computer Vision and Pattern Recog- agnosia in multimodal large language models,” CoRR,
nition, CVPR 2021, virtual, June 19-25, 2021. Computer vol. abs/2309.04041, 2023.
Vision Foundation / IEEE, 2021, pp. 3558–3568. [843] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell,
[829] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, and K. Saenko, “Object hallucination in image cap-
A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian, tioning,” in EMNLP. Association for Computational
Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Mod- Linguistics, 2018, pp. 4035–4045.
ularization empowers large language models with [844] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and
multimodality,” CoRR, vol. abs/2304.14178, 2023. J.-R. Wen, “Evaluating object hallucination in large
[830] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, vision-language models,” in The 2023 Conference on
C. Zhou, and J. Zhou, “Qwen-vl: A frontier large Empirical Methods in Natural Language Processing, 2023.
vision-language model with versatile abilities,” CoRR, [Online]. Available: https://openreview.net/forum?
vol. abs/2308.12966, 2023. id=xozJw0kZXF
[831] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved [845] D. A. Hudson and C. D. Manning, “GQA: A new
baselines with visual instruction tuning,” CoRR, vol. dataset for real-world visual reasoning and compo-
abs/2310.03744, 2023. sitional question answering,” in CVPR. Computer
[832] P. Zhang, X. Dong, B. Wang, Y. Cao, C. Xu, L. Ouyang, Vision Foundation / IEEE, 2019, pp. 6700–6709.
Z. Zhao, S. Ding, S. Zhang, H. Duan, W. Zhang, [846] P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu,
H. Yan, X. Zhang, W. Li, J. Li, K. Chen, C. He, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain:
X. Zhang, Y. Qiao, D. Lin, and J. Wang, “Internlm- Multimodal reasoning via thought chains for science
xcomposer: A vision-language large model for ad- question answering,” in NeurIPS, 2022.
vanced text-image comprehension and composition,” [847] A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen,
CoRR, vol. abs/2309.15112, 2023. D. Parikh, and M. Rohrbach, “Towards vqa models
[833] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and that can read,” in Proceedings of the IEEE Conference
R. Zhao, “Shikra: Unleashing multimodal llm’s ref- on Computer Vision and Pattern Recognition, 2019, pp.
erential dialogue magic,” CoRR, vol. abs/2306.15195, 8317–8326.
2023. [848] F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob, D. Manocha,
[834] F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, and T. Zhou, “Hallusionbench: You see what you
“Aligning large multi-modal model with robust in- think? or you think what you see? an image-context
struction tuning,” CoRR, vol. abs/2306.14565, 2023. reasoning benchmark challenging for gpt-4v(ision),
[835] Y. Du, H. Guo, K. Zhou, W. X. Zhao, J. Wang, C. Wang, llava-1.5, and other multi-modality models,” CoRR,
M. Cai, R. Song, and J.-R. Wen, “What makes for vol. abs/2310.14566, 2023.
good visual instructions? synthesizing complex visual [849] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
reasoning instructions for visual instruction tuning,” C. L. Zitnick, and D. Parikh, “VQA: visual question
2023. answering,” in ICCV. IEEE Computer Society, 2015,
[836] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grau- pp. 2425–2433.
man, J. Luo, and J. P. Bigham, “Vizwiz grand chal- [850] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider:
lenge: Answering visual questions from blind peo- Consensus-based image description evaluation,” in
ple,” in CVPR. Computer Vision Foundation / IEEE CVPR. IEEE Computer Society, 2015, pp. 4566–4575.
Computer Society, 2018, pp. 3608–3617. [851] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
[837] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down tuning,” CoRR, vol. abs/2304.08485, 2023.
and bottom-up cues for scene text recognition,” in [852] P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei,
CVPR. IEEE Computer Society, 2012, pp. 2687–2694. F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub:
[838] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, A comprehensive evaluation benchmark for large
Y. Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin, vision-language models,” CoRR, vol. abs/2306.09265,
“Mmbench: Is your multi-modal model an all-around 2023.
player?” CoRR, vol. abs/2307.06281, 2023. [853] Z. Li, Y. Wang, M. Du, Q. Liu, B. Wu, J. Zhang,
[839] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, C. Zhou, Z. Fan, J. Fu, J. Chen, X. Huang, and Z. Wei,
121

“Reform-eval: Evaluating large vision language mod- 2021.


els via unified re-formulation of task-oriented bench- [866] J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li,
marks,” CoRR, vol. abs/2310.02569, 2023. and H. Chen, “Subgraph retrieval enhanced model
[854] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and for multi-hop knowledge base question answering,”
Y. Shan, “Seed-bench: Benchmarking multimodal in Proceedings of the 60th Annual Meeting of the As-
llms with generative comprehension,” CoRR, vol. sociation for Computational Linguistics (Volume 1: Long
abs/2307.16125, 2023. Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022.
[855] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, Association for Computational Linguistics, 2022, pp.
X. Wang, and L. Wang, “Mm-vet: Evaluating large 5773–5784.
multimodal models for integrated capabilities,” CoRR, [867] P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu,
vol. abs/2308.02490, 2023. and M. Huang, “Jointgt: Graph-text joint represen-
[856] J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y. Jiang, tation learning for text generation from knowledge
“To see is to believe: Prompting GPT-4V for better graphs,” in Findings of the Association for Computational
visual instruction tuning,” CoRR, vol. abs/2311.07574, Linguistics: ACL/IJCNLP 2021, Online Event, August 1-
2023. 6, 2021, ser. Findings of ACL, vol. ACL/IJCNLP 2021.
[857] Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang, Association for Computational Linguistics, 2021, pp.
and T. Sun, “Llavar: Enhanced visual instruction tun- 2526–2538.
ing for text-rich image understanding,” arXiv preprint [868] O. Agarwal, H. Ge, S. Shakeri, and R. Al-Rfou, “Large
arXiv:2306.17107, 2023. scale knowledge graph based synthetic corpus gener-
[858] X. Qi, K. Huang, A. Panda, M. Wang, and P. Mittal, ation for knowledge-enhanced language model pre-
“Visual adversarial examples jailbreak aligned large training,” CoRR, vol. abs/2010.12688, 2020.
language models,” in The Second Workshop on New [869] W. Chen, Y. Su, X. Yan, and W. Y. Wang, “KGPT:
Frontiers in Adversarial Machine Learning, 2023. knowledge-grounded pre-training for data-to-text
[859] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, generation,” in Proceedings of the 2020 Conference
M. Bansal, and H. Yao, “Analyzing and mitigating on Empirical Methods in Natural Language Processing,
object hallucination in large vision-language models,” EMNLP 2020, Online, November 16-20, 2020. Associ-
arXiv preprint arXiv:2310.00754, 2023. ation for Computational Linguistics, 2020, pp. 8635–
[860] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, 8648.
L.-Y. Gui, Y.-X. Wang, Y. Yang et al., “Aligning large [870] Y. Gu, X. Deng, and Y. Su, “Don’t generate, discrimi-
multimodal models with factually augmented rlhf,” nate: A proposal for grounding language models to
arXiv preprint arXiv:2309.14525, 2023. real-world environments,” in Proceedings of the 61st
[861] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and Annual Meeting of the Association for Computational
X. Wu, “Unifying large language models and knowl- Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,
edge graphs: A roadmap,” CoRR, vol. abs/2306.08302, Canada, July 9-14, 2023. Association for Computa-
2023. tional Linguistics, 2023, pp. 4928–4949.
[862] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, [871] L. Luo, Y. Li, G. Haffari, and S. Pan, “Reasoning
J. Chen, and K. Srinivas, “Semtab 2019: Resources to on graphs: Faithful and interpretable large language
benchmark tabular data to knowledge graph match- model reasoning,” CoRR, vol. abs/2310.01061, 2023.
ing systems,” in The Semantic Web - 17th International [872] Y. Lan and J. Jiang, “Query graph generation for
Conference, ESWC 2020, Heraklion, Crete, Greece, May answering multi-hop complex questions from knowl-
31-June 4, 2020, Proceedings, ser. Lecture Notes in Com- edge bases,” in Proceedings of the 58th Annual Meeting of
puter Science, vol. 12123. Springer, 2020, pp. 514–530. the Association for Computational Linguistics, ACL 2020,
[863] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, Online, July 5-10, 2020, D. J. and, Ed. Association for
J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, Computational Linguistics, 2020, pp. 969–974.
W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, [873] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang, M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
“ERNIE 3.0: Large-scale knowledge enhanced pre- H. Chen, “Easyedit: An easy-to-use knowledge edit-
training for language understanding and generation,” ing framework for large language models,” CoRR, vol.
CoRR, vol. abs/2107.02137, 2021. [Online]. Available: abs/2308.07269, 2023.
https://arxiv.org/abs/2107.02137 [874] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
[864] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and H. Chen, and N. Zhang, “Editing large language mod-
Q. Liu, “ERNIE: enhanced language representation els: Problems, methods, and opportunities,” CoRR,
with informative entities,” in Proceedings of the 57th vol. abs/2305.13172, 2023.
Conference of the Association for Computational Linguis- [875] S. Choi, T. Fang, Z. Wang, and Y. Song, “KCTS:
tics, ACL 2019, Florence, Italy, July 28- August 2, 2019, knowledge-constrained tree search decoding with
Volume 1: Long Papers. Association for Computational token-level hallucination detection,” CoRR, vol.
Linguistics, 2019, pp. 1441–1451. abs/2310.09044, 2023.
[865] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li, and [876] S. Zhang, L. Pan, J. Zhao, and W. Y. Wang, “Mit-
J. Tang, “KEPLER: A unified model for knowledge igating language model hallucination with inter-
embedding and pre-trained language representation,” active question-knowledge alignment,” CoRR, vol.
Trans. Assoc. Comput. Linguistics, vol. 9, pp. 176–194, abs/2305.13669, 2023.
122

[877] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, abs/2309.10691, 2023.


Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms [892] S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. Weston,
for knowledge graph construction and reasoning: and X. Li, “Branch-solve-merge improves large lan-
Recent capabilities and future opportunities,” CoRR, guage model evaluation and generation,” CoRR, vol.
vol. abs/2305.13168, 2023. [Online]. Available: https: abs/2310.15123, 2023.
//doi.org/10.48550/arXiv.2305.13168 [893] X. Zhang, B. Yu, H. Yu, Y. Lv, T. Liu, F. Huang, H. Xu,
[878] S. Russell and P. Norvig, Artificial Intelligence: and Y. Li, “Wider and deeper LLM networks are fairer
A Modern Approach (4th Edition). Pearson, 2020. LLM evaluators,” CoRR, vol. abs/2308.01862, 2023.
[Online]. Available: http://aima.cs.berkeley.edu/ [894] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang,
[879] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. J. Fu, and Z. Liu, “Chateval: Towards better llm-based
Gershman, “Building machines that learn and think evaluators through multi-agent debate,” CoRR, vol.
like people,” CoRR, vol. abs/1604.00289, 2016. abs/2308.07201, 2023.
[880] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, [895] R. Li, T. Patel, and X. Du, “PRD: peer rank and
K. Narasimhan, and Y. Cao, “React: Synergizing rea- discussion improve large language model based eval-
soning and acting in language models,” CoRR, vol. uations,” CoRR, vol. abs/2307.02762, 2023.
abs/2210.03629, 2022. [896] L. Zhu, X. Wang, and X. Wang, “Judgelm: Fine-tuned
[881] 2023. [Online]. Available: https://github.com/ large language models are scalable judges,” CoRR, vol.
AntonOsika/gpt-engineer abs/2310.17631, 2023.
[882] X. Team, “Xagent: An autonomous agent for complex [897] Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal, and D. Chen,
task solving,” 2023. “Evaluating large language models at evaluating in-
[883] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, struction following,” CoRR, vol. abs/2310.07641, 2023.
and B. Ghanem, “CAMEL: communicative agents for [898] R. Koo, M. Lee, V. Raheja, J. I. Park, Z. M. Kim,
”mind” exploration of large scale language model and D. Kang, “Benchmarking cognitive biases in
society,” CoRR, vol. abs/2303.17760, 2023. large language models as evaluators,” CoRR, vol.
[884] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, abs/2309.17012, 2023.
C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, [899] P. West, X. Lu, N. Dziri, F. Brahman, L. Li, J. D.
C. Ran, L. Xiao, and C. Wu, “Metagpt: Meta pro- Hwang, L. Jiang, J. Fisher, A. Ravichander, K. Chandu,
gramming for multi-agent collaborative framework,” B. Newman, P. W. Koh, A. Ettinger, and Y. Choi, “The
CoRR, vol. abs/2308.00352, 2023. generative AI paradox: ”what it can create, it may not
[885] C. Pham, B. Liu, Y. Yang, Z. Chen, T. Liu, J. Yuan, understand”,” CoRR, vol. abs/2311.00059, 2023.
B. A. Plummer, Z. Wang, and H. Yang, “Let mod- [900] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu,
els speak ciphers: Multiagent debate through embed- X. Song, and D. Zhou, “Large language models cannot
dings,” CoRR, vol. abs/2310.06272, 2023. self-correct reasoning yet,” CoRR, vol. abs/2310.01798,
[886] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mor- 2023.
datch, “Improving factuality and reasoning in lan- [901] K. Stechly, M. Marquez, and S. Kambhampati, “GPT-
guage models through multiagent debate,” CoRR, vol. 4 doesn’t know it’s wrong: An analysis of itera-
abs/2305.14325, 2023. tive prompting for reasoning problems,” CoRR, vol.
[887] M. Karpinska, N. Akoury, and M. Iyyer, “The per- abs/2310.12397, 2023.
ils of using mechanical turk to evaluate open-ended [902] O. Nov, N. Singh, and D. M. Mann, “Putting chat-
text generation,” in Proceedings of the 2021 Conference gpt’s medical advice to the (turing) test,” CoRR, vol.
on Empirical Methods in Natural Language Processing, abs/2301.10035, 2023.
EMNLP 2021, Virtual Event / Punta Cana, Dominican [903] K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Ananiadou,
Republic, 7-11 November, 2021, M. Moens, X. Huang, “On the evaluations of chatgpt and emotion-enhanced
L. Specia, and S. W. Yih, Eds. Association for Com- prompting for mental health analysis,” CoRR, vol.
putational Linguistics, 2021, pp. 1265–1285. abs/2304.03347, 2023.
[888] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, [904] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T.
C. Bishop, V. Carbune, and A. Rastogi, “RLAIF: scal- Stüber, J. Topalis, T. Weber, P. Wesp, B. O. Sabel,
ing reinforcement learning from human feedback with J. Ricke, and M. Ingrisch, “Chatgpt makes medicine
AI feedback,” CoRR, vol. abs/2309.00267, 2023. easy to swallow: An exploratory case study on sim-
[889] T. Wang, P. Yu, X. E. Tan, S. O’Brien, R. Pa- plified radiology reports,” CoRR, vol. abs/2212.14882,
sunuru, J. Dwivedi-Yu, O. Golovneva, L. Zettlemoyer, 2022.
M. Fazel-Zarandi, and A. Celikyilmaz, “Shepherd: [905] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn,
A critic for language model generation,” CoRR, vol. L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal,
abs/2308.04592, 2023. M. Schaekermann, A. Wang, M. Amin, S. Lachgar,
[890] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, P. A. Mansfield, S. Prakash, B. Green, E. Dominowska,
G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting B. A. y Arcas, N. Tomasev, Y. Liu, R. Wong, C. Sem-
language models with high-quality feedback,” CoRR, turs, S. S. Mahdavi, J. K. Barral, D. R. Webster, G. S.
vol. abs/2310.01377, 2023. Corrado, Y. Matias, S. Azizi, A. Karthikesalingam, and
[891] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, V. Natarajan, “Towards expert-level medical question
and H. Ji, “MINT: evaluating llms in multi-turn inter- answering with large language models,” CoRR, vol.
action with tools and language feedback,” CoRR, vol. abs/2305.09617, 2023.
123

[906] S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y. Jia, and [923] D. Araci, “Finbert: Financial sentiment analysis
H. Zan, “Zhongjing: Enhancing the chinese medical with pre-trained language models,” CoRR, vol.
capabilities of large language model through expert abs/1908.10063, 2019.
feedback and real-world multi-turn dialogue,” CoRR, [924] J. C. S. Alvarado, K. Verspoor, and T. Baldwin, “Do-
vol. abs/2308.03549, 2023. main adaption of named entity recognition to sup-
[907] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts, G. K. port credit risk assessment,” in Proceedings of the
Savova, R. H. Mak, and D. S. Bitterman, “The utility Australasian Language Technology Association Workshop,
of chatgpt for cancer treatment information,” medRxiv, ALTA 2015, Parramatta, Australia, December 8 - 9, 2015,
2023. B. Hachey and K. Webster, Eds. ACL, 2015, pp. 84–90.
[908] K. Malinka, M. Peresı́ni, A. Firc, O. Hujnak, and [925] G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond
F. Janus, “On the educational impact of chatgpt: Is classification: Financial reasoning in state-of-the-art
artificial intelligence ready to obtain a university de- language models,” CoRR, vol. abs/2305.01505, 2023.
gree?” CoRR, vol. abs/2303.11146, 2023. [926] X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A large
[909] T. Susnjak, “Chatgpt: The end of online exam in- chinese financial chat model with hundreds of billions
tegrity?” CoRR, vol. abs/2212.09292, 2022. parameters,” arXiv preprint arXiv:2305.12002, 2023.
[910] K. Tan, T. Pang, and C. Fan, “Towards applying pow- [927] H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-
erful large ai models in classroom teaching: Opportu- source financial large language models,” CoRR, vol.
nities, challenges and prospects,” 2023. abs/2306.06031, 2023.
[911] F. Kamalov and I. Gurrib, “A new era of artificial [928] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu,
intelligence in education: A multifaceted revolution,” “Pubmedqa: A dataset for biomedical research ques-
CoRR, vol. abs/2305.18303, 2023. tion answering,” in Proceedings of the 2019 Conference
[912] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, on Empirical Methods in Natural Language Processing
D. Dementieva, F. Fischer, U. Gasser, G. Groh, and the 9th International Joint Conference on Natural
S. Günnemann, E. Hüllermeier et al., “Chatgpt for Language Processing, EMNLP-IJCNLP 2019, Hong Kong,
good? on opportunities and challenges of large lan- China, November 3-7, 2019, 2019, pp. 2567–2577.
guage models for education,” Learning and Individual [929] A. Krithara, A. Nentidis, K. Bougiatiotis, and
Differences, vol. 103, p. 102274, 2023. G. Paliouras, “Bioasq-qa: A manually curated corpus
[913] A. Blair-Stanek, N. Holzenberger, and B. V. Durme, for biomedical question answering,” 2022.
“Can GPT-3 perform statutory reasoning?” CoRR, vol. [930] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, and
abs/2302.06100, 2023. H. Chen, “Oceangpt: A large language model for
[914] D. Trautmann, A. Petrova, and F. Schilder, “Legal ocean science tasks,” CoRR, vol. abs/2310.02031, 2023.
prompt engineering for multilingual legal judgement [931] C. Zhang, C. Zhang, C. Li, Y. Qiao, S. Zheng, S. K.
prediction,” CoRR, vol. abs/2212.02199, 2022. Dam, M. Zhang, J. U. Kim, S. T. Kim, J. Choi, G. Park,
[915] J. H. Choi, K. E. Hickman, A. Monahan, and S. Bae, L. Lee, P. Hui, I. S. Kweon, and C. S. Hong,
D. Schwarcz, “Chatgpt goes to law school,” Available “One small step for generative ai, one giant leap for
at SSRN, 2023. AGI: A complete survey on chatgpt in AIGC era,”
[916] J. J. Nay, “Law informs code: A legal informatics CoRR, vol. abs/2304.06488, 2023.
approach to aligning artificial intelligence with hu- [932] M. Haman and M. Skolnik, “Using chatgpt to conduct
mans,” CoRR, vol. abs/2209.13020, 2022. a literature review.” Accountability in research, 2023.
[917] F. Yu, L. Quartey, and F. Schilder, “Legal prompting: [933] Ö. Aydın and E. Karaarslan, “Openai chatgpt gen-
Teaching a language model to think like a lawyer,” erated literature review: Digital twin in healthcare,”
CoRR, vol. abs/2212.01326, 2022. SSRN Electronic Journal, 2022.
[918] D. Trautmann, A. Petrova, and F. Schilder, “Legal [934] Y. J. Park, D. Kaplan, Z. Ren, C. Hsu, C. Li, H. Xu, S. Li,
prompt engineering for multilingual legal judgement and J. Li, “Can chatgpt be used to generate scientific
prediction,” CoRR, vol. abs/2212.02199, 2022. hypotheses?” CoRR, vol. abs/2304.12208, 2023.
[919] A. Tamkin, M. Brundage, J. Clark, and D. Ganguli, [935] M. M. Hassan, R. A. Knipper, and S. K. K. Santu,
“Understanding the capabilities, limitations, and so- “Chatgpt as your personal data scientist,” CoRR, vol.
cietal impact of large language models,” CoRR, vol. abs/2305.13657, 2023.
abs/2102.02503, 2021. [936] L. Cheng, X. Li, and L. Bing, “Is GPT-4 a good data
[920] Z. Sun, “A short survey of viewing large language analyst?” CoRR, vol. abs/2305.15038, 2023.
models in legal aspect,” CoRR, vol. abs/2303.09136, [937] S. I. M. Hussam Alkaissi, “Artificial hallucinations in
2023. chatgpt: Implications in scientific writing,” PubMed,
[921] A. Abid, M. Farooqi, and J. Zou, “Persistent anti- 2023.
muslim bias in large language models,” in AIES ’21: [938] A. Azaria, R. Azoulay, and S. Reches, “Chatgpt
AAAI/ACM Conference on AI, Ethics, and Society, Virtual is a remarkable tool – for experts,” CoRR, vol.
Event, USA, May 19-21, 2021, M. Fourcade, B. Kuipers, abs/2306.03102, 2023.
S. Lazar, and D. K. Mulligan, Eds. ACM, 2021, pp. [939] O. O. Buruk, “Academic writing with GPT-3.5: reflec-
298–306. tions on practices, efficacy and transparency,” CoRR,
[922] A. Shah and S. Chava, “Zero is not hero yet: Bench- vol. abs/2304.11079, 2023.
marking zero-shot performance of llms for financial [940] R. Liu and N. B. Shah, “Reviewergpt? an exploratory
tasks,” CoRR, vol. abs/2305.16633, 2023. study on using large language models for paper re-
124

viewing,” CoRR, vol. abs/2306.00622, 2023.


[941] M. Kosinski, “Theory of mind may have sponta-
neously emerged in large language models,” CoRR,
vol. abs/2302.02083, 2023.
[942] M. M. Amin, E. Cambria, and B. W. Schuller, “Will
affective computing emerge from foundation models
and general ai? A first evaluation on chatgpt,” CoRR,
vol. abs/2303.03186, 2023.
[943] G. Sridhara, R. H. G., and S. Mazumdar, “Chatgpt: A
study on its utility for ubiquitous software engineer-
ing tasks,” CoRR, vol. abs/2305.16837, 2023.
[944] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng,
S. Huang, Y. Chen, Q. Zhang, H. Qian, Y. Liu, and
Z. Chen, “Automatic code summarization via chatgpt:
How far are we?” CoRR, vol. abs/2305.12865, 2023.
[945] C. S. Xia and L. Zhang, “Conversational automated
program repair,” CoRR, vol. abs/2301.13246, 2023.
[946] W. Kuang, B. Qian, Z. Li, D. Chen, D. Gao, X. Pan,
Y. Xie, Y. Li, B. Ding, and J. Zhou, “Federatedscope-
llm: A comprehensive package for fine-tuning large
language models in federated learning,” 2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy