A Survey On Efficient Inference For Large Language Models
A Survey On Efficient Inference For Large Language Models
✦
arXiv:2404.14294v1 [cs.CL] 22 Apr 2024
Abstract—Large Language Models (LLMs) have attracted extensive (NLU), neural language generation (NLG), reasoning [13],
attention due to their remarkable performance across various tasks. [14], and code generation [15], consequently enabling im-
However, the substantial computational and memory requirements of pactful applications like ChatGPT, Copilot, and Bing. There
LLM inference pose challenges for deployment in resource-constrained is a growing belief [16] that the rise and achievements of
scenarios. Efforts within the field have been directed towards developing
LLMs signify a significant stride towards Artificial General
techniques aimed at enhancing the efficiency of LLM inference. This
paper presents a comprehensive survey of the existing literature on
Intelligence (AGI) for humanity.
efficient LLM inference. We start by analyzing the primary causes of
the inefficient LLM inference, i.e., the large model size, the quadratic- Higher Latency
complexity attention operation, and the auto-regressive decoding ap- Higher Computational
proach. Then, we introduce a comprehensive taxonomy that organizes Cost
the current literature into data-level, model-level, and system-level op- Lower Throughput
timization. Moreover, the paper includes comparative experiments on
Higher Memory Access
representative methods within critical sub-fields to provide quantitative Higher Power
Cost
insights. Last but not least, we provide some knowledge summary and Consumption
discuss future research directions.
Higher Memory Cost
Higher Storage
1 I NTRODUCTION
Large Language Models (LLMs) have garnered substantial Fig. 1. The challenges of LLM deployment.
attention from both academia and industry in recent years.
The field of LLMs has experienced notable growth and sig- However, the deployment of LLMs is not always going
nificant achievements. Numerous open-source LLMs have smoothly. As shown in Fig. 1, LLMs typically demand
emerged, including the GPT-series (GPT-1 [1], GPT-2 [2], higher computational cost, memory access cost and memory
and GPT-3 [3]), OPT [4], LLaMA-series (LLaMA [5], LLaMA usage in their inference process (we will analyse the root
2 [5], Baichuan 2 [6], Vicuna [7], LongChat [8]), BLOOM [9], causes in the Sec. 2.3), which deteriorates the efficiency
FALCON [10], GLM [11], and Mistral [12], which are used indicators (e.g., latency, throughput, power consumption
for both academic research and commercial purposes. The and storage) in the resource-constrained scenarios. This
success of LLMs stems from their robust capability in han- poses challenges for the application of LLMs in both edge
dling diverse tasks such as neural language understanding and cloud scenarios. For example, the immense storage re-
quirements render the deployment of a 70-billion-parameter
• Z. Zhou, K. Hong, T. Fu, S. Li, L. Wang are with Infinigence-AI and the model impractical on personal laptops for tasks such as
Department of Electronic Engineering, Tsinghua University, China. development assistance. Additionally, the low throughput
E-mail: zhouzx21@mails.tsinghua.edu.cn (Z. Zhou) would result in significant costs if LLMs are used for every
• X. Ning, Y. Lou, Y. Wang are with the Department of Electronic Engi- search engine request, leading to a considerable reduction
neering, Tsinghua University, China.
E-mail: foxdoraame@gmail.com (X. Ning), yu-wang@tsinghua.edu.cn (Y. in the profits of the search engine.
Wang) Fortunately, a substantial array of techniques has been
• J. Xu, G. Dai are with Infinigence-AI and the Department of Electronic proposed to enable efficient inference for LLMs. To gain
Engineering, Shanghai Jiaotong University, China.
E-mail: daiguohao@sjtu.edu.cn (G. Dai)
a comprehensive understanding of existing studies and
• X.-P. Zhang, Y. Dong are with Tsinghua Shenzhen International Graduate inspire further research, this survey employs a hierarchical
School. classification and systematic summarization of the current
E-mail: xpzhang@ieee.org (X.-P. Zhang), dongyuhan@sz.tsinghua.edu.cn landscape of efficient LLM inference. Specifically, we cat-
(Y. Dong)
• Z. Yuan, S. Yan are with Infinigence-AI. egorize relevant studies into three levels: data-level opti-
• X. Li is with Peking University. mization, model-level optimization, and system-level op-
• Corresponding authors: Yu Wang, Xuefei Ning, Guohao Dai. timization (refer to Sec. 3 for elaboration). Moreover, we
• *Equal contribution.
conduct experimental analyses on representative methods
2
within critical sub-fields to consolidate knowledge, offer blocks. Typically, a Transformer block consists of a Multi-
practical recommendations, and provide guidance for future Head Self-Attention (MHSA) block, a Feed Forward Net-
research endeavors. work (FFN), and a LayerNorm (LN) operation. For each
block, it receives the output features of the previous one
TABLE 1 as the input, and passes the features through each sub-
Comparison of existing surveys. module to obtain the output. Specially, before the first block,
a tokenizer is used to convert the original input sentence
Optimization Levels into a sequence of tokens, and a following embedding layer
Experimental
Survey
Data-level Model-level System-level Analysis serves to convert the tokens into the input features. Then,
the additional position embeddings are added into the input
[17], [18], [19] ✓
[20] ✓ ✓ features to encode the sequential order of each input token.
[21] ✓ ✓ The core concept of the Transformer architecture is the
[22], [23] ✓ ✓ ✓
Ours ✓ ✓ ✓ ✓ self-attention mechanism, which is adopted in the MHSA
block. Specifically, denoted the input features as X =
[x1 , x2 , ..., xn ], the MHSA block applies linear projection to
Currently, several surveys [17], [18], [19], [20], [21], [22] them and obtains a set of queries Q, keys K and values V as
have been conducted in the field of efficient LLMs. These Eq. 1:
surveys primarily focus on different aspects of LLM effi-
ciency but offer opportunities for further improvement. Zhu
Qi = XW Qi , Ki = XW Ki , Vi = XW Vi , (1)
et al. [17], Park et al. [18] and Wang et al. [19] concen- where W Qi , W Ki and W Vi are the projection matrices
trate on model compression techniques within model-level corresponding to the i-th attention head. Then the self-
optimization. Ding et al. [20] center on efficiency research attention operation is applied to each tuple of (Qi , Ki , Vi )
considering both data and model architecture perspectives. and get the feature of the i-th attention head Zi as Eq. 2:
Miao et al. [21] approach efficient LLM inference from a ma-
chine learning system (MLSys) research perspective. In con- Qi K T
Zi = Attention(Qi , Ki , Vi ) = Softmax( √ i )Vi , (2)
trast, our survey provides a more comprehensive research dk
scope, addressing optimization at three levels: data-level, where dk is the dimension of the queries (keys). Note that
model-level, and system-level, with the inclusion of recent the self-attention operation contains the matrix multipli-
advancements. While Wan et al. [22] and Xu et al. [23] also cation operation, its computation complexity is quadratic
deliver comprehensive review of efficient LLM research, our in the input length. Finally, the MHSA block concatenates
work extends by incorporating comparative experiments the features of all the attention heads and applies a linear
and offering practical insights and recommendations based projection to them to form its output Z as Eq. 3:
on experimental analyses in several critical sub-fields like
model quantization and serving systems. A comparison of Z = Concat(Z1 , Z2 , ..., Zh )W O , (3)
these surveys is summarized in Table 1. where WO is the projection matrix. As can be seen, the
The remainder of this survey is organized as follows: self-attention mechanism allows the model to identify the
Sec. 2 introduces the basic concept and knowledge about importance of different input parts regardless of the dis-
LLMs and presents a detailed analysis of the efficiency tance, and thus can capture the long-range dependencies
bottlenecks during the inference process of LLMs. Sec. 3 and complex relationships in the input sentence.
demonstrates our taxonomy. Sec. 4 to Sec. 6 respectively Another important module in the Transformer block is
present and discuss studies on efficiency optimization at the FFN. Typically, FFN is placed after the MHSA block
three distinct levels. Sec. 7 offers broader discussions for and consists of two linear transformation layers with a non-
several key application scenarios. Sec. 8 concludes the key linear activation function. It receives the output features X
contributions provided by this survey. from the MHSA block and processes them as Eq 4:
FFN(X) = W2 σ(W1 X), (4)
2 P RELIMINARIES where W1 and W2 denote the weight matrices of the two
2.1 Transformer-Style LLMs linear layers, and σ(·) denotes the activation function.
Language modeling, as the fundamental function of lan-
guage models (LMs), involves modeling the likelihood of 2.2 Inference Process of LLMs
the word sequence and predicting the distribution of subse- The most popular LLMs, i.e., decoder-only LLMs, often
quent words. Over recent years, researchers have discovered adopt the auto-regressive method to generate the output
that scaling up language models not only enhances their sentence. Specifically, the auto-regressive method generates
language modeling ability but also engenders emergent the tokens one by one. In each generation step, the LLM
capabilities for tackling more intricate tasks beyond conven- takes as input the whole token sequences, including the in-
tional NLP tasks [24]. These scaled-up language models are put tokens and previously generated tokens, and generates
referred to as large language models (LLMs). the next token. With the increase in sequence length, the
The mainstream LLMs are designed based on the Trans- time cost of the generation process grows rapidly. To ad-
former architecture [25]. Specifically, a typical Transformer dress this challenge, a crucial technique, namely key-value
architecture is composed of several stacked Transformer (KV) cache, has been introduced to expedite the generation
3
FFN FFN
FC2 FC2
process. The KV cache technique, as its name suggests, inference. As for latency, generating one token on 2 NVIDIA
involves storing and reusing previous key (K) and value (V) A100 GPUs requires approximately 100 milliseconds. Con-
pairs within the Multi-Head Self-Attention (MHSA) block. sequently, generating a sequence with hundreds of tokens
This technique has been widely adopted in LLM inference requires more than 10 seconds. In addition to storage and
engines and systems due to its substantial optimization latency, the efficiency indicators, such as throughput, energy
of generation latency. Based on the above methods and and power consumption, also need to be considered. During
techniques, the inference process of LLMs can be divided the LLM inference process, three important factors would
into two stages: largely affect these indicators, i.e., the computational cost,
• Prefilling Stage: The LLM calculates and stores the KV the memory access cost and the memory usage. Yuan et
cache of the initial input tokens, and generates the first al. [26] provide a more systematic analysis to demonstrate
output token, as shown in Fig. 2(a). how these factors affect the inference inefficiency with a
• Decoding Stage: The LLM generates the output tokens roofline model. In the following, we further analyze three
one by one with the KV cache, and then updates it with root causes of inefficiency in the LLM inference process,
the key (K) and value (V) pairs of the newly generated focusing on the above three key factors:
token, as shown in Fig. 2(b). • Model Size: Mainstream LLMs typically incorporate
As shown in Fig. 3, we illustrate some critical efficiency billions or even trillions of parameters. For instance,
indicators. As for the latency, we denote first token latency the LLaMA-70B model comprises 70 billion parame-
as the latency to generate the first output token in the ters, while the GPT-3 model scales up to 175 billion
prefilling stage, while we denote per-output token latency parameters. This considerable model size contributes
as the average latency to generate one output token in significantly to the elevated computational cost, mem-
the decoding stage. Besides, we use generation latency ory access cost, and memory usage during the LLM
to denote the latency to generate the whole output token inference process.
sequences. As for the memory, we use model size to denote • Attention Operation: As illustrated in Sec. 2.1 and
the memory to store the model weights, and use KV cache Sec. 2.2, in the prefilling stage, the self-attention oper-
size to denote the memory to store the KV cache. Addition- ation exhibits quadratic computational complexity in
ally, peak memory denotes the maximum memory usage the input length. Consequently, as the input length
during the generation process, which is approximately equal increases, the computational cost, memory access cost,
to the memory sum of model weights and KV cache. Apart and memory usage of the attention operation escalate
from the latency and memory, throughput is also a widely- rapidly.
used indicator in the LLM serving system. We use token • Decoding Approach: The auto-regressive decoding ap-
throughput to denote the number of generated tokens per proach generates the tokens one by one. In each decod-
second, and use request throughput to denote the number ing step, all the model weights are loaded from the off-
of completed requests per second. chip HBM to the GPU chip, leading to a large memory
access cost. In addition, the size of KV cache increases
with the growth in the input length, potentially leading
2.3 Efficiency Analysis to fragmented memory and irregular memory access
Deploying LLMs on resource-constrained scenarios while patterns.
preserving their powerful capabilities poses a significant
challenge for both practitioners and researchers. For in-
stance, let’s consider to deploy a LLaMA-2-70B model, 3 TAXONOMY
which contains 70 billion parameters. Storing its weights In the aforementioned discussion, we identify key factors
in FP16 format necessitates 140 GB of VRAM, requiring (i.e., computational cost, memory access cost and mem-
at least 6 RTX 3090Ti GPUs (each with 24 GB VRAM) ory usage) that significantly impact the efficiency during
or 2 NVIDIA A100 GPUs (each with 80 GB VRAM) for the LLM inference process, and further analyze three root
4
Prompt Pruning
Prompt Summary
Input Compression
(Sec. 4.1) Soft Prompt-based
Data-level Compression
Optimization
(Sec. 4) Output Organization
(Sec. 4.2) Retrieval-Augmented
Generation
Post-Training Quantization
Quantization
Model-level Quantization-
Optimization aware Training
(Sec. 5)
Weight Pruning
Sparsification
Sparse Attention
Model Compression
(Sec. 5.2) Structure Factorization
Structure Optimization
Neural Architecture Search
White-box KD
Knowledge Distillation
Black-box KD
Dynamic Inference
Distributed Systems
causes (i.e., model size, attention operation and decoding • Model-level Optimization refers to designing an ef-
approach). Many efforts have been made to optimize the ficient model structure (i.e., efficient structure design)
inference efficiency from different perspectives. By carefully or compressing the pre-trained models (i.e., model
reviewing and summarizing these studies, we classify them compression) in the inference process to improve its
into three levels, i.e., data-level optimization, model-level efficiency. This line of optimization (1) often requires
optimization and system-level optimization (as shown in costly pre-training or a smaller amount of fine-tuning
Fig. 4): cost to retain or recover the model ability, and (2) is
• Data-level Optimization refers to improving the ef- typically lossy in the model performance.
ficiency via optimizing the input prompts (i.e., input • System-level Optimization refers to optimizing the
compression) or better organizing the output content inference engine or the serving system. This line of opti-
(i.e., output organization). This line of optimization typ- mization (1) does not involve the costly model training,
ically does not change the original model, thus is free and (2) is typically lossless in model performance. In
of costly model training cost (note that a small amount addition, we provide a brief introduction for hardware
of training for auxiliary models might be required, but accelerator design in Sec. 6.3.
this cost can be ignored compared with the training cost
for original LLMs).
5
Fig. 5. Taxonomy of the input compression methods for Large Language Models.
train a pre-trained LM to compress the prompts into sum- expansions to form the final answer. When applied to open-
mary vectors via unsupervised learning. ICAE [33] trains source models, point-expanding can be performed through
an autoencoder to compress the original context into short batch inference, which optimizes hardware utilization and
memory slots. Specifically, ICAE employs a LoRA-adapted reduces overall generation latency using the same compu-
LLM as the encoder, and uses the target LLM as the decoder. tational resources. To mitigate the additional computation
A set of memory tokens is added before the input tokens
and encoded into memory slots. 1. Noodles: Various noodle
dishes, such as …
4.1.4 Retrieval-Augmented Generation What are the 1. Noodles 2. Hot pot: A communal pot
2. Hot pot of simmering broth at the
typical types of
Retrieval-Augmented Generation (RAG) [27] aims to im- Chinese dishes? 3. Rice center of the table …
prove the quality of LLMs’ responses by incorporating exter- … 3. Rice: Fried Rice,
Yangzhou Fried Rice, and
nal knowledge sources. RAG can be also viewed as a tech- other rice-based dishes …
nique to improve the inference efficiency when handling a …
large amount of data. Instead of merging all information
into an excessively long prompt, RAG only adds relevant (a) The skeleton stage (b) The point-expanding stage
retrieved information to the original prompt, ensuring that
the model receives necessary information while reducing Fig. 6. Demonstration of the inference process of SoT.
prompt length significantly. FLARE [28] uses predictions of
upcoming sentences to proactively decide when and what overhead brought by the extra prompt (i.e., skeleton prompt
information to retrieve. REPLUG [29] treats the LLM as a and point-expanding prompt), SoT discusses the possibility of
black box and augments it with a tuneable retrieval model. sharing the KV cache of the common prompt prefix across
It prepends retrieved documents to the input for the frozen multiple points in the point expansion phase. Additionally,
black-box LLM, and further utilizes the LLM to supervise SoT uses a router model to decide whether applying SoT is
the retrieval model. Self-RAG [30] enhances LLM’s quality appropriate for specific questions, aiming to limit its use to
and factuality through retrieval and self-reflection. It intro- suitable cases. As a result, SoT achieves up to a 2.39× speed-
duces reflection tokens to make the LLM controllable during up on 12 recently released LLMs, and improves the answer
the inference phase. quality for many questions by improving the diversity and
relevance of their answer.
SGD [46] further extends the idea of SoT by organizing
4.2 Output Organization sub-problem points into a Directed Acyclic Graph (DAG)
The traditional generation process of LLMs is entirely se- and answering the logic-independent sub-problems in par-
quential, leading to significant time consumption. Output allel in one turn. Similar to SoT, SGD also leverages the
organization techniques aim to (partially) parallelize gener- emerging ability of LLMs to generate the output structure
ation via organizing the structure of output content. by providing manually-crafted prompts along with several
Skeleton-of-Thought (SoT) [45] is pioneering in this di- examples. SGD relaxes the strict independence assumption
rection. The core idea behind SoT is to leverage the emerg- among different points to enhance the quality of answers,
ing ability of LLMs to plan the output content’s struc- especially for math and coding problems. Compared with
ture. Specifically, SoT consists of two main phases. In the SoT, SGD prioritizes answer quality over speed. Addition-
first phase (i.e., skeleton phase), SoT instructs the LLM to ally, SGD introduces an adaptive model selection approach,
generate a concise skeleton of the answer using a prede- assigning an optimal model size to handle each sub-problem
fined ”skeleton prompt.” For instance, given a question like based on its estimated complexity, thus further improving
”What are the typical types of Chinese dishes?”, the output efficiency.
at this stage would be a list of dishes (e.g., noodles, hot APAR [47] adopts a similar idea with SoT, leveraging
pot, rice) without elaborate descriptions. Then, in the second LLMs to output special control tokens (i.e., [fork]) for auto-
phase (i.e., point-expanding phase), SoT instructs the LLM matically and dynamically triggering the parallel decoding.
to expand each point in the skeleton simultaneously using To effectively exploit the inherent parallelizable structure
a ”point-expanding prompt,” and then concatenates these within the output content and accurately generate control
7
tokens, APAR fine-tunes the LLMs on carefully-designed In addition to optimizing the efficiency of existing frame-
data that formed in specific tree structure. As a result, APAR works, certain studies focus on designing more efficient
achieves an average 1.4∼2.0× speed-up on benchmarks and agent frameworks directly. For example, FrugalGPT [56]
cases a negligible impact on the answer quality. Further- proposes a model cascade comprising LLMs of varying
more, APAR combines their decoding approach with the sizes, with the inference process being halted early if the
speculative decoding technique (i.e., Medusa [48]) and serv- model reaches a sufficient level of certainty regarding the
ing system (i.e. vLLM [49]) to further improve the inference answer. This approach aims to achieve efficiency by leverag-
latency and system throughput, respectively. ing a tiered model architecture and intelligent inference ter-
SGLang [50] introduces a domain-specific language mination based on model confidence estimation. Compared
(DSL) in Python featuring primitives that flexibly facili- with model-level dynamic inference techniques (Sec. 5.2.5),
tate LLM programming. The core idea behind SGLang is FrugalGPT performs dynamic inference at the pipeline level.
to analyze dependencies among various generation calls
automatically, and perform batch inference and KV cache
5 M ODEL - LEVEL O PTIMIZATION
sharing based on this analysis. With this language, users can
implement various prompting strategies easily and benefit The model-level optimization for LLM efficient inference
from the automatic efficiency optimization of SGLang (e.g., mainly concentrates on optimizing the model structure or
SoT [45], ToT [51]). Furthermore, SGLang introduces and data representation. Model structure optimization involves
combines several system-level compilation techniques, such directly designing efficient model structure, modifying the
as code movement and prefetching annotations. original model and adjusting the inference-time architec-
ture. In terms of data representation optimization, the model
quantization technique is commonly employed.
4.3 Knowledge, Suggestions and Future Direction In this section, we categorize model-level optimization
The growing demand for LLMs to handle longer inputs and techniques based on the additional training overhead they
generate longer outputs highlights the importance of the require. The first category involves designing more efficient
data-level optimization techniques. Within these techniques, model structures (referred to as efficient structure design).
input compression methods primarily target enhancing the Models developed using this approach typically require
prefilling stage by diminishing the computational and mem- training from scratch. The second category focuses on com-
ory cost resulting from the attention operation. Additionally, pressing pre-trained models (referred to as model compres-
for API-based LLMs, these methods can reduce the API cost sion). Compressed models in this category generally require
associated with input tokens. In contrast, output organiza- only minimal fine-tuning to restore their performance.
tion methods concentrate on optimizing the decoding stage
by alleviating the substantial memory access cost associated 5.1 Efficient Structure Design
with auto-regressive decoding approach.
Currently, state-of-the-art LLMs commonly employ the
As LLMs become more and more capable, there is poten-
Transformer architecture, as discussed in Section 2.1. How-
tial to utilize them to compress the input prompts or struc-
ever, the key components of Transformer-based LLMs, in-
ture the output content. Recent advancements in output
cluding the Feed Forward Network (FFN) and attention
organization methods [45], [46], [47] demonstrate the effec-
operation, present efficiency challenges during inference.
tiveness of leveraging LLMs to organize the output content
We identify the causes as follows:
into independent points or a dependency graph, facilitating
• The FFN contributes a substantial portion of the model
batch inference for improving generation latency. These
methods capitalize on the inherent parallelizable structure parameters in Transformer-based LLMs, resulting in
within output content, enabling LLMs to perform parallel significant memory access cost and memory usage,
decoding to enhance hardware utilization and thereby re- particularly during the decoding stage. For instance, the
duce end-to-end generation latency. FFN module accounts for 63.01% of the parameters in
Recently, diverse prompting pipelines (e.g., ToT [51], the LLaMA-7B model and 71.69% in the LLaMA-70B
GoT [52]) and agent frameworks [53], [54], [55] are emerg- model.
• The attention operation demonstrates quadratic com-
ing. While these innovations enhance LLMs’ capabilities,
they also extend the length of inputs, leading to increased plexity in the input length, leading to substantial com-
computational cost. To address this challenge, adopting putational cost and memory usage, especially when
input compression techniques to reduce input length shows dealing with longer input contexts.
promise as a solution. Simultaneously, these pipelines and To tackle these efficiency challenges, several studies have
frameworks naturally introduce more parallelism into out- concentrated on developing more efficient model structures.
put structures, offering increased potential for parallel de- We categorize these studies into three groups (as depicted in
coding and key-value (KV) cache sharing across different Fig. 7): efficient FFN design, efficient attention design, and
decoding threads. SGLang [50] supports flexible LLM pro- Transformer alternates.
gramming and offers opportunities for front-end and back-
end co-optimization, laying the groundwork for further 5.1.1 Efficient FFN Design
extensions and improvements in this area. In summary, In this field, many studies concentrate on integrating the
data-level optimization, including input compression and Mixture-of-Experts (MoE) technique [96] into LLMs to en-
output organization techniques, would become increasingly hance their performance while maintaining the computa-
necessary to enhance efficiency in the foreseeable future. tional cost. The core idea of MoE is to dynamically allocate
8
Efficient FFN Design Switch Transformers [86], MoEfication [87], MPOE [88], Sparse Upcy-
cling [89], BASE [90], Expert Choice [91], SE-MoE [92], StableMoE [93], SMoE-
Dropout [94], GLaM [95], Mixtral 8x7B [12]
Fig. 7. Taxonomy of the efficient structure design for Large Language Models.
varying computational budgets to different input tokens. In ence efficiency. Current MoE implementations [86], [97], [98]
MoE-based Transformers, multiple parallel Feed Forward often use batched matrix multiplication to compute all FFN
Networks (FFNs), namely experts, are utilized alongside a experts simultaneously. This requires that the input matrices
trainable routing module. During inference, the model se- of each expert must have the same shape. However, since
lectively activates specific experts for each token controlled the load imbalance problem exists, input token sets for
by the routing module. these under-utilized experts are needed to be padded to
Some researches concentrate on the construction of FFN meet the shape constraint, resulting in a waste of compu-
expert, which mainly focus on optimizing the process of tation. Therefore, the major aim of routing module design
acquiring expert weights or making these experts more is achieving better balance in token assignment for MoE
lightweight for efficiency. For instance, MoEfication [87] de- experts. Switch Transformers [86] introduces an additional
vises a method to transform a non-MoE LLM into the MoE loss, namely the load balancing loss, into the final loss
version using its pre-trained weights. This approach elimi- function to penalize imbalanced assignments by the routing
nates the need for expensive pre-training of the MoE model. module. This loss is formulated as the scaled dot-product
To accomplish this, MoEfication first divides FFN neurons between the token assignment fraction vector and a uniform
of the pre-trained LLM into multiple groups. Within each distribution vector. As a result, the loss is minimized only
group, the neurons are commonly activated simultaneously when the token assignment is balanced across all experts.
by the activation function. Then, it restructures each group This approach encourages the routing module to distribute
of neurons as an expert. Sparse Upcycling [89] introduces a tokens evenly among experts, promoting load balance and
method to initialize the weights of MoE-based LLM directly ultimately improving model performance and efficiency.
from a dense model’s checkpoint. In this approach, the BASE [90] learns an embedding for each expert in an end-
experts within the MoE-based LLM are exact replicas of to-end manner and then assigns experts to tokens based on
the FFN from the dense model. By employing this straight- the similarity of their embeddings. To ensure load balance,
forward initialization, Sparse Upcycling can efficiently train BASE formulates a linear assignment problem and utilizes
the MoE model to achieve high performance. MPOE [88] the auction algorithm [99] to solve this problem efficiently.
proposes to reduce the parameters of MoE-based LLMs Expert Choice [91] introduces a simple yet effective strategy
through Matrix Product Operators (MPO) decomposition. to ensure perfect load balance within MoE-based models.
This method involves decomposing each weight matrix of Unlike previous methods that assign experts to tokens,
the FFN into a global shared tensor containing common Expert Choice allows each expert to independently select
information and a set of local auxiliary tensors that capture the top-k tokens based on their embedding similarities. This
specialized features. approach ensures that each expert handles a fixed number
of tokens, even though each token might be assigned to a
Another line of researches focuses on improving the
different number of experts.
design of the routing module (or strategy) within MoE
models. In previous MoE models, the routing module often In addition to the aforementioned researches focusing
causes the load imbalance problem, which denotes that on the model architecture itself, there are also studies that
some experts are assigned a large number of tokens while concentrate on improving the training methods for MoE-
the others handle only a few. This imbalance not only based models. SE-MoE [92] introduces a new auxiliary loss
wastes the capacities of the under-utilized experts, which called the router z-loss, which aims to enhance the stability
degrades model performance, but also degrades the infer- of model training without compromising performance. SE-
9
MoE identifies that the exponential functions introduced by MQA. Specifically, GQA segments the attention heads into
softmax operations in the routing module can exacerbate groups, storing a single set of K and V values for each
roundoff errors, leading to training instability. To address group. This method not only sustains the benefits of MQA
this issue, the router z-loss penalizes large logits that are in- in reducing memory overhead but also offers an enhanced
put into exponential functions, thereby minimizing roundoff balance between inference speed and output quality.
errors during training. StableMoE [93] points out the routing Low-Complexity Attention. Low-complexity attention
fluctuation problem existing in the MoE-based LLMs, which methods aim to design new mechanisms that reduce the
denotes the inconsistency of the expert assignment in the computational complexity of each attention head. To sim-
training and inference stage. For the same input token, it is plify the discussion, we assume that the dimensions of the
assigned to different experts along with training, but only Q (query), K (key), and V (value) matrices are identical,
activates one expert at inference time. To address this issue, with Q, K, V ∈ Rn×d . Since the following work does not
StableMoE suggests a more consistent training approach. involve altering the number of attention heads like MQA,
It first learns a routing strategy and then keeps it fixed our discussions focus on the attention mechanism within
during both the model backbone training and the inference each head. As introduced in Section 2.2, the computational
stage. SMoE-Dropout [94] designs a novel training method complexity of the conventional attention mechanism scales
for MoE-based LLMs, which proposes to gradually increase as O(n2 ), exhibiting quadratic growth with respect to the in-
the number of activated experts during the training process. put length n. To address the inefficiency issue, kernel-based
This approach enhances the scalability of MoE-based mod- attention and low-rank attention methods are proposed to
els for inference and downstream fine-tuning. GLaM [95] reduce the complexity to O(n).
pre-trains and releases a series of models with various pa- • Kernel-based Attention. Kernel-based attention designs
rameter sizes, demonstrating their comparable performance kernel ϕ to approximate the non-linear softmax oper-
to dense LLMs on few-shot tasks. The largest model in this ation of Softmax(QK T ) with a linear dot product be-
family has a parameter size of up to 1.2 trillion. Mixtral tween kernel-transformed feature maps, i.e., ϕ(Q)ϕ(K)T .
8x7B [12] is a remarkable recently released open-source It avoids the conventional quadratic computation associ-
model. During inference, it utilizes only 13 billion active ated with QK T ∈ Rn×n by prioritizing the computation
parameters and achieves superior performance compared of ϕ(K)T V ∈ Rd×d , followed by its multiplication with
to the LLaMA-2-70B model across different benchmarks. ϕ(Q) ∈ Rn×d . Specifically, the input Q and K matrices are
Mixtral 8x7B consists of 8 Feed-Forward Network (FFN) first mapped into kernel space using a kernel function ϕ,
experts in each layer, with each token assigned to two while maintaining their original dimensions. Leveraging
experts during inference. the associative property of matrix multiplication allows
for the multiplication of K and V prior to their interaction
5.1.2 Efficient Attention Design with Q. The attention mechanism is reformulated as:
The attention operation is a critical component in the Trans-
former architecture. However, its quadratic complexity in Softmax(QK T )V ≈ ϕ(Q)(ϕ(K)T V ), (5)
relation to input length leads to substantial computational where ϕ(Q), ϕ(K) ∈ R n×d
. This strategy effectively re-
cost, memory access cost, and memory usage, especially duces the computational complexity to O(nd2 ), render-
when dealing with long contexts. To address this issue, ing it linear with respect to the input length. Linear
researchers are exploring more efficient approaches to ap- Transformer [82] is the first work to propose the kernel-
proximate the functionality of the original attention oper- based attention. It adopts ϕ(x) = elu(x) + 1 as the ker-
ation. These studies can be broadly categorized into two nel function, where elu(·) denotes the exponential linear
main branches: multi-query attention and low-complexity unit activation function. Performers [83] and RFA [84]
attention. proposes to use random feature projection to better ap-
Multi-Query Attention. Multi-query attention (MQA) [75] proximate the softmax function. PolySketchFormer [85]
optimizes the attention operation by sharing the key (K) employs polynomial functions and sketching techniques
and value (V) cache across different attention heads. This to approximate the softmax function.
strategy effectively reduces both memory access cost and
• Low-Rank Attention. Low-Rank Attention technique em-
memory usage during inference, contributing to improved
ploys compression on the token dimensions (i.e., n) of the
efficiency in Transformer models. As introduced in Sec. 2.2,
K and V matrices to a smaller, fixed length (i.e., k ) before
the Transformer-style LLMs typically adopts multi-head
performing the attention computation. The approach is
attention (MHA) operation. This operation requires stor-
based on the insight that the n × n attention matrix
ing and retrieving K and V pairs for each attention head
often exhibits a low-rank property, making it feasible to
during the decoding stage, leading to substantial increases
compress it in the token dimension. The main focus of
in memory access cost and memory usage. MQA tackles
this line of researches is to design effective methods for
this challenge by using the same K and V pairs across
the compression, where X can be context matrix or K and
different heads while maintaining distinct query (Q) values.
V matrices:
Through extensive testing, it has been demonstrated that
MQA significantly reduces memory requirements with only
X ∈ Rn×d → X ′ ∈ Rk×d . (6)
a minimal impact on model performance, making it a cru-
cial strategy for enhancing inference efficiency. The concept One line of work uses linear projection to compress the
of MQA is further extended by Grouped-query attention token dimension. It is done by multiplying K and V matri-
(GQA) [76], which can be seen as a blend of MHA and ces with projection matrices Pk , Pv ∈ Rk×n . In this way,
10
the computational complexity of the attention operation where A, B and C denote the transition matrices, x denotes
is reduced to O(nkd), which is linear to the input length. the intermediate state and u denotes the input sequence. (2)
Linformer [77] first observes and analyses the low-rank They design the transition matrix A based on the HiPPO
property of the attention map, and proposes the low-rank theory [62]. Specifically, HiPPO proposes to compress the
attention framework. LRT [78] proposes to simultaneously input sequence into a sequence of coefficients (namely state)
apply low-rank transformation to both attention block by projecting it onto a set of polynomial bases.
and FFN to further improve the computational efficiency. Building upon the aforementioned framework, several
FLuRKA [79] combines the low-rank transformation and studies concentrate on improving the parameterization or
kernalization to the attention matrices to further improve initialization of the transition matrix A. This involves re-
the efficiency. Specifically, it first reduces the token dimen- fining how the matrix is formulated or initialized within
sion of K and V matrices, and then applies kernel function the SSM to enhance its effectiveness and performance in
to the Q and low-rank K matrices. sequence modeling tasks. LSSL [63] firstly proposes to ini-
Aside from linear projection, other token-dimension tialize A with the optimal transition matrix HiPPO-LegS
compression methods are also proposed. Luna [80] and designed by HiPPO. In addition, LSSL also trains the SSM
Set Transformer [81] leverage additional attention compu- in a convolution manner by unrolling the Eq. 7. Specifically,
tations alongside smaller queries to effectively compress through a convolution kernel defined as KL (A, B, C) =
the K and V matrices. Luna [80] involves an extra query (CAi B)i∈[L] = (CB, CAB, ..., CAL−1 B), the Eq. 7 can
matrix of fixed length k . The small query performs at- be rewritten as y = KL (A, B, C) ∗ u and also can be
tention with the original context matrix, termed as pack computed efficiently via Fast Fourier Transform (FFT). How-
attention, to compress the context matrix to size Rk×d . ever, computing this convolution kernel is expensive, since
Subsequently, the regular attention, termed unpack atten- it requires multiple times of multiplication by A. To this
tion, applies attention to the original Q matrices and the end, S4 [64], DSS [65] and S4D [66] propose to diagonalize
compressed K and V matrices. The extra query matrix the matrix A, which can accelerate the computing. This can
can be learnable parameters or acquired from previous be seen as a parameterization technique to the transition
layers. Set Transformer [81] designs the similar technique matrix A. Previous SSMs processed each input dimension
by introducing an inducing points vector with fixed length. independently, resulting in a large number of trainable
Unlike previous works that compress K and V, Funnel- parameters. To enhance efficiency, S5 [70] proposes to simul-
Transformer [100] uses pooling operation to gradually taneously process all input dimensions using a single set of
compress the sequence length of the Q matrix. parameters. Building upon this structure, S5 introduces a
parameterization and initialization method for A based on
the standard HiPPO matrix. Liquid S4 [69] and Mamba [73]
5.1.3 Transformer Alternates
parameterize the transition matrices in a input-dependent
In addition to applying efficient techniques to the attention manner, which further enhances the modeling capability of
operation, recent studies have also innovated to design SSM. Additionally, both S5 [70] and Mamba [73] adopt a
sequence modeling architectures that are efficient yet ef- parallel scan technique for efficient model training without
fective. Table 2 compares the efficiency of some represen- the need for convolution operations. This technique offers
tative non-Transformer models. These architectures exhibit advantages in implementation and deployment on modern
sub-quadratic computational complexity with respect to se- GPU hardware.
quence length during both training and inference, enabling Another line of research aim to design better model
LLMs to significantly increase their context length. architecture based on SSMs. GSS [67] and BiGS [72] com-
Within this research field, two prominent lines of study bines the Gated Attention Unit (GAU) [102] with SSM.
have garnered significant attention. One line of studies con- Specifically, they replace the attention operation in GAU
centrates on the State Space Model (SSM), which formulates with SSM operation. BST [71] combines the SSM model
sequence modeling as a recurrence transformation based with the proposed Block Transformer which introduces a
on the HiPPO theory [62]. Additionally, other studies pri- strong local inductive bias. H3 [68] observes that SSM is
marily focus on employing long convolutions or designing weak in recalling the earlier tokens and comparing a token
attention-like formulations to model sequences. across the sequence. To this end, it proposes to add a shift
State Space Model. The State Space Model (SSM) has SSM operation before the standard SSM operation, which
demonstrated competitive modeling capabilities in certain is used to directly shift the input tokens into the state.
Natural Language Processing (NLP) [73] and and Computer MambaFormer [74] combines the standard Transformer and
Vision (CV) [101] tasks. Compared to attention-based Trans- SSM model by substituting the FFN layer in the Trans-
formers, SSM exhibits linear computational and memory former with an SSM layer. Jamba [103] introduces another
complexity with respect to the input sequence length, which approach to combining the Transformer and SSM models by
enhances its efficiency in handling long-context sequences. adding four Transformer layers into an SSM model. Dense-
In this survey, SSM refers to a series of model architectures Mamba [104] explores the issue of hidden state degradation
that satisfy the following two properties: (1) They model in traditional SSMs and introduces dense connections within
sequence based on the following formulation proposed by the SSM architecture to preserve fine-grained information
HiPPO [62] and LSSL [63]: across deeper layers of the model. BlackMamba [105] and
MoE-Mamba [106] propose to enhance SSM models with the
xk = Axk−1 + Buk , Mixture-of-Experts (MoE) technique to optimize the train-
(7)
yk = Cxk , ing and inference efficiency while maintaining the model
11
TABLE 2
Efficiency comparison of some novel non-Transformer models. Note that we denote n as the input length and d as the input dimension.
GEMM operation, allowing computation with low-precision need for retraining, which can be a costly process. While
Tensor Cores (e.g., INT8). This quantization approach is PTQ methods have been well-explored for smaller mod-
referred to as Weight-Activation Quantization. els, applying existing quantization techniques directly to
In contrast, during the decoding stage, LLMs process LLMs presents challenges. This is primarily because the
only one token at each generation step using general matrix- weights and activations of LLMs often exhibit more outliers
vector multiplication (GEMV) as the core operation. The and have a wider distribution range compared to smaller
latency of the decoding stage is mainly influenced by the models, making their quantization more challenging. In
loading of large weight tensors. To tackle this challenge, summary, the complex nature of LLMs, characterized by
existing methods focus on quantizing only the weights to their size and complexity, requires specialized approaches
accelerate memory access. This method, known as Weight- to effectively handle the quantization process. The presence
only Quantization, involves offline quantization of weights, of outliers and wider distribution ranges in LLMs necessi-
followed by de-quantization of the low-precision weights tates the development of tailored quantization techniques
into FP16 format for computation, as shown in Figure 9 (a). that can account for these unique characteristics without
Post-Training Quantization. Post-training quantization compromising model performance or efficiency.
(PTQ) involves quantizing pre-trained models without the Numerous studies have concentrated on developing
13
TABLE 3
Summary of the representative studies on Post-Training Quantization. Quantized Tensor Type denotes which parts of tensors are quantized.
Quantized Format denotes whether to adopt uniform or non-uniform quantization. Quantized Criterion denotes the how to decide the parameters
(e.g., scaling factor, zero-point). Quantized Value Update denotes whether to change the model weight (e.g., compensation, re-parameterization)
during the quantization process.
non-uniform quantization to the remaining weights. The liers is concentrated and asymmetrical, posing a challenge
values for non-uniform quantization are determined based to LLM quantization. To address this, OS+ introduces a
on quantization sensitivity, which contributes to improved channel-wise shifting and scaling technique aimed at allevi-
performance of the quantized model. QuIP [188] introduces ating these challenges. The shifting and scaling parameters
LDLQ, an optimal adaptive method for a quadratic proxy are determined through a search process to effectively han-
objective. The study reveals that ensuring incoherence be- dle the concentrated and asymmetrical outlier distribution.
tween weight and Hessian matrices can enhance the ef- ZeroQuant-FP [199] investigates the feasibility of quantizing
fectiveness of LDLQ. QuIP utilizes LDLQ and achieves weight and activation values into FP4 and FP8 formats.
incoherence by employing random orthogonal matrix mul- The study reveals that quantizing activations into floating-
tiplication. FineQuant [189] utilizes a heuristic approach point types (FP4 and FP8) produces superior results com-
to determine the granularity of quantization per column, pared to integer types. Omniquant [200] diverges from prior
combining empirical insights gained from experiments to approaches that rely on empirical design of quantization
design a quantization scheme. QuantEase [190] builds upon parameters. Instead, it optimizes the boundaries for weight
GPTQ. When quantizing each layer, it proposes a method clipping and the scaling factor for equivalent transformation
based on Coordinate Descent to compensate for the unquan- to minimize quantization errors. QLLM [201] addresses
tized weights more precisely. Additionally, QuantEase can the impact of outliers on quantization by implementing
leverage quantized weights from GPTQ as an initialization channel reassembly. Additionally, it introduces learnable
and further refine the compensation process. LLM-MQ [191] low-rank parameters to minimize quantization errors in
protects the weight outliers with FP16 format, and stores the post-quantized model. Atom [202] employs a strategy
them in Compressed Sparse Row (CSR) format for efficient involving mixed-precision and dynamic quantization for
computation. Besides, LLM-MQ models the bit-width as- activations. Notably, it extends this approach to quantize the
signment to each layer as an integer programming problem, KV cache into INT4 to enhance throughput performance.
and employs an efficient solver to solve it within a few LLM-FP4 [177] endeavors to quantize the entire model
seconds. Moveover, LLM-MQ designs a efficient CUDA ker- into FP4 format and introduces a pre-shifted exponent
nel to integrate dequantization operators, thereby reducing bias technique. This approach combines the scaling factor
memory access cost during computation. of activation values with weights to address quantization
For weight-activation quantization, ZeroQuant [192] em- challenges posed by outliers. BiLLM [203] represents one
ploys finer-grained quantization for weights and activa- of the lowest-bit PTQ efforts to date. BiLLM identified the
tions, leveraging kernel fusion to minimize the memory bell-shaped distribution of weights and the exceptionally
access cost during quantization and conducting layer-by- long-tail distribution of weights’ Hessian matrix. Based on
layer knowledge distillation to recover the performance. this, it proposes to categorize weights into salient and non-
FlexGen [193] quantizes weights and KV cache directly salient values structurally based on the Hessian matrix
into INT4 to reduce the memory footprint during infer- and binarizes them separately. As a result, BiLLM can
ence with large batch sizes. LLM.int8() [194] identifies that extensively quantize LLMs to 1.08 bits without significant
outliers in activations are concentrated within a small sub- degradation in perplexity. KVQuant [205] proposes a non-
set of channels. Leveraging this insight, LLM.int8() splits uniform quantization scheme for KV cache quantization, by
activations and weights into two distinct parts based on deriving the optimal datatypes offline on a calibration set.
the outlier distribution within input channels to minimize KIVI [206] proposes a tuning-free 2bit KV cache quantiza-
quantization errors in activations. Channels containing out- tion algorithm, which utilizes per-channel quantization for
lier data in both activations and weights are stored in key cache and per-token quantization for value cache in a
FP16 format, while other channels are stored in INT8 group-wise manner. Li et al. [204] conducted a thorough
format. SmoothQuant [195] employs a reparameterization evaluation to assess the impact of quantization on different
technique to address the challenges of quantizing activa- tensor types (including KV Cache), various tasks, 11 LLM
tion values. This method introduces a scaling factor that families, and SOTA quantization methods.
expands the data range of weight channels while shrinking Quantization-Aware Training. Quantization-aware training
the data range of corresponding activation channels. Zero- (QAT) incorporates the influence of quantization within the
Quant [192] introduces a group-wise quantization strategy model training procedure. By integrating layers that repli-
for weights and a token-wise quantization approach for cate quantization effects, this approach facilitates weight
activations. Building upon this methodology, ZeroQuant- adaptation to quantization-induced errors, leading to en-
V2 [196] presents the LoRC (Low-Rank Compensation) tech- hanced task performance. Nevertheless, training LLMs typ-
nique, employing low-rank matrices to mitigate quantiza- ically demands substantial training data and considerable
tion inaccuracies. RPTQ [197] identifies substantial varia- computational resources, posing potential bottlenecks for
tions in the distribution of different activation channels, QAT implementation. Consequently, current research en-
which present challenges for quantization. To mitigate this deavors focus on strategies to reduce the training data re-
issue, RPTQ reorganizes channels with similar activation quirements or alleviate the computational burden associated
distributions into clusters and independently applies quan- with QAT implementation.
tization within each cluster. OliVe [198] observes that the To reduce the data requirements, LLM-QAT [177] intro-
normal values neighboring to the outliers are less critical. duces a data-free method to generate the training data by
Therefore, it pairs each outlier with a normal value, sacri- using the original FP16 LLMs. Specifically, LLM-QAT uses
ficing the latter to achieve a broader representation range every token in the tokenization vocabulary as a starting
for outliers. OS+ [166] observes that the distribution of out- token to generate sentences. Based on the generated training
15
TABLE 4
Comparison of speed-ups in different scenarios (e.g., model size, batch size, input context length, inference framework) with W4A16 quantization
based on TensorRT-LLM [208] and LMDeploy [209] framework, respectively. We test the speed-ups of prefilling/decoding/end-to-end latency on a
single NVIDIA A100 GPU. OOM denotes “Out Of Memory”.
TensorRT-LLM
B 128 256 512 1024 2048
1 1.06/2.40/2.37 0.90/2.38/2.34 0.92/2.30/2.28 0.88/2.19/2.17 0.91/2.00/1.98
2 0.88/2.10/2.05 0.91/2.07/2.04 0.89/2.01/1.98 0.91/1.92/1.89 0.88/1.78/1.76
LLaMA-2-7B 4 0.92/1.72/1.67 0.89/1.67/1.64 0.90/1.61/1.58 0.87/1.53/1.51 0.84/1.42/1.40
8 0.91/1.43/1.36 0.88/1.38/1.33 0.83/1.33/1.28 0.77/1.25/1.21 0.78/1.16/1.14
16 0.91/1.43/1.36 0.88/1.38/1.33 0.83/1.33/1.28 0.77/1.25/1.21 0.78/1.16/1.14
B 128 256 512 1024 2048
1 1.24/2.51/2.50 0.89/2.45/2.47 0.94/2.34/2.42 0.90/2.18/2.32 0.83/1.94/2.16
2 0.90/2.51/2.50 0.95/2.45/2.47 0.90/2.34/2.42 0.83/2.18/2.32 0.80/1.94/2.16
LLaMA-2-13B 4 0.96/1.80/1.76 0.91/1.78/1.74 0.83/1.73/1.69 0.80/1.65/1.62 0.83/1.54/1.52
8 0.91/1.86/1.77 0.83/1.81/1.73 0.80/1.73/1.66 0.82/1.62/1.56 0.75/1.46/1.41
16 0.84/1.84/1.69 0.81/1.77/1.63 0.82/1.63/1.53 0.78/1.46/1.39 OOM
LMDeploy
B 128 256 512 1024 2048
1 1.30/2.11/2.09 0.94/2.07/2.05 0.90/2.03/2.02 0.88/1.97/1.96 0.94/1.92/1.91
2 1.03/2.24/2.20 0.90/2.19/2.15 0.88/2.11/2.08 0.93/1.97/1.95 0.85/1.78/1.76
LLaMA-2-7B 4 0.90/2.18/2.10 0.87/2.12/2.05 0.93/2.01/1.96 0.92/1.86/1.83 0.92/1.64/1.62
8 0.92/1.92/1.77 0.91/1.82/1.71 0.92/1.65/1.57 0.93/1.45/1.41 0.94/1.28/1.26
16 0.92/1.92/1.77 0.91/1.82/1.71 0.92/1.65/1.57 0.93/1.45/1.41 0.94/1.28/1.26
B 128 256 512 1024 2048
1 1.32/2.34/2.32 0.94/2.31/2.28 0.92/2.22/2.20 0.94/2.15/2.13 0.94/2.01/1.99
2 1.06/2.42/2.36 0.92/2.37/2.32 0.94/2.29/2.25 0.94/2.15/2.12 0.95/1.95/1.93
LLaMA-2-13B 4 0.93/2.36/2.26 0.94/2.29/2.21 0.94/2.18/2.12 0.95/2.01/1.97 0.96/1.78/1.75
8 0.92/2.24/2.10 0.93/1.93/2.02 0.94/1.81/1.89 0.94/1.65/1.71 0.95/1.45/1.49
16 0.93/2.02/1.85 0.94/1.90/1.76 0.94/1.73/1.63 0.95/1.50/1.45 OOM
data, LLM-QAT applies a distillation-based workflow to Tweaking [178] proposes to train the LayerNorm layer after
train the quantized LLM to match the output distribution quantization and use knowledge distillation to match the
of the original FP16 LLM. Norm Tweaking [178] limits output distribution of the quantized model with that of the
the selection of the starting token to only those language FP16 model, achieving effects similar to LLM-QAT while
categories listed among the top languages with the highest avoiding high training costs.
proportion. This strategy can effectively improve the gener- Comparative Experiments and Analysis. In this section, we
alization of the quantized model on different tasks. conduct experiments to evaluate the speed-ups achieved
To reduce the computation cost, many methods apply by employing the weight-only quantization technique in
parameter-efficient tuning (PEFT) strategies to accelerate various scenarios. Specifically, we focus on two widely-used
QAT. QLoRA [179] quantizes the weights of LLMs into large language models (LLMs), LLaMA-2-7B and LLaMA-2-
4-bit and subsequently employs LoRA [210] in BF16 for 13B, and quantize their weights to 4-bit using the AWQ [184]
each 4-bit weight matrix to fine-tune the quantized model. algorithm. Subsequently, we deploy these quantized models
QLoRA allows for the efficient fine-tuning of a 65B param- on a single NVIDIA A100 GPU using two different inference
eter LLM on one GPU with only 30GB of memory. QA- frameworks: TensorRT-LLM [208] and LMDeploy [209]. We
LoRA [180] proposes to incorporate group-wise quantiza- then evaluate the speed-ups achieved by these frameworks
tion into QLoRA. The authors observe that the number of across different input sequences characterized by varying
quantization parameters in QLoRA is significantly smaller batch sizes and context lengths.
than the number of LoRA parameters, leading to an imbal- We present the speed-ups of prefilling latency, decoding
ance between quantization and low-rank adaptation. They latency, and end-to-end latency, as summarized in Tab. 4.
suggest that group-wise operations can address this issue by From the results, several key observations can be made:
increasing the number of parameters dedicated to quantiza- (1) Weight-only quantization can substantially accelerate the
tion. In addition, QA-LoRA can merge the LoRA terms into decoding stage, leading to improvements in end-to-end la-
the corresponding quantized weight matrices. LoftQ [181] tency. This enhancement primarily stems from the capability
identifies that initializing LoRA matrices with zeros in of loading the quantized model with low-precision weight
QLoRA is inefficient for downstream tasks. As an alterna- tensors much more swiftly from the High Bandwidth Mem-
tive, LoftQ suggests initializing the LoRA matrices using ory (HBM), as illustrated in the preceding “Efficient Analy-
the Singular Value Decomposition (SVD) of the difference sis” part. Consequently, this approach markedly diminishes
between the original FP16 weights and quantized weights. the memory access overhead. (2) Regarding the prefilling
LoftQ iteratively applies quantization and SVD to achieve a stage, weight-only quantization may actually increase the
more accurate approximation of the original weights. Norm latency. This is due to the fact that the bottleneck in the
16
prefilling stage is the computational cost rather than the potential for hardware acceleration, as modern computing
memory access cost. Therefore, quantizing only the weights architectures are optimized for dense, regular data patterns.
without the activations has minimal impact on latency. Ad- Consequently, despite achieving higher sparsity levels, the
ditionally, as illustrated in Fig. 9, weight-only quantization practical benefits of unstructured pruning in terms of hard-
necessitates the de-quantization of low-precision weights ware efficiency and computational speedup may be limited.
to FP16, leading to additional computational overhead and The common focus of this line of work is the pruning
consequently slowing down the prefilling stage. (3) As the criterion, including the weight importance and pruning
batch size and input length increase, the extent of speed-up ratio. Considering the huge parameter size of LLMs, im-
achieved by weight-only quantization gradually diminishes. proving the pruning efficiency is also crucial. One pruning
This is primarily because, with larger batch sizes and input criterion is to minimize the reconstruction loss of the model.
lengths, the computational cost constitutes a larger propor- SparseGPT [162] is a representative approach in this field. It
tion of latency. While weight-only quantization predomi- follows the idea of OBS [211], which considers the impact of
nantly reduces memory access cost, its impact on latency removing each weight on the network’s reconstruction loss.
becomes less significant as the computational demands OBS iteratively decides a pruning mask to prune the weights
become more prominent with larger batch sizes and input and reconstructs the unpruned weights to compensate for
lengths. (4) Weight-only quantization offers greater benefits the pruning loss. SparseGPT overcomes the efficiency bot-
for larger models due to the significant memory access over- tleneck of OBS via the Optimal Partial Updates technique,
head associated with larger model sizes. As models grow and designs an adaptive mask selection technique based
in complexity and size, the amount of memory required on the OBS reconstruction error. Prune and Tune [165]
to store and access weights increases proportionally. By improves upon SparseGPT by fine-tuning the LLMs with
quantizing the model weights, weight-only quantization ef- minimal training steps during pruning. ISC [164] designs a
fectively reduces this memory footprint and memory access novel pruning criterion by combining the saliency criteria
overhead. in OBS [211] and OBD [212]. It further assigns non-uniform
pruning ratios to each layer based on Hessian information.
5.2.2 Sparsification BESA [167] learns a differentiable binary mask via gradi-
Sparsification is a compression technique that increases the ent descent of the reconstruction loss. The pruning ratio
proportion of zero-valued elements in data structures such for each layer is sequentially decided by minimizing the
as model parameters or activations. This method aims to reconstruction error. The other popular pruning criterion is
decrease computational complexity and memory usage by magnitude-based. Wanda [163] proposes to use the element-
efficiently ignoring zero elements during computation. In wise product between the weight magnitude and the norm
the context of LLMs, sparsification is commonly applied of input activation as the pruning criterion. RIA [168] jointly
to weight parameters and attention activations. It leads to considers the weights and activations by using the metric
the development of weight pruning strategies and sparse of Relative Importance and Activations, which evaluates
attention mechanisms. the importance of each weight element based on all its
Weight Pruning. Weight pruning systematically removes connected weights. In addition, RIA converts the unstruc-
less critical weights and structures from models, aiming tured sparsity pattern to a structured N:M sparsity pattern,
to reduce computational and memory cost during both which can enjoy the actual speed-up on NVIDIA GPUs.
prefilling stages and decoding stages without significantly Additionally, OWL [166] focuses on deciding the pruning
compromising performance. This sparsification approach is ratio of each layer. It assigns the pruning ratios to each layer
categorized into two main types: unstructured pruning and based on its activation outlier ratios.
structured pruning. The categorization is based on the gran- Structured pruning prunes larger structural units of
ularity of the pruning process, as illustrated in Figure 10. the model, such as entire channels or layers, operating at
a coarser granularity compared to unstructured pruning.
These methods directly facilitate inference speed-up on
conventional hardware platforms due to their alignment
with the dense, regular data patterns these systems are
optimized to process. However, the coarse granularity of
structured pruning often results in a more pronounced
impact on model performance. The pruning criterion of this
Unstructured Pruning Structured Pruning
Granularity: Weight Granularity: Channel/Group/Layer
line of work additionally enforces the structured pruning
pattern. LLM-Pruner [169] proposes a task-agnostic struc-
tured pruning algorithm. Specifically, it first identifies the
Fig. 10. Illustration of Unstructured Pruning (left) and Structured Pruning
(right). couple structures in the LLM, based on the connection
dependencies between neurons. Then, it decides which
Unstructured pruning prunes individual weight values structure groups to remove based on a well-designed group-
with fine granularity. Compared with structured pruning, it wise pruning metric. After pruning, it further proposes
typically achieves a greater level of sparsity with minimal to recover the model performance by a parameter-efficient
impact on model prediction. However, the sparse pattern training technique, i.e., LoRA [210]. Sheared LLaMA [170]
achieved through unstructured pruning lacks high-level proposes to prune the original LLM to a specific target
regularity, leading to irregular memory access and compu- architecture of existing pre-trained LLMs. In addition, it
tation patterns. This irregularity can significantly hinder the designs dynamic batch-loading techniques to improve post-
17
training performance. ZipLM [171] iteratively identifies and Static sparse attention removes activation values inde-
prunes the structural components with the worst trade- pendently of specific inputs [147], [149], [150], [151]. These
off between loss and runtime. LoRAPrune [172] proposes methods pre-determine the sparse attention mask and en-
a structured pruning framework for the pre-trained LLMs force it on the attention matrix during inference. Previous
with LoRA modules to enable fast inference of LoRA-based studies combine different sparse patterns to preserve the
models. It designs a LoRA-guided pruning criterion that most essential elements within each attention matrix. As
uses the weights and gradients of LoRA, and an iterative shown in Figure 11(a), the most common sparse attention
pruning scheme to remove the unimportant weights based patterns are the local and global attention patterns. The local
on the criterion. LoRAShear [173] also designs a pruning attention pattern captures the local context of each token
method for LoRA-based LLMs with (1) a graph algorithm with a fixed-size window attention surrounding each token.
to identify the minimal removal structures, (2) a progressive The global attention pattern captures the correlation of spe-
structured pruning algorithm LHSPG, and (3) a dynamic cific tokens to all other tokens by computing and attending
knowledge recovery mechanism to recover the model per- to all tokens across the sequence. Note that leveraging global
formance. SliceGPT [174] builds on the idea of computa- patterns can eliminate the need to store key-value (KV)
tional invariance of RMSNorm operation. It proposes to pairs for unused tokens, thereby reducing memory access
structurally arrange the sparsity in each weight matrix, and cost and memory usage during the decoding stage. Sparse
to slice out the entire rows or columns. PLATON [175] Transformer [147] combines these patterns to capture the
proposes to prune the weights by considering both their local context with a local pattern, and then aggregates the
importance and uncertainty. It uses the exponential moving information with the global pattern for every few words.
average (EMA) of the importance scores to estimate the StreamingLLM [148] applies the local pattern, along with
importance, and adopts the upper confidence bound (UCB) the global pattern only for the first few tokens. It shows
for the uncertainty. SIMPLE [176] proposes to prune the that such a global pattern serves as the attention sink to
attention head, FFN neurons and hidden dimension via keep the strong attention scores toward initial tokens. It
learning the corresponding sparsity masks. After pruning, helps the LLMs to generalize to infinite input sequence
it further adopts knowledge distillation to fine-tune the length. Bigbird [150] also uses the random pattern, where
pruned models for performance recovery. all tokens attend to a set of random tokens. The combi-
nation of local, global and random patterns is proven to
encapsulate all continuous sequence-to-sequence functions,
local global random dilated rate 1/2/8
affirming its Turing completeness. As shown in Figure 11(b),
Longformer [149] additionally introduces the dilated sliding
window pattern. It is analogous to dilated CNNs and makes
the sliding window “dilated” to increase the receptive
field. To adapt the model to the sparse setting, Structured
Sparse Attention [151] advocates an entropy-aware training
method that congregates high-probability attention values
into denser regions. Unlike previous studies that manually
design sparse patterns, SemSA [152] uses gradient-based
(a) (b)
profiling to identify important attention patterns and au-
tomatically optimizes the attention density distribution to
attended pruned bucket 0/1
further improve model efficiency.
In contrast, Dynamic sparse attention adaptively elim-
inates activation values based on varying inputs, employ-
ing real-time monitoring of neuronal activation values to
bypass computations for neurons with negligible impact,
thereby achieving pruning. Most dynamic sparse attention
methods employ the dynamic token-pruning methods, as
Figure 11(c) shows. Spatten [153], SeqBoat [154] and Adap-
(c) (d) tively Sparse Attention [155] leverage the inherent redun-
dancy in linguistic constructs to propose dynamic token-
level pruning strategies. Spatten [153] assesses the cumula-
Fig. 11. Examples of different sparse attention masks. (a) Static mask
with local, global, and random attention pattern. (b) Static mask with tive importance of each word by aggregating the attention
dilated attention pattern of different dilated rate. (c) Dynamic token matrix columns, subsequently pruning tokens with minimal
pruning. (d) Dynamic attention pruning. cumulative significance from the input in subsequent layers.
SeqBoat [154] trains a linear State Space Model (SSM) with a
Sparse Attention. Sparse attention techniques in Multi- sparse sigmoid function to determine which token to prune
Head Self-Attention (MHSA) components of transformer for each attention head. Both Spatten and SeqBoat prune
models strategically omit certain attention calculations to the uninformative tokens for the whole input. Adaptively
enhance computational efficiency of the attention operation Sparse Attention [155] gradually prunes the tokens during
mainly in the prefilling stage. These mechanisms diverge the generation process. It drops parts of the context that are
into static and dynamic categories based on their reliance no longer required for future generation.
on specific input data. In addition to dynamic token pruning, dynamic atten-
18
tion pruning strategies are also employed [156], [157], [158], Neural Architecture Search. Neural Architecture Search
[159], [160]. As Figure 11(d) shows, instead of pruning all (NAS) [213] aims to automatically search the optimal neu-
the attention values of certain tokens, these methods dy- ral architectures that strike an optimized balance between
namically prune the selective part of the attention based on efficiency and performance. AutoTinyBERT [135] utilizes
the input. A prominent approach within this domain is dy- one-shot Neural Architecture Search (NAS) to discover the
namically segmenting input tokens into groups, known as hyper-parameters of the Transformer architecture. Notably,
buckets, and strategically omitting the attention calculations it introduces a compelling batch-wise training approach
for tokens that reside in separate buckets. The challenge to train a Super Pre-trained Language Model (SuperPLM)
and focus of these methods lie in the way to cluster related and subsequently employs an evolutionary algorithm to
tokens together, thereby facilitating attention computations identify the optimal sub-models. NAS-BERT [136] trains a
solely among them to enhance efficiency. Reformer [156] large super-net on conventional self-supervised pre-training
leverages locality-sensitive hashing to cluster keys and tasks using several innovative techniques, such as block-
queries that share identical hash codes into the same bucket. wise search, search space pruning, and performance ap-
Following this, Sparse Flash Attention [157] introduces spe- proximation. This approach allows NAS-BERT to be applied
cialized GPU kernels optimized for this hash-based sparse efficiently across various downstream tasks without requir-
attention mechanism, further improving computational effi- ing extensive re-training. Structure pruning via NAS [137]
ciency. Meanwhile, the Routing Transformer [158] employs a treats structural pruning as a multi-objective NAS problem,
spherical k-means clustering algorithm to aggregate tokens and solves it via the one-shot NAS method. LiteTransform-
into buckets, optimizing the selection process for attention erSearch [138] proposes to use a training-free indicator, i.e.,
computations. Sparse Sinkhorn Attention [159] adopts a the number of parameters, as a proxy indicator to guide the
learned sorting network to align keys with their relevant search. This method enables efficient exploration and selec-
query buckets, ensuring that attention is computed only tion of the optimal architectures without the need for actual
between the corresponding query-key pairs. Diverging from training during the search phase. AutoDistil [139] presents a
the bucket-level operation, H2 O [160] introduces the token- fully task-agnostic few-shot NAS algorithm featuring three
level dynamic attention pruning mechanism. It combines primary techniques: search space partitioning, task-agnostic
static local attention with dynamic computations between SuperLM training, and task-agnostic search. This approach
the current query and a set of dynamically identified key aims to facilitate efficient architecture discovery across vari-
tokens, termed heavy-hitters (H2 ). These heavy-hitters are ous tasks with minimal task-specific adaptations. Typically,
dynamically adjusted with an eviction policy aimed at re- NAS algorithms necessitate evaluating the performance of
moving the least significant keys at each generation step, each sampled architecture, which can incur significant train-
effectively managing the size and relevance of the heavy- ing cost. Consequently, these techniques are challenging to
hitter set. apply to LLMs.
Moreover, viewing each token as a graph node and Low Rank Factorization. Low Rank Factorization (LRF), or
attention between tokens as edges offers an extended per- Low Rank Decomposition, aims to approximate a matrix
spective on static sparse attention [150], [161]. The original, Am×n with two low-rank matrices B m×r and C r×n by:
full attention mechanism equates to a complete graph with
Am×n ≈ B m×r × C r×n , (11)
a uniform shortest path distance of 1. Sparse attention,
with its random mask, introduces random edges, effectively where r is much smaller than m and n. In this way, LRF
reducing the shortest path distance between any two nodes can diminish memory usage and enhance computational
to O(log n), thus maintaining efficient information flow efficiency. Furthermore, during the decoding stage of LLM
akin to full attention. Diffuser [161] utilizes the perspective inference, memory access cost presents a bottleneck to the
of graph theory to expand the receptive field of sparse decoding speed. Therefore, LRF can reduce the number
attention with multi-hop token correlations. It also takes of parameters that need to be loaded, thereby accelerat-
inspiration from the expander graph properties to design ing the decoding speed. LoRD [140] shows the potential
better sparse patterns that approximate the information flow of compressing the LLMs without largely degrading the
of full attention. performance via LRF. Specifically, it adopts Singular Value
Beyond the attention-level and token-level sparsity, the Decomposition (SVD) to factorize the weight matrices, and
scope of attention pruning extends to various granularities. successfully compresses a LLM with 16B parameters to
Spatten [153] also extends pruning beyond token granular- 12.3B with minimal performance drop. TensorGPT [141]
ity to attention head granularity, eliminating computations introduces a method to compress the embedding layer
for inessential attention heads to further reduce computa- using Tensor-Train Decomposition. Each token embedding
tional and memory demands. is treated as a Matrix Product State (MPS) and efficiently
computed in a distributed manner. LoSparse [142] combines
the benefits of LRF and weight pruning for LLM com-
5.2.3 Structure Optimization
pression. By leveraging low-rank approximation, LoSparse
The objective of structure optimization is to refine model mitigates the risk of losing too many expressive neurons that
architecture or structure with the goal of enhancing the typically occurs with direct model pruning. LPLR [143] and
balance between model efficiency and performance. Within ZeroQuant-V2 [144] both propose to compress the weight
this field of research, two prominent techniques stand out: matrix by simultaneously applying LRF and quantization to
Neural Architecture Search (NAS) and Low Rank Factoriza- it. DSFormer [145] proposes to factorize the weight matrix
tion (LRF). into the product of a semi-structured sparse matrix and a
19
small dense matrix. ASVD [146] designs an activation-aware to distill the student models. In the field of LLMs, black-
SVD method. This approach involves scaling the weight ma- box KD mainly guides the student models to learn LLMs’
trix based on activation distribution prior to applying SVD generalization ability and emergent ability, including In-
for matrix decomposition. ASVD also involves determining Context Learning (ICL) ability [43], Chain-of-Thought (CoT)
an appropriate truncation rank for each layer through a reasoning ability [14] and Instruction Following (IF) abil-
search process. ity [214].
Regarding the ICL ability, Multitask-ICT [116] introduces
5.2.4 Knowledge Distillation in-context learning distillation to transfer the multitask few-
Knowledge Distillation (KD) is a well-established technique shot ability of Large Language Models (LLMs), leveraging
for model compression, wherein knowledge from large both in-context learning and language modeling proficiency.
models (referred to as teacher models) is transferred to MCKD [117] observes that student models distilled from in-
smaller models (referred to as student models). In the context learned teacher models often exhibit superior per-
context of LLMs, KD involves using the original LLMs as formance on unseen input prompts. Building on this obser-
teacher models to distill smaller LMs. Numerous studies vation, MCKD devises a multi-stage distillation paradigm
have focused on effectively transferring various abilities of where the student model from previous stages is employed
LLMs to smaller models. In this domain, methods can be to generate distillation data for subsequent stages, enhanc-
categorized into two main types: white-box KD and black- ing the effectiveness of the distillation method.
box KD (as illustrated in Fig. 12). To distill the Chain-of-Thought (CoT) reasoning ability,
several techniques such as Distilling Step-by-Step [118],
White-Box KD Black-Box KD SCoTD [119], CoT Prompting [120], MCC-KD [121], and
Features
Outputs Fine-tune-CoT [122] propose distillation methods that in-
ICL ability
corporate responses and rationales extracted from LLMs
Logits
Outputs
CoT ability
to train student models. Socratic CoT [123] also targets
IF ability
Output 1
5.3 Knowledge, Suggestions and Future Direction
Sample-level Dynamic
In the field of efficient structure design, the pursuit of
Input 1
Inference alternative architectures to Transformers is a burgeoning
area of research. Examples such as Mamba [73], RWKV [60],
Input 2 Output 2 and their respective variants [101], [104] have demonstrated
competitive performance across various tasks, garnering
Token 1 increasing attention in recent times. Nevertheless, it remains
pertinent to investigate whether these non-Transformer
Token-level Dynamic models may exhibit certain shortcomings compared to
Prompt
Inference Transformer models. Concurrently, exploring the integra-
Token 2 tion of non-Transformer architectures with the attention
operation [74], [103], [216] represents another promising
avenue for future research.
Fig. 13. Illustration of Token-level (up) and Sample-level (down) dynamic In the realm of model compression, quantization stands
inference.
out as the predominant method employed in Large Lan-
guage Model (LLM) deployment, primarily due to two key
factors. Firstly, quantization presents a convenient means of
compressing LLMs. For instance, employing Post-Training
Sample-level. Sample-level early exiting techniques focus Quantization (PTQ) methods can reduce the parameter
on determining the optimal size and structure of Language count of an LLM with seven billion parameters to a com-
Models (LLMs) for individual input samples. A common pressed form within a matter of minutes. Secondly, quan-
approach is to augment LLMs with additional modules after tization holds the potential to achieve substantial reduc-
each layer, leveraging these modules to decide whether to tions in memory consumption and inference speed, while
terminate inference at a specific layer. FastBERT [110], Dee- introducing only minor performance trade-offs. This com-
BERT [113], MP [215], and MPEE [111] train these modules promise is generally deemed acceptable for numerous real-
directly to make decisions (e.g., outputting 0 to continue or world applications. However, it’s worth noting that quan-
1 to stop) based on features from the current layer. Global tization may still compromise certain emergent abilities
Past-Future Early Exit [112] proposes a method that enriches of LLMs, such as self-calibration or multi-step reasoning.
the input to these modules with linguistic information from Additionally, in specific scenarios like dealing with long
both preceding and subsequent layers. Given that future contexts, quantization could lead to significant performance
layer features are not directly accessible during inference, a degradation [204]. Consequently, it is required to carefully
simple feed-forward layer is trained to estimate these future select appropriate quantization methods to mitigate the risk
features. PABEE [114] trains the modules as output heads of such degradation in these specialized cases.
for direct prediction, suggesting inference termination when Extensive literature has devoted into studying sparse at-
predictions remain consistent. HASHEE [115] employs a tention techniques for efficient long-context processing. For
non-parametric decision-making approach based on the hy- example, a recent representative work, StreamingLLM [148],
pothesis that similar samples should exit inference at the can process 4 million tokens by only restoring several
same layer. attention sink tokens. Nonetheless, these approaches of-
ten sacrifice critical information, resulting in performance
Token-level. In the decoding stage of LLM inference, where degradation. Therefore, the challenge of preserving essen-
tokens are generated sequentially, token-level early exiting tial information while efficiently managing long contexts
techniques aim to optimize the size and structure of LLMs remains an important area for future exploration. As for
for each output token. CALM [108] introduces early exit the weight pruning techniques, LLM-KICK [217] notes that
classifiers after each Transformer layer, training them to current state-of-the-art (SOTA) methods experience con-
output confidence scores that determine whether to halt siderable performance degradation even at relatively low
inference at a specific layer. Notably, in the self-attention sparsity ratios. Consequently, developing effective weight
block, computing the current token’s feature at each layer pruning methods to maintain LLM performance remains an
relies on all previous tokens’ features (i.e., KV cache) in emerging and critical research direction.
the same layer. To address the issue of missing KV cache The optimization of model structures often involves the
due to early exiting of previous tokens, CALM proposes use of Neural Architecture Search (NAS), which typically
directly copying the feature from the exiting layer to subse- demands extensive computational resources, posing a po-
quent layers, with experimental results showing only minor tential barrier to its practical application in compressing
performance degradation. SkipDecode [109] addresses lim- LLMs. Therefore, investigating the feasibility of employ-
itations of previous early exiting methods that hinder their ing automatic structure optimization for LLM compression
applicability to batch inference and KV caching, thereby lim- warrants further exploration. Additionally, the challenge
iting actual speed-up gains. For batch inference, SkipDecode remains for techniques like low-rank factorization (LRF) to
proposes a unified exit point for all tokens within a batch. achieve an optimal balance between compression ratio and
Regarding KV caching, SkipDecode ensures a monotonic task performance. For instance, ASVD [146] achieves only a
decrease in exit points to prevent recomputation of KV modest 10% to 20% compression ratio without compromis-
cache, facilitating efficiency gains during inference. ing the reasoning capabilities of LLMs.
21
In addition to employing individual model compres- the computational graph level, current optimized inference
sion techniques, several studies explore the combination engines implement highly fused operator.
of different methods to compress LLMs, leveraging their Attention Operator Optimization. The standard attention
respective advantages for improved efficiency. For instance, computation (e.g., using Pytorch) involves the multiplica-
MPOE [88] applies weight matrix factorization specifically tion of the Query matrix (Q) with the Key matrix (K),
to the expert Feed-Forward Networks (FFNs) in MoE-based resulting in quadratic time and space complexity in relation
LLMs, with the goal of further reducing memory require- to the input sequence length. As shown in Fig. 15, the
ments. LLM-MQ [191] utilizes weight sparsity techniques to time proportion of the attention operator increases as the
protect weight outliers during model quantization, thereby context length grows. This translates to high demands on
minimizing quantization errors. LPLR [143] focuses on memory size and computational capability, especially when
quantizing low-rank factorized weight matrices to further dealing with long sequences. To address the computational
decrease memory footprint and memory access cost during and memory overhead of standard attention computation
LLM inference. Furthermore, LoSparse [142] combines low- on GPUs, customized attention operators are essential.
rank factorization with weight pruning, leveraging pruning FlashAttention [233], [234] fuses the entire attention oper-
to enhance the diversity of low-rank approximation while ation into a single, memory-efficient operator to alleviate
using low-rank factorization to retain important weights memory access overhead. The input matrices (Q, K, V) and
and prevent loss of critical information. These approaches attention matrix are tiled into multiple blocks, which elimi-
highlight the potential of integrating multiple compression nates the need for complete data loading. Built upon Flash
techniques to achieve better optimization of LLMs. Attention, FlashDecoding [237] aims to maximize compu-
tational parallelism for decoding. Due to the application of
the decoding approach, the Q matrix degrades into a batch
6 S YSTEM - LEVEL O PTIMIZATION of vectors during decoding, which makes it challenging
The system-level optimization for LLM inference primarily to fill the computational units if the parallelism is limited
involves enhancing the model forward pass. Considering to the batch size dimension. FlashDecoding addresses this
the computational graph of a LLM, there exist multiple by introducing parallel computation along the sequence
operators, with attention and linear operators dominating dimension. While this introduces some synchronization
most of the runtime. As mentioned in Sec. 2.3, system- overhead to softmax computation, it leads to noticeable
level optimization primarily considers the distinctive char- improvements in parallelism, particularly for small batch
acteristics of the attention operator and the decoding ap- sizes and long sequences. The subsequent work, FlashDe-
proach within LLM. In particular, to address the specific coding++ [231], observes that in previous works [233], [234],
issues related to the decoding approach of LLMs, the linear [237], the maximum value within the softmax only serves as
operator requires special tiling designs, and speculative a scaling factor to prevent data overflow. However, the dy-
decoding methods are proposed to improve the utilization. namical maximum value incurs significant synchronization
Furthermore, in the context of online serving, requests come overhead. Moreover, extensive experiments indicate that
from multiple users. Therefore, beyond the optimizations in typical LLM (e.g., Llama2 [239], ChatGLM [240]), over
discussed earlier, online serving faces challenges related 99.99% of the softmax inputs fall within a certain range.
to memory, batching and scheduling arising from asyn- Thus, FlashDecoding++ proposes to determine the scaling
chronous requests. factor based on statistics in advance. This eliminates the
synchronization overhead in softmax computation, enabling
parallel execution of subsequent operations alongside the
6.1 Inference Engine
softmax computation.
The optimizations for inference engines are dedicated to Linear Operator Optimization The linear operator plays
accelerate the model forward process. Main operators and a pivotal role in LLM inference, performing in feature
the computational graph in LLM inference are highly op- projection and Feedforward Neural Networks (FFNs). In
timized. Besides, speculative decoding technique is pro- traditional neural networks, linear operators can be ab-
posed to accelerate the inference speed without performance stracted into General Matrix-Matrix Multiplication (GEMM)
degradation. operations. However, in the case of LLM, the application of
the decoding approach results in a notably reduced dimen-
6.1.1 Graph and Operator Optimization sion, diverging from the conventional GEMM workload.
Runtime Profiling. Using HuggingFace [238] implementa- The low-level implementation of traditional GEMM has
tion, we profile the inference runtime with different mod- been highly optimized, and mainstream LLM frameworks
els and context lengths. The profiling results in Fig. 15 (e.g., DeepSpeed [236], vLLM [49], OpenPPL [241] and etc.)
demonstrate that attention operators and linear operators primarily call the GEMM APIs offered by cuBLAS [242]
collectively dominate runtime, with their combined dura- for linear operators. Without an explicitly tailored imple-
tion often exceeding 75% of the inference duration. Conse- mentation for GEMMs with a reduced dimension, the lin-
quently, a significant portion of optimization efforts at the ear operators during decoding suffer inefficiency. A no-
operator level is dedicated to enhancing the performance of table trend to address the issue is observed in the latest
the two operators. Furthermore, there are multiple operators release of TensorRT-LLM [208]. It introduces a dedicated
occupying a small proportion of runtime, which fragments General Matrix-Vector Multiplication (GEMV) implemen-
the operator execution timeline and increases the cost of tation, potentially improving efficiency for the decoding
kernel launch on the CPU side. To address this issue, at step. Recent research FlashDecoding++ [231] makes a fur-
22
Speculative Decoding Speculative decoding [218], Speculative sampling [219], DistillSpec [220], Self-
speculative decoding [221], OSD [222], PaSS [223], REST [224], SpecInfer [225],
Stage speculative decoding [226], Cascade Speculative Drafting [227], Looka-
head decoding [228], Medusa [48], Eagle [229], Spectr [230]
(a) Llama2-7B, (b) Llama2-7B, (c) Baichuan2-13B, (d) Baichuan2-13B, (e) Mixtral-8x7B, (f) Mixtral-8x7B,
128 context length 2k context length 128 context length 2k context length 128 context length 2k context length
ther step, addressing the inefficiency of cuBLAS [242] and urgent need to optimize the FFN layer. MegaBlocks [232]
CUTLASS [243] libraries when dealing with small batch is the first to optimize the computation for MoE FFN lay-
sizes during the decode step. The authors first introduce ers. The work formulates the MoE FFN computation into
the concept of the FlatGEMM operation to represent the block-sparse operations and proposes tailored GPU kernels
workload of GEMM with a highly reduced dimension (di- for acceleration. However, MegaBlocks concentrates on the
mension size < 8 in FlashDecoding++). As FlatGEMM poses efficient training of the MoE models and hence ignores the
new computational characteristics, the tiling strategy for characteristics of inference (e.g.,, the decoding approach).
traditional GEMMs necessitates modification to be applied. Existing frameworks are working hard to optimize the
The authors observe that two challenges exist as the work- computations of the MoE FFN inference stage. The official
load varies: low parallelism and memory access bottleneck. repository of vLLM [49] integrates the fused kernels for
To tackle the challenges, FlashDecoding++ adopts a fine- MoE FFN in Triton [245], seamlessly removing the index
grained tiling strategy to improve parallelism, and leverages overhead.
the double buffering technique to hide memory access la- Graph-Level Optimization. Kernel fusion stands out as a
tency. Furthermore, recognizing that the linear operations in prevalent graph-level optimization because of its capabil-
typical LLM (e.g., Llama2 [239], ChatGLM [240]) often have ity to reduce runtime. There are three main advantages
fixed shapes, FlashDecoding++ establishes a heuristic selec- of applying kernel fusion: (1) To reduce memory access.
tion mechanism. This mechanism dynamically chooses be- The fused kernel inherently removes the memory access of
tween different linear operators based on the input size. The intermediate results, mitigating the memory bottleneck for
options include FastGEMV [244], FlatGEMM, and GEMM operators. (2) To mitigate kernel launching overhead. For
provided by cuBLAS [242], [243] libraries. This approach some lightweight operators (e.g., residual adding), the ker-
ensures the selection of the most efficient operator for the nel launching time occupies most of the latency, and kernel
given linear workload, potentially leading to better end-to- fusion reduces individual kernel launchings. (3) To enhance
end performance. parallelism. For those operators without data dependency,
Recently, the application of the MoE FFN to enhance the when one-by-one kernel execution fails to fill the hardware
model capability has become a trend in LLMs [12]. This capacity, it is beneficial to parallel the kernels via fusion.
model structure also puts forward new requirements for The technique of kernel fusion proves effective with
operator optimization. As shown in Fig. 15, in the Mixtral LLM inference, with all of the aforementioned benefits.
model with MoE FFN, the linear operator dominates the FlashAttention [233] formulates the attention operator into
runtime due to the non-optimized FFN computation in the one single kernel, removing the overhead of accessing the at-
HuggingFace implementation. Besides, Mixtral’s adoption tention results. Based on the fact that the attention operator
of the GQA attention structure decreases the attention op- is memory-bounded, the reduction of memory access effec-
erator’s runtime proportion, which further points out the tively transfers to runtime speed-up. ByteTransformer [235]
23
and DeepSpeed [236] propose to fuse lightweight operators coding techniques typically employ two primary sampling
including residual adding, layernorm and activation func- strategies: greedy sampling and nucleus sampling. Greedy
tions, into the former linear operators to reduce the kernel sampling involves selecting the token with the highest
launching overhead. As a result, those lightweight operators probability at each decoding step to generate a specific
disappear in the timeline with nearly no extra latency. More- output sequence. The initial attempt at speculative decod-
over, kernel fusion is also adopted to enhance the utilization ing, known as Blockwise Parallel Decoding [246], aims to
of LLM inference. The projections of Query, Key and Value ensure that the draft tokens precisely match the tokens
matrices are originally three individual linear operations, sampled via greedy sampling, thus preserving output to-
and are fused into one linear operator to deploy on mod- ken equivalence. In contrast, nucleus sampling involves
ern GPUs. Currently, the kernel fusion technique has been sampling tokens from a probability distribution, resulting
exploited in LLM inference practice, and highly optimized in diverse token sequences with each run. This diversity
inference engines employ only a few fused kernels within makes nucleus sampling popular. To accommodate nucleus
the runtime. For example, in FlashDecoding++ [231] im- sampling within speculative decoding frameworks, specu-
plementation, a transformer block integrates merely seven lative sampling techniques [218], [219] have been proposed.
fused kernels. Leveraging the aforementioned operators and Speculative sampling maintains output distribution equiv-
kernel fusion optimization, FlashDecoding++ achieves up to alence, aligning with the probabilistic nature of nucleus
4.86× speed-up over the HuggingFace implementation. sampling to generate varied token sequences. Formally,
given a sequence of tokens x1 , x2 , ..., xn and a sequence of
6.1.2 Speculative Decoding
draft tokens x̂n+1 , x̂n+2 , ..., x̂n+k , the speculative sampling
Speculative decoding [218] (i.e., speculative sampling [219]) strategy accepts the i-th draft token with the following
is an innovative decoding technique for auto-regressive probabilities:
LLMs designed to enhance decoding efficiency without
p(x̂i |x1 , x2 , ..., xi−1 )
compromising the fidelity of outputs. The core idea of this
min 1, , (12)
approach involves employing a smaller model, termed a q(x̂i |x1 , x2 , ..., xi−1 )
draft model, to predict several subsequent tokens efficiently,
where p(·|·) and q(·|·) denote the conditional probabilities
followed by validation of these predictions using the target
from the target LLM and the draft model, respectively. If
LLM in parallel. This methodology aims to enable the LLM
the i-th draft token is accepted, it sets xi ←
− x̂i . Otherwise,
to generate multiple tokens within the time frame typi-
it quits the verification of the following draft tokens, and
cally required for a single inference. Fig. 16 demonstrates
resamples xi from the following distribution:
the comparison of the traditional auto-regressive decoding
method and the speculative decoding approach. Formally, norm(max(0, p(·|x1 , x2 , ..., xi−1 ) − q(·|x1 , x2 , ..., xi−1 ))).
speculative decoding approach consists of two steps: (13)
1) Draft Construction: It employs the draft model to gen- Building upon speculative sampling, several variants [225],
erate several subsequent tokens, namely draft tokens, [230] have emerged, aimed at validating multiple draft
in parallel or in the auto-regressive manner. token sequences. Notably, the token tree verifier [225] has
2) Draft Verification: It employs the target model to com- become a widely adopted verification strategy within this
pute the conditional probabilities of all the draft tokens context. This approach utilizes a tree-structured represen-
in a single LLM inference step, subsequently determin- tation of draft token sets and employs a tree attention
ing the acceptance of each draft token sequentially. The mechanism to efficiently perform the verification process.
acceptance rate, representing the average number of In the speculative decoding approach, the acceptance
accepted draft tokens per inference step, serves as a key rate of draft tokens is significantly influenced by the degree
metric for evaluating the performance of a speculative to which the output distributions of draft models align
decoding algorithm. with those of original LLMs. As a result, considerable re-
search efforts have been directed towards improving the
Generated Token Accept
Optional design of draft models. DistillSpec [220] directly distills a
Token
smaller draft model from the target LLM. SSD [221] involves
Draft Token Token Reject
automatically identifying a sub-model (a subset of model
Draft Model
layers) from the target LLM to serve as the draft model,
eliminating the need for separate training of the draft model.
OSD [222] dynamically adjusts the output distribution of the
draft model to match the user query distribution in online
LLM LLM
LLM services. It achieves this by monitoring rejected draft
tokens from the LLM and using this data to refine the draft
model through distillation. PaSS [223] proposes utilizing the
target LLM itself as the draft model, incorporating trainable
(a) Auto-regressive Decoding (b) Speculative Decoding
tokens (look-ahead tokens) into the input sequence to enable
simultaneous generation of subsequent tokens. REST [224]
Fig. 16. Comparison of auto-regressive decoding (a) and speculative
decoding (b).
introduces a retrieval-based speculative decoding approach,
employing a non-parametric retrieval data store as the draft
Speculative decoding ensures output equivalence with model. SpecInfer [225] introduces a collective boost-tuning
standard auto-regressive decoding methods. Traditional de- technique to align the output distribution of a group of
24
TABLE 5
Comparison of several open-source implementations of speculative decoding. In this table, we also show the additional overhead of constructing
draft models. Note that for SpD [218], [219], LADE [228], Medusa [48] and Eagle [229], we report the training cost from their original papers. And
for SSD [221] and REST [27], we run the sub-LLM search and datastore construction with the code they provide, and report the time cost.
Besides, for Medusa, we use Medusa-1 [48] which does not fine-tune the original LLM backbone.
Additional Overhead
Method Draft Model Draft Construction Draft Verifier Acceptance Rate Speed-up
(GPU hours)
SpD [218], [219] small speculative model one draft sequence speculative sampling 275 1.77∼2.02× 1.05∼1.77×
LADE [228] LLM + N grams one draft sequence greedy sampling 0 1.92∼2.14× 1.12∼1.30×
SSD [221] sub-LLM one draft sequence speculative sampling 4 1.64∼1.74× 1.01∼1.23×
REST [27] datastore token tree speculative sampling 1.5 2.18∼2.31× 1.72∼2.27×
Medusa-1 [48] four LLM heads token tree speculative sampling ∼24 2.52∼2.62× 2.04∼2.86×
Eagle [229] one Transformer Layer token tree speculative sampling 96∼192 3.47∼3.72× 2.77∼3.74×
draft models with that of the target LLM. Lookahead decod- their proposed draft construction approach and use the
ing [228] involves generating n-grams of the target LLM in checkpoints they provide. As for the evaluation metrics, we
parallel to aid in generating draft tokens. Medusa [48] fine- adopt acceptance rate, which denotes the ratio of the number
tunes several heads of the LLM specifically for generating of accepted tokens to the number of generation steps, and
subsequent draft tokens. Eagle [229] adopts a lightweight speed-up, which denotes the ratio of the latency of original
transformer layer called an auto-regression head to gener- auto-regressive decoding to the latency of speculative de-
ate draft tokens in an auto-regressive manner, integrating coding when fixing the total length of output.
rich contextual features from the target LLM into the draft Tab. 5 provides a comparison of various speculative
model’s input. decoding methods, highlighting several key observations:
Another line of studies focus on designing more effective (1) Eagle demonstrates exceptional performance, achieving
draft construction strategies. Conventional approaches often a notable 3.47∼3.72× end-to-end speed-up across multiple
yield single draft token sequences, posing challenges for LLMs. To understand its success, a deeper analysis of Eagle
passing verification. In response, Spectr [230] advocates reveals two key factors. Firstly, Eagle employs an auto-
generating multiple draft token sequences and employs a regressive approach for decoding draft tokens, leveraging
k -sequential draft selection technique to concurrently verify information from previously generated tokens directly. Sec-
k sequences. This method leverages speculative sampling, ondly, Eagle integrates rich features from previous tokens
ensuring equivalence in output distributions. Similarly, of both original LLMs and draft models to enhance the
SpecInfer [225] adopts a comparable approach. However, accuracy of next draft token generation. (2) The token tree
unlike Spectr, SpecInfer merges draft token sequences into a verifier proves to be an effective technique in enhancing the
“token tree” and introduces a tree attention mechanism for performance of speculative decoding methods. (3) The end-
validation. This strategy is called the ”token tree verifier”. to-end speed-up achieved by these methods is often lower
Due to its efficacy, token tree verifier has been widely em- than the acceptance rate. This difference arises due to the
braced in numerous speculative decoding algorithms [48], practical consideration that the generation cost associated
[224], [226], [229]. In addition to these efforts, Stage Spec- with draft models cannot be overlooked.
ulative Decoding [226] and Cascade Speculative Drafting
(CS Drafting) [227] propose accelerating draft construction 6.2 Serving System
by integrating speculative decoding directly into the token The optimizations for serving systems are dedicated to
generation process. improve the efficiency in handling asynchronous requests.
Comparative Experiments and Analysis. We conduct The memory management is optimized to hold more re-
an experiment to evaluate the speed-up performance of quests, and efficient batching and scheduling strategies are
the speculative decoding methods. Specifically, we thor- integrated to improve the system throughput. Besides, op-
oughly review the studies of this field, and select six timizations specific to distributed systems are proposed to
of them that have open-sourced their codes, i.e., Spec- exploit distributed computational resources.
ulative Decoding (SpD) [218], [219], Lookahead Decod-
ing (LADE) [228], REST [224], Self-speculative Decoding 6.2.1 Memory Management
(SSD) [221], Medusa [48] and Eagle [229]. As for the eval- The storage of KV cache dominates the memory usage in
uation dataset, we use Vicuna-80 [7] to evaluate the above LLM serving, especially when the context length is long
methods, which contains 80 questions that classified into (see Sec. 2.3). Since the generation length is uncertain, it
10 categories. We report the average results on these 80 is challenging to allocate the space for KV cache storage
questions. As for target LLMs, we adopt five fashion open- in advance. Earlier implementations [261] usually allocate
source LLMs, i.e., Vicuna-7B-V1.3 [7], Vicuna-13B-V1.3 [7], storage space in advance based on the preset maximum
Vicuna-33B-V1.3 [7], LLaMA-2-7B [5] and LLaMA-2-13B [5]. length of each request. However, in instances where re-
We report the range of evaluation metrics across these 5 quest generation is terminated early, this approach incurs
LLMs. As for draft models, we adopt two well-trained significant wastage of storage resources. To address the
draft models, i.e., LLaMA-68M and LLaMA-160M [225] for issue, S3 [259] proposes to predict an upper bound of the
SpD. For other speculative decoding methods, we follow generation length for each request, in order to reduce the
25
waste of the pre-allocated space. However, the static way The computation of each request encompasses multiple
of KV cache memory allocation still fails when no such iterations, with each iteration representing either a pre-
large contiguous space exists. To deal with the fragmented filling step or a decoding step. The author suggests that
storage, vLLM [49] proposes to store the KV cache in a different requests can be batched at the iteration level. The
paged manner following the operating system. vLLM first work implements iteration-level batching in linear oper-
allocates a memory space as large as possible and divides ators, concatenating different requests together in the se-
it equally into multiple physical blocks. When a request quence dimension. Hence, the spare storage and computa-
comes, vLLM dynamically maps the generated KV cache to tional resources corresponding to the completed requests
the pre-allocated physical blocks in a discontinuous fashion. are promptly released. Following ORCA, vLLM [49] ex-
In this way, vLLM significantly reduces storage fragmenta- tends the technique to the attention computation, enabling
tion and achieves a higher throughput in LLM serving. On requests with different KV cache lengths to be batched to-
the basis of vLLM, LightLLM [253] uses a more fine-grained gether. Sarathi [257], DeepSpeed-FastGen [254] and Sarathi-
KV cache storage to cut down the waste happening with Serve [258] further introduce a split-and-fuse method to
the irregular boundary. Instead of a block, LightLLM treats batch together prefilling requests and decoding requests.
the KV cache of a token as a unit, so that the generated KV Specifically, this method first splits the long prefilling re-
cache always saturates the pre-allocated space. quest in the sequence dimension, and then batches it to-
Current optimized service systems commonly employ gether with multiple short decoding requests. The split-and-
this paged approach to manage the KV cache storage, fuse method balances the workloads among different itera-
thereby mitigating the waste of redundant KV cache mem- tions, and significantly reduces the tail latency via removing
ory. However, the paged storage leads to irregular memory the stalls from new requests. LightLLM [253] also adopts the
access in the attention operator. For the attention operator split-and-fuse method.
using the paged KV cache, this necessitates the consider- The split-and-fuse technology operates on the premise
ation of the mapping relationship between the virtual ad- that requests during the prefilling stage can be partitioned
dress space of the KV cache and its corresponding physical into discrete chunks. Chunked-prefill methodology involves
address space. To enhance the efficiency of the attention segmenting prefilling requests along the sequence dimen-
operator, the loading pattern of the KV cache must be tai- sion, thereby preventing the potential bottlenecks for other
lored to facilitate contiguous memory access. For instance, requests. This strategy capitalizes on the auto-regressive
in the case of the PagedAttention by vLLM [49], the storage characteristics inherent in LLMs, where attention compu-
of the head size dimension is structured as a 16-byte con- tation only relies on prior tokens. Consequently, the math-
tiguous vector for K cache, while FlashInfer [260] orches- ematical equivalence of chunked-prefill technology is guar-
trates diverse data layouts for the KV cache, accompanied anteed, positioning it as a leading approach for reducing
by an appropriately designed memory access scheme. The request latency in LLM serving.
optimization of the attention operator in conjunction with
paged KV cache storage remains a forefront challenge in the 6.2.3 Scheduling Strategy
advancement of serving systems. In LLM serving, the job length of each request exhibits
variability, and hence the order of executing requests signif-
6.2.2 Continuous Batching icantly impacts the throughput of the serving system. The
The request lengths in a batch can be different, leading head-of-line blocking [255] happens when long requests are
to low utilization when shorter requests are finished and accorded priority. Specifically, memory usage grows rapidly
longer requests are still running. Due to the asynchronous in response to long requests, resulting in the impeding of
nature of requests in serving scenarios, there exists an subsequent requests when the system exhausts its mem-
opportunity that such periods of low utilization could be ory capacity. The pioneering work ORCA [252] and open-
mitigated. The continuous batching technique is proposed source systems, including vLLM [49] and LightLLM [253],
to leverage the opportunity by batching new requests once employ the simple first-come-first-serve (FCFS) principle
some old requests are finished. ORCA [252] is the first to to schedule requests. DeepSpeed-FastGen [254] gives pri-
utilize the continuous batching technique in LLM serving. ority to the decoding requests to enhance the performance.
26
TABLE 6
Comparison of multiple open-source inference engines and serving systems. ”-” denotes no serving support. Note that the scheduling method of
TensorRT-LLM is not open-sourced.
FastServe [255] proposes a preemptive scheduling strategy model compression methods, limiting scalability to larger
to optimize the head-of-line blocking problem, achieving models and longer inputs (up to 1.5B model and 256 tokens).
low job completion time (JCT) in LLM serving. FastServe ALLO builds on these insights, further offering a library
employs a multi-level feedback queue (MLFQ) to prioritize of High-level Synthesis (HLS) kernels that are composable
the requests with the shortest remaining time. Since the and reusable. ALLO’s implementation demonstrates supe-
auto-regressive decoding approach poses unknown request rior generation speed-up compared to DFX in the prefilling
lengths, FastServe predicts the length first and utilizes a stage, achieving enhanced energy efficiency and speedup
skip-join fashion to find the proper priority for each request. over the NVIDIA A100 GPU during decoding.
Unlike previous work, VTC [256] discusses the fairness in FlightLLM [268] also leverages these insights, introduc-
LLM serving. VTC introduces a cost function based on token ing a configurable sparse digital signal processor (DSP)
numbers to measure fairness among clients, and further chain for various sparsity patterns with high computa-
proposes a fair scheduler to ensure fairness. tional efficiency. It proposes an always-on-chip decode
scheme with mixed-precision support to enhance mem-
6.2.4 Distributed Systems ory bandwidth utilization. FlightLLM achieves 6.0× higher
In order to achieve high throughput, LLM services are energy efficiency and 1.8× better cost efficiency than the
commonly deployed on distributed platforms. Recent works NVIDIA V100S GPU for Llama2-7B models, with 1.2×
have additionally focused on optimizing the performance higher throughput than the NVIDIA A100 GPU during
of such inference services by exploiting distributed charac- decoding.
teristics. Notably, observing that the prefilling is compute-
intensive and the decoding is memory-intensive, split-
wise [247], TetriInfer [248] and DistServe [249] demonstrate
the efficiency of disaggregating the prefilling and the de- 6.4 Comparison of LLM Frameworks
coding steps of a request. In this way, the two distinct
stages are processed independently based on their char- We compare the performance of multiple LLM frame-
acteristics. SpotServe [250] is designed to provide LLM works in Table 6. The inference throughput is measured
service on clouds with preemptible GPU instances. Spot- with Llama2-7B (batch size=1, input length=1k, output
Serve efficiently handles challenges including dynamic par- length=128). The serving performance is the maximum
allel control and instance migration, and also utilizes the throughput measured on the ShareGPT [269] dataset. Both
auto-regressive nature of LLMs to achieve token-level state are derived on a single NVIDIA A100 80GB GPU. Among
recovery. Moreover, Infinite-LLM [251] extends the paged the mentioned frameworks, DeepSpeed [236], vLLM [49],
KV cache method in vLLM [49] to the distributed cloud LightLLM [253] and TensorRT-LLM [208] integrate the serv-
environment. ing function to serve asynchronous requests from multiple
users. We also list the optimizations for each framework in
the table. All the frameworks except HuggingFace imple-
6.3 Hardware Accelerator Design ment operator-level or graph-level optimizations to enhance
Previous research efforts [262], [263], [264] have focused on performance, and some of them also support the speculative
optimizing Transformer architectures, particularly enhanc- decoding technique. Note that the speculative decoding
ing the attention operator, often employing sparse methods technique is off when we measure the inference perfor-
to facilitate FPGA deployment. The FACT [265] accelera- mance for all frameworks. The results of inference through-
tor achieves superior energy efficiency compared to the put show that FlashDecoding++ and TensorRT-LLM out-
NVIDIA V100 GPU through mixed-precision quantization perform others with optimizations covering predominant
for linear operators and algorithm-hardware co-design, yet operators and the computational graph. From the aspect of
these approaches are not tailored for generative LLMs. serving, all the frameworks use fine-grained and discontigu-
Recent work like ALLO [266] highlights FPGA advan- ous storage for KV cache, and apply the continuous batching
tages in managing the memory-intensive decoding stage techniques to improve the system utilization. Unlike vLLM
and emphasizes the importance of model compression tech- and LightLLM, DeepSpeed prioritizes the decoding requests
niques for LLMs’ efficient FPGA deployment. Conversely, in scheduling, which means no new request is merged if
DFX [267] focuses on decoding stage optimizations but lacks there are enough existing decoding requests in the batch.
27
6.5 Knowledge, Suggestions and Future Direction maximum context length during both training and infer-
The system-level optimization improves efficiency while ence phases. Various strategies have been explored to ad-
bringing no accuracy degradation, hence becoming preva- dress this limitation, including input compression (Sec. 4.1),
lent in the LLM inference practice. The optimization for sparse attention (Sec. 5.2.2), design of low-complexity struc-
inference is also applicable to serving. Recently, the oper- tures (Sec. 5.1.3), and optimization of attention opera-
ator optimization has been closely combined with practi- tors (Sec. 6.1.1). Notably, non-Transformer architectures
cal serving scenarios, e.g.,, RadixAttention [50] designed (Sec. 5.1.3) with sub-quadratic or linear complexity have
specifically for prefix caching, and tree attention [225] to recently garnered significant interest from researchers.
accelerate speculative decoding verification. The iterating of Despite their efficiency, the competitiveness of these
applications and scenarios will continue to put forward new novel architectures compared to the Transformer archi-
requirements for operator development. tecture across various abilities, such as in-context learn-
Given the multifaceted objectives inherent in real-world ing ability and long-range modeling ability, is still under
serving systems, such as JCT, system throughput, and fair- scrutiny [74], [271]. Therefore, exploring the capabilities of
ness, the design of scheduling strategies becomes corre- these new architectures from multiple angles and address-
spondingly intricate. Within the domain of LLM serving, ing their limitations remains a valuable pursuit. Moreover,
where the length of requests is indeterminate, extant litera- it is crucial to determine the necessary context lengths for
ture commonly relies on predictive mechanisms to facilitate various scenarios and tasks, as well as identify the next-
the design of scheduling strategies. However, the efficacy generation architecture that will serve as the foundational
of current predictors [248] falls short of ideal standards, backbone for LLMs in the future.
indicating the potential for refinement and optimization in Edge Scenario Deployment. While considerable efforts
serving scheduling strategy development. have been directed towards enhancing the efficiency of
LLM inference, deploying LLMs onto extremely resource-
constrained edge devices like mobile phones presents ongo-
7 D ISCUSSIONS OF K EY A PPLICATION S CENAR - ing challenges. Recently, numerous researchers [272], [273],
[274], [275], [276], [277], [278], [279], [280], [281], [282] have
IOS
shown interest in pre-training smaller language models
Current research endeavors have made significant strides in with 1B to 3B parameters. Models of this scale offer re-
exploring the boundaries of efficient LLM inference across duced resource costs during inference and hold potential for
various optimization levels. However, further studies are achieving generalization abilities and competitive perfor-
warranted to enhance LLM efficiency in practical scenarios. mance compared to larger models. However, the methods
We have provided promising future directions for opti- to develop such efficient and powerful smaller language
mization techniques at the data-level (Sec. 4.3), model-level models remain under-explored.
(Sec. 5.3), and system-level (Sec. 6.5). In this section, we Several studies have initiated this promising direction.
summarize four critical scenarios: agent and multi-model For instance, MiniCPM [281] conducts sandbox experi-
framework, long-context LLMs, edge scenario deployment, ments to determine optimal pre-training hyper-parameters.
and security-efficiency synergy, and provide a broader dis- PanGu-π -Pro [274] suggests initializing model weights from
cussion on them. pre-trained LLMs using metrics and techniques from model
Agent and Multi-Model Framework. As discussed in pruning. MobileLLM [282] adopts a“deep and thin” archi-
Sec. 4.3, recent advancements in agent and multi-model tecture for small model design and proposes weight sharing
frameworks [53], [54], [55] have significantly improved across different layers to increase the number of layers
agents’ capabilities to handle complex tasks and human without additional memory costs. Nevertheless, a perfor-
requests by harnessing the powerful abilities of LLMs. These mance gap still exists between small and large models,
frameworks, while increasing the computational demands necessitating future studies to narrow this gap. In the future,
of LLMs, introduce more parallelism into the structure of there is a crucial need for research aimed at identifying the
LLMs’ output content, thereby creating opportunities for model scale limited in the edge scenarios, and exploring the
data-level and system-level optimizations such as output or- boundaries of various optimization methods on designing
ganization techniques [50]. Furthermore, these frameworks smaller models.
naturally introduce a new optimization level, i.e., pipeline- Beyond designing smaller models, system-level opti-
level, which holds potential for efficiency enhancements at mization offers a promising direction in LLM deployment. A
this level [56]. notable recent project, MLC-LLM [283], successfully deploys
In addition, there is a growing research trend [270] fo- the LLaMA-7B model on mobile phones. MLC-LLM pri-
cused on extending AI agents into the multimodal domain, marily employs compilation techniques like fusion, memory
which often utilize Large Multimodal Models (LMMs) as planning, and loop optimization to enhance latency and re-
the core of these agent systems. To enhance the efficiency of duce memory cost during inference. Additionally, adopting
these emerging LMM-based agents, designing optimization the cloud-edge collaboration techniques, or designing more
techniques for LMMs is a promising research direction. sophisticated hardware accelerators can also help deploy
Long-Context LLMs. Currently, LLMs face the challenge LLMs onto edge devices.
of handling increasingly longer input contexts. However, Security-Efficiency Synergy. In addition to task perfor-
the self-attention operation, the fundamental component mance and efficiency, security is also a crucial factor that
of Transformer-style LLMs, exhibits quadratic complexity must be considered in LLM applications [284], [285]. Cur-
in relation to the context length, imposing constraints on rent research primarily focuses on efficiency optimiza-
28
tion without adequately addressing security considerations. [7] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng,
Therefore, it is critical to investigate the interplay between S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-
source chatbot impressing gpt-4 with 90%* chatgpt quality,” See
efficiency and security and determine whether the current https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
optimization techniques compromise the security of LLMs. [8] D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. Gonzalez, I. Stoica,
If these techniques negatively impacts LLMs’ security, a X. Ma, and H. Zhang, “How long can context length of open-
promising direction would involve developing new opti- source llms truly promise?” in NeurIPS 2023 Workshop on Instruc-
tion Tuning and Instruction Following, 2023.
mization methods or refining the existing ones to achieve [9] B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić,
a better trade-off between LLMs’ efficiency and security. D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon et al., “Bloom: A
176b-parameter open-access multilingual language model,” arXiv
preprint arXiv:2211.05100, 2022.
8 C ONCLUSION [10] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo-
caru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic
Efficient LLM inference focuses on reducing the compu- et al., “The falcon series of open language models,” arXiv preprint
tational, memory access, and memory costs during LLM arXiv:2311.16867, 2023.
[11] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang,
inference processes, aiming to optimize efficiency metrics “Glm: General language model pretraining with autoregressive
such as latency, throughput, storage, power, and energy. blank infilling,” arXiv preprint arXiv:2103.10360, 2021.
This survey offers a comprehensive review of efficient LLM [12] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary,
inference research, presenting insights, recommendations, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand
et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
and future directions for key techniques. Initially, we intro- [13] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong,
duce a hierarchical taxonomy encompassing data-, model- B. Yin, and X. Hu, “Harnessing the power of llms in practice: A
, and system-level optimizations. Subsequently, guided by survey on chatgpt and beyond,” ACM Transactions on Knowledge
Discovery from Data, 2023.
this taxonomy, we meticulously examine and summarize
[14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
studies at each level and sub-field. For well-established D. Zhou et al., “Chain-of-thought prompting elicits reasoning in
techniques like model quantization and efficient serving large language models,” Advances in Neural Information Processing
systems, we conduct experiments to evaluate and analyze Systems, vol. 35, pp. 24 824–24 837, 2022.
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
their performance. Based on these analyses, we offer practi- H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Eval-
cal suggestions and identify promising research avenues for uating large language models trained on code,” arXiv preprint
practitioners and researchers in the field. arXiv:2107.03374, 2021.
[16] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz,
E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks
of artificial general intelligence: Early experiments with gpt-4,”
ACKNOWLEDGEMENTS arXiv preprint arXiv:2303.12712, 2023.
This work was supported by National Natural Science [17] X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on
model compression for large language models,” arXiv preprint
Foundation of China (No. 62325405, 62104128, U19B2019, arXiv:2308.07633, 2023.
U21B2031, 61832007, 62204164), Tsinghua EE Xilinx AI Re- [18] S. Park, J. Choi, S. Lee, and U. Kang, “A comprehensive survey
search Fund, and Beijing National Research Center for In- of compression algorithms for language models,” arXiv preprint
formation Science and Technology (BNRist). We thank for arXiv:2401.15347, 2024.
[19] W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin,
all the support from Infinigence-AI. We thank Xiangsheng D. Cai, and X. He, “Model compression and efficient infer-
Shi, Zinan Lin, Xinhao Yang, Hongyi Wang, Linfeng Zhang, ence for large language models: A survey,” arXiv preprint
Yulin Wang, Xuemin Sun, Saiqian Zhang for their valuable arXiv:2402.09748, 2024.
suggestions on the paper. We thank Shengxiang Wang, Qiuli [20] T. Ding, T. Chen, H. Zhu, J. Jiang, Y. Zhong, J. Zhou, G. Wang,
Z. Zhu, I. Zharkov, and L. Liang, “The efficiency spectrum of
Mao for providing the efficiency profiling data of quantized large language models: An algorithmic survey,” arXiv preprint
operators. arXiv:2312.00678, 2023.
[21] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen,
and Z. Jia, “Towards efficient generative large language model
R EFERENCES serving: A survey from algorithms to systems,” arXiv preprint
arXiv:2312.15234, 2023.
[1] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., [22] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, Z. Qu, S. Yan, Y. Zhu,
“Improving language understanding by generative pre-training,” Q. Zhang, M. Chowdhury et al., “Efficient large language models:
2018. A survey,” arXiv preprint arXiv:2312.03863, vol. 1, 2023.
[2] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever [23] M. Xu, W. Yin, D. Cai, R. Yi, D. Xu, Q. Wang, B. Wu, Y. Zhao,
et al., “Language models are unsupervised multitask learners,” C. Yang, S. Wang et al., “A survey of resource-efficient llm and
OpenAI blog, vol. 1, no. 8, p. 9, 2019. multimodal foundation models,” arXiv preprint arXiv:2401.08092,
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- 2024.
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., [24] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min,
“Language models are few-shot learners,” Advances in neural B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language
information processing systems, vol. 33, pp. 1877–1901, 2020. models,” arXiv preprint arXiv:2303.18223, 2023.
[4] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
transformer language models,” arXiv preprint arXiv:2205.01068, Advances in neural information processing systems, vol. 30, 2017.
2022. [26] Z. Yuan, Y. Shang, Y. Zhou, Z. Dong, C. Xue, B. Wu, Z. Li,
[5] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, Q. Gu, Y. J. Lee, Y. Yan et al., “Llm inference unveiled: Survey and
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., roofline model insights,” arXiv preprint arXiv:2402.16363, 2024.
“Llama: Open and efficient foundation language models,” arXiv [27] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
preprint arXiv:2302.13971, 2023. H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
[6] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, augmented generation for knowledge-intensive nlp tasks,” Ad-
D. Wang, D. Yan et al., “Baichuan 2: Open large-scale language vances in Neural Information Processing Systems, vol. 33, pp. 9459–
models,” arXiv preprint arXiv:2309.10305, 2023. 9474, 2020.
29
[28] A. Chevalier, A. Wettig, A. Ajith, and D. Chen, “Adapt- [51] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and
ing language models to compress contexts,” arXiv preprint K. Narasimhan, “Tree of thoughts: Deliberate problem solving
arXiv:2305.14788, 2023. with large language models,” Advances in Neural Information
[29] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, Processing Systems, vol. 36, 2024.
L. Zettlemoyer, and W. tau Yih, “Replug: Retrieval-augmented [52] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski,
black-box language models,” 2023. L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk
[30] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- et al., “Graph of thoughts: Solving elaborate problems with
rag: Learning to retrieve, generate, and critique through self- large language models,” in Proceedings of the AAAI Conference on
reflection,” 2023. Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690.
[31] D. Wingate, M. Shoeybi, and T. Sorensen, “Prompt compres- [53] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang,
sion and contrastive conditioning for controllability and toxicity J. Wang, S. Jin, E. Zhou et al., “The rise and potential of
reduction in language models,” arXiv preprint arXiv:2210.03162, large language model based agents: A survey,” arXiv preprint
2022. arXiv:2309.07864, 2023.
[32] J. Mu, X. L. Li, and N. Goodman, “Learning to compress prompts [54] Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex:
with gist tokens,” arXiv preprint arXiv:2304.08467, 2023. Pushing the boundaries of complex reasoning through multi-
[33] T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context model collaboration,” arXiv preprint arXiv:2310.00280, 2023.
autoencoder for context compression in a large language model,” [55] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla,
arXiv preprint arXiv:2307.06945, 2023. O. Wiest, and X. Zhang, “Large language model based multi-
[34] F. Xu, W. Shi, and E. Choi, “Recomp: Improving retrieval- agents: A survey of progress and challenges,” arXiv preprint
augmented lms with compression and selective augmentation,” arXiv:2402.01680, 2024.
arXiv preprint arXiv:2310.04408, 2023. [56] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large
[35] W. Fei, X. Niu, P. Zhou, L. Hou, B. Bai, L. Deng, and W. Han, “Ex- language models while reducing cost and improving perfor-
tending context window of large language models via semantic mance,” arXiv preprint arXiv:2305.05176, 2023.
compression,” arXiv preprint arXiv:2312.09571, 2023. [57] Y. Li, T. Cai, Y. Zhang, D. Chen, and D. Dey, “What makes
[36] W. Zhou, Y. E. Jiang, R. Cotterell, and M. Sachan, “Efficient convolutional models great on long sequence modeling?” arXiv
prompting via dynamic in-context learning,” arXiv preprint preprint arXiv:2210.09298, 2022.
arXiv:2305.11170, 2023. [58] D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, and
[37] Y. Li, B. Dong, F. Guerin, and C. Lin, “Compressing context M. Hoogendoorn, “Ckconv: Continuous kernel convolution for
to enhance inference efficiency of large language models,” in sequential data,” arXiv preprint arXiv:2102.02611, 2021.
Proceedings of the 2023 Conference on Empirical Methods in Natural [59] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus,
Language Processing, 2023, pp. 6342–6353. Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger
[38] F. Yin, J. Vig, P. Laban, S. Joty, C. Xiong, and C.-S. J. Wu, “Did you convolutional language models,” in International Conference on
read the instructions? rethinking the effectiveness of task defi- Machine Learning. PMLR, 2023, pp. 28 043–28 078.
nitions in instruction learning,” arXiv preprint arXiv:2306.01150, [60] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho,
2023. H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al.,
[39] H. Jung and K.-J. Kim, “Discrete prompt compression with “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint
reinforcement learning,” arXiv preprint arXiv:2308.08758, 2023. arXiv:2305.13048, 2023.
[40] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: [61] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and
Compressing prompts for accelerated inference of large language F. Wei, “Retentive network: A successor to transformer for large
models,” in The 2023 Conference on Empirical Methods in Natural language models,” arXiv preprint arXiv:2307.08621, 2023.
Language Processing, 2023.
[62] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent
[41] H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and memory with optimal polynomial projections,” Advances in neural
L. Qiu, “Longllmlingua: Accelerating and enhancing llms in information processing systems, vol. 33, pp. 1474–1487, 2020.
long context scenarios via prompt compression,” arXiv preprint
[63] A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and
arXiv:2310.06839, 2023.
C. Ré, “Combining recurrent, convolutional, and continuous-
[42] X. Huang, L. L. Zhang, K.-T. Cheng, and M. Yang, “Boosting llm
time models with linear state space layers,” Advances in neural
reasoning: Push the limits of few-shot learning with reinforced
information processing systems, vol. 34, pp. 572–585, 2021.
in-context pruning,” arXiv preprint arXiv:2312.08901, 2023.
[64] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences
[43] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu,
with structured state spaces,” arXiv preprint arXiv:2111.00396,
and Z. Sui, “A survey for in-context learning,” arXiv preprint
2021.
arXiv:2301.00234, 2022.
[65] A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as ef-
[44] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous
fective as structured state spaces,” Advances in Neural Information
prompts for generation,” in Proceedings of the 59th Annual Meeting
Processing Systems, vol. 35, pp. 22 982–22 994, 2022.
of the Association for Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language Processing (Volume 1: [66] A. Gu, K. Goel, A. Gupta, and C. Ré, “On the parameterization
Long Papers), 2021, pp. 4582–4597. and initialization of diagonal state space models,” Advances in
[45] X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of- Neural Information Processing Systems, vol. 35, pp. 35 971–35 983,
thought: Large language models can do parallel decoding,” arXiv 2022.
preprint arXiv:2307.15337, 2023. [67] H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, “Long
[46] S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, range language modeling via gated state spaces,” in International
A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph Conference on Learning Representations, 2023.
decoding,” arXiv preprint arXiv:2402.12280, 2024. [68] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré,
[47] M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Hungry hungry hippos: Towards language modeling with state
“Apar: Llms can do auto-parallel auto-regressive decoding,” space models,” arXiv preprint arXiv:2212.14052, 2022.
arXiv preprint arXiv:2401.06761, 2024. [69] R. Hasani, M. Lechner, T.-H. Wang, M. Chahine, A. Amini, and
[48] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, D. Rus, “Liquid structural state-space models,” arXiv preprint
“Medusa: Simple llm inference acceleration framework with mul- arXiv:2209.12951, 2022.
tiple decoding heads,” 2024. [70] J. T. Smith, A. Warrington, and S. W. Linderman, “Simpli-
[49] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, fied state space layers for sequence modeling,” arXiv preprint
J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory manage- arXiv:2208.04933, 2022.
ment for large language model serving with pagedattention,” in [71] J. Pilault, M. Fathi, O. Firat, C. Pal, P.-L. Bacon, and R. Goroshin,
Proceedings of the 29th Symposium on Operating Systems Principles, “Block-state transformers,” Advances in Neural Information Pro-
2023, pp. 611–626. cessing Systems, vol. 36, 2024.
[50] L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, [72] J. Wang, J. N. Yan, A. Gu, and A. M. Rush, “Pretraining without
C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., “Efficiently pro- attention,” arXiv preprint arXiv:2212.10544, 2022.
gramming large language models using sglang,” arXiv preprint [73] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
arXiv:2312.07104, 2023. selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
30
[74] J. Park, J. Park, Z. Xiong, N. Lee, J. Cho, S. Oymak, K. Lee, [95] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu,
and D. Papailiopoulos, “Can mamba learn how to learn? a M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scal-
comparative study on in-context learning tasks,” arXiv preprint ing of language models with mixture-of-experts,” in International
arXiv:2402.04248, 2024. Conference on Machine Learning. PMLR, 2022, pp. 5547–5569.
[75] N. Shazeer, “Fast transformer decoding: One write-head is all [96] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
you need,” arXiv preprint arXiv:1911.02150, 2019. and J. Dean, “Outrageously large neural networks: The sparsely-
[76] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, gated mixture-of-experts layer,” in International Conference on
and S. Sanghai, “Gqa: Training generalized multi-query trans- Learning Representations, 2016.
former models from multi-head checkpoints,” arXiv preprint [97] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang,
arXiv:2305.13245, 2023. M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant
[77] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Lin- models with conditional computation and automatic sharding,”
former: Self-attention with linear complexity,” arXiv preprint arXiv preprint arXiv:2006.16668, 2020.
arXiv:2006.04768, 2020. [98] C. Hwang, W. Cui, Y. Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang,
[78] G. I. Winata, S. Cahyawijaya, Z. Lin, Z. Liu, and P. Fung, R. Salas, J. Jose, P. Ram et al., “Tutel: Adaptive mixture-of-experts
“Lightweight and efficient end-to-end speech recognition using at scale,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
low-rank transformer,” in ICASSP 2020-2020 IEEE International [99] D. P. Bertsekas, “Auction algorithms for network flow problems:
Conference on Acoustics, Speech and Signal Processing (ICASSP). A tutorial introduction,” Computational optimization and applica-
IEEE, 2020, pp. 6144–6148. tions, vol. 1, pp. 7–66, 1992.
[79] A. Gupta, Y. Yuan, Y. Zhou, and C. Mendis, “Flurka: Fast fused [100] Z. Dai, G. Lai, Y. Yang, and Q. Le, “Funnel-transformer: Filtering
low-rank & kernel attention,” arXiv preprint arXiv:2306.15799, out sequential redundancy for efficient language processing,”
2023. Advances in neural information processing systems, vol. 33, pp. 4271–
[80] X. Ma, X. Kong, S. Wang, C. Zhou, J. May, H. Ma, and L. Zettle- 4282, 2020.
moyer, “Luna: Linear unified nested attention,” Advances in Neu- [101] L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang,
ral Information Processing Systems, vol. 34, pp. 2441–2453, 2021. “Vision mamba: Efficient visual representation learning with
[81] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, bidirectional state space model,” arXiv preprint arXiv:2401.09417,
“Set transformer: A framework for attention-based permutation- 2024.
invariant neural networks,” in International conference on machine [102] W. Hua, Z. Dai, H. Liu, and Q. Le, “Transformer quality in linear
learning. PMLR, 2019, pp. 3744–3753. time,” in International Conference on Machine Learning. PMLR,
[82] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Trans- 2022, pp. 9099–9117.
formers are rnns: Fast autoregressive transformers with linear [103] AI21, “Jamba: Ai21’s groundbreaking ssm-transformer model,”
attention,” in International conference on machine learning. PMLR, March 2024. [Online]. Available: https://www.ai21.com/blog/
2020, pp. 5156–5165. announcing-jamba
[83] K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, [104] W. He, K. Han, Y. Tang, C. Wang, Y. Yang, T. Guo, and
A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, Y. Wang, “Densemamba: State space models with dense hidden
L. Kaiser et al., “Rethinking attention with performers,” in In- connection for efficient large language models,” arXiv preprint
ternational Conference on Learning Representations, 2020. arXiv:2403.00818, 2024.
[84] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and [105] Q. Anthony, Y. Tokpanov, P. Glorioso, and B. Millidge, “Black-
L. Kong, “Random feature attention,” in International Conference mamba: Mixture of experts for state-space models,” arXiv preprint
on Learning Representations, 2022. arXiv:2402.01771, 2024.
[85] P. Kacham, V. Mirrokni, and P. Zhong, “Polysketchformer: Fast [106] M. Pióro, K. Ciebiera, K. Król, J. Ludziejewski, and S. Jaszczur,
transformers via sketches for polynomial kernels,” arXiv preprint “Moe-mamba: Efficient selective state space models with mixture
arXiv:2310.01655, 2023. of experts,” arXiv preprint arXiv:2401.04081, 2024.
[86] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling [107] S. Zhai, W. Talbott, N. Srivastava, C. Huang, H. Goh, R. Zhang,
to trillion parameter models with simple and efficient sparsity,” and J. Susskind, “An attention free transformer,” arXiv preprint
The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232– arXiv:2105.14103, 2021.
5270, 2022. [108] T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V. Tran,
[87] Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “Moefication: Y. Tay, and D. Metzler, “Confident adaptive language modeling,”
Transformer feed-forward layers are mixtures of experts,” in Advances in Neural Information Processing Systems, vol. 35, pp.
Findings of the Association for Computational Linguistics: ACL 2022, 17 456–17 472, 2022.
2022, pp. 877–890. [109] L. Del Corro, A. Del Giorno, S. Agarwal, B. Yu, A. Awadallah, and
[88] Z.-F. Gao, P. Liu, W. X. Zhao, Z.-Y. Lu, and J.-R. Wen, “Parameter- S. Mukherjee, “Skipdecode: Autoregressive skip decoding with
efficient mixture-of-experts architecture for pre-trained language batching and caching for efficient llm inference,” arXiv preprint
models,” in Proceedings of the 29th International Conference on arXiv:2307.02628, 2023.
Computational Linguistics, 2022, pp. 3263–3273. [110] W. Liu, P. Zhou, Z. Wang, Z. Zhao, H. Deng, and Q. Ju, “Fastbert:
[89] A. Komatsuzaki, J. Puigcerver, J. Lee-Thorp, C. R. Ruiz, a self-distilling bert with adaptive inference time,” in Proceedings
B. Mustafa, J. Ainslie, Y. Tay, M. Dehghani, and N. Houlsby, of the 58th Annual Meeting of the Association for Computational
“Sparse upcycling: Training mixture-of-experts from dense Linguistics, 2020, pp. 6035–6044.
checkpoints,” arXiv preprint arXiv:2212.05055, 2022. [111] J. Kong, J. Wang, L.-C. Yu, and X. Zhang, “Accelerating inference
[90] M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, for pretrained language models by unified multi-perspective
“Base layers: Simplifying training of large, sparse models,” in early exiting,” in Proceedings of the 29th International Conference
International Conference on Machine Learning. PMLR, 2021, pp. on Computational Linguistics, 2022, pp. 4677–4686.
6265–6274. [112] K. Liao, Y. Zhang, X. Ren, Q. Su, X. Sun, and B. He, “A global
[91] Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. past-future early exit method for accelerating inference of pre-
Dai, Q. V. Le, J. Laudon et al., “Mixture-of-experts with expert trained language models,” in Proceedings of the 2021 Conference
choice routing,” Advances in Neural Information Processing Systems, of the North American Chapter of the Association for Computational
vol. 35, pp. 7103–7114, 2022. Linguistics: Human Language Technologies, 2021, pp. 2013–2023.
[92] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, [113] J. Xin, R. Tang, J. Lee, Y. Yu, and J. Lin, “Deebert: Dynamic
and W. Fedus, “St-moe: Designing stable and transferable sparse early exiting for accelerating bert inference,” in Proceedings of the
expert models,” arXiv preprint arXiv:2202.08906, 2022. 58th Annual Meeting of the Association for Computational Linguistics,
[93] D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, 2020, pp. 2246–2251.
“Stablemoe: Stable routing strategy for mixture of experts,” in [114] W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses
Proceedings of the 60th Annual Meeting of the Association for Compu- patience: Fast and robust inference with early exit,” Advances in
tational Linguistics (Volume 1: Long Papers), 2022, pp. 7085–7095. Neural Information Processing Systems, vol. 33, pp. 18 330–18 341,
[94] T. Chen, Z. Zhang, A. K. JAISWAL, S. Liu, and Z. Wang, “Sparse 2020.
moe as the new dropout: Scaling dense and self-slimmable [115] T. Sun, X. Liu, W. Zhu, Z. Geng, L. Wu, Y. He, Y. Ni, G. Xie, X.-J.
transformers,” in The Eleventh International Conference on Learning Huang, and X. Qiu, “A simple hash-based early exiting approach
Representations, 2022. for language understanding and generation,” in Findings of the
31
Association for Computational Linguistics: ACL 2022, 2022, pp. 2409– [138] M. Javaheripi, G. de Rosa, S. Mukherjee, S. Shah, T. Religa,
2421. C. C. Teodoro Mendes, S. Bubeck, F. Koushanfar, and D. Dey,
[116] Y. Huang, Y. Chen, Z. Yu, and K. McKeown, “In-context learning “Litetransformersearch: Training-free neural architecture search
distillation: Transferring few-shot learning ability of pre-trained for efficient language models,” Advances in Neural Information
language models,” arXiv preprint arXiv:2212.10670, 2022. Processing Systems, vol. 35, pp. 24 254–24 267, 2022.
[117] J. Zhao, W. Zhao, A. Drozdov, B. Rozonoyer, M. A. Sultan, [139] D. D. Xu, S. Mukherjee, X. Liu, D. Dey, W. Wang, X. Zhang,
J.-Y. Lee, M. Iyyer, and A. McCallum, “Multistage collabora- A. Awadallah, and J. Gao, “Few-shot task-agnostic neural archi-
tive knowledge distillation from large language models,” arXiv tecture search for distilling large language models,” Advances in
preprint arXiv:2311.08640, 2023. Neural Information Processing Systems, vol. 35, pp. 28 644–28 656,
[118] C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, 2022.
R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! [140] A. Kaushal, T. Vaidhya, and I. Rish, “Lord: Low rank decomposi-
outperforming larger language models with less training data tion of monolingual code llms for one-shot compression,” arXiv
and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023. preprint arXiv:2309.14021, 2023.
[119] L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi, [141] M. Xu, Y. L. Xu, and D. P. Mandic, “Tensorgpt: Efficient com-
“Symbolic chain-of-thought distillation: Small models can also” pression of the embedding layer in llms based on the tensor-train
think” step-by-step,” arXiv preprint arXiv:2306.14050, 2023. decomposition,” arXiv preprint arXiv:2307.00526, 2023.
[120] L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Sev- [142] Y. Li, Y. Yu, Q. Zhang, C. Liang, P. He, W. Chen, and T. Zhao,
eryn, “Teaching small language models to reason,” arXiv preprint “Losparse: Structured compression of large language models
arXiv:2212.08410, 2022. based on low-rank and sparse approximation,” arXiv preprint
[121] H. Chen, S. Wu, X. Quan, R. Wang, M. Yan, and J. Zhang, “Mcc- arXiv:2306.11222, 2023.
kd: Multi-cot consistent knowledge distillation,” arXiv preprint [143] R. Saha, V. Srivastava, and M. Pilanci, “Matrix compression via
arXiv:2310.14747, 2023. randomized low rank and low precision factorization,” arXiv
[122] N. Ho, L. Schmid, and S.-Y. Yun, “Large language models are preprint arXiv:2310.11028, 2023.
reasoning teachers,” arXiv preprint arXiv:2212.10071, 2022. [144] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-v2: Exploring
[123] K. Shridhar, A. Stolfo, and M. Sachan, “Distilling reasoning post-training quantization in llms from comprehensive study to
capabilities into smaller language models,” in Findings of the low rank compensation,” arXiv preprint arXiv:2303.08302, 2023.
Association for Computational Linguistics: ACL 2023, 2023, pp. 7059– [145] R. Chand, Y. Prabhu, and P. Kumar, “Dsformer: Effective com-
7073. pression of text-transformers by dense-sparse weight factoriza-
[124] X. Zhu, B. Qi, K. Zhang, X. Long, and B. Zhou, “Pad: Program- tion,” arXiv preprint arXiv:2312.13211, 2023.
aided distillation specializes large models in reasoning,” arXiv [146] Z. Yuan, Y. Shang, Y. Song, Q. Wu, Y. Yan, and G. Sun, “Asvd:
preprint arXiv:2305.13888, 2023. Activation-aware singular value decomposition for compressing
[125] P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren, “Scott: large language models,” arXiv preprint arXiv:2312.05821, 2023.
Self-consistent chain-of-thought distillation,” arXiv preprint [147] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generat-
arXiv:2305.01879, 2023. ing long sequences with sparse transformers,” arXiv preprint
[126] Z. Chen, Q. Gao, A. Bosselut, A. Sabharwal, and K. Richardson, arXiv:1904.10509, 2019.
“Disco: distilling counterfactuals with large language models,” [148] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient
in Proceedings of the 61st Annual Meeting of the Association for streaming language models with attention sinks,” arXiv preprint
Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5514– arXiv:2309.17453, 2023.
5528. [149] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
[127] M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. Aji, document transformer,” arXiv preprint arXiv:2004.05150, 2020.
“Lamini-lm: A diverse herd of distilled models from large-scale [150] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti,
instructions,” arXiv preprint arXiv:2304.14402, 2023. S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big
[128] Y. Jiang, C. Chan, M. Chen, and W. Wang, “Lion: Adversarial bird: Transformers for longer sequences,” Advances in neural
distillation of proprietary large language models,” in Proceedings information processing systems, vol. 33, pp. 17 283–17 297, 2020.
of the 2023 Conference on Empirical Methods in Natural Language [151] S. Dai, H. Genc, R. Venkatesan, and B. Khailany, “Efficient trans-
Processing, 2023, pp. 3134–3154. former inference with statically structured sparse attention,” in
[129] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge distillation 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE,
of large language models,” arXiv preprint arXiv:2306.08543, 2023. 2023, pp. 1–6.
[130] R. Agarwal, N. Vieillard, P. Stanczyk, S. Ramos, M. Geist, and [152] Anonymous, “SemSA: Semantic sparse attention is hidden
O. Bachem, “Gkd: Generalized knowledge distillation for auto- in large language models.” 2023. [Online]. Available: https:
regressive sequence models,” arXiv preprint arXiv:2306.13649, //openreview.net/forum?id=eG9AkHtYYH
2023. [153] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse at-
[131] C. Liang, S. Zuo, Q. Zhang, P. He, W. Chen, and T. Zhao, “Less tention architecture with cascade token and head pruning,” in
is more: Task-aware layer-wise distillation for language model 2021 IEEE International Symposium on High-Performance Computer
compression,” in International Conference on Machine Learning. Architecture (HPCA). IEEE, pp. 97–110.
PMLR, 2023, pp. 20 852–20 867. [154] L. Ren, Y. Liu, S. Wang, Y. Xu, C. Zhu, and C. Zhai, “Sparse mod-
[132] I. Timiryasov and J.-L. Tastet, “Baby llama: knowledge distillation ular activation for efficient sequence modeling,” arXiv preprint
from an ensemble of teachers trained on a small dataset with no arXiv:2306.11197, 2023.
performance penalty,” arXiv preprint arXiv:2308.02019, 2023. [155] S. Anagnostidis, D. Pavllo, L. Biggio, L. Noci, A. Lucchi,
[133] C. Zhang, Y. Yang, J. Liu, J. Wang, Y. Xian, B. Wang, and D. Song, and T. Hoffmann, “Dynamic context pruning for efficient
“Lifting the curse of capacity gap in distilling language models,” and interpretable autoregressive transformers,” arXiv preprint
arXiv preprint arXiv:2305.12129, 2023. arXiv:2305.15805, 2023.
[134] S. Padmanabhan, Y. Onoe, M. J. Zhang, G. Durrett, and E. Choi, [156] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient
“Propagating knowledge updates to lms through distillation,” transformer,” arXiv preprint arXiv:2001.04451, 2020.
arXiv preprint arXiv:2306.09306, 2023. [157] M. Pagliardini, D. Paliotta, M. Jaggi, and F. Fleuret, “Faster causal
[135] Y. Yin, C. Chen, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Au- attention over large sequences through sparse flash attention,”
totinybert: Automatic hyper-parameter optimization for efficient arXiv preprint arXiv:2306.01160, 2023.
pre-trained language models,” arXiv preprint arXiv:2107.13686, [158] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient content-
2021. based sparse attention with routing transformers,” Transactions of
[136] J. Xu, X. Tan, R. Luo, K. Song, J. Li, T. Qin, and T.-Y. Liu, “Nas- the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021.
bert: task-agnostic and adaptive-size bert compression with neu- [159] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse
ral architecture search,” in Proceedings of the 27th ACM SIGKDD sinkhorn attention,” in International Conference on Machine Learn-
Conference on Knowledge Discovery & Data Mining, 2021, pp. 1933– ing. PMLR, 2020, pp. 9438–9447.
1943. [160] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song,
[137] A. Klein, J. Golebiowski, X. Ma, V. Perrone, and C. Archambeau, Y. Tian, C. Ré, C. Barrett et al., “H2o: Heavy-hitter oracle for
“Structural pruning of large language models via neural archi- efficient generative inference of large language models,” Advances
tecture search,” 2023. in Neural Information Processing Systems, vol. 36, 2024.
32
[161] A. Feng, I. Li, Y. Jiang, and R. Ying, “Diffuser: efficient transform- [184] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “Awq:
ers with multi-hop attention diffusion for long sequences,” in Activation-aware weight quantization for llm compression and
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, acceleration,” arXiv preprint arXiv:2306.00978, 2023.
no. 11, 2023, pp. 12 772–12 780. [185] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “Owq: Lessons learned
[162] E. Frantar and D. Alistarh, “Sparsegpt: Massive language models from activation outliers for weight quantization in large language
can be accurately pruned in one-shot,” 2023. models,” arXiv preprint arXiv:2306.02272, 2023.
[163] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective [186] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev,
pruning approach for large language models,” arXiv preprint E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh,
arXiv:2306.11695, 2023. “Spqr: A sparse-quantized representation for near-lossless llm
[164] H. Shao, B. Liu, and Y. Qian, “One-shot sensitivity-aware mixed weight compression,” arXiv preprint arXiv:2306.03078, 2023.
sparsity pruning for large language models,” arXiv preprint [187] S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W.
arXiv:2310.09499, 2023. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quan-
[165] A. Syed, P. H. Guo, and V. Sundarapandiyan, “Prune and tune: tization,” arXiv preprint arXiv:2306.07629, 2023.
Improving efficient pruning techniques for massive language [188] J. Chee, Y. Cai, V. Kuleshov, and C. De Sa, “Quip: 2-bit quantiza-
models,” 2023. tion of large language models with guarantees,” in Thirty-seventh
[166] X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, Conference on Neural Information Processing Systems, 2023.
“Outlier suppression+: Accurate quantization of large language [189] Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, “Finequant:
models by equivalent and optimal shifting and scaling,” arXiv Unlocking efficiency with fine-grained weight-only quantization
preprint arXiv:2304.09145, 2023. for llms,” arXiv preprint arXiv:2308.09723, 2023.
[167] P. Xu, W. Shao, M. Chen, S. Tang, K. Zhang, P. Gao, F. An, [190] K. Behdin, A. Acharya, A. Gupta, S. Keerthi, and R. Mazumder,
Y. Qiao, and P. Luo, “Besa: Pruning large language models with “Quantease: Optimization-based quantization for language
blockwise parameter-efficient sparsity allocation,” in The Twelfth models–an efficient and intuitive algorithm,” arXiv preprint
International Conference on Learning Representations, 2023. arXiv:2309.01885, 2023.
[168] Y. Zhang, H. Bai, H. Lin, J. Zhao, L. Hou, and C. V. Cannistraci, [191] S. Li, X. Ning, K. Hong, T. Liu, L. Wang, X. Li, K. Zhong, G. Dai,
“An efficient plug-and-play post-training pruning strategy in H. Yang, and Y. Wang, “Llm-mq: Mixed-precision quantization
large language models,” 2023. for efficient llm deployment,” 2023.
[169] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural [192] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He,
pruning of large language models,” Advances in neural information “Zeroquant: Efficient and affordable post-training quantization
processing systems, vol. 36, 2024. for large-scale transformers,” in Advances in Neural Information
[170] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerat- Processing Systems, 2022.
ing language model pre-training via structured pruning,” arXiv
[193] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang,
preprint arXiv:2310.06694, 2023.
C. Re, I. Stoica, and C. Zhang, “Flexgen: High-throughput gen-
[171] E. Kurtić, E. Frantar, and D. Alistarh, “Ziplm: Inference-aware erative inference of large language models with a single gpu,”
structured pruning of language models,” Advances in Neural 2023.
Information Processing Systems, vol. 36, 2024.
[194] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm. int8
[172] M. Zhang, H. Chen, C. Shen, Z. Yang, L. Ou, X. Yu, and
(): 8-bit matrix multiplication for transformers at scale,” arXiv
B. Zhuang, “Loraprune: Pruning meets low-rank parameter-
preprint arXiv:2208.07339, 2022.
efficient fine-tuning,” 2023.
[195] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
[173] T. Chen, T. Ding, B. Yadav, I. Zharkov, and L. Liang, “Lorashear:
“Smoothquant: Accurate and efficient post-training quantization
Efficient large language model structured pruning and knowl-
for large language models,” in International Conference on Machine
edge recovery,” arXiv preprint arXiv:2310.18356, 2023.
Learning. PMLR, 2023, pp. 38 087–38 099.
[174] S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler,
[196] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-v2: Exploring
and J. Hensman, “Slicegpt: Compress large language models
post-training quantization in llms from comprehensive study to
by deleting rows and columns,” arXiv preprint arXiv:2401.15024,
low rank compensation,” arXiv preprint arXiv:2303.08302, 2023.
2024.
[175] Q. Zhang, S. Zuo, C. Liang, A. Bukharin, P. He, W. Chen, [197] Z. Yuan, L. Niu, J. Liu, W. Liu, X. Wang, Y. Shang, G. Sun, Q. Wu,
and T. Zhao, “Platon: Pruning large transformer models with J. Wu, and B. Wu, “Rptq: Reorder-based post-training quantiza-
upper confidence bound of weight importance,” in International tion for large language models,” arXiv preprint arXiv:2304.01089,
Conference on Machine Learning. PMLR, 2022, pp. 26 809–26 823. 2023.
[176] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, and [198] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu,
N. Wong, “Structured pruning for efficient generative pre-trained M. Guo, and Y. Zhu, “Olive: Accelerating large language models
language models,” in Findings of the Association for Computational via hardware-friendly outlier-victim pair quantization,” in Pro-
Linguistics: ACL 2023, 2023, pp. 10 880–10 895. ceedings of the 50th Annual International Symposium on Computer
[177] S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng, “Llm-fp4: 4- Architecture, 2023, pp. 1–15.
bit floating-point quantized transformers,” in The 2023 Conference [199] X. Wu, Z. Yao, and Y. He, “Zeroquant-fp: A leap forward in llms
on Empirical Methods in Natural Language Processing, 2023. post-training w4a8 quantization using floating-point formats,”
[178] L. Li, Q. Li, B. Zhang, and X. Chu, “Norm tweaking: High- arXiv preprint arXiv:2307.09782, 2023.
performance low-bit quantization of large language models,” [200] W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang,
arXiv preprint arXiv:2309.02784, 2023. P. Gao, Y. Qiao, and P. Luo, “Omniquant: Omnidirectionally
[179] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, calibrated quantization for large language models,” in The Twelfth
“Qlora: Efficient finetuning of quantized llms,” Advances in Neu- International Conference on Learning Representations, 2023.
ral Information Processing Systems, vol. 36, 2024. [201] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, and B. Zhuang, “Qllm:
[180] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, Accurate and efficient low-bitwidth quantization for large lan-
X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low- guage models,” in The Twelfth International Conference on Learning
rank adaptation of large language models,” arXiv preprint Representations, 2023.
arXiv:2309.14717, 2023. [202] Y. Zhao, C.-Y. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze,
[181] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit
T. Zhao, “Loftq: Lora-fine-tuning-aware quantization for large quantization for efficient and accurate llm serving,” arXiv preprint
language models,” arXiv preprint arXiv:2310.08659, 2023. arXiv:2310.19102, 2023.
[182] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: [203] W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and
Accurate post-training quantization for generative pre-trained X. Qi, “Billm: Pushing the limit of post-training quantization for
transformers,” arXiv preprint arXiv:2210.17323, 2022. llms,” 2024.
[183] G. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S. J. Kwon, B. Kim, [204] S. Li, X. Ning, L. Wang, T. Liu, X. Shi, S. Yan, G. Dai, H. Yang, and
Y. Lee, D. Lee et al., “Lut-gemm: Quantized matrix multiplication Y. Wang, “Evaluating quantized large language models,” arXiv
based on luts for efficient inference in large-scale generative lan- preprint arXiv:2402.18158, 2024.
guage models,” in The Twelfth International Conference on Learning [205] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S.
Representations, 2023. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million
33
context length llm inference with kv cache quantization,” arXiv [230] Z. Sun, A. T. Suresh, J. H. Ro, A. Beirami, H. Jain, and F. Yu,
preprint arXiv:2401.18079, 2024. “Spectr: Fast speculative decoding via optimal transport,” arXiv
[206] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, preprint arXiv:2310.15141, 2023.
and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for [231] K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, Y. Dong,
kv cache,” arXiv preprint arXiv:2402.02750, 2024. and Y. Wang, “Flashdecoding++: Faster large language model
[207] E. Frantar and D. Alistarh, “Optimal brain compression: A frame- inference on gpus,” 2024.
work for accurate post-training quantization and pruning,” in [232] T. Gale, D. Narayanan, C. Young, and M. Zaharia, “Megablocks:
Advances in Neural Information Processing Systems, 2022. Efficient sparse training with mixture-of-experts,” in Proceedings
[208] N. Vaidya, F. Oh, and N. Comly, “Optimizing inference on of Machine Learning and Systems (MLSys), 2023.
large language models with nvidia tensorrt-llm, now pub- [233] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention:
licly available,” [Online], 2023, https://github.com/NVIDIA/ Fast and memory-efficient exact attention with io-awareness,”
TensorRT-LLM. Advances in Neural Information Processing Systems, vol. 35, pp.
[209] InternLM, “Lmdeploy,” 2024. [Online]. Available: https:// 16 344–16 359, 2022.
github.com/InternLM/lmdeploy [234] T. Dao, “Flashattention-2: Faster attention with better parallelism
[210] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
and W. Chen, “Lora: Low-rank adaptation of large language [235] Y. Zhai, C. Jiang, L. Wang, X. Jia, S. Zhang, Z. Chen, X. Liu,
models,” arXiv preprint arXiv:2106.09685, 2021. and Y. Zhu, “Bytetransformer: A high-performance transformer
[211] B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon boosted for variable-length inputs,” in 2023 IEEE International
and general network pruning,” in IEEE international conference on Parallel and Distributed Processing Symposium (IPDPS). IEEE,
neural networks. IEEE, 1993, pp. 293–299. 2023, pp. 344–355.
[212] Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” [236] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li,
Advances in neural information processing systems, vol. 2, 1989. E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley et al.,
[213] B. Zoph and Q. Le, “Neural architecture search with reinforce- “Deepspeed-inference: enabling efficient inference of transformer
ment learning,” in International Conference on Learning Representa- models at unprecedented scale,” in SC22: International Conference
tions, 2016. for High Performance Computing, Networking, Storage and Analysis.
[214] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, IEEE, 2022, pp. 1–15.
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train- [237] T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash-decoding for
ing language models to follow instructions with human feedback, long-context inference,” [Online], 2023, https://crfm.stanford.
2022,” URL https://arxiv. org/abs/2203.02155, vol. 13, 2022. edu/2023/10/12/flashdecoding.html.
[215] X. He, I. Keivanloo, Y. Xu, X. He, B. Zeng, S. Rajagopalan, and [238] HuggingFace, “Transformers: State-of-the-art machine learning
T. Chilimbi, “Magic pyramid: Accelerating inference with early for pytorch, tensorflow, and jax.” [Online], 2024, https://github.
exiting and token pruning,” Image, 2023. com/huggingface/transformers.
[216] TogetherAI, “Paving the way to efficient architectures: [239] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
Stripedhyena-7b, open source models offering a glimpse T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al.,
into a world beyond transformers,” December 2023. [Online]. “Llama: Open and efficient foundation language models,” arXiv
Available: https://www.together.ai/blog/stripedhyena-7b preprint arXiv:2302.13971, 2023.
[217] A. Jaiswal, Z. Gan, X. Du, B. Zhang, Z. Wang, and Y. Yang, [240] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang,
“Compressing llms: The truth is rarely pure and never simple,” “Glm: General language model pretraining with autoregressive
arXiv preprint arXiv:2310.01382, 2023. blank infilling,” in Proceedings of the 60th Annual Meeting of the
[218] Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from Association for Computational Linguistics (Volume 1: Long Papers),
transformers via speculative decoding,” in International Confer- 2022, pp. 320–335.
ence on Machine Learning. PMLR, 2023, pp. 19 274–19 286. [241] Sensetime, “Openppl: A high-performance deep learning infer-
[219] C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and ence platform,” [Online], 2023, https://openppl.ai/home.
J. Jumper, “Accelerating large language model decoding with
[242] NVIDIA, “cublas: Basic linear algebra on nvidia gpus,” [Online],
speculative sampling,” arXiv preprint arXiv:2302.01318, 2023.
2017, https://developer.nvidia.com/cublas.
[220] Y. Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh,
[243] ——, “Cutlass: Cuda templates for linear algebra subroutines,”
S. Kumar, J.-F. Kagy, and R. Agarwal, “Distillspec: Improving
[Online], 2017, https://github.com/NVIDIA/cutlass.
speculative decoding via knowledge distillation,” arXiv preprint
arXiv:2310.08461, 2023. [244] S. Wang, “Fastgemv: High-speed gemv kernels,” [Online], 2023,
[221] J. Zhang, J. Wang, H. Li, L. Shou, K. Chen, G. Chen, and S. Mehro- https://github.com/wangsiping97/FastGEMV.
tra, “Draft & verify: Lossless large language model acceleration [245] P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate lan-
via self-speculative decoding,” arXiv preprint arXiv:2309.08168, guage and compiler for tiled neural network computations,” in
2023. Proceedings of the 3rd ACM SIGPLAN International Workshop on
[222] X. Liu, L. Hu, P. Bailis, I. Stoica, Z. Deng, A. Cheung, Machine Learning and Programming Languages, 2019, pp. 10–19.
and H. Zhang, “Online speculative decoding,” arXiv preprint [246] M. Stern, N. Shazeer, and J. Uszkoreit, “Blockwise parallel
arXiv:2310.07177, 2023. decoding for deep autoregressive models,” Advances in Neural
[223] G. Monea, A. Joulin, and E. Grave, “Pass: Parallel speculative Information Processing Systems, vol. 31, 2018.
sampling,” arXiv preprint arXiv:2311.13581, 2023. [247] P. Patel, E. Choukse, C. Zhang, Íñigo Goiri, A. Shah, S. Maleki,
[224] Z. He, Z. Zhong, T. Cai, J. D. Lee, and D. He, “Rest: Retrieval- and R. Bianchini, “Splitwise: Efficient generative llm inference
based speculative decoding,” arXiv preprint arXiv:2311.08252, using phase splitting,” arXiv preprint arXiv:2311.18677, 2023.
2023. [248] C. Hu, H. Huang, L. Xu, X. Chen, J. Xu, S. Chen, H. Feng,
[225] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, R. Y. Y. C. Wang, S. Wang, Y. Bao, N. Sun, and Y. Shan, “Inference without
Wong, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia, “Specinfer: interference: Disaggregate llm inference for mixed downstream
Accelerating generative llm serving with speculative inference workloads,” arXiv preprint arXiv:2401.11181, 2024.
and token tree verification,” arXiv preprint arXiv:2305.09781, 2023. [249] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and
[226] B. Spector and C. Re, “Accelerating llm inference with staged H. Zhang, “Distserve: Disaggregating prefill and decoding for
speculative decoding,” arXiv preprint arXiv:2308.04623, 2023. goodput-optimized large language model serving,” arXiv preprint
[227] Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang, arXiv:2401.09670, 2024.
“Cascade speculative drafting for even faster llm inference,” [250] X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia, “Spot-
arXiv preprint arXiv:2312.11462, 2023. serve: Serving generative large language models on preemptible
[228] Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Breaking the sequential instances,” arXiv preprint arXiv:2311.15566, 2023.
dependency of llm inference using lookahead decoding,” [251] B. Lin, T. Peng, C. Zhang, M. Sun, L. Li, H. Zhao, W. Xiao,
November 2023. [Online]. Available: https://lmsys.org/blog/ Q. Xu, X. Qiu, S. Li, Z. Ji, Y. Li, and W. Lin, “Infinite-llm: Efficient
2023-11-21-lookahead-decoding/ llm service for long context with distattention and distributed
[229] Y. Li, C. Zhang, and H. Zhang, “Eagle: Lossless acceleration kvcache,” arXiv preprint arXiv:2401.02669, 2024.
of llm decoding by feature extrapolation,” December 2023. [252] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca:
[Online]. Available: https://sites.google.com/view/eagle-llm A distributed serving system for transformer-based generative
34
models,” in Proceedings of the 16th USENIX Symposium on Operat- [274] Y. Tang, F. Liu, Y. Ni, Y. Tian, Z. Bai, Y.-Q. Hu, S. Liu, S. Jui,
ing Systems Design and Implementation, 2022, pp. 521–538. K. Han, and Y. Wang, “Rethinking optimization and architecture
[253] ModelTC, “Lightllm,” February 2024. [Online]. Available: for tiny language models,” arXiv preprint arXiv:2402.02791, 2024.
https://github.com/ModelTC/lightllm/ [275] Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y. T.
[254] C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan, J. Rasley, S. Ra- Lee, “Textbooks are all you need ii: phi-1.5 technical report,”
jbhandari, R. Y. Aminabadi, H. Qin, A. Bakhtiari, L. Kurilenko, arXiv preprint arXiv:2309.05463, 2023.
and Y. He, “Deepspeed-fastgen: High-throughput text genera- [276] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes,
tion for llms via mii and deepspeed-inference,” arXiv preprint A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa,
arXiv:2401.08671, 2024. O. Saarikivi et al., “Textbooks are all you need,” arXiv preprint
[255] B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast arXiv:2306.11644, 2023.
distributed inference serving for large language models,” arXiv [277] P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-
preprint arXiv:2305.05920, 2023. source small language model,” arXiv preprint arXiv:2401.02385,
[256] Y. Sheng, S. Cao, D. Li, B. Zhu, Z. Li, and D. Zhuo, “Fairness in 2024.
serving large language models,” arXiv preprint arXiv:2401.00588, [278] C. Zhang, D. Song, Z. Ye, and Y. Gao, “Towards the law
2024. of capacity gap in distilling language models,” arXiv preprint
arXiv:2311.07052, 2023.
[257] A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani,
[279] X. Geng and H. Liu, “Openllama: An open reproduction of
and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking
llama,” May 2023. [Online]. Available: https://github.com/
decodes with chunked prefills,” arXiv preprint arXiv:2308.16369,
openlm-research/open llama
2023.
[280] M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi,
[258] A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S.
R. Adithyan, J. Baicoianu, B. Brooks, N. Cooper, A. Datta
Gulavani, A. Tumanov, , and R. Ramjee, “Taming throughput-
et al., “Stable lm 2 1.6 b technical report,” arXiv preprint
latency tradeoff in llm inference with sarathi-serve,” arXiv
arXiv:2402.17834, 2024.
preprint arXiv:2403.02310, 2024.
[281] “Minicpm: Unveiling the potential of end-side large language
[259] Y. Jin, C.-F. Wu, D. Brooks, and G.-Y. Wei, “S3 : Increasing gpu models,” 2024.
utilization during generative inference for higher throughput,” [282] Z. Liu, C. Zhao, F. Iandola, C. Lai, Y. Tian, I. Fedorov, Y. Xiong,
arXiv preprint arXiv:2306.06000, 2023. E. Chang, Y. Shi, R. Krishnamoorthi et al., “Mobilellm: Optimiz-
[260] Z. Ye, “flashinfer,” March 2024. [Online]. Available: https: ing sub-billion parameter language models for on-device use
//github.com/flashinfer-ai/flashinfer cases,” arXiv preprint arXiv:2402.14905, 2024.
[261] NVIDIA, “Fastertransformer: About transformer related opti- [283] M. team, “MLC-LLM,” 2023. [Online]. Available: https:
mization, including bert, gpt,” [Online], 2017, https://github. //github.com/mlc-ai/mlc-llm
com/NVIDIA/FasterTransformer. [284] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on
[262] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, large language model (llm) security and privacy: The good, the
H. Liu, and C. Ding, “Ftrans: Energy-efficient acceleration of bad, and the ugly,” High-Confidence Computing, p. 100211, 2024.
transformers using fpga,” arXiv preprint arXiv:2007.08563, 2020. [285] Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu,
[263] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jun, and J. W. Lee, X. Wang, Y. Sun et al., “Personal llm agents: Insights and sur-
“Elsa: Hardware-software co-design for efficient, lightweight vey about the capability, efficiency and security,” arXiv preprint
self-attention mechanism in neural networks,” in ACM/IEEE 48th arXiv:2401.05459, 2024.
Annual International Symposium on Computer Architecture, 2021,
pp. 692–705.
[264] H. Fan, T. Chau, S. I. Venieris, R. Lee, A. Kouris, W. Luk, N. D.
Lane, and M. S. Abdelfattah, “Adaptable butterfly accelerator for
attention-based nns via hardware and algorithm co-design,” in
IEEE/ACM International Symposium on Microarchitecture, 2022, pp.
599–615.
[265] Y. Qin, Y. Wang, D. Deng, Z. Zhao, X. Yang, L. Liu, S. Wei,
Y. Hu, and S. Yin, “Fact: Ffn-attention co-optimized transformer
architecture with eager correlation prediction,” in Proceedings of
the 50th Annual International Symposium on Computer Architecture,
2023, pp. 1–14.
[266] H. Chen, J. Zhang, Y. Du, S. Xiang, Z. Yue, N. Zhang, Y. Cai,
and Z. Zhang, “Understanding the potential of fpga-based spatial
acceleration for large language model inference,” arXiv preprint
arXiv:2312.15159, 2023.
[267] S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y.
Kim, “Dfx: A low-latency multi-fpga appliance for accelerating
transformer-based text generation,” in IEEE Hot Chips 34 Sympo-
sium, 2022.
[268] S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun,
S. Li, Z. Huang et al., “Flightllm: Efficient large language model
inference with a complete mapping flow on fpga,” arXiv preprint
arXiv:2401.03868, 2024.
[269] S. teams, “Sharegpt,” 2023. [Online]. Available: https://sharegpt.
com/
[270] J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal
agents: A survey,” arXiv preprint arXiv:2402.15116, 2024.
[271] I. Lee, N. Jiang, and T. Berg-Kirkpatrick, “Exploring the relation-
ship between model architecture and in-context learning ability,”
arXiv preprint arXiv:2310.08049, 2023.
[272] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley,
K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth,
E. Raff et al., “Pythia: A suite for analyzing large language models
across training and scaling,” in International Conference on Machine
Learning. PMLR, 2023, pp. 2397–2430.
[273] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge,
Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint
arXiv:2309.16609, 2023.