0% found this document useful (0 votes)

47 views34 pages

A Survey On Efficient Inference For Large Language Models

Uploaded by

498585298

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views34 pages

A Survey On Efficient Inference For Large Language Models

Uploaded by

498585298

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

1

A Survey on Efficient Inference for Large

Language Models
Zixuan Zhou*, Xuefei Ning*, Ke Hong*, Tianyu Fu, Jiaming Xu, Shiyao Li,
Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai,
Xiao-Ping Zhang Fellow, IEEE, Yuhan Dong, Yu Wang Fellow, IEEE

✦
arXiv:2404.14294v1 [cs.CL] 22 Apr 2024

Abstract—Large Language Models (LLMs) have attracted extensive (NLU), neural language generation (NLG), reasoning [13],
attention due to their remarkable performance across various tasks. [14], and code generation [15], consequently enabling im-
However, the substantial computational and memory requirements of pactful applications like ChatGPT, Copilot, and Bing. There
LLM inference pose challenges for deployment in resource-constrained is a growing belief [16] that the rise and achievements of
scenarios. Efforts within the field have been directed towards developing
LLMs signify a significant stride towards Artificial General
techniques aimed at enhancing the efficiency of LLM inference. This
paper presents a comprehensive survey of the existing literature on
Intelligence (AGI) for humanity.
efficient LLM inference. We start by analyzing the primary causes of
the inefficient LLM inference, i.e., the large model size, the quadratic- Higher Latency
complexity attention operation, and the auto-regressive decoding ap- Higher Computational
proach. Then, we introduce a comprehensive taxonomy that organizes Cost
the current literature into data-level, model-level, and system-level op- Lower Throughput
timization. Moreover, the paper includes comparative experiments on
Higher Memory Access
representative methods within critical sub-fields to provide quantitative Higher Power
Cost
insights. Last but not least, we provide some knowledge summary and Consumption
discuss future research directions.
Higher Memory Cost
Higher Storage
1 I NTRODUCTION
Large Language Models (LLMs) have garnered substantial Fig. 1. The challenges of LLM deployment.
attention from both academia and industry in recent years.
The field of LLMs has experienced notable growth and sig- However, the deployment of LLMs is not always going
nificant achievements. Numerous open-source LLMs have smoothly. As shown in Fig. 1, LLMs typically demand
emerged, including the GPT-series (GPT-1 [1], GPT-2 [2], higher computational cost, memory access cost and memory
and GPT-3 [3]), OPT [4], LLaMA-series (LLaMA [5], LLaMA usage in their inference process (we will analyse the root
2 [5], Baichuan 2 [6], Vicuna [7], LongChat [8]), BLOOM [9], causes in the Sec. 2.3), which deteriorates the efficiency
FALCON [10], GLM [11], and Mistral [12], which are used indicators (e.g., latency, throughput, power consumption
for both academic research and commercial purposes. The and storage) in the resource-constrained scenarios. This
success of LLMs stems from their robust capability in han- poses challenges for the application of LLMs in both edge
dling diverse tasks such as neural language understanding and cloud scenarios. For example, the immense storage re-
quirements render the deployment of a 70-billion-parameter
• Z. Zhou, K. Hong, T. Fu, S. Li, L. Wang are with Infinigence-AI and the model impractical on personal laptops for tasks such as
Department of Electronic Engineering, Tsinghua University, China. development assistance. Additionally, the low throughput
E-mail: zhouzx21@mails.tsinghua.edu.cn (Z. Zhou) would result in significant costs if LLMs are used for every
• X. Ning, Y. Lou, Y. Wang are with the Department of Electronic Engi- search engine request, leading to a considerable reduction
neering, Tsinghua University, China.
E-mail: foxdoraame@gmail.com (X. Ning), yu-wang@tsinghua.edu.cn (Y. in the profits of the search engine.
Wang) Fortunately, a substantial array of techniques has been
• J. Xu, G. Dai are with Infinigence-AI and the Department of Electronic proposed to enable efficient inference for LLMs. To gain
Engineering, Shanghai Jiaotong University, China.
E-mail: daiguohao@sjtu.edu.cn (G. Dai)
a comprehensive understanding of existing studies and
• X.-P. Zhang, Y. Dong are with Tsinghua Shenzhen International Graduate inspire further research, this survey employs a hierarchical
School. classification and systematic summarization of the current
E-mail: xpzhang@ieee.org (X.-P. Zhang), dongyuhan@sz.tsinghua.edu.cn landscape of efficient LLM inference. Specifically, we cat-
(Y. Dong)
• Z. Yuan, S. Yan are with Infinigence-AI. egorize relevant studies into three levels: data-level opti-
• X. Li is with Peking University. mization, model-level optimization, and system-level op-
• Corresponding authors: Yu Wang, Xuefei Ning, Guohao Dai. timization (refer to Sec. 3 for elaboration). Moreover, we
• *Equal contribution.
conduct experimental analyses on representative methods
2

within critical sub-fields to consolidate knowledge, offer blocks. Typically, a Transformer block consists of a Multi-
practical recommendations, and provide guidance for future Head Self-Attention (MHSA) block, a Feed Forward Net-
research endeavors. work (FFN), and a LayerNorm (LN) operation. For each
block, it receives the output features of the previous one
TABLE 1 as the input, and passes the features through each sub-
Comparison of existing surveys. module to obtain the output. Specially, before the first block,
a tokenizer is used to convert the original input sentence
Optimization Levels into a sequence of tokens, and a following embedding layer
Experimental
Survey
Data-level Model-level System-level Analysis serves to convert the tokens into the input features. Then,
the additional position embeddings are added into the input
[17], [18], [19] ✓
[20] ✓ ✓ features to encode the sequential order of each input token.
[21] ✓ ✓ The core concept of the Transformer architecture is the
[22], [23] ✓ ✓ ✓
Ours ✓ ✓ ✓ ✓ self-attention mechanism, which is adopted in the MHSA
block. Specifically, denoted the input features as X =
[x1 , x2 , ..., xn ], the MHSA block applies linear projection to
Currently, several surveys [17], [18], [19], [20], [21], [22] them and obtains a set of queries Q, keys K and values V as
have been conducted in the field of efficient LLMs. These Eq. 1:
surveys primarily focus on different aspects of LLM effi-
ciency but offer opportunities for further improvement. Zhu
Qi = XW Qi , Ki = XW Ki , Vi = XW Vi , (1)
et al. [17], Park et al. [18] and Wang et al. [19] concen- where W Qi , W Ki and W Vi are the projection matrices
trate on model compression techniques within model-level corresponding to the i-th attention head. Then the self-
optimization. Ding et al. [20] center on efficiency research attention operation is applied to each tuple of (Qi , Ki , Vi )
considering both data and model architecture perspectives. and get the feature of the i-th attention head Zi as Eq. 2:
Miao et al. [21] approach efficient LLM inference from a ma-
chine learning system (MLSys) research perspective. In con- Qi K T
Zi = Attention(Qi , Ki , Vi ) = Softmax( √ i )Vi , (2)
trast, our survey provides a more comprehensive research dk
scope, addressing optimization at three levels: data-level, where dk is the dimension of the queries (keys). Note that
model-level, and system-level, with the inclusion of recent the self-attention operation contains the matrix multipli-
advancements. While Wan et al. [22] and Xu et al. [23] also cation operation, its computation complexity is quadratic
deliver comprehensive review of efficient LLM research, our in the input length. Finally, the MHSA block concatenates
work extends by incorporating comparative experiments the features of all the attention heads and applies a linear
and offering practical insights and recommendations based projection to them to form its output Z as Eq. 3:
on experimental analyses in several critical sub-fields like
model quantization and serving systems. A comparison of Z = Concat(Z1 , Z2 , ..., Zh )W O , (3)
these surveys is summarized in Table 1. where WO is the projection matrix. As can be seen, the
The remainder of this survey is organized as follows: self-attention mechanism allows the model to identify the
Sec. 2 introduces the basic concept and knowledge about importance of different input parts regardless of the dis-
LLMs and presents a detailed analysis of the efficiency tance, and thus can capture the long-range dependencies
bottlenecks during the inference process of LLMs. Sec. 3 and complex relationships in the input sentence.
demonstrates our taxonomy. Sec. 4 to Sec. 6 respectively Another important module in the Transformer block is
present and discuss studies on efficiency optimization at the FFN. Typically, FFN is placed after the MHSA block
three distinct levels. Sec. 7 offers broader discussions for and consists of two linear transformation layers with a non-
several key application scenarios. Sec. 8 concludes the key linear activation function. It receives the output features X
contributions provided by this survey. from the MHSA block and processes them as Eq 4:
FFN(X) = W2 σ(W1 X), (4)
2 P RELIMINARIES where W1 and W2 denote the weight matrices of the two
2.1 Transformer-Style LLMs linear layers, and σ(·) denotes the activation function.
Language modeling, as the fundamental function of lan-
guage models (LMs), involves modeling the likelihood of 2.2 Inference Process of LLMs
the word sequence and predicting the distribution of subse- The most popular LLMs, i.e., decoder-only LLMs, often
quent words. Over recent years, researchers have discovered adopt the auto-regressive method to generate the output
that scaling up language models not only enhances their sentence. Specifically, the auto-regressive method generates
language modeling ability but also engenders emergent the tokens one by one. In each generation step, the LLM
capabilities for tackling more intricate tasks beyond conven- takes as input the whole token sequences, including the in-
tional NLP tasks [24]. These scaled-up language models are put tokens and previously generated tokens, and generates
referred to as large language models (LLMs). the next token. With the increase in sequence length, the
The mainstream LLMs are designed based on the Trans- time cost of the generation process grows rapidly. To ad-
former architecture [25]. Specifically, a typical Transformer dress this challenge, a crucial technique, namely key-value
architecture is composed of several stacked Transformer (KV) cache, has been introduced to expedite the generation
3

Output: ['Processing'] (1dim) Output: ['! '] (1dim) Memory

Add & LayerNorm Add & LayerNorm

FFN FFN
FC2 FC2

Activation Activation KV Cache

Size
FC1 FC1

Add & LayerNorm Add & LayerNorm Peak

Memory
𝑸𝑲𝑻 𝑸𝑲𝑻
MHSA WO Softmax 𝑽 MHSA WO Softmax 𝑽 First Token
𝒅𝒌 𝒅𝒌
where 𝑄, 𝐾, 𝑉 ∈ 𝑅𝑁×𝑑 Self-Attention
where 𝑄 ∈ 𝑅1×𝑑 Latency
Self-Attention 𝐾, 𝑉 ∈ 𝑅𝑁×𝑑
Q K V Q K V Model
Per-output Token
K Cache V Cache Size
Latency
WQ WK WV WQ WK WV
Latency
Input: ['I', 'like', 'natural', 'language'] Input: ['I', 'like', 'natural', 'language', 'Processing'] Generation Latency
(a) The prefilling stage (b) The decoding stage
Fig. 3. Illustration of the memory variation through time (latency) during
one generation process. Note that we ignore the activation size in this
Fig. 2. Demonstration of the prefilling stage (a) and decoding stage (b). figure for a simplification.

process. The KV cache technique, as its name suggests, inference. As for latency, generating one token on 2 NVIDIA
involves storing and reusing previous key (K) and value (V) A100 GPUs requires approximately 100 milliseconds. Con-
pairs within the Multi-Head Self-Attention (MHSA) block. sequently, generating a sequence with hundreds of tokens
This technique has been widely adopted in LLM inference requires more than 10 seconds. In addition to storage and
engines and systems due to its substantial optimization latency, the efficiency indicators, such as throughput, energy
of generation latency. Based on the above methods and and power consumption, also need to be considered. During
techniques, the inference process of LLMs can be divided the LLM inference process, three important factors would
into two stages: largely affect these indicators, i.e., the computational cost,
• Prefilling Stage: The LLM calculates and stores the KV the memory access cost and the memory usage. Yuan et
cache of the initial input tokens, and generates the first al. [26] provide a more systematic analysis to demonstrate
output token, as shown in Fig. 2(a). how these factors affect the inference inefficiency with a
• Decoding Stage: The LLM generates the output tokens roofline model. In the following, we further analyze three
one by one with the KV cache, and then updates it with root causes of inefficiency in the LLM inference process,
the key (K) and value (V) pairs of the newly generated focusing on the above three key factors:
token, as shown in Fig. 2(b). • Model Size: Mainstream LLMs typically incorporate
As shown in Fig. 3, we illustrate some critical efficiency billions or even trillions of parameters. For instance,
indicators. As for the latency, we denote first token latency the LLaMA-70B model comprises 70 billion parame-
as the latency to generate the first output token in the ters, while the GPT-3 model scales up to 175 billion
prefilling stage, while we denote per-output token latency parameters. This considerable model size contributes
as the average latency to generate one output token in significantly to the elevated computational cost, mem-
the decoding stage. Besides, we use generation latency ory access cost, and memory usage during the LLM
to denote the latency to generate the whole output token inference process.
sequences. As for the memory, we use model size to denote • Attention Operation: As illustrated in Sec. 2.1 and
the memory to store the model weights, and use KV cache Sec. 2.2, in the prefilling stage, the self-attention oper-
size to denote the memory to store the KV cache. Addition- ation exhibits quadratic computational complexity in
ally, peak memory denotes the maximum memory usage the input length. Consequently, as the input length
during the generation process, which is approximately equal increases, the computational cost, memory access cost,
to the memory sum of model weights and KV cache. Apart and memory usage of the attention operation escalate
from the latency and memory, throughput is also a widely- rapidly.
used indicator in the LLM serving system. We use token • Decoding Approach: The auto-regressive decoding ap-
throughput to denote the number of generated tokens per proach generates the tokens one by one. In each decod-
second, and use request throughput to denote the number ing step, all the model weights are loaded from the off-
of completed requests per second. chip HBM to the GPU chip, leading to a large memory
access cost. In addition, the size of KV cache increases
with the growth in the input length, potentially leading
2.3 Efficiency Analysis to fragmented memory and irregular memory access
Deploying LLMs on resource-constrained scenarios while patterns.
preserving their powerful capabilities poses a significant
challenge for both practitioners and researchers. For in-
stance, let’s consider to deploy a LLaMA-2-70B model, 3 TAXONOMY
which contains 70 billion parameters. Storing its weights In the aforementioned discussion, we identify key factors
in FP16 format necessitates 140 GB of VRAM, requiring (i.e., computational cost, memory access cost and mem-
at least 6 RTX 3090Ti GPUs (each with 24 GB VRAM) ory usage) that significantly impact the efficiency during
or 2 NVIDIA A100 GPUs (each with 80 GB VRAM) for the LLM inference process, and further analyze three root
4

Prompt Pruning

Prompt Summary
Input Compression
(Sec. 4.1) Soft Prompt-based
Data-level Compression
Optimization
(Sec. 4) Output Organization
(Sec. 4.2) Retrieval-Augmented
Generation

Efficient FFN Design

Low-Complexity Attention
Efficient Structure Design Efficient Attention Design
(Sec. 5.1) Multi/Group-
Transformer Alternate Query Attention
Efficient Inference for Large Language Models

Post-Training Quantization
Quantization
Model-level Quantization-
Optimization aware Training
(Sec. 5)
Weight Pruning
Sparsification
Sparse Attention
Model Compression
(Sec. 5.2) Structure Factorization
Structure Optimization
Neural Architecture Search

White-box KD
Knowledge Distillation
Black-box KD
Dynamic Inference

Graph and Operator

Optimization
Inference Engine
(Sec. 6.1)
Speculative Decoding

System-level Memory Management

Optimization
(Sec. 6)
Batching
Serving System
(Sec. 6.2) Scheduling

Distributed Systems

Fig. 4. Taxonomy of efficient inference methods for Large Language Models.

causes (i.e., model size, attention operation and decoding • Model-level Optimization refers to designing an ef-
approach). Many efforts have been made to optimize the ficient model structure (i.e., efficient structure design)
inference efficiency from different perspectives. By carefully or compressing the pre-trained models (i.e., model
reviewing and summarizing these studies, we classify them compression) in the inference process to improve its
into three levels, i.e., data-level optimization, model-level efficiency. This line of optimization (1) often requires
optimization and system-level optimization (as shown in costly pre-training or a smaller amount of fine-tuning
Fig. 4): cost to retain or recover the model ability, and (2) is
• Data-level Optimization refers to improving the ef- typically lossy in the model performance.
ficiency via optimizing the input prompts (i.e., input • System-level Optimization refers to optimizing the
compression) or better organizing the output content inference engine or the serving system. This line of opti-
(i.e., output organization). This line of optimization typ- mization (1) does not involve the costly model training,
ically does not change the original model, thus is free and (2) is typically lossless in model performance. In
of costly model training cost (note that a small amount addition, we provide a brief introduction for hardware
of training for auxiliary models might be required, but accelerator design in Sec. 6.3.
this cost can be ignored compared with the training cost
for original LLMs).
5

4 DATA - LEVEL O PTIMIZATION by token-level pruning based on perplexity. To enhance

In the data level, prior studies can be divided into two performance, LLMLingua proposes a budget controller that
categories, i.e., input compression and output organization. dynamically allocates the pruning budget across different
Input compression techniques directly shorten the model in- parts of prompts. Additionally, it utilizes an iterative token-
put to reduce the inference cost. While output organization level compression algorithm to address inaccuracies in-
techniques enable batch (parallel) inference via organizing troduced by conditional independence assumptions. Fur-
the structure of output content, which can improve the thermore, LLMLingua incorporates a distribution align-
hardware utilization and reduce the generation latency. ment strategy to align the output distribution of the target
LLM with a smaller LLM used for perplexity calculation.
LongLLMLingua [41] builds upon LLMLingua with several
4.1 Input Compression enhancements: (1) It utilizes perplexity conditioned on the
In the practical application of LLMs, prompts are crucial. input question as the indicator for prompt pruning. (2) It
Numerous studies suggest new ways to design prompts allocates varying pruning ratios to different demonstrations
effectively and show in practice that well-designed prompts and reorders the demonstrations within the final prompt
can unleash the capabilities of LLMs. For instance, In- based on their indicator values. (3) It restores the original
Context Learning (ICL) [43] suggests to include multiple content based on the response. CoT-Influx [42] introduces
relevant examples within the prompt. This approach en- a coarse-to-grained pruning method for Chain-of-Thought
courages LLMs to learn through analogy. Chain-of-Thought (CoT) prompts using reinforcement learning. Specifically, it
(CoT) [14] proposes to incorporate a sequence of intermedi- prunes unimportant examples, followed by pruning unim-
ate reasoning steps within the in-context examples, which portant tokens within the remaining examples.
help LLMs to conduct complex reasoning. However, these
prompting techniques inevitably lead to longer prompts, 4.1.2 Prompt Summary
which poses a challenge because the computational cost and
memory usage increase quadratically during the prefilling The core idea of prompt summary is to condense the
stage (as illustrated in Sec. 2.3). original prompt into a shorter summary while preserving
To address this challenge, input prompt compres- similar semantic information. These techniques also serve as
sion [31] has been proposed to shorten prompts without online compression methods for prompts. In contrast to the
significantly impacting the quality of answers from LLMs. aforementioned prompt pruning techniques that preserve
Within this field, relevant studies are categorized into four the unpruned tokens unchanged, this line of methods con-
groups, as depicted in Figure 5: prompt pruning, prompt verts the entire prompt into its summation. RECOMP [34]
summary, soft prompt-based compression, and retrieval- introduces an Abstractive Compressor that takes an input
augmented generation. question and retrieved documents as input, and produces
a concise summary. Specifically, it distills a lightweight
4.1.1 Prompt Pruning compressor from the extreme-scale LLMs to perform the
summary. SemanticCompression [35] proposes a semantic
The core idea behind the prompt pruning is to remove
compression method. It starts by breaking down the text
unimportant tokens, sentences, or documents online from
into sentences. Next, it groups sentences together by topic
each input prompt based on predefined or learnable impor-
and then summarizes the sentences within each group.
tance indicators. DYNAICL [36] proposes to dynamically
decide the optimal number of in-context examples for a
given input based on the computational budget via a well- 4.1.3 Soft Prompt-based Compression
trained LLM-based meta controller. Selective Context [37] The core idea of this kind of compression techniques is to
proposes to merge tokens into units, and then applies a design a soft prompt, significantly shorter than the orig-
unit-level prompt pruning based on the self-information inal prompt, for use as input to LLMs. The soft prompt
indicator (i.e., negative log likelihood). STDC [38] prunes is defined as a sequence of learnable continuous tokens.
the prompts based on the parse tree, which iteratively Some techniques adopt offline compression for the fixed
removes phrase nodes that cause the smallest performance prefix prompt (e.g., system prompt, task-specific prompt).
drop after pruning it. PCRL [39] introduces a token-level For example, PromptCompression [31] trains a soft prompt
pruning scheme based on reinforcement learning. The main to emulate a predetermined system prompt. The approach
idea behind PCRL is to train a policy LLM by combining involves adding several soft tokens before the input tokens
faithfulness and compression ratio into the reward func- and enabling these soft tokens to be adjusted during back-
tion. Faithfulness is measured as the output similarity be- propagation. Following fine-tuning on the prompt dataset,
tween the compressed prompt and the original prompt. the sequence of soft tokens serves as the soft prompt.
RECOMP [34] implements a sentence-level pruning strategy Gisting [32] introduces a method to condense task-specific
to compress prompts for Retrieval-Augmented Language prompts into a concise set of gist tokens using prefix-
Models (RALMs). The approach involves encoding the in- tuning [44]. Given that task-specific prompts differ across
put question and documents into latent embeddings using tasks, prefix-tuning is applied individually for each task.
a pre-trained encoder. Then, it decides which documents To enhance efficiency, Gisting further introduces a meta-
to remove based on the similarity of their embeddings learning approach that predicts gist tokens for new unseen
with the question’s embedding. LLMLingua [40] introduces tasks based on the gist tokens of previous tasks.
a coarse-to-fine pruning scheme for prompt compression. Other techniques adopt online compression for every
Initially, it performs a demonstration-level pruning followed new input prompts. For instance, AutoCompressors [28]
6

Prompt Pruning DYNAICL [36], Selective Context [37],

STDC [38], PCRL [39], RECOMP [34], LLM-
Lingua [40], LongLLMLingua [41], CoT-
Influx [42]

Input Prompt Summary RECOMP [34], SemanticCompression [35]

Compression
Soft Prompt-based PromptCompression [31], Gisting [32], Auto-
Compression Compressors [28], ICAE [33]

Retrieval-Augmented RAG [27], FLARE [28], REPLUG [29], Self-

Generation RAG [30]

Fig. 5. Taxonomy of the input compression methods for Large Language Models.

train a pre-trained LM to compress the prompts into sum- expansions to form the final answer. When applied to open-
mary vectors via unsupervised learning. ICAE [33] trains source models, point-expanding can be performed through
an autoencoder to compress the original context into short batch inference, which optimizes hardware utilization and
memory slots. Specifically, ICAE employs a LoRA-adapted reduces overall generation latency using the same compu-
LLM as the encoder, and uses the target LLM as the decoder. tational resources. To mitigate the additional computation
A set of memory tokens is added before the input tokens
and encoded into memory slots. 1. Noodles: Various noodle
dishes, such as …
4.1.4 Retrieval-Augmented Generation What are the 1. Noodles 2. Hot pot: A communal pot
2. Hot pot of simmering broth at the
typical types of
Retrieval-Augmented Generation (RAG) [27] aims to im- Chinese dishes? 3. Rice center of the table …
prove the quality of LLMs’ responses by incorporating exter- … 3. Rice: Fried Rice,
Yangzhou Fried Rice, and
nal knowledge sources. RAG can be also viewed as a tech- other rice-based dishes …
nique to improve the inference efficiency when handling a …
large amount of data. Instead of merging all information
into an excessively long prompt, RAG only adds relevant (a) The skeleton stage (b) The point-expanding stage
retrieved information to the original prompt, ensuring that
the model receives necessary information while reducing Fig. 6. Demonstration of the inference process of SoT.
prompt length significantly. FLARE [28] uses predictions of
upcoming sentences to proactively decide when and what overhead brought by the extra prompt (i.e., skeleton prompt
information to retrieve. REPLUG [29] treats the LLM as a and point-expanding prompt), SoT discusses the possibility of
black box and augments it with a tuneable retrieval model. sharing the KV cache of the common prompt prefix across
It prepends retrieved documents to the input for the frozen multiple points in the point expansion phase. Additionally,
black-box LLM, and further utilizes the LLM to supervise SoT uses a router model to decide whether applying SoT is
the retrieval model. Self-RAG [30] enhances LLM’s quality appropriate for specific questions, aiming to limit its use to
and factuality through retrieval and self-reflection. It intro- suitable cases. As a result, SoT achieves up to a 2.39× speed-
duces reflection tokens to make the LLM controllable during up on 12 recently released LLMs, and improves the answer
the inference phase. quality for many questions by improving the diversity and
relevance of their answer.
SGD [46] further extends the idea of SoT by organizing
4.2 Output Organization sub-problem points into a Directed Acyclic Graph (DAG)
The traditional generation process of LLMs is entirely se- and answering the logic-independent sub-problems in par-
quential, leading to significant time consumption. Output allel in one turn. Similar to SoT, SGD also leverages the
organization techniques aim to (partially) parallelize gener- emerging ability of LLMs to generate the output structure
ation via organizing the structure of output content. by providing manually-crafted prompts along with several
Skeleton-of-Thought (SoT) [45] is pioneering in this di- examples. SGD relaxes the strict independence assumption
rection. The core idea behind SoT is to leverage the emerg- among different points to enhance the quality of answers,
ing ability of LLMs to plan the output content’s struc- especially for math and coding problems. Compared with
ture. Specifically, SoT consists of two main phases. In the SoT, SGD prioritizes answer quality over speed. Addition-
first phase (i.e., skeleton phase), SoT instructs the LLM to ally, SGD introduces an adaptive model selection approach,
generate a concise skeleton of the answer using a prede- assigning an optimal model size to handle each sub-problem
fined ”skeleton prompt.” For instance, given a question like based on its estimated complexity, thus further improving
”What are the typical types of Chinese dishes?”, the output efficiency.
at this stage would be a list of dishes (e.g., noodles, hot APAR [47] adopts a similar idea with SoT, leveraging
pot, rice) without elaborate descriptions. Then, in the second LLMs to output special control tokens (i.e., [fork]) for auto-
phase (i.e., point-expanding phase), SoT instructs the LLM matically and dynamically triggering the parallel decoding.
to expand each point in the skeleton simultaneously using To effectively exploit the inherent parallelizable structure
a ”point-expanding prompt,” and then concatenates these within the output content and accurately generate control
7

tokens, APAR fine-tunes the LLMs on carefully-designed In addition to optimizing the efficiency of existing frame-
data that formed in specific tree structure. As a result, APAR works, certain studies focus on designing more efficient
achieves an average 1.4∼2.0× speed-up on benchmarks and agent frameworks directly. For example, FrugalGPT [56]
cases a negligible impact on the answer quality. Further- proposes a model cascade comprising LLMs of varying
more, APAR combines their decoding approach with the sizes, with the inference process being halted early if the
speculative decoding technique (i.e., Medusa [48]) and serv- model reaches a sufficient level of certainty regarding the
ing system (i.e. vLLM [49]) to further improve the inference answer. This approach aims to achieve efficiency by leverag-
latency and system throughput, respectively. ing a tiered model architecture and intelligent inference ter-
SGLang [50] introduces a domain-specific language mination based on model confidence estimation. Compared
(DSL) in Python featuring primitives that flexibly facili- with model-level dynamic inference techniques (Sec. 5.2.5),
tate LLM programming. The core idea behind SGLang is FrugalGPT performs dynamic inference at the pipeline level.
to analyze dependencies among various generation calls
automatically, and perform batch inference and KV cache
5 M ODEL - LEVEL O PTIMIZATION
sharing based on this analysis. With this language, users can
implement various prompting strategies easily and benefit The model-level optimization for LLM efficient inference
from the automatic efficiency optimization of SGLang (e.g., mainly concentrates on optimizing the model structure or
SoT [45], ToT [51]). Furthermore, SGLang introduces and data representation. Model structure optimization involves
combines several system-level compilation techniques, such directly designing efficient model structure, modifying the
as code movement and prefetching annotations. original model and adjusting the inference-time architec-
ture. In terms of data representation optimization, the model
quantization technique is commonly employed.
4.3 Knowledge, Suggestions and Future Direction In this section, we categorize model-level optimization
The growing demand for LLMs to handle longer inputs and techniques based on the additional training overhead they
generate longer outputs highlights the importance of the require. The first category involves designing more efficient
data-level optimization techniques. Within these techniques, model structures (referred to as efficient structure design).
input compression methods primarily target enhancing the Models developed using this approach typically require
prefilling stage by diminishing the computational and mem- training from scratch. The second category focuses on com-
ory cost resulting from the attention operation. Additionally, pressing pre-trained models (referred to as model compres-
for API-based LLMs, these methods can reduce the API cost sion). Compressed models in this category generally require
associated with input tokens. In contrast, output organiza- only minimal fine-tuning to restore their performance.
tion methods concentrate on optimizing the decoding stage
by alleviating the substantial memory access cost associated 5.1 Efficient Structure Design
with auto-regressive decoding approach.
Currently, state-of-the-art LLMs commonly employ the
As LLMs become more and more capable, there is poten-
Transformer architecture, as discussed in Section 2.1. How-
tial to utilize them to compress the input prompts or struc-
ever, the key components of Transformer-based LLMs, in-
ture the output content. Recent advancements in output
cluding the Feed Forward Network (FFN) and attention
organization methods [45], [46], [47] demonstrate the effec-
operation, present efficiency challenges during inference.
tiveness of leveraging LLMs to organize the output content
We identify the causes as follows:
into independent points or a dependency graph, facilitating
• The FFN contributes a substantial portion of the model
batch inference for improving generation latency. These
methods capitalize on the inherent parallelizable structure parameters in Transformer-based LLMs, resulting in
within output content, enabling LLMs to perform parallel significant memory access cost and memory usage,
decoding to enhance hardware utilization and thereby re- particularly during the decoding stage. For instance, the
duce end-to-end generation latency. FFN module accounts for 63.01% of the parameters in
Recently, diverse prompting pipelines (e.g., ToT [51], the LLaMA-7B model and 71.69% in the LLaMA-70B
GoT [52]) and agent frameworks [53], [54], [55] are emerg- model.
• The attention operation demonstrates quadratic com-
ing. While these innovations enhance LLMs’ capabilities,
they also extend the length of inputs, leading to increased plexity in the input length, leading to substantial com-
computational cost. To address this challenge, adopting putational cost and memory usage, especially when
input compression techniques to reduce input length shows dealing with longer input contexts.
promise as a solution. Simultaneously, these pipelines and To tackle these efficiency challenges, several studies have
frameworks naturally introduce more parallelism into out- concentrated on developing more efficient model structures.
put structures, offering increased potential for parallel de- We categorize these studies into three groups (as depicted in
coding and key-value (KV) cache sharing across different Fig. 7): efficient FFN design, efficient attention design, and
decoding threads. SGLang [50] supports flexible LLM pro- Transformer alternates.
gramming and offers opportunities for front-end and back-
end co-optimization, laying the groundwork for further 5.1.1 Efficient FFN Design
extensions and improvements in this area. In summary, In this field, many studies concentrate on integrating the
data-level optimization, including input compression and Mixture-of-Experts (MoE) technique [96] into LLMs to en-
output organization techniques, would become increasingly hance their performance while maintaining the computa-
necessary to enhance efficiency in the foreseeable future. tional cost. The core idea of MoE is to dynamically allocate
8

Efficient FFN Design Switch Transformers [86], MoEfication [87], MPOE [88], Sparse Upcy-
cling [89], BASE [90], Expert Choice [91], SE-MoE [92], StableMoE [93], SMoE-
Dropout [94], GLaM [95], Mixtral 8x7B [12]

Kernel-based Linear Transformer [82],

Attention Performers [83], RFA [84],
Low-Complexity Attention PolySketchFormer [85]

Low-Rank Linformer [77], LRT [78],

Efficient Efficient Attention Design Attention FLuRKA [79],Luna [80],
Structure Set Transformer [81]
Design
Multi/Group- MQA [75], GQA [76]
Query Attention

SSM HiPPO [62], LSSL [63], S4 [64], DSS [65],

S4D [66], GSS [67], H3 [68], Liquid S4 [69],
S5 [70], BST [71], BiGS [72], Mamba [73],
Transformer Alternates MambaFormer [74]

Others SGConv [57], CKConv [58], Hyena [59],

RWKV [60], RetNet [61]

Fig. 7. Taxonomy of the efficient structure design for Large Language Models.

varying computational budgets to different input tokens. In ence efficiency. Current MoE implementations [86], [97], [98]
MoE-based Transformers, multiple parallel Feed Forward often use batched matrix multiplication to compute all FFN
Networks (FFNs), namely experts, are utilized alongside a experts simultaneously. This requires that the input matrices
trainable routing module. During inference, the model se- of each expert must have the same shape. However, since
lectively activates specific experts for each token controlled the load imbalance problem exists, input token sets for
by the routing module. these under-utilized experts are needed to be padded to
Some researches concentrate on the construction of FFN meet the shape constraint, resulting in a waste of compu-
expert, which mainly focus on optimizing the process of tation. Therefore, the major aim of routing module design
acquiring expert weights or making these experts more is achieving better balance in token assignment for MoE
lightweight for efficiency. For instance, MoEfication [87] de- experts. Switch Transformers [86] introduces an additional
vises a method to transform a non-MoE LLM into the MoE loss, namely the load balancing loss, into the final loss
version using its pre-trained weights. This approach elimi- function to penalize imbalanced assignments by the routing
nates the need for expensive pre-training of the MoE model. module. This loss is formulated as the scaled dot-product
To accomplish this, MoEfication first divides FFN neurons between the token assignment fraction vector and a uniform
of the pre-trained LLM into multiple groups. Within each distribution vector. As a result, the loss is minimized only
group, the neurons are commonly activated simultaneously when the token assignment is balanced across all experts.
by the activation function. Then, it restructures each group This approach encourages the routing module to distribute
of neurons as an expert. Sparse Upcycling [89] introduces a tokens evenly among experts, promoting load balance and
method to initialize the weights of MoE-based LLM directly ultimately improving model performance and efficiency.
from a dense model’s checkpoint. In this approach, the BASE [90] learns an embedding for each expert in an end-
experts within the MoE-based LLM are exact replicas of to-end manner and then assigns experts to tokens based on
the FFN from the dense model. By employing this straight- the similarity of their embeddings. To ensure load balance,
forward initialization, Sparse Upcycling can efficiently train BASE formulates a linear assignment problem and utilizes
the MoE model to achieve high performance. MPOE [88] the auction algorithm [99] to solve this problem efficiently.
proposes to reduce the parameters of MoE-based LLMs Expert Choice [91] introduces a simple yet effective strategy
through Matrix Product Operators (MPO) decomposition. to ensure perfect load balance within MoE-based models.
This method involves decomposing each weight matrix of Unlike previous methods that assign experts to tokens,
the FFN into a global shared tensor containing common Expert Choice allows each expert to independently select
information and a set of local auxiliary tensors that capture the top-k tokens based on their embedding similarities. This
specialized features. approach ensures that each expert handles a fixed number
of tokens, even though each token might be assigned to a
Another line of researches focuses on improving the
different number of experts.
design of the routing module (or strategy) within MoE
models. In previous MoE models, the routing module often In addition to the aforementioned researches focusing
causes the load imbalance problem, which denotes that on the model architecture itself, there are also studies that
some experts are assigned a large number of tokens while concentrate on improving the training methods for MoE-
the others handle only a few. This imbalance not only based models. SE-MoE [92] introduces a new auxiliary loss
wastes the capacities of the under-utilized experts, which called the router z-loss, which aims to enhance the stability
degrades model performance, but also degrades the infer- of model training without compromising performance. SE-
9

MoE identifies that the exponential functions introduced by MQA. Specifically, GQA segments the attention heads into
softmax operations in the routing module can exacerbate groups, storing a single set of K and V values for each
roundoff errors, leading to training instability. To address group. This method not only sustains the benefits of MQA
this issue, the router z-loss penalizes large logits that are in- in reducing memory overhead but also offers an enhanced
put into exponential functions, thereby minimizing roundoff balance between inference speed and output quality.
errors during training. StableMoE [93] points out the routing Low-Complexity Attention. Low-complexity attention
fluctuation problem existing in the MoE-based LLMs, which methods aim to design new mechanisms that reduce the
denotes the inconsistency of the expert assignment in the computational complexity of each attention head. To sim-
training and inference stage. For the same input token, it is plify the discussion, we assume that the dimensions of the
assigned to different experts along with training, but only Q (query), K (key), and V (value) matrices are identical,
activates one expert at inference time. To address this issue, with Q, K, V ∈ Rn×d . Since the following work does not
StableMoE suggests a more consistent training approach. involve altering the number of attention heads like MQA,
It first learns a routing strategy and then keeps it fixed our discussions focus on the attention mechanism within
during both the model backbone training and the inference each head. As introduced in Section 2.2, the computational
stage. SMoE-Dropout [94] designs a novel training method complexity of the conventional attention mechanism scales
for MoE-based LLMs, which proposes to gradually increase as O(n2 ), exhibiting quadratic growth with respect to the in-
the number of activated experts during the training process. put length n. To address the inefficiency issue, kernel-based
This approach enhances the scalability of MoE-based mod- attention and low-rank attention methods are proposed to
els for inference and downstream fine-tuning. GLaM [95] reduce the complexity to O(n).
pre-trains and releases a series of models with various pa- • Kernel-based Attention. Kernel-based attention designs
rameter sizes, demonstrating their comparable performance kernel ϕ to approximate the non-linear softmax oper-
to dense LLMs on few-shot tasks. The largest model in this ation of Softmax(QK T ) with a linear dot product be-
family has a parameter size of up to 1.2 trillion. Mixtral tween kernel-transformed feature maps, i.e., ϕ(Q)ϕ(K)T .
8x7B [12] is a remarkable recently released open-source It avoids the conventional quadratic computation associ-
model. During inference, it utilizes only 13 billion active ated with QK T ∈ Rn×n by prioritizing the computation
parameters and achieves superior performance compared of ϕ(K)T V ∈ Rd×d , followed by its multiplication with
to the LLaMA-2-70B model across different benchmarks. ϕ(Q) ∈ Rn×d . Specifically, the input Q and K matrices are
Mixtral 8x7B consists of 8 Feed-Forward Network (FFN) first mapped into kernel space using a kernel function ϕ,
experts in each layer, with each token assigned to two while maintaining their original dimensions. Leveraging
experts during inference. the associative property of matrix multiplication allows
for the multiplication of K and V prior to their interaction
5.1.2 Efficient Attention Design with Q. The attention mechanism is reformulated as:
The attention operation is a critical component in the Trans-
former architecture. However, its quadratic complexity in Softmax(QK T )V ≈ ϕ(Q)(ϕ(K)T V ), (5)
relation to input length leads to substantial computational where ϕ(Q), ϕ(K) ∈ R n×d
. This strategy effectively re-
cost, memory access cost, and memory usage, especially duces the computational complexity to O(nd2 ), render-
when dealing with long contexts. To address this issue, ing it linear with respect to the input length. Linear
researchers are exploring more efficient approaches to ap- Transformer [82] is the first work to propose the kernel-
proximate the functionality of the original attention oper- based attention. It adopts ϕ(x) = elu(x) + 1 as the ker-
ation. These studies can be broadly categorized into two nel function, where elu(·) denotes the exponential linear
main branches: multi-query attention and low-complexity unit activation function. Performers [83] and RFA [84]
attention. proposes to use random feature projection to better ap-
Multi-Query Attention. Multi-query attention (MQA) [75] proximate the softmax function. PolySketchFormer [85]
optimizes the attention operation by sharing the key (K) employs polynomial functions and sketching techniques
and value (V) cache across different attention heads. This to approximate the softmax function.
strategy effectively reduces both memory access cost and
• Low-Rank Attention. Low-Rank Attention technique em-
memory usage during inference, contributing to improved
ploys compression on the token dimensions (i.e., n) of the
efficiency in Transformer models. As introduced in Sec. 2.2,
K and V matrices to a smaller, fixed length (i.e., k ) before
the Transformer-style LLMs typically adopts multi-head
performing the attention computation. The approach is
attention (MHA) operation. This operation requires stor-
based on the insight that the n × n attention matrix
ing and retrieving K and V pairs for each attention head
often exhibits a low-rank property, making it feasible to
during the decoding stage, leading to substantial increases
compress it in the token dimension. The main focus of
in memory access cost and memory usage. MQA tackles
this line of researches is to design effective methods for
this challenge by using the same K and V pairs across
the compression, where X can be context matrix or K and
different heads while maintaining distinct query (Q) values.
V matrices:
Through extensive testing, it has been demonstrated that
MQA significantly reduces memory requirements with only
X ∈ Rn×d → X ′ ∈ Rk×d . (6)
a minimal impact on model performance, making it a cru-
cial strategy for enhancing inference efficiency. The concept One line of work uses linear projection to compress the
of MQA is further extended by Grouped-query attention token dimension. It is done by multiplying K and V matri-
(GQA) [76], which can be seen as a blend of MHA and ces with projection matrices Pk , Pv ∈ Rk×n . In this way,
10

the computational complexity of the attention operation where A, B and C denote the transition matrices, x denotes
is reduced to O(nkd), which is linear to the input length. the intermediate state and u denotes the input sequence. (2)
Linformer [77] first observes and analyses the low-rank They design the transition matrix A based on the HiPPO
property of the attention map, and proposes the low-rank theory [62]. Specifically, HiPPO proposes to compress the
attention framework. LRT [78] proposes to simultaneously input sequence into a sequence of coefficients (namely state)
apply low-rank transformation to both attention block by projecting it onto a set of polynomial bases.
and FFN to further improve the computational efficiency. Building upon the aforementioned framework, several
FLuRKA [79] combines the low-rank transformation and studies concentrate on improving the parameterization or
kernalization to the attention matrices to further improve initialization of the transition matrix A. This involves re-
the efficiency. Specifically, it first reduces the token dimen- fining how the matrix is formulated or initialized within
sion of K and V matrices, and then applies kernel function the SSM to enhance its effectiveness and performance in
to the Q and low-rank K matrices. sequence modeling tasks. LSSL [63] firstly proposes to ini-
Aside from linear projection, other token-dimension tialize A with the optimal transition matrix HiPPO-LegS
compression methods are also proposed. Luna [80] and designed by HiPPO. In addition, LSSL also trains the SSM
Set Transformer [81] leverage additional attention compu- in a convolution manner by unrolling the Eq. 7. Specifically,
tations alongside smaller queries to effectively compress through a convolution kernel defined as KL (A, B, C) =
the K and V matrices. Luna [80] involves an extra query (CAi B)i∈[L] = (CB, CAB, ..., CAL−1 B), the Eq. 7 can
matrix of fixed length k . The small query performs at- be rewritten as y = KL (A, B, C) ∗ u and also can be
tention with the original context matrix, termed as pack computed efficiently via Fast Fourier Transform (FFT). How-
attention, to compress the context matrix to size Rk×d . ever, computing this convolution kernel is expensive, since
Subsequently, the regular attention, termed unpack atten- it requires multiple times of multiplication by A. To this
tion, applies attention to the original Q matrices and the end, S4 [64], DSS [65] and S4D [66] propose to diagonalize
compressed K and V matrices. The extra query matrix the matrix A, which can accelerate the computing. This can
can be learnable parameters or acquired from previous be seen as a parameterization technique to the transition
layers. Set Transformer [81] designs the similar technique matrix A. Previous SSMs processed each input dimension
by introducing an inducing points vector with fixed length. independently, resulting in a large number of trainable
Unlike previous works that compress K and V, Funnel- parameters. To enhance efficiency, S5 [70] proposes to simul-
Transformer [100] uses pooling operation to gradually taneously process all input dimensions using a single set of
compress the sequence length of the Q matrix. parameters. Building upon this structure, S5 introduces a
parameterization and initialization method for A based on
the standard HiPPO matrix. Liquid S4 [69] and Mamba [73]
5.1.3 Transformer Alternates
parameterize the transition matrices in a input-dependent
In addition to applying efficient techniques to the attention manner, which further enhances the modeling capability of
operation, recent studies have also innovated to design SSM. Additionally, both S5 [70] and Mamba [73] adopt a
sequence modeling architectures that are efficient yet ef- parallel scan technique for efficient model training without
fective. Table 2 compares the efficiency of some represen- the need for convolution operations. This technique offers
tative non-Transformer models. These architectures exhibit advantages in implementation and deployment on modern
sub-quadratic computational complexity with respect to se- GPU hardware.
quence length during both training and inference, enabling Another line of research aim to design better model
LLMs to significantly increase their context length. architecture based on SSMs. GSS [67] and BiGS [72] com-
Within this research field, two prominent lines of study bines the Gated Attention Unit (GAU) [102] with SSM.
have garnered significant attention. One line of studies con- Specifically, they replace the attention operation in GAU
centrates on the State Space Model (SSM), which formulates with SSM operation. BST [71] combines the SSM model
sequence modeling as a recurrence transformation based with the proposed Block Transformer which introduces a
on the HiPPO theory [62]. Additionally, other studies pri- strong local inductive bias. H3 [68] observes that SSM is
marily focus on employing long convolutions or designing weak in recalling the earlier tokens and comparing a token
attention-like formulations to model sequences. across the sequence. To this end, it proposes to add a shift
State Space Model. The State Space Model (SSM) has SSM operation before the standard SSM operation, which
demonstrated competitive modeling capabilities in certain is used to directly shift the input tokens into the state.
Natural Language Processing (NLP) [73] and and Computer MambaFormer [74] combines the standard Transformer and
Vision (CV) [101] tasks. Compared to attention-based Trans- SSM model by substituting the FFN layer in the Trans-
formers, SSM exhibits linear computational and memory former with an SSM layer. Jamba [103] introduces another
complexity with respect to the input sequence length, which approach to combining the Transformer and SSM models by
enhances its efficiency in handling long-context sequences. adding four Transformer layers into an SSM model. Dense-
In this survey, SSM refers to a series of model architectures Mamba [104] explores the issue of hidden state degradation
that satisfy the following two properties: (1) They model in traditional SSMs and introduces dense connections within
sequence based on the following formulation proposed by the SSM architecture to preserve fine-grained information
HiPPO [62] and LSSL [63]: across deeper layers of the model. BlackMamba [105] and
MoE-Mamba [106] propose to enhance SSM models with the
xk = Axk−1 + Buk , Mixture-of-Experts (MoE) technique to optimize the train-
(7)
yk = Cxk , ing and inference efficiency while maintaining the model
11

TABLE 2
Efficiency comparison of some novel non-Transformer models. Note that we denote n as the input length and d as the input dimension.

Training Computational Training Memory Inference Computational Complexity

Model Training Form Complexity Complexity Inference Form
Prefilling Decoding (per token)
Transformer [25] Transformer-like O(n2 d) O(n2 + nd) Transformer-like O(n2 d) O(nd)
S4 [64] Convolution O(nd2 log n) O(nd) Recurrence O(nd2 ) O(d2 )
Mamba [73] Recurrence O(nd2 log n) O(nd) Recurrence O(nd2 ) O(d2 )
Hyena [59] Convolution O(nd log n) O(nd) Convolution O(nd log n) O(nd log n)
RetNet [61] Transformer-like O(n2 d) O(n2 + nd) Recurrence O(nd2 ) O(d2 )
RWKV [60] Recurrence O(nd2 ) O(nd) Recurrence O(nd2 ) O(d2 )

performance. length-agnostic in the decoding stage. Furthermore, in the

Other Alternates. In addition to SSMs, several other efficient decoding phase, these novel architectures eliminate the need
alternates have also garnered significant attention, including to cache and load features of previous tokens (similar to the
long convolution and attention-like recurrence operation. key-value cache in Transformer-based language models),
Long convolution has been adopted in long sequence resulting in significant memory access cost savings.
modeling in some studies [57], [58], [59]. These efforts
mainly focus on the parameterization of the convolution 5.2 Model Compression
parameters. For example, Hyena [59] adopts a implicit
Model compression encompasses a range of techniques de-
parametrization for the long convolution via a shallow
signed to enhance the inference efficiency of a pre-trained
feed-forward neural network (FFN). Besides, it also intro-
model by modifying its data representation (e.g., quantiza-
duces a element-wise multiplicative gating to control the
tion) or altering its architecture (e.g., sparsification, struc-
parametrization by the input data. Several recent studies
tural optimization, and dynamic inference), as depicted in
have applied long convolution in the context of modeling
Fig. 8.
long sequences [57], [58], [59]. These investigations primar-
ily concentrate on refining the parameterization of the con- 5.2.1 Quantization
volution kernel. For instance, Hyena [59] employs an data-
Quantization is a widely employed technique that reduces
dependent parameterization method for long convolutions
the computational and memory cost of LLMs by converting
using a shallow feed-forward neural network (FFN).
the models’ weights and activations from high bit-width to
Other studies [60], [61] aim to design the operation that
low bit-width representations. Specifically, many methods
has a similar form as the attention operation but can be
involve quantizing FP16 tensors into low-bit integer tensors,
enrolled to the recurrent manner, enabling both efficient
which can be represented as follows:
training and efficient inference. For instance, RWKV [60]
XFP16 − Z

builds upon AFT [107], which proposes to substitute the
XINT = , (9)
attention operation in the Transformer model with the fol- S
lowing equation:
max(XFP16 ) − min(XFP16 )
PT S= , (10)
t′ =1 exp(Kt + wt,t ) ⊙ Vt
′ ′ ′ 2N −1 − 1
Yt = σq (Qt ) ⊙ , (8)
PT where XFP16 denotes the 16-bit floating-point (FP16) value,
t′ =1 exp(Kt + wt,t )
′ ′
XINT denotes the low-precision integer value, N denotes
where Q, K , and V are the query, key, and value matrices the number of bits, and S and Z denote the scaling factor
as in Transformer, w ∈ RT ×T denotes a learnable pair- and zero-point.
wise position bias and σq (·) denotes a non-linear function. In the following, we start with an efficiency analysis to
Specifically, it further reparameterizes the position bias as illustrate how quantization techniques reduce the end-to-
′
wt,t′ = −(t − t )w, and thus can rewrite Eq. 8 in a recursive end inference latency of LLMs. Subsequently, we offer a de-
form. In this way, RWKV can combine the effective paral- tailed introduction to two distinct quantization workflows:
lelizable training feature of Transformer and the efficient Post-Training Quantization (PTQ) and Quantization-Aware
inference ability of RNN. Training (QAT), respectively.
Efficiency Analysis. We analyze and compare the com- Efficiency Analysis. As discussed in Section 2.2, the infer-
putational and memory complexity of several innovative ence process of LLMs involves two stages: the prefilling
and representative non-transformer architectures in Table 2. stage and the decoding stage. During the prefilling stage,
In terms of training time, many studies (e.g., S4, Hyena, LLMs typically handle long token sequences, and the pri-
RetNet) aim to preserve training parallelism by adopting mary operation is general matrix multiplication (GEMM).
training forms such as the convolution or attention. Notably, The latency of the prefilling stage is primarily constrained
Mamba utilizes parallel scan techniques for processing input by the computation performed by high-precision CUDA
sequences, thereby leveraging training parallelism as well. Cores. To address this challenge, existing methods quan-
On the other hand, during inference, most studies opt tize both weights and activations to accelerate computation
for recurrent architectures to maintain linear computational using low-precision Tensor Cores. As illustrated in Figure 9
complexity in the prefilling stage and to remain context (b), activation quantization is performed online before each
12

Post-Training Quantization GPTQ [182], LUT-GEMM [183], AWQ [184],

OWQ [185], SpQR [186], SqueezeLLM [187],
QuIP [188], FineQuant [189], QuantEase [190],
LLM-MQ [191], ZeroQuant [192], Flex-
Gen [193], LLM.int8() [194], Smoothquant
Quantization [195], ZeroQuant-V2 [196], RPTQ [197],
OliVe [198], OS+ [166], ZeroQuant-FP [199],
Omniquant [200], QLLM [201], ATOM [202],
LLM-FP4 [177], BiLLM [203], Li et.al. [204]

Quantization- LLM-QAT [177], Norm Tweaking [178],

aware Training QLoRA [179], QA-LoRA [180], LoftQ [181]

Weight Pruning SparseGPT [162], Wanda [163], ISC [164],

Prune and Tune [165], OWL [166], BESA [167],
RIA [168], LLM-Pruner [169], Sheared
LLaMA [170], ZipLM [171], LoRAPrune [172],
Sparsification LoRAShear [173], SliceGPT [174], PLA-
TON [175], SIMPLE [176]

Sparse Attention Sparse Transformer [147], StreamingLLM [148],

Longformer [149], Bigbird [150], Structured
Sparse Attention [151], SemSA [152], Spat-
ten [153], SeqBoat [154], Adaptively Sparse
Attention [155], Reformer [156], Sparse Flash
Attention [157], Routing Transformer [158],
Sparse Sinkhorn Attention [159], H2 O [160],
Model Diffuser [161]
Compression
Structure Factorization LoRD [140], TensorGPT [141], LoSparse [142],
LPLR [143], ZeroQuant-V2 [144], DS-
Structure Optimization Former [145], ASVD [146]

Neural Architecture Search AutoTinyBERT [135], NAS-BERT [136], Struc-

ture pruning via NAS [137], LiteTransform-
erSearch [138], AutoDistil [139]

White-box KD MiniLLM [129], GKD [130], TED [131], BabyL-

lama [132], MiniMoE [133], KPTD [134]
Knowledge Distillation
Black-box KD Multitask-ICT [116], MCKD [117], Distilling
Step-by-Step [118], SCoTD [119], CoT Prompt-
ing [120], MCC-KD [121], Fine-tune-CoT [122],
Socratic CoT [123], PaD [124], SCOTT [125],
DISCO [126], LaMini-LM [127], Lion [128]

Sample-level FastBERT [110], MPEE [111], Global Past-

Future Early Exit [112], DeeBERT [113],
Dynamic Inference PABEE [114],HASHEE [115]

Token-level CALM [108], SkipDecode [109]

Fig. 8. Taxonomy of model compression methods for Large Language Models.

GEMM operation, allowing computation with low-precision need for retraining, which can be a costly process. While
Tensor Cores (e.g., INT8). This quantization approach is PTQ methods have been well-explored for smaller mod-
referred to as Weight-Activation Quantization. els, applying existing quantization techniques directly to
In contrast, during the decoding stage, LLMs process LLMs presents challenges. This is primarily because the
only one token at each generation step using general matrix- weights and activations of LLMs often exhibit more outliers
vector multiplication (GEMV) as the core operation. The and have a wider distribution range compared to smaller
latency of the decoding stage is mainly influenced by the models, making their quantization more challenging. In
loading of large weight tensors. To tackle this challenge, summary, the complex nature of LLMs, characterized by
existing methods focus on quantizing only the weights to their size and complexity, requires specialized approaches
accelerate memory access. This method, known as Weight- to effectively handle the quantization process. The presence
only Quantization, involves offline quantization of weights, of outliers and wider distribution ranges in LLMs necessi-
followed by de-quantization of the low-precision weights tates the development of tailored quantization techniques
into FP16 format for computation, as shown in Figure 9 (a). that can account for these unique characteristics without
Post-Training Quantization. Post-training quantization compromising model performance or efficiency.
(PTQ) involves quantizing pre-trained models without the Numerous studies have concentrated on developing
13

TABLE 3
Summary of the representative studies on Post-Training Quantization. Quantized Tensor Type denotes which parts of tensors are quantized.
Quantized Format denotes whether to adopt uniform or non-uniform quantization. Quantized Criterion denotes the how to decide the parameters
(e.g., scaling factor, zero-point). Quantized Value Update denotes whether to change the model weight (e.g., compensation, re-parameterization)
during the quantization process.

Quantized Tensor Type Quantized Quantized Quantized

Model Format Criterion Value Update
Weight Activation KV Cache
GPTQ [182] ✓ Uniform Statistic-based ✓
LUT-GEMM [183] ✓ Non-uniform Statistic-based
AWQ [184] ✓ Uniform Search-based ✓
SqueezeLLM [187] ✓ Non-uniform Statistic-based
LLM.int8() [194] ✓ ✓ Uniform Statistic-based
SmoothQuant [195] ✓ ✓ Uniform Statistic-based ✓
RPTQ [197] ✓ ✓ Uniform Statistic-based
OmniQuant [200] ✓ ✓ Uniform Search-based
FlexGen [193] ✓ ✓ Uniform Statistic-based
Atom [202] ✓ ✓ ✓ Uniform Statistic-based
KVQuant [205] ✓ Non-uniform Statistic-based
KIVI [206] ✓ Uniform Statistic-based

In weight-only quantization, GPTQ [182] represents an

Weight Bias
De-quantize De-quantize early advancement in LLM quantization, building upon the
(INT8) (INT8)
traditional algorithm OBQ [207]. OBQ utilizes an optimal
quantization order per row of the weight matrix, guided
Activation FP16 FP16 Activation
(FP16) GEMM/GEMV Accumulator (FP16)
by the reconstruction error relative to the Hessian matrix
of unquantized weights. After each quantization step, OBQ
(a) Weight-only Quantization iteratively adjusts the unquantized weights to mitigate re-
construction errors. However, the frequent updating of the
Weight INT8 INT8 Hessian matrix during quantization escalates computational
Bias
(INT8) GEMM/GEMV Accumulator (INT8) complexity. GPTQ streamlines this process by adopting a
INT32
uniform left-to-right order for quantizing each row, thus cir-
cumventing the need for extensive Hessian matrix updates.
Activation Activation
Quantization De-quantize This strategy substantially reduces computational demands
(FP16) (FP16)
by computing the Hessian matrix solely during the quanti-
(b) Weight-Activation Quantization zation of one row, then leveraging the computing results
for subsequent rows, expediting the overall quantization
Fig. 9. (a) The inference workflow of Weight-only Quantization. (b) The procedure. LUT-GEMM [183] presents a novel dequantiza-
inference workflow of Weight-Activation Quantization. tion method utilizing a Look-Up Table (LUT), aiming to
accelerate the inference process of quantized LLMs by re-
ducing the dequantization overhead. Additionally, it adopts
a non-uniform quantization approach known as Binary-
effective quantization algorithms to compress LLMs. We
Coding Quantization (BCQ), which incorporates learnable
provide a synthesis of representative algorithms categorized
quantization intervals. AWQ [184] observes that weight
across four dimensions in Tab. 3. Regarding the types of
channels vary in importance for performance, particularly
quantized tensors, certain studies [182], [183], [184], [187]
emphasizing those aligned with input channels exhibiting
concentrate on weight-only quantization, whereas many
outliers in activations. To enhance the preservation of crit-
others [194], [195], [197] focus on quantizing both weights
ical weight channels, AWQ utilizes a reparameterization
and activations. Notably, in LLMs, the KV cache represents
method. This technique selects reparameterization coeffi-
a distinctive component that impacts memory and mem-
cients via grid search to minimize reconstruction errors
ory access. Consequently, some investigations [193], [202],
effectively. OWQ [185] observes the difficulty of quantiz-
[205] propose KV cache quantization. Regarding quantized
ing weights associated with activation outliers. To address
formats, the majority of algorithms adopt a uniform format
this challenge, OWQ employs a mixed-precision quanti-
for straightforward hardware implementation. Concerning
zation strategy. This method identifies weak columns in
the determination of quantized parameters (e.g., scale, zero-
the weight matrix and allocates higher precision to these
point), most studies rely on statistics derived from weight or
specific weights, while quantizing the rest of the weights at a
activation values. Nevertheless, some research efforts [184],
lower precision level. SpQR [186] introduces a methodology
[200] advocate for searching optimal parameters based on
where weight outliers are identified and allocated higher
reconstruction loss. Furthermore, certain studies [182], [184],
precision during quantization, while the rest of the weights
[195] suggest updating unquantized weights (referred to as
are quantized to 3 bits. SqueezeLLM [187] proposes to store
Quantized Value Update) before or during the quantization
the outliers in a full-precision sparse matrix, and apply
process to enhance performance.
14

non-uniform quantization to the remaining weights. The liers is concentrated and asymmetrical, posing a challenge
values for non-uniform quantization are determined based to LLM quantization. To address this, OS+ introduces a
on quantization sensitivity, which contributes to improved channel-wise shifting and scaling technique aimed at allevi-
performance of the quantized model. QuIP [188] introduces ating these challenges. The shifting and scaling parameters
LDLQ, an optimal adaptive method for a quadratic proxy are determined through a search process to effectively han-
objective. The study reveals that ensuring incoherence be- dle the concentrated and asymmetrical outlier distribution.
tween weight and Hessian matrices can enhance the ef- ZeroQuant-FP [199] investigates the feasibility of quantizing
fectiveness of LDLQ. QuIP utilizes LDLQ and achieves weight and activation values into FP4 and FP8 formats.
incoherence by employing random orthogonal matrix mul- The study reveals that quantizing activations into floating-
tiplication. FineQuant [189] utilizes a heuristic approach point types (FP4 and FP8) produces superior results com-
to determine the granularity of quantization per column, pared to integer types. Omniquant [200] diverges from prior
combining empirical insights gained from experiments to approaches that rely on empirical design of quantization
design a quantization scheme. QuantEase [190] builds upon parameters. Instead, it optimizes the boundaries for weight
GPTQ. When quantizing each layer, it proposes a method clipping and the scaling factor for equivalent transformation
based on Coordinate Descent to compensate for the unquan- to minimize quantization errors. QLLM [201] addresses
tized weights more precisely. Additionally, QuantEase can the impact of outliers on quantization by implementing
leverage quantized weights from GPTQ as an initialization channel reassembly. Additionally, it introduces learnable
and further refine the compensation process. LLM-MQ [191] low-rank parameters to minimize quantization errors in
protects the weight outliers with FP16 format, and stores the post-quantized model. Atom [202] employs a strategy
them in Compressed Sparse Row (CSR) format for efficient involving mixed-precision and dynamic quantization for
computation. Besides, LLM-MQ models the bit-width as- activations. Notably, it extends this approach to quantize the
signment to each layer as an integer programming problem, KV cache into INT4 to enhance throughput performance.
and employs an efficient solver to solve it within a few LLM-FP4 [177] endeavors to quantize the entire model
seconds. Moveover, LLM-MQ designs a efficient CUDA ker- into FP4 format and introduces a pre-shifted exponent
nel to integrate dequantization operators, thereby reducing bias technique. This approach combines the scaling factor
memory access cost during computation. of activation values with weights to address quantization
For weight-activation quantization, ZeroQuant [192] em- challenges posed by outliers. BiLLM [203] represents one
ploys finer-grained quantization for weights and activa- of the lowest-bit PTQ efforts to date. BiLLM identified the
tions, leveraging kernel fusion to minimize the memory bell-shaped distribution of weights and the exceptionally
access cost during quantization and conducting layer-by- long-tail distribution of weights’ Hessian matrix. Based on
layer knowledge distillation to recover the performance. this, it proposes to categorize weights into salient and non-
FlexGen [193] quantizes weights and KV cache directly salient values structurally based on the Hessian matrix
into INT4 to reduce the memory footprint during infer- and binarizes them separately. As a result, BiLLM can
ence with large batch sizes. LLM.int8() [194] identifies that extensively quantize LLMs to 1.08 bits without significant
outliers in activations are concentrated within a small sub- degradation in perplexity. KVQuant [205] proposes a non-
set of channels. Leveraging this insight, LLM.int8() splits uniform quantization scheme for KV cache quantization, by
activations and weights into two distinct parts based on deriving the optimal datatypes offline on a calibration set.
the outlier distribution within input channels to minimize KIVI [206] proposes a tuning-free 2bit KV cache quantiza-
quantization errors in activations. Channels containing out- tion algorithm, which utilizes per-channel quantization for
lier data in both activations and weights are stored in key cache and per-token quantization for value cache in a
FP16 format, while other channels are stored in INT8 group-wise manner. Li et al. [204] conducted a thorough
format. SmoothQuant [195] employs a reparameterization evaluation to assess the impact of quantization on different
technique to address the challenges of quantizing activa- tensor types (including KV Cache), various tasks, 11 LLM
tion values. This method introduces a scaling factor that families, and SOTA quantization methods.
expands the data range of weight channels while shrinking Quantization-Aware Training. Quantization-aware training
the data range of corresponding activation channels. Zero- (QAT) incorporates the influence of quantization within the
Quant [192] introduces a group-wise quantization strategy model training procedure. By integrating layers that repli-
for weights and a token-wise quantization approach for cate quantization effects, this approach facilitates weight
activations. Building upon this methodology, ZeroQuant- adaptation to quantization-induced errors, leading to en-
V2 [196] presents the LoRC (Low-Rank Compensation) tech- hanced task performance. Nevertheless, training LLMs typ-
nique, employing low-rank matrices to mitigate quantiza- ically demands substantial training data and considerable
tion inaccuracies. RPTQ [197] identifies substantial varia- computational resources, posing potential bottlenecks for
tions in the distribution of different activation channels, QAT implementation. Consequently, current research en-
which present challenges for quantization. To mitigate this deavors focus on strategies to reduce the training data re-
issue, RPTQ reorganizes channels with similar activation quirements or alleviate the computational burden associated
distributions into clusters and independently applies quan- with QAT implementation.
tization within each cluster. OliVe [198] observes that the To reduce the data requirements, LLM-QAT [177] intro-
normal values neighboring to the outliers are less critical. duces a data-free method to generate the training data by
Therefore, it pairs each outlier with a normal value, sacri- using the original FP16 LLMs. Specifically, LLM-QAT uses
ficing the latter to achieve a broader representation range every token in the tokenization vocabulary as a starting
for outliers. OS+ [166] observes that the distribution of out- token to generate sentences. Based on the generated training
15

TABLE 4
Comparison of speed-ups in different scenarios (e.g., model size, batch size, input context length, inference framework) with W4A16 quantization
based on TensorRT-LLM [208] and LMDeploy [209] framework, respectively. We test the speed-ups of prefilling/decoding/end-to-end latency on a
single NVIDIA A100 GPU. OOM denotes “Out Of Memory”.

TensorRT-LLM
B 128 256 512 1024 2048
1 1.06/2.40/2.37 0.90/2.38/2.34 0.92/2.30/2.28 0.88/2.19/2.17 0.91/2.00/1.98
2 0.88/2.10/2.05 0.91/2.07/2.04 0.89/2.01/1.98 0.91/1.92/1.89 0.88/1.78/1.76
LLaMA-2-7B 4 0.92/1.72/1.67 0.89/1.67/1.64 0.90/1.61/1.58 0.87/1.53/1.51 0.84/1.42/1.40
8 0.91/1.43/1.36 0.88/1.38/1.33 0.83/1.33/1.28 0.77/1.25/1.21 0.78/1.16/1.14
16 0.91/1.43/1.36 0.88/1.38/1.33 0.83/1.33/1.28 0.77/1.25/1.21 0.78/1.16/1.14
B 128 256 512 1024 2048
1 1.24/2.51/2.50 0.89/2.45/2.47 0.94/2.34/2.42 0.90/2.18/2.32 0.83/1.94/2.16
2 0.90/2.51/2.50 0.95/2.45/2.47 0.90/2.34/2.42 0.83/2.18/2.32 0.80/1.94/2.16
LLaMA-2-13B 4 0.96/1.80/1.76 0.91/1.78/1.74 0.83/1.73/1.69 0.80/1.65/1.62 0.83/1.54/1.52
8 0.91/1.86/1.77 0.83/1.81/1.73 0.80/1.73/1.66 0.82/1.62/1.56 0.75/1.46/1.41
16 0.84/1.84/1.69 0.81/1.77/1.63 0.82/1.63/1.53 0.78/1.46/1.39 OOM
LMDeploy
B 128 256 512 1024 2048
1 1.30/2.11/2.09 0.94/2.07/2.05 0.90/2.03/2.02 0.88/1.97/1.96 0.94/1.92/1.91
2 1.03/2.24/2.20 0.90/2.19/2.15 0.88/2.11/2.08 0.93/1.97/1.95 0.85/1.78/1.76
LLaMA-2-7B 4 0.90/2.18/2.10 0.87/2.12/2.05 0.93/2.01/1.96 0.92/1.86/1.83 0.92/1.64/1.62
8 0.92/1.92/1.77 0.91/1.82/1.71 0.92/1.65/1.57 0.93/1.45/1.41 0.94/1.28/1.26
16 0.92/1.92/1.77 0.91/1.82/1.71 0.92/1.65/1.57 0.93/1.45/1.41 0.94/1.28/1.26
B 128 256 512 1024 2048
1 1.32/2.34/2.32 0.94/2.31/2.28 0.92/2.22/2.20 0.94/2.15/2.13 0.94/2.01/1.99
2 1.06/2.42/2.36 0.92/2.37/2.32 0.94/2.29/2.25 0.94/2.15/2.12 0.95/1.95/1.93
LLaMA-2-13B 4 0.93/2.36/2.26 0.94/2.29/2.21 0.94/2.18/2.12 0.95/2.01/1.97 0.96/1.78/1.75
8 0.92/2.24/2.10 0.93/1.93/2.02 0.94/1.81/1.89 0.94/1.65/1.71 0.95/1.45/1.49
16 0.93/2.02/1.85 0.94/1.90/1.76 0.94/1.73/1.63 0.95/1.50/1.45 OOM

data, LLM-QAT applies a distillation-based workflow to Tweaking [178] proposes to train the LayerNorm layer after
train the quantized LLM to match the output distribution quantization and use knowledge distillation to match the
of the original FP16 LLM. Norm Tweaking [178] limits output distribution of the quantized model with that of the
the selection of the starting token to only those language FP16 model, achieving effects similar to LLM-QAT while
categories listed among the top languages with the highest avoiding high training costs.
proportion. This strategy can effectively improve the gener- Comparative Experiments and Analysis. In this section, we
alization of the quantized model on different tasks. conduct experiments to evaluate the speed-ups achieved
To reduce the computation cost, many methods apply by employing the weight-only quantization technique in
parameter-efficient tuning (PEFT) strategies to accelerate various scenarios. Specifically, we focus on two widely-used
QAT. QLoRA [179] quantizes the weights of LLMs into large language models (LLMs), LLaMA-2-7B and LLaMA-2-
4-bit and subsequently employs LoRA [210] in BF16 for 13B, and quantize their weights to 4-bit using the AWQ [184]
each 4-bit weight matrix to fine-tune the quantized model. algorithm. Subsequently, we deploy these quantized models
QLoRA allows for the efficient fine-tuning of a 65B param- on a single NVIDIA A100 GPU using two different inference
eter LLM on one GPU with only 30GB of memory. QA- frameworks: TensorRT-LLM [208] and LMDeploy [209]. We
LoRA [180] proposes to incorporate group-wise quantiza- then evaluate the speed-ups achieved by these frameworks
tion into QLoRA. The authors observe that the number of across different input sequences characterized by varying
quantization parameters in QLoRA is significantly smaller batch sizes and context lengths.
than the number of LoRA parameters, leading to an imbal- We present the speed-ups of prefilling latency, decoding
ance between quantization and low-rank adaptation. They latency, and end-to-end latency, as summarized in Tab. 4.
suggest that group-wise operations can address this issue by From the results, several key observations can be made:
increasing the number of parameters dedicated to quantiza- (1) Weight-only quantization can substantially accelerate the
tion. In addition, QA-LoRA can merge the LoRA terms into decoding stage, leading to improvements in end-to-end la-
the corresponding quantized weight matrices. LoftQ [181] tency. This enhancement primarily stems from the capability
identifies that initializing LoRA matrices with zeros in of loading the quantized model with low-precision weight
QLoRA is inefficient for downstream tasks. As an alterna- tensors much more swiftly from the High Bandwidth Mem-
tive, LoftQ suggests initializing the LoRA matrices using ory (HBM), as illustrated in the preceding “Efficient Analy-
the Singular Value Decomposition (SVD) of the difference sis” part. Consequently, this approach markedly diminishes
between the original FP16 weights and quantized weights. the memory access overhead. (2) Regarding the prefilling
LoftQ iteratively applies quantization and SVD to achieve a stage, weight-only quantization may actually increase the
more accurate approximation of the original weights. Norm latency. This is due to the fact that the bottleneck in the
16

prefilling stage is the computational cost rather than the potential for hardware acceleration, as modern computing
memory access cost. Therefore, quantizing only the weights architectures are optimized for dense, regular data patterns.
without the activations has minimal impact on latency. Ad- Consequently, despite achieving higher sparsity levels, the
ditionally, as illustrated in Fig. 9, weight-only quantization practical benefits of unstructured pruning in terms of hard-
necessitates the de-quantization of low-precision weights ware efficiency and computational speedup may be limited.
to FP16, leading to additional computational overhead and The common focus of this line of work is the pruning
consequently slowing down the prefilling stage. (3) As the criterion, including the weight importance and pruning
batch size and input length increase, the extent of speed-up ratio. Considering the huge parameter size of LLMs, im-
achieved by weight-only quantization gradually diminishes. proving the pruning efficiency is also crucial. One pruning
This is primarily because, with larger batch sizes and input criterion is to minimize the reconstruction loss of the model.
lengths, the computational cost constitutes a larger propor- SparseGPT [162] is a representative approach in this field. It
tion of latency. While weight-only quantization predomi- follows the idea of OBS [211], which considers the impact of
nantly reduces memory access cost, its impact on latency removing each weight on the network’s reconstruction loss.
becomes less significant as the computational demands OBS iteratively decides a pruning mask to prune the weights
become more prominent with larger batch sizes and input and reconstructs the unpruned weights to compensate for
lengths. (4) Weight-only quantization offers greater benefits the pruning loss. SparseGPT overcomes the efficiency bot-
for larger models due to the significant memory access over- tleneck of OBS via the Optimal Partial Updates technique,
head associated with larger model sizes. As models grow and designs an adaptive mask selection technique based
in complexity and size, the amount of memory required on the OBS reconstruction error. Prune and Tune [165]
to store and access weights increases proportionally. By improves upon SparseGPT by fine-tuning the LLMs with
quantizing the model weights, weight-only quantization ef- minimal training steps during pruning. ISC [164] designs a
fectively reduces this memory footprint and memory access novel pruning criterion by combining the saliency criteria
overhead. in OBS [211] and OBD [212]. It further assigns non-uniform
pruning ratios to each layer based on Hessian information.
5.2.2 Sparsification BESA [167] learns a differentiable binary mask via gradi-
Sparsification is a compression technique that increases the ent descent of the reconstruction loss. The pruning ratio
proportion of zero-valued elements in data structures such for each layer is sequentially decided by minimizing the
as model parameters or activations. This method aims to reconstruction error. The other popular pruning criterion is
decrease computational complexity and memory usage by magnitude-based. Wanda [163] proposes to use the element-
efficiently ignoring zero elements during computation. In wise product between the weight magnitude and the norm
the context of LLMs, sparsification is commonly applied of input activation as the pruning criterion. RIA [168] jointly
to weight parameters and attention activations. It leads to considers the weights and activations by using the metric
the development of weight pruning strategies and sparse of Relative Importance and Activations, which evaluates
attention mechanisms. the importance of each weight element based on all its
Weight Pruning. Weight pruning systematically removes connected weights. In addition, RIA converts the unstruc-
less critical weights and structures from models, aiming tured sparsity pattern to a structured N:M sparsity pattern,
to reduce computational and memory cost during both which can enjoy the actual speed-up on NVIDIA GPUs.
prefilling stages and decoding stages without significantly Additionally, OWL [166] focuses on deciding the pruning
compromising performance. This sparsification approach is ratio of each layer. It assigns the pruning ratios to each layer
categorized into two main types: unstructured pruning and based on its activation outlier ratios.
structured pruning. The categorization is based on the gran- Structured pruning prunes larger structural units of
ularity of the pruning process, as illustrated in Figure 10. the model, such as entire channels or layers, operating at
a coarser granularity compared to unstructured pruning.
These methods directly facilitate inference speed-up on
conventional hardware platforms due to their alignment
with the dense, regular data patterns these systems are
optimized to process. However, the coarse granularity of
structured pruning often results in a more pronounced
impact on model performance. The pruning criterion of this
Unstructured Pruning Structured Pruning
Granularity: Weight Granularity: Channel/Group/Layer
line of work additionally enforces the structured pruning
pattern. LLM-Pruner [169] proposes a task-agnostic struc-
tured pruning algorithm. Specifically, it first identifies the
Fig. 10. Illustration of Unstructured Pruning (left) and Structured Pruning
(right). couple structures in the LLM, based on the connection
dependencies between neurons. Then, it decides which
Unstructured pruning prunes individual weight values structure groups to remove based on a well-designed group-
with fine granularity. Compared with structured pruning, it wise pruning metric. After pruning, it further proposes
typically achieves a greater level of sparsity with minimal to recover the model performance by a parameter-efficient
impact on model prediction. However, the sparse pattern training technique, i.e., LoRA [210]. Sheared LLaMA [170]
achieved through unstructured pruning lacks high-level proposes to prune the original LLM to a specific target
regularity, leading to irregular memory access and compu- architecture of existing pre-trained LLMs. In addition, it
tation patterns. This irregularity can significantly hinder the designs dynamic batch-loading techniques to improve post-
17

training performance. ZipLM [171] iteratively identifies and Static sparse attention removes activation values inde-
prunes the structural components with the worst trade- pendently of specific inputs [147], [149], [150], [151]. These
off between loss and runtime. LoRAPrune [172] proposes methods pre-determine the sparse attention mask and en-
a structured pruning framework for the pre-trained LLMs force it on the attention matrix during inference. Previous
with LoRA modules to enable fast inference of LoRA-based studies combine different sparse patterns to preserve the
models. It designs a LoRA-guided pruning criterion that most essential elements within each attention matrix. As
uses the weights and gradients of LoRA, and an iterative shown in Figure 11(a), the most common sparse attention
pruning scheme to remove the unimportant weights based patterns are the local and global attention patterns. The local
on the criterion. LoRAShear [173] also designs a pruning attention pattern captures the local context of each token
method for LoRA-based LLMs with (1) a graph algorithm with a fixed-size window attention surrounding each token.
to identify the minimal removal structures, (2) a progressive The global attention pattern captures the correlation of spe-
structured pruning algorithm LHSPG, and (3) a dynamic cific tokens to all other tokens by computing and attending
knowledge recovery mechanism to recover the model per- to all tokens across the sequence. Note that leveraging global
formance. SliceGPT [174] builds on the idea of computa- patterns can eliminate the need to store key-value (KV)
tional invariance of RMSNorm operation. It proposes to pairs for unused tokens, thereby reducing memory access
structurally arrange the sparsity in each weight matrix, and cost and memory usage during the decoding stage. Sparse
to slice out the entire rows or columns. PLATON [175] Transformer [147] combines these patterns to capture the
proposes to prune the weights by considering both their local context with a local pattern, and then aggregates the
importance and uncertainty. It uses the exponential moving information with the global pattern for every few words.
average (EMA) of the importance scores to estimate the StreamingLLM [148] applies the local pattern, along with
importance, and adopts the upper confidence bound (UCB) the global pattern only for the first few tokens. It shows
for the uncertainty. SIMPLE [176] proposes to prune the that such a global pattern serves as the attention sink to
attention head, FFN neurons and hidden dimension via keep the strong attention scores toward initial tokens. It
learning the corresponding sparsity masks. After pruning, helps the LLMs to generalize to infinite input sequence
it further adopts knowledge distillation to fine-tune the length. Bigbird [150] also uses the random pattern, where
pruned models for performance recovery. all tokens attend to a set of random tokens. The combi-
nation of local, global and random patterns is proven to
encapsulate all continuous sequence-to-sequence functions,
local global random dilated rate 1/2/8
affirming its Turing completeness. As shown in Figure 11(b),
Longformer [149] additionally introduces the dilated sliding
window pattern. It is analogous to dilated CNNs and makes
the sliding window “dilated” to increase the receptive
field. To adapt the model to the sparse setting, Structured
Sparse Attention [151] advocates an entropy-aware training
method that congregates high-probability attention values
into denser regions. Unlike previous studies that manually
design sparse patterns, SemSA [152] uses gradient-based
(a) (b)
profiling to identify important attention patterns and au-
tomatically optimizes the attention density distribution to
attended pruned bucket 0/1
further improve model efficiency.
In contrast, Dynamic sparse attention adaptively elim-
inates activation values based on varying inputs, employ-
ing real-time monitoring of neuronal activation values to
bypass computations for neurons with negligible impact,
thereby achieving pruning. Most dynamic sparse attention
methods employ the dynamic token-pruning methods, as
Figure 11(c) shows. Spatten [153], SeqBoat [154] and Adap-
(c) (d) tively Sparse Attention [155] leverage the inherent redun-
dancy in linguistic constructs to propose dynamic token-
level pruning strategies. Spatten [153] assesses the cumula-
Fig. 11. Examples of different sparse attention masks. (a) Static mask
with local, global, and random attention pattern. (b) Static mask with tive importance of each word by aggregating the attention
dilated attention pattern of different dilated rate. (c) Dynamic token matrix columns, subsequently pruning tokens with minimal
pruning. (d) Dynamic attention pruning. cumulative significance from the input in subsequent layers.
SeqBoat [154] trains a linear State Space Model (SSM) with a
Sparse Attention. Sparse attention techniques in Multi- sparse sigmoid function to determine which token to prune
Head Self-Attention (MHSA) components of transformer for each attention head. Both Spatten and SeqBoat prune
models strategically omit certain attention calculations to the uninformative tokens for the whole input. Adaptively
enhance computational efficiency of the attention operation Sparse Attention [155] gradually prunes the tokens during
mainly in the prefilling stage. These mechanisms diverge the generation process. It drops parts of the context that are
into static and dynamic categories based on their reliance no longer required for future generation.
on specific input data. In addition to dynamic token pruning, dynamic atten-
18

tion pruning strategies are also employed [156], [157], [158], Neural Architecture Search. Neural Architecture Search
[159], [160]. As Figure 11(d) shows, instead of pruning all (NAS) [213] aims to automatically search the optimal neu-
the attention values of certain tokens, these methods dy- ral architectures that strike an optimized balance between
namically prune the selective part of the attention based on efficiency and performance. AutoTinyBERT [135] utilizes
the input. A prominent approach within this domain is dy- one-shot Neural Architecture Search (NAS) to discover the
namically segmenting input tokens into groups, known as hyper-parameters of the Transformer architecture. Notably,
buckets, and strategically omitting the attention calculations it introduces a compelling batch-wise training approach
for tokens that reside in separate buckets. The challenge to train a Super Pre-trained Language Model (SuperPLM)
and focus of these methods lie in the way to cluster related and subsequently employs an evolutionary algorithm to
tokens together, thereby facilitating attention computations identify the optimal sub-models. NAS-BERT [136] trains a
solely among them to enhance efficiency. Reformer [156] large super-net on conventional self-supervised pre-training
leverages locality-sensitive hashing to cluster keys and tasks using several innovative techniques, such as block-
queries that share identical hash codes into the same bucket. wise search, search space pruning, and performance ap-
Following this, Sparse Flash Attention [157] introduces spe- proximation. This approach allows NAS-BERT to be applied
cialized GPU kernels optimized for this hash-based sparse efficiently across various downstream tasks without requir-
attention mechanism, further improving computational effi- ing extensive re-training. Structure pruning via NAS [137]
ciency. Meanwhile, the Routing Transformer [158] employs a treats structural pruning as a multi-objective NAS problem,
spherical k-means clustering algorithm to aggregate tokens and solves it via the one-shot NAS method. LiteTransform-
into buckets, optimizing the selection process for attention erSearch [138] proposes to use a training-free indicator, i.e.,
computations. Sparse Sinkhorn Attention [159] adopts a the number of parameters, as a proxy indicator to guide the
learned sorting network to align keys with their relevant search. This method enables efficient exploration and selec-
query buckets, ensuring that attention is computed only tion of the optimal architectures without the need for actual
between the corresponding query-key pairs. Diverging from training during the search phase. AutoDistil [139] presents a
the bucket-level operation, H2 O [160] introduces the token- fully task-agnostic few-shot NAS algorithm featuring three
level dynamic attention pruning mechanism. It combines primary techniques: search space partitioning, task-agnostic
static local attention with dynamic computations between SuperLM training, and task-agnostic search. This approach
the current query and a set of dynamically identified key aims to facilitate efficient architecture discovery across vari-
tokens, termed heavy-hitters (H2 ). These heavy-hitters are ous tasks with minimal task-specific adaptations. Typically,
dynamically adjusted with an eviction policy aimed at re- NAS algorithms necessitate evaluating the performance of
moving the least significant keys at each generation step, each sampled architecture, which can incur significant train-
effectively managing the size and relevance of the heavy- ing cost. Consequently, these techniques are challenging to
hitter set. apply to LLMs.
Moreover, viewing each token as a graph node and Low Rank Factorization. Low Rank Factorization (LRF), or
attention between tokens as edges offers an extended per- Low Rank Decomposition, aims to approximate a matrix
spective on static sparse attention [150], [161]. The original, Am×n with two low-rank matrices B m×r and C r×n by:
full attention mechanism equates to a complete graph with
Am×n ≈ B m×r × C r×n , (11)
a uniform shortest path distance of 1. Sparse attention,
with its random mask, introduces random edges, effectively where r is much smaller than m and n. In this way, LRF
reducing the shortest path distance between any two nodes can diminish memory usage and enhance computational
to O(log n), thus maintaining efficient information flow efficiency. Furthermore, during the decoding stage of LLM
akin to full attention. Diffuser [161] utilizes the perspective inference, memory access cost presents a bottleneck to the
of graph theory to expand the receptive field of sparse decoding speed. Therefore, LRF can reduce the number
attention with multi-hop token correlations. It also takes of parameters that need to be loaded, thereby accelerat-
inspiration from the expander graph properties to design ing the decoding speed. LoRD [140] shows the potential
better sparse patterns that approximate the information flow of compressing the LLMs without largely degrading the
of full attention. performance via LRF. Specifically, it adopts Singular Value
Beyond the attention-level and token-level sparsity, the Decomposition (SVD) to factorize the weight matrices, and
scope of attention pruning extends to various granularities. successfully compresses a LLM with 16B parameters to
Spatten [153] also extends pruning beyond token granular- 12.3B with minimal performance drop. TensorGPT [141]
ity to attention head granularity, eliminating computations introduces a method to compress the embedding layer
for inessential attention heads to further reduce computa- using Tensor-Train Decomposition. Each token embedding
tional and memory demands. is treated as a Matrix Product State (MPS) and efficiently
computed in a distributed manner. LoSparse [142] combines
the benefits of LRF and weight pruning for LLM com-
5.2.3 Structure Optimization
pression. By leveraging low-rank approximation, LoSparse
The objective of structure optimization is to refine model mitigates the risk of losing too many expressive neurons that
architecture or structure with the goal of enhancing the typically occurs with direct model pruning. LPLR [143] and
balance between model efficiency and performance. Within ZeroQuant-V2 [144] both propose to compress the weight
this field of research, two prominent techniques stand out: matrix by simultaneously applying LRF and quantization to
Neural Architecture Search (NAS) and Low Rank Factoriza- it. DSFormer [145] proposes to factorize the weight matrix
tion (LRF). into the product of a semi-structured sparse matrix and a
19

small dense matrix. ASVD [146] designs an activation-aware to distill the student models. In the field of LLMs, black-
SVD method. This approach involves scaling the weight ma- box KD mainly guides the student models to learn LLMs’
trix based on activation distribution prior to applying SVD generalization ability and emergent ability, including In-
for matrix decomposition. ASVD also involves determining Context Learning (ICL) ability [43], Chain-of-Thought (CoT)
an appropriate truncation rank for each layer through a reasoning ability [14] and Instruction Following (IF) abil-
search process. ity [214].
Regarding the ICL ability, Multitask-ICT [116] introduces
5.2.4 Knowledge Distillation in-context learning distillation to transfer the multitask few-
Knowledge Distillation (KD) is a well-established technique shot ability of Large Language Models (LLMs), leveraging
for model compression, wherein knowledge from large both in-context learning and language modeling proficiency.
models (referred to as teacher models) is transferred to MCKD [117] observes that student models distilled from in-
smaller models (referred to as student models). In the context learned teacher models often exhibit superior per-
context of LLMs, KD involves using the original LLMs as formance on unseen input prompts. Building on this obser-
teacher models to distill smaller LMs. Numerous studies vation, MCKD devises a multi-stage distillation paradigm
have focused on effectively transferring various abilities of where the student model from previous stages is employed
LLMs to smaller models. In this domain, methods can be to generate distillation data for subsequent stages, enhanc-
categorized into two main types: white-box KD and black- ing the effectiveness of the distillation method.
box KD (as illustrated in Fig. 12). To distill the Chain-of-Thought (CoT) reasoning ability,
several techniques such as Distilling Step-by-Step [118],
White-Box KD Black-Box KD SCoTD [119], CoT Prompting [120], MCC-KD [121], and
Features
Outputs Fine-tune-CoT [122] propose distillation methods that in-
ICL ability
corporate responses and rationales extracted from LLMs
Logits

Outputs
CoT ability
to train student models. Socratic CoT [123] also targets
IF ability

Teacher Model Student Model

API-based Teacher reasoning ability transfer to smaller models. Specifically,
Model
it fine-tunes a pair of student models, namely a Question
Fig. 12. Illustration of White-Box KD (left) and Black-Box KD (right).
Generation (QG) model and a Question Answering (QA)
model. The QG model is trained to generate intermediate
White-box KD. White-box KD refers to distillation methods questions based on input questions, guiding the QA model
that leverage access to the structure and parameters of the in producing the final response. PaD [124] observes that
teacher models. This approach enables KD to effectively faulty reasoning (i.e., correct final answer but incorrect
utilize the intermediate features and output logits of the reasoning steps) can be detrimental to student models. To
teacher models for enhanced performance of the student address this, PaD proposes generating synthetic programs
models. MiniLLM [129] proposes to adopt the standard for reasoning problems, which can then be automatically
white-box KD approach but replace the forward Kullback- checked by an additional interpreter. This approach helps in
Leibler divergence (KLD) with the reverse KLD. GKD [130] removing distillation data with faulty reasoning, enhancing
introduces the use of on-policy data, which includes output the quality of the training data for student models.
sequences generated by the student model itself, to further For the IF ability, several methods have been proposed
distill the student model. This method focuses on aligning to transfer this capability to smaller models. DISCO [126]
the output logits between the teacher and student models introduces a technique where phrasal perturbations are gen-
using these on-policy data. TED [131] presents a task-aware erated using a LLM. These perturbations are then filtered by
layer-wise KD method. This approach involves adding fil- a task-specific teacher model to distill high-quality counter-
ters after each layer in both the teacher and student models, factual data. LaMini-LM [127] aims to transfer instruction
training these task-specific filters, and subsequently freezing following ability by designing a diverse instruction set for
the teacher model’s filters while training the student filters distilling student models. Lion [128] utilizes the teacher
to align their output features with the corresponding teacher model to identify difficult instructions, and generates new
filters. MiniMoE [133] mitigates the capacity gap by utilizing and complex instructions to distill the small model.
a Mixture-of-Experts (MoE) model as the student model.
For newly emerging entities, pre-trained language models 5.2.5 Dynamic Inference
(LLMs) may lack up-to-date information. To address this, Dynamic inference involves the adaptive selection of model
one solution involves incorporating additional retrieved sub-structures during the inference process, conditioned on
texts into prompts, albeit at an increased inference cost. input data. This section focuses on early exiting techniques,
Alternatively, KPTD [134] suggests transferring knowledge which enable a LLM to halt its inference at different model
from entity definitions into LLM parameters via knowledge layers depending on specific samples or tokens. Notably,
distillation. This method generates a transfer set based on while MoE techniques (discussed in Sec. 5.1.1) also adjust
entity definitions and distills the student model to match model structure during inference, they typically involve
output distributions with the teacher model based on these expensive pre-training cost. In contrast, early exiting tech-
definitions. niques only require training a small module to determine
Black-box KD. Black-box KD refers to the knowledge dis- when to conclude the inference. We categorize studies on
tillation methods in which the structure and parameters of early exiting techniques into two main types: sample-level
teacher models are not available. Typically, black-box KD early exiting and token-level early exiting (illustrated in
only uses the final results obtained by the teacher models Fig. 13).
20

Output 1
5.3 Knowledge, Suggestions and Future Direction

Sample-level Dynamic
In the field of efficient structure design, the pursuit of
Input 1
Inference alternative architectures to Transformers is a burgeoning
area of research. Examples such as Mamba [73], RWKV [60],
Input 2 Output 2 and their respective variants [101], [104] have demonstrated
competitive performance across various tasks, garnering
Token 1 increasing attention in recent times. Nevertheless, it remains
pertinent to investigate whether these non-Transformer
Token-level Dynamic models may exhibit certain shortcomings compared to
Prompt
Inference Transformer models. Concurrently, exploring the integra-
Token 2 tion of non-Transformer architectures with the attention
operation [74], [103], [216] represents another promising
avenue for future research.
Fig. 13. Illustration of Token-level (up) and Sample-level (down) dynamic In the realm of model compression, quantization stands
inference.
out as the predominant method employed in Large Lan-
guage Model (LLM) deployment, primarily due to two key
factors. Firstly, quantization presents a convenient means of
compressing LLMs. For instance, employing Post-Training
Sample-level. Sample-level early exiting techniques focus Quantization (PTQ) methods can reduce the parameter
on determining the optimal size and structure of Language count of an LLM with seven billion parameters to a com-
Models (LLMs) for individual input samples. A common pressed form within a matter of minutes. Secondly, quan-
approach is to augment LLMs with additional modules after tization holds the potential to achieve substantial reduc-
each layer, leveraging these modules to decide whether to tions in memory consumption and inference speed, while
terminate inference at a specific layer. FastBERT [110], Dee- introducing only minor performance trade-offs. This com-
BERT [113], MP [215], and MPEE [111] train these modules promise is generally deemed acceptable for numerous real-
directly to make decisions (e.g., outputting 0 to continue or world applications. However, it’s worth noting that quan-
1 to stop) based on features from the current layer. Global tization may still compromise certain emergent abilities
Past-Future Early Exit [112] proposes a method that enriches of LLMs, such as self-calibration or multi-step reasoning.
the input to these modules with linguistic information from Additionally, in specific scenarios like dealing with long
both preceding and subsequent layers. Given that future contexts, quantization could lead to significant performance
layer features are not directly accessible during inference, a degradation [204]. Consequently, it is required to carefully
simple feed-forward layer is trained to estimate these future select appropriate quantization methods to mitigate the risk
features. PABEE [114] trains the modules as output heads of such degradation in these specialized cases.
for direct prediction, suggesting inference termination when Extensive literature has devoted into studying sparse at-
predictions remain consistent. HASHEE [115] employs a tention techniques for efficient long-context processing. For
non-parametric decision-making approach based on the hy- example, a recent representative work, StreamingLLM [148],
pothesis that similar samples should exit inference at the can process 4 million tokens by only restoring several
same layer. attention sink tokens. Nonetheless, these approaches of-
ten sacrifice critical information, resulting in performance
Token-level. In the decoding stage of LLM inference, where degradation. Therefore, the challenge of preserving essen-
tokens are generated sequentially, token-level early exiting tial information while efficiently managing long contexts
techniques aim to optimize the size and structure of LLMs remains an important area for future exploration. As for
for each output token. CALM [108] introduces early exit the weight pruning techniques, LLM-KICK [217] notes that
classifiers after each Transformer layer, training them to current state-of-the-art (SOTA) methods experience con-
output confidence scores that determine whether to halt siderable performance degradation even at relatively low
inference at a specific layer. Notably, in the self-attention sparsity ratios. Consequently, developing effective weight
block, computing the current token’s feature at each layer pruning methods to maintain LLM performance remains an
relies on all previous tokens’ features (i.e., KV cache) in emerging and critical research direction.
the same layer. To address the issue of missing KV cache The optimization of model structures often involves the
due to early exiting of previous tokens, CALM proposes use of Neural Architecture Search (NAS), which typically
directly copying the feature from the exiting layer to subse- demands extensive computational resources, posing a po-
quent layers, with experimental results showing only minor tential barrier to its practical application in compressing
performance degradation. SkipDecode [109] addresses lim- LLMs. Therefore, investigating the feasibility of employ-
itations of previous early exiting methods that hinder their ing automatic structure optimization for LLM compression
applicability to batch inference and KV caching, thereby lim- warrants further exploration. Additionally, the challenge
iting actual speed-up gains. For batch inference, SkipDecode remains for techniques like low-rank factorization (LRF) to
proposes a unified exit point for all tokens within a batch. achieve an optimal balance between compression ratio and
Regarding KV caching, SkipDecode ensures a monotonic task performance. For instance, ASVD [146] achieves only a
decrease in exit points to prevent recomputation of KV modest 10% to 20% compression ratio without compromis-
cache, facilitating efficiency gains during inference. ing the reasoning capabilities of LLMs.
21

In addition to employing individual model compres- the computational graph level, current optimized inference
sion techniques, several studies explore the combination engines implement highly fused operator.
of different methods to compress LLMs, leveraging their Attention Operator Optimization. The standard attention
respective advantages for improved efficiency. For instance, computation (e.g., using Pytorch) involves the multiplica-
MPOE [88] applies weight matrix factorization specifically tion of the Query matrix (Q) with the Key matrix (K),
to the expert Feed-Forward Networks (FFNs) in MoE-based resulting in quadratic time and space complexity in relation
LLMs, with the goal of further reducing memory require- to the input sequence length. As shown in Fig. 15, the
ments. LLM-MQ [191] utilizes weight sparsity techniques to time proportion of the attention operator increases as the
protect weight outliers during model quantization, thereby context length grows. This translates to high demands on
minimizing quantization errors. LPLR [143] focuses on memory size and computational capability, especially when
quantizing low-rank factorized weight matrices to further dealing with long sequences. To address the computational
decrease memory footprint and memory access cost during and memory overhead of standard attention computation
LLM inference. Furthermore, LoSparse [142] combines low- on GPUs, customized attention operators are essential.
rank factorization with weight pruning, leveraging pruning FlashAttention [233], [234] fuses the entire attention oper-
to enhance the diversity of low-rank approximation while ation into a single, memory-efficient operator to alleviate
using low-rank factorization to retain important weights memory access overhead. The input matrices (Q, K, V) and
and prevent loss of critical information. These approaches attention matrix are tiled into multiple blocks, which elimi-
highlight the potential of integrating multiple compression nates the need for complete data loading. Built upon Flash
techniques to achieve better optimization of LLMs. Attention, FlashDecoding [237] aims to maximize compu-
tational parallelism for decoding. Due to the application of
the decoding approach, the Q matrix degrades into a batch
6 S YSTEM - LEVEL O PTIMIZATION of vectors during decoding, which makes it challenging
The system-level optimization for LLM inference primarily to fill the computational units if the parallelism is limited
involves enhancing the model forward pass. Considering to the batch size dimension. FlashDecoding addresses this
the computational graph of a LLM, there exist multiple by introducing parallel computation along the sequence
operators, with attention and linear operators dominating dimension. While this introduces some synchronization
most of the runtime. As mentioned in Sec. 2.3, system- overhead to softmax computation, it leads to noticeable
level optimization primarily considers the distinctive char- improvements in parallelism, particularly for small batch
acteristics of the attention operator and the decoding ap- sizes and long sequences. The subsequent work, FlashDe-
proach within LLM. In particular, to address the specific coding++ [231], observes that in previous works [233], [234],
issues related to the decoding approach of LLMs, the linear [237], the maximum value within the softmax only serves as
operator requires special tiling designs, and speculative a scaling factor to prevent data overflow. However, the dy-
decoding methods are proposed to improve the utilization. namical maximum value incurs significant synchronization
Furthermore, in the context of online serving, requests come overhead. Moreover, extensive experiments indicate that
from multiple users. Therefore, beyond the optimizations in typical LLM (e.g., Llama2 [239], ChatGLM [240]), over
discussed earlier, online serving faces challenges related 99.99% of the softmax inputs fall within a certain range.
to memory, batching and scheduling arising from asyn- Thus, FlashDecoding++ proposes to determine the scaling
chronous requests. factor based on statistics in advance. This eliminates the
synchronization overhead in softmax computation, enabling
parallel execution of subsequent operations alongside the
6.1 Inference Engine
softmax computation.
The optimizations for inference engines are dedicated to Linear Operator Optimization The linear operator plays
accelerate the model forward process. Main operators and a pivotal role in LLM inference, performing in feature
the computational graph in LLM inference are highly op- projection and Feedforward Neural Networks (FFNs). In
timized. Besides, speculative decoding technique is pro- traditional neural networks, linear operators can be ab-
posed to accelerate the inference speed without performance stracted into General Matrix-Matrix Multiplication (GEMM)
degradation. operations. However, in the case of LLM, the application of
the decoding approach results in a notably reduced dimen-
6.1.1 Graph and Operator Optimization sion, diverging from the conventional GEMM workload.
Runtime Profiling. Using HuggingFace [238] implementa- The low-level implementation of traditional GEMM has
tion, we profile the inference runtime with different mod- been highly optimized, and mainstream LLM frameworks
els and context lengths. The profiling results in Fig. 15 (e.g., DeepSpeed [236], vLLM [49], OpenPPL [241] and etc.)
demonstrate that attention operators and linear operators primarily call the GEMM APIs offered by cuBLAS [242]
collectively dominate runtime, with their combined dura- for linear operators. Without an explicitly tailored imple-
tion often exceeding 75% of the inference duration. Conse- mentation for GEMMs with a reduced dimension, the lin-
quently, a significant portion of optimization efforts at the ear operators during decoding suffer inefficiency. A no-
operator level is dedicated to enhancing the performance of table trend to address the issue is observed in the latest
the two operators. Furthermore, there are multiple operators release of TensorRT-LLM [208]. It introduces a dedicated
occupying a small proportion of runtime, which fragments General Matrix-Vector Multiplication (GEMV) implemen-
the operator execution timeline and increases the cost of tation, potentially improving efficiency for the decoding
kernel launch on the CPU side. To address this issue, at step. Recent research FlashDecoding++ [231] makes a fur-
22

Attention Operator FlashAttention [233], [234], FlashDecod-

Optimization ing [237], FlashDecoding++ [231]

Graph-Level Optimization FlashAttention [233], [234], ByteTrans-

Graph and Operator
former [235], DeepSpeed [236], FlashDecod-
Optimization
ing++ [231]

Inference Linear Operator TensorRT-LLM [208], FlashDecoding++ [231],

Engine Optimization MegaBlocks [232], vLLM [49]

Speculative Decoding Speculative decoding [218], Speculative sampling [219], DistillSpec [220], Self-
speculative decoding [221], OSD [222], PaSS [223], REST [224], SpecInfer [225],
Stage speculative decoding [226], Cascade Speculative Drafting [227], Looka-
head decoding [228], Medusa [48], Eagle [229], Spectr [230]

Fig. 14. Taxonomy of the optimization for LLM inference engine.

attention others attention others

1.84% 6.72% 2.33% 7.95%
others others others others
23.63% linear 21.68% linear 27.36% 24.64%
linear linear
40.55% 38.82%
53.37% 50.42%
attention attention attention attention linear linear
35.82% 39.50% 19.27% 29.94% 91.44% 89.72%

(a) Llama2-7B, (b) Llama2-7B, (c) Baichuan2-13B, (d) Baichuan2-13B, (e) Mixtral-8x7B, (f) Mixtral-8x7B,
128 context length 2k context length 128 context length 2k context length 128 context length 2k context length

Fig. 15. Inference runtime breakdown over multiple LLMs.

ther step, addressing the inefficiency of cuBLAS [242] and urgent need to optimize the FFN layer. MegaBlocks [232]
CUTLASS [243] libraries when dealing with small batch is the first to optimize the computation for MoE FFN lay-
sizes during the decode step. The authors first introduce ers. The work formulates the MoE FFN computation into
the concept of the FlatGEMM operation to represent the block-sparse operations and proposes tailored GPU kernels
workload of GEMM with a highly reduced dimension (di- for acceleration. However, MegaBlocks concentrates on the
mension size < 8 in FlashDecoding++). As FlatGEMM poses efficient training of the MoE models and hence ignores the
new computational characteristics, the tiling strategy for characteristics of inference (e.g.,, the decoding approach).
traditional GEMMs necessitates modification to be applied. Existing frameworks are working hard to optimize the
The authors observe that two challenges exist as the work- computations of the MoE FFN inference stage. The official
load varies: low parallelism and memory access bottleneck. repository of vLLM [49] integrates the fused kernels for
To tackle the challenges, FlashDecoding++ adopts a fine- MoE FFN in Triton [245], seamlessly removing the index
grained tiling strategy to improve parallelism, and leverages overhead.
the double buffering technique to hide memory access la- Graph-Level Optimization. Kernel fusion stands out as a
tency. Furthermore, recognizing that the linear operations in prevalent graph-level optimization because of its capabil-
typical LLM (e.g., Llama2 [239], ChatGLM [240]) often have ity to reduce runtime. There are three main advantages
fixed shapes, FlashDecoding++ establishes a heuristic selec- of applying kernel fusion: (1) To reduce memory access.
tion mechanism. This mechanism dynamically chooses be- The fused kernel inherently removes the memory access of
tween different linear operators based on the input size. The intermediate results, mitigating the memory bottleneck for
options include FastGEMV [244], FlatGEMM, and GEMM operators. (2) To mitigate kernel launching overhead. For
provided by cuBLAS [242], [243] libraries. This approach some lightweight operators (e.g., residual adding), the ker-
ensures the selection of the most efficient operator for the nel launching time occupies most of the latency, and kernel
given linear workload, potentially leading to better end-to- fusion reduces individual kernel launchings. (3) To enhance
end performance. parallelism. For those operators without data dependency,
Recently, the application of the MoE FFN to enhance the when one-by-one kernel execution fails to fill the hardware
model capability has become a trend in LLMs [12]. This capacity, it is beneficial to parallel the kernels via fusion.
model structure also puts forward new requirements for The technique of kernel fusion proves effective with
operator optimization. As shown in Fig. 15, in the Mixtral LLM inference, with all of the aforementioned benefits.
model with MoE FFN, the linear operator dominates the FlashAttention [233] formulates the attention operator into
runtime due to the non-optimized FFN computation in the one single kernel, removing the overhead of accessing the at-
HuggingFace implementation. Besides, Mixtral’s adoption tention results. Based on the fact that the attention operator
of the GQA attention structure decreases the attention op- is memory-bounded, the reduction of memory access effec-
erator’s runtime proportion, which further points out the tively transfers to runtime speed-up. ByteTransformer [235]
23

and DeepSpeed [236] propose to fuse lightweight operators coding techniques typically employ two primary sampling
including residual adding, layernorm and activation func- strategies: greedy sampling and nucleus sampling. Greedy
tions, into the former linear operators to reduce the kernel sampling involves selecting the token with the highest
launching overhead. As a result, those lightweight operators probability at each decoding step to generate a specific
disappear in the timeline with nearly no extra latency. More- output sequence. The initial attempt at speculative decod-
over, kernel fusion is also adopted to enhance the utilization ing, known as Blockwise Parallel Decoding [246], aims to
of LLM inference. The projections of Query, Key and Value ensure that the draft tokens precisely match the tokens
matrices are originally three individual linear operations, sampled via greedy sampling, thus preserving output to-
and are fused into one linear operator to deploy on mod- ken equivalence. In contrast, nucleus sampling involves
ern GPUs. Currently, the kernel fusion technique has been sampling tokens from a probability distribution, resulting
exploited in LLM inference practice, and highly optimized in diverse token sequences with each run. This diversity
inference engines employ only a few fused kernels within makes nucleus sampling popular. To accommodate nucleus
the runtime. For example, in FlashDecoding++ [231] im- sampling within speculative decoding frameworks, specu-
plementation, a transformer block integrates merely seven lative sampling techniques [218], [219] have been proposed.
fused kernels. Leveraging the aforementioned operators and Speculative sampling maintains output distribution equiv-
kernel fusion optimization, FlashDecoding++ achieves up to alence, aligning with the probabilistic nature of nucleus
4.86× speed-up over the HuggingFace implementation. sampling to generate varied token sequences. Formally,
given a sequence of tokens x1 , x2 , ..., xn and a sequence of
6.1.2 Speculative Decoding
draft tokens x̂n+1 , x̂n+2 , ..., x̂n+k , the speculative sampling
Speculative decoding [218] (i.e., speculative sampling [219]) strategy accepts the i-th draft token with the following
is an innovative decoding technique for auto-regressive probabilities:
LLMs designed to enhance decoding efficiency without
p(x̂i |x1 , x2 , ..., xi−1 )

compromising the fidelity of outputs. The core idea of this
min 1, , (12)
approach involves employing a smaller model, termed a q(x̂i |x1 , x2 , ..., xi−1 )
draft model, to predict several subsequent tokens efficiently,
where p(·|·) and q(·|·) denote the conditional probabilities
followed by validation of these predictions using the target
from the target LLM and the draft model, respectively. If
LLM in parallel. This methodology aims to enable the LLM
the i-th draft token is accepted, it sets xi ←
− x̂i . Otherwise,
to generate multiple tokens within the time frame typi-
it quits the verification of the following draft tokens, and
cally required for a single inference. Fig. 16 demonstrates
resamples xi from the following distribution:
the comparison of the traditional auto-regressive decoding
method and the speculative decoding approach. Formally, norm(max(0, p(·|x1 , x2 , ..., xi−1 ) − q(·|x1 , x2 , ..., xi−1 ))).
speculative decoding approach consists of two steps: (13)
1) Draft Construction: It employs the draft model to gen- Building upon speculative sampling, several variants [225],
erate several subsequent tokens, namely draft tokens, [230] have emerged, aimed at validating multiple draft
in parallel or in the auto-regressive manner. token sequences. Notably, the token tree verifier [225] has
2) Draft Verification: It employs the target model to com- become a widely adopted verification strategy within this
pute the conditional probabilities of all the draft tokens context. This approach utilizes a tree-structured represen-
in a single LLM inference step, subsequently determin- tation of draft token sets and employs a tree attention
ing the acceptance of each draft token sequentially. The mechanism to efficiently perform the verification process.
acceptance rate, representing the average number of In the speculative decoding approach, the acceptance
accepted draft tokens per inference step, serves as a key rate of draft tokens is significantly influenced by the degree
metric for evaluating the performance of a speculative to which the output distributions of draft models align
decoding algorithm. with those of original LLMs. As a result, considerable re-
search efforts have been directed towards improving the
Generated Token Accept
Optional design of draft models. DistillSpec [220] directly distills a
Token
smaller draft model from the target LLM. SSD [221] involves
Draft Token Token Reject
automatically identifying a sub-model (a subset of model
Draft Model
layers) from the target LLM to serve as the draft model,
eliminating the need for separate training of the draft model.
OSD [222] dynamically adjusts the output distribution of the
draft model to match the user query distribution in online
LLM LLM
LLM services. It achieves this by monitoring rejected draft
tokens from the LLM and using this data to refine the draft
model through distillation. PaSS [223] proposes utilizing the
target LLM itself as the draft model, incorporating trainable
(a) Auto-regressive Decoding (b) Speculative Decoding
tokens (look-ahead tokens) into the input sequence to enable
simultaneous generation of subsequent tokens. REST [224]
Fig. 16. Comparison of auto-regressive decoding (a) and speculative
decoding (b).
introduces a retrieval-based speculative decoding approach,
employing a non-parametric retrieval data store as the draft
Speculative decoding ensures output equivalence with model. SpecInfer [225] introduces a collective boost-tuning
standard auto-regressive decoding methods. Traditional de- technique to align the output distribution of a group of
24

TABLE 5
Comparison of several open-source implementations of speculative decoding. In this table, we also show the additional overhead of constructing
draft models. Note that for SpD [218], [219], LADE [228], Medusa [48] and Eagle [229], we report the training cost from their original papers. And
for SSD [221] and REST [27], we run the sub-LLM search and datastore construction with the code they provide, and report the time cost.
Besides, for Medusa, we use Medusa-1 [48] which does not fine-tune the original LLM backbone.

Additional Overhead
Method Draft Model Draft Construction Draft Verifier Acceptance Rate Speed-up
(GPU hours)
SpD [218], [219] small speculative model one draft sequence speculative sampling 275 1.77∼2.02× 1.05∼1.77×
LADE [228] LLM + N grams one draft sequence greedy sampling 0 1.92∼2.14× 1.12∼1.30×
SSD [221] sub-LLM one draft sequence speculative sampling 4 1.64∼1.74× 1.01∼1.23×
REST [27] datastore token tree speculative sampling 1.5 2.18∼2.31× 1.72∼2.27×
Medusa-1 [48] four LLM heads token tree speculative sampling ∼24 2.52∼2.62× 2.04∼2.86×
Eagle [229] one Transformer Layer token tree speculative sampling 96∼192 3.47∼3.72× 2.77∼3.74×

draft models with that of the target LLM. Lookahead decod- their proposed draft construction approach and use the
ing [228] involves generating n-grams of the target LLM in checkpoints they provide. As for the evaluation metrics, we
parallel to aid in generating draft tokens. Medusa [48] fine- adopt acceptance rate, which denotes the ratio of the number
tunes several heads of the LLM specifically for generating of accepted tokens to the number of generation steps, and
subsequent draft tokens. Eagle [229] adopts a lightweight speed-up, which denotes the ratio of the latency of original
transformer layer called an auto-regression head to gener- auto-regressive decoding to the latency of speculative de-
ate draft tokens in an auto-regressive manner, integrating coding when fixing the total length of output.
rich contextual features from the target LLM into the draft Tab. 5 provides a comparison of various speculative
model’s input. decoding methods, highlighting several key observations:
Another line of studies focus on designing more effective (1) Eagle demonstrates exceptional performance, achieving
draft construction strategies. Conventional approaches often a notable 3.47∼3.72× end-to-end speed-up across multiple
yield single draft token sequences, posing challenges for LLMs. To understand its success, a deeper analysis of Eagle
passing verification. In response, Spectr [230] advocates reveals two key factors. Firstly, Eagle employs an auto-
generating multiple draft token sequences and employs a regressive approach for decoding draft tokens, leveraging
k -sequential draft selection technique to concurrently verify information from previously generated tokens directly. Sec-
k sequences. This method leverages speculative sampling, ondly, Eagle integrates rich features from previous tokens
ensuring equivalence in output distributions. Similarly, of both original LLMs and draft models to enhance the
SpecInfer [225] adopts a comparable approach. However, accuracy of next draft token generation. (2) The token tree
unlike Spectr, SpecInfer merges draft token sequences into a verifier proves to be an effective technique in enhancing the
“token tree” and introduces a tree attention mechanism for performance of speculative decoding methods. (3) The end-
validation. This strategy is called the ”token tree verifier”. to-end speed-up achieved by these methods is often lower
Due to its efficacy, token tree verifier has been widely em- than the acceptance rate. This difference arises due to the
braced in numerous speculative decoding algorithms [48], practical consideration that the generation cost associated
[224], [226], [229]. In addition to these efforts, Stage Spec- with draft models cannot be overlooked.
ulative Decoding [226] and Cascade Speculative Drafting
(CS Drafting) [227] propose accelerating draft construction 6.2 Serving System
by integrating speculative decoding directly into the token The optimizations for serving systems are dedicated to
generation process. improve the efficiency in handling asynchronous requests.
Comparative Experiments and Analysis. We conduct The memory management is optimized to hold more re-
an experiment to evaluate the speed-up performance of quests, and efficient batching and scheduling strategies are
the speculative decoding methods. Specifically, we thor- integrated to improve the system throughput. Besides, op-
oughly review the studies of this field, and select six timizations specific to distributed systems are proposed to
of them that have open-sourced their codes, i.e., Spec- exploit distributed computational resources.
ulative Decoding (SpD) [218], [219], Lookahead Decod-
ing (LADE) [228], REST [224], Self-speculative Decoding 6.2.1 Memory Management
(SSD) [221], Medusa [48] and Eagle [229]. As for the eval- The storage of KV cache dominates the memory usage in
uation dataset, we use Vicuna-80 [7] to evaluate the above LLM serving, especially when the context length is long
methods, which contains 80 questions that classified into (see Sec. 2.3). Since the generation length is uncertain, it
10 categories. We report the average results on these 80 is challenging to allocate the space for KV cache storage
questions. As for target LLMs, we adopt five fashion open- in advance. Earlier implementations [261] usually allocate
source LLMs, i.e., Vicuna-7B-V1.3 [7], Vicuna-13B-V1.3 [7], storage space in advance based on the preset maximum
Vicuna-33B-V1.3 [7], LLaMA-2-7B [5] and LLaMA-2-13B [5]. length of each request. However, in instances where re-
We report the range of evaluation metrics across these 5 quest generation is terminated early, this approach incurs
LLMs. As for draft models, we adopt two well-trained significant wastage of storage resources. To address the
draft models, i.e., LLaMA-68M and LLaMA-160M [225] for issue, S3 [259] proposes to predict an upper bound of the
SpD. For other speculative decoding methods, we follow generation length for each request, in order to reduce the
25

Memory Management S3 [259], vLLM [49], LightLLM [253], FlashIn-

fer [260]

Batching ORCA [252], vLLM [49], Sarathi [257],

DeepSpeed-FastGen [254], Sarathi-Serve [258],
LightLLM [253]
Serving System
Scheduling ORCA [252], vLLM [49], LightLLM [253],
DeepSpeed-FastGen [254], FastServe [255],
VTC [256]

Distributed Systems Splitwise [247], TetriInfer [248], Dist-

Serve [249], SpotServe [250], Infinite-LLM [251]

Fig. 17. Taxonomy of the optimization for LLM serving system.

waste of the pre-allocated space. However, the static way The computation of each request encompasses multiple
of KV cache memory allocation still fails when no such iterations, with each iteration representing either a pre-
large contiguous space exists. To deal with the fragmented filling step or a decoding step. The author suggests that
storage, vLLM [49] proposes to store the KV cache in a different requests can be batched at the iteration level. The
paged manner following the operating system. vLLM first work implements iteration-level batching in linear oper-
allocates a memory space as large as possible and divides ators, concatenating different requests together in the se-
it equally into multiple physical blocks. When a request quence dimension. Hence, the spare storage and computa-
comes, vLLM dynamically maps the generated KV cache to tional resources corresponding to the completed requests
the pre-allocated physical blocks in a discontinuous fashion. are promptly released. Following ORCA, vLLM [49] ex-
In this way, vLLM significantly reduces storage fragmenta- tends the technique to the attention computation, enabling
tion and achieves a higher throughput in LLM serving. On requests with different KV cache lengths to be batched to-
the basis of vLLM, LightLLM [253] uses a more fine-grained gether. Sarathi [257], DeepSpeed-FastGen [254] and Sarathi-
KV cache storage to cut down the waste happening with Serve [258] further introduce a split-and-fuse method to
the irregular boundary. Instead of a block, LightLLM treats batch together prefilling requests and decoding requests.
the KV cache of a token as a unit, so that the generated KV Specifically, this method first splits the long prefilling re-
cache always saturates the pre-allocated space. quest in the sequence dimension, and then batches it to-
Current optimized service systems commonly employ gether with multiple short decoding requests. The split-and-
this paged approach to manage the KV cache storage, fuse method balances the workloads among different itera-
thereby mitigating the waste of redundant KV cache mem- tions, and significantly reduces the tail latency via removing
ory. However, the paged storage leads to irregular memory the stalls from new requests. LightLLM [253] also adopts the
access in the attention operator. For the attention operator split-and-fuse method.
using the paged KV cache, this necessitates the consider- The split-and-fuse technology operates on the premise
ation of the mapping relationship between the virtual ad- that requests during the prefilling stage can be partitioned
dress space of the KV cache and its corresponding physical into discrete chunks. Chunked-prefill methodology involves
address space. To enhance the efficiency of the attention segmenting prefilling requests along the sequence dimen-
operator, the loading pattern of the KV cache must be tai- sion, thereby preventing the potential bottlenecks for other
lored to facilitate contiguous memory access. For instance, requests. This strategy capitalizes on the auto-regressive
in the case of the PagedAttention by vLLM [49], the storage characteristics inherent in LLMs, where attention compu-
of the head size dimension is structured as a 16-byte con- tation only relies on prior tokens. Consequently, the math-
tiguous vector for K cache, while FlashInfer [260] orches- ematical equivalence of chunked-prefill technology is guar-
trates diverse data layouts for the KV cache, accompanied anteed, positioning it as a leading approach for reducing
by an appropriately designed memory access scheme. The request latency in LLM serving.
optimization of the attention operator in conjunction with
paged KV cache storage remains a forefront challenge in the 6.2.3 Scheduling Strategy
advancement of serving systems. In LLM serving, the job length of each request exhibits
variability, and hence the order of executing requests signif-
6.2.2 Continuous Batching icantly impacts the throughput of the serving system. The
The request lengths in a batch can be different, leading head-of-line blocking [255] happens when long requests are
to low utilization when shorter requests are finished and accorded priority. Specifically, memory usage grows rapidly
longer requests are still running. Due to the asynchronous in response to long requests, resulting in the impeding of
nature of requests in serving scenarios, there exists an subsequent requests when the system exhausts its mem-
opportunity that such periods of low utilization could be ory capacity. The pioneering work ORCA [252] and open-
mitigated. The continuous batching technique is proposed source systems, including vLLM [49] and LightLLM [253],
to leverage the opportunity by batching new requests once employ the simple first-come-first-serve (FCFS) principle
some old requests are finished. ORCA [252] is the first to to schedule requests. DeepSpeed-FastGen [254] gives pri-
utilize the continuous batching technique in LLM serving. ority to the decoding requests to enhance the performance.
26

TABLE 6
Comparison of multiple open-source inference engines and serving systems. ”-” denotes no serving support. Note that the scheduling method of
TensorRT-LLM is not open-sourced.

Inference Optimization Inference Serving Optimization Serving

Model (token/s) (req/s)
Attention Linear Graph Speculative Decoding Memory Batching Scheduling
HuggingFace [238] ✓ 38.963 - - - -
DeepSpeed [236] ✓ ✓ 80.947 blocked split-and-fuse decode prioritized 6.78
vLLM [49] ✓ ✓ 90.052 paged continuous batching prefill prioritized 7.11
OpenPPL [241] ✓ ✓ 81.169 - - - -
FlashDecoding++ [231] ✓ ✓ ✓ 106.636 - - - -
LightLLM [253] ✓ 73.599 token-wise split-and-fuse prefill prioritized 10.29
TensorRT-LLM [208] ✓ ✓ ✓ ✓ 92.512 paged continuous batching - 5.87

FastServe [255] proposes a preemptive scheduling strategy model compression methods, limiting scalability to larger
to optimize the head-of-line blocking problem, achieving models and longer inputs (up to 1.5B model and 256 tokens).
low job completion time (JCT) in LLM serving. FastServe ALLO builds on these insights, further offering a library
employs a multi-level feedback queue (MLFQ) to prioritize of High-level Synthesis (HLS) kernels that are composable
the requests with the shortest remaining time. Since the and reusable. ALLO’s implementation demonstrates supe-
auto-regressive decoding approach poses unknown request rior generation speed-up compared to DFX in the prefilling
lengths, FastServe predicts the length first and utilizes a stage, achieving enhanced energy efficiency and speedup
skip-join fashion to find the proper priority for each request. over the NVIDIA A100 GPU during decoding.
Unlike previous work, VTC [256] discusses the fairness in FlightLLM [268] also leverages these insights, introduc-
LLM serving. VTC introduces a cost function based on token ing a configurable sparse digital signal processor (DSP)
numbers to measure fairness among clients, and further chain for various sparsity patterns with high computa-
proposes a fair scheduler to ensure fairness. tional efficiency. It proposes an always-on-chip decode
scheme with mixed-precision support to enhance mem-
6.2.4 Distributed Systems ory bandwidth utilization. FlightLLM achieves 6.0× higher
In order to achieve high throughput, LLM services are energy efficiency and 1.8× better cost efficiency than the
commonly deployed on distributed platforms. Recent works NVIDIA V100S GPU for Llama2-7B models, with 1.2×
have additionally focused on optimizing the performance higher throughput than the NVIDIA A100 GPU during
of such inference services by exploiting distributed charac- decoding.
teristics. Notably, observing that the prefilling is compute-
intensive and the decoding is memory-intensive, split-
wise [247], TetriInfer [248] and DistServe [249] demonstrate
the efficiency of disaggregating the prefilling and the de- 6.4 Comparison of LLM Frameworks
coding steps of a request. In this way, the two distinct
stages are processed independently based on their char- We compare the performance of multiple LLM frame-
acteristics. SpotServe [250] is designed to provide LLM works in Table 6. The inference throughput is measured
service on clouds with preemptible GPU instances. Spot- with Llama2-7B (batch size=1, input length=1k, output
Serve efficiently handles challenges including dynamic par- length=128). The serving performance is the maximum
allel control and instance migration, and also utilizes the throughput measured on the ShareGPT [269] dataset. Both
auto-regressive nature of LLMs to achieve token-level state are derived on a single NVIDIA A100 80GB GPU. Among
recovery. Moreover, Infinite-LLM [251] extends the paged the mentioned frameworks, DeepSpeed [236], vLLM [49],
KV cache method in vLLM [49] to the distributed cloud LightLLM [253] and TensorRT-LLM [208] integrate the serv-
environment. ing function to serve asynchronous requests from multiple
users. We also list the optimizations for each framework in
the table. All the frameworks except HuggingFace imple-
6.3 Hardware Accelerator Design ment operator-level or graph-level optimizations to enhance
Previous research efforts [262], [263], [264] have focused on performance, and some of them also support the speculative
optimizing Transformer architectures, particularly enhanc- decoding technique. Note that the speculative decoding
ing the attention operator, often employing sparse methods technique is off when we measure the inference perfor-
to facilitate FPGA deployment. The FACT [265] accelera- mance for all frameworks. The results of inference through-
tor achieves superior energy efficiency compared to the put show that FlashDecoding++ and TensorRT-LLM out-
NVIDIA V100 GPU through mixed-precision quantization perform others with optimizations covering predominant
for linear operators and algorithm-hardware co-design, yet operators and the computational graph. From the aspect of
these approaches are not tailored for generative LLMs. serving, all the frameworks use fine-grained and discontigu-
Recent work like ALLO [266] highlights FPGA advan- ous storage for KV cache, and apply the continuous batching
tages in managing the memory-intensive decoding stage techniques to improve the system utilization. Unlike vLLM
and emphasizes the importance of model compression tech- and LightLLM, DeepSpeed prioritizes the decoding requests
niques for LLMs’ efficient FPGA deployment. Conversely, in scheduling, which means no new request is merged if
DFX [267] focuses on decoding stage optimizations but lacks there are enough existing decoding requests in the batch.
27

6.5 Knowledge, Suggestions and Future Direction maximum context length during both training and infer-
The system-level optimization improves efficiency while ence phases. Various strategies have been explored to ad-
bringing no accuracy degradation, hence becoming preva- dress this limitation, including input compression (Sec. 4.1),
lent in the LLM inference practice. The optimization for sparse attention (Sec. 5.2.2), design of low-complexity struc-
inference is also applicable to serving. Recently, the oper- tures (Sec. 5.1.3), and optimization of attention opera-
ator optimization has been closely combined with practi- tors (Sec. 6.1.1). Notably, non-Transformer architectures
cal serving scenarios, e.g.,, RadixAttention [50] designed (Sec. 5.1.3) with sub-quadratic or linear complexity have
specifically for prefix caching, and tree attention [225] to recently garnered significant interest from researchers.
accelerate speculative decoding verification. The iterating of Despite their efficiency, the competitiveness of these
applications and scenarios will continue to put forward new novel architectures compared to the Transformer archi-
requirements for operator development. tecture across various abilities, such as in-context learn-
Given the multifaceted objectives inherent in real-world ing ability and long-range modeling ability, is still under
serving systems, such as JCT, system throughput, and fair- scrutiny [74], [271]. Therefore, exploring the capabilities of
ness, the design of scheduling strategies becomes corre- these new architectures from multiple angles and address-
spondingly intricate. Within the domain of LLM serving, ing their limitations remains a valuable pursuit. Moreover,
where the length of requests is indeterminate, extant litera- it is crucial to determine the necessary context lengths for
ture commonly relies on predictive mechanisms to facilitate various scenarios and tasks, as well as identify the next-
the design of scheduling strategies. However, the efficacy generation architecture that will serve as the foundational
of current predictors [248] falls short of ideal standards, backbone for LLMs in the future.
indicating the potential for refinement and optimization in Edge Scenario Deployment. While considerable efforts
serving scheduling strategy development. have been directed towards enhancing the efficiency of
LLM inference, deploying LLMs onto extremely resource-
constrained edge devices like mobile phones presents ongo-
7 D ISCUSSIONS OF K EY A PPLICATION S CENAR - ing challenges. Recently, numerous researchers [272], [273],
[274], [275], [276], [277], [278], [279], [280], [281], [282] have
IOS
shown interest in pre-training smaller language models
Current research endeavors have made significant strides in with 1B to 3B parameters. Models of this scale offer re-
exploring the boundaries of efficient LLM inference across duced resource costs during inference and hold potential for
various optimization levels. However, further studies are achieving generalization abilities and competitive perfor-
warranted to enhance LLM efficiency in practical scenarios. mance compared to larger models. However, the methods
We have provided promising future directions for opti- to develop such efficient and powerful smaller language
mization techniques at the data-level (Sec. 4.3), model-level models remain under-explored.
(Sec. 5.3), and system-level (Sec. 6.5). In this section, we Several studies have initiated this promising direction.
summarize four critical scenarios: agent and multi-model For instance, MiniCPM [281] conducts sandbox experi-
framework, long-context LLMs, edge scenario deployment, ments to determine optimal pre-training hyper-parameters.
and security-efficiency synergy, and provide a broader dis- PanGu-π -Pro [274] suggests initializing model weights from
cussion on them. pre-trained LLMs using metrics and techniques from model
Agent and Multi-Model Framework. As discussed in pruning. MobileLLM [282] adopts a“deep and thin” archi-
Sec. 4.3, recent advancements in agent and multi-model tecture for small model design and proposes weight sharing
frameworks [53], [54], [55] have significantly improved across different layers to increase the number of layers
agents’ capabilities to handle complex tasks and human without additional memory costs. Nevertheless, a perfor-
requests by harnessing the powerful abilities of LLMs. These mance gap still exists between small and large models,
frameworks, while increasing the computational demands necessitating future studies to narrow this gap. In the future,
of LLMs, introduce more parallelism into the structure of there is a crucial need for research aimed at identifying the
LLMs’ output content, thereby creating opportunities for model scale limited in the edge scenarios, and exploring the
data-level and system-level optimizations such as output or- boundaries of various optimization methods on designing
ganization techniques [50]. Furthermore, these frameworks smaller models.
naturally introduce a new optimization level, i.e., pipeline- Beyond designing smaller models, system-level opti-
level, which holds potential for efficiency enhancements at mization offers a promising direction in LLM deployment. A
this level [56]. notable recent project, MLC-LLM [283], successfully deploys
In addition, there is a growing research trend [270] fo- the LLaMA-7B model on mobile phones. MLC-LLM pri-
cused on extending AI agents into the multimodal domain, marily employs compilation techniques like fusion, memory
which often utilize Large Multimodal Models (LMMs) as planning, and loop optimization to enhance latency and re-
the core of these agent systems. To enhance the efficiency of duce memory cost during inference. Additionally, adopting
these emerging LMM-based agents, designing optimization the cloud-edge collaboration techniques, or designing more
techniques for LMMs is a promising research direction. sophisticated hardware accelerators can also help deploy
Long-Context LLMs. Currently, LLMs face the challenge LLMs onto edge devices.
of handling increasingly longer input contexts. However, Security-Efficiency Synergy. In addition to task perfor-
the self-attention operation, the fundamental component mance and efficiency, security is also a crucial factor that
of Transformer-style LLMs, exhibits quadratic complexity must be considered in LLM applications [284], [285]. Cur-
in relation to the context length, imposing constraints on rent research primarily focuses on efficiency optimiza-
28

tion without adequately addressing security considerations. [7] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng,
Therefore, it is critical to investigate the interplay between S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-
source chatbot impressing gpt-4 with 90%* chatgpt quality,” See
efficiency and security and determine whether the current https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
optimization techniques compromise the security of LLMs. [8] D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. Gonzalez, I. Stoica,
If these techniques negatively impacts LLMs’ security, a X. Ma, and H. Zhang, “How long can context length of open-
promising direction would involve developing new opti- source llms truly promise?” in NeurIPS 2023 Workshop on Instruc-
tion Tuning and Instruction Following, 2023.
mization methods or refining the existing ones to achieve [9] B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić,
a better trade-off between LLMs’ efficiency and security. D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon et al., “Bloom: A
176b-parameter open-access multilingual language model,” arXiv
preprint arXiv:2211.05100, 2022.
8 C ONCLUSION [10] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo-
caru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic
Efficient LLM inference focuses on reducing the compu- et al., “The falcon series of open language models,” arXiv preprint
tational, memory access, and memory costs during LLM arXiv:2311.16867, 2023.
[11] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang,
inference processes, aiming to optimize efficiency metrics “Glm: General language model pretraining with autoregressive
such as latency, throughput, storage, power, and energy. blank infilling,” arXiv preprint arXiv:2103.10360, 2021.
This survey offers a comprehensive review of efficient LLM [12] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary,
inference research, presenting insights, recommendations, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand
et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
and future directions for key techniques. Initially, we intro- [13] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong,
duce a hierarchical taxonomy encompassing data-, model- B. Yin, and X. Hu, “Harnessing the power of llms in practice: A
, and system-level optimizations. Subsequently, guided by survey on chatgpt and beyond,” ACM Transactions on Knowledge
Discovery from Data, 2023.
this taxonomy, we meticulously examine and summarize
[14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
studies at each level and sub-field. For well-established D. Zhou et al., “Chain-of-thought prompting elicits reasoning in
techniques like model quantization and efficient serving large language models,” Advances in Neural Information Processing
systems, we conduct experiments to evaluate and analyze Systems, vol. 35, pp. 24 824–24 837, 2022.
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
their performance. Based on these analyses, we offer practi- H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Eval-
cal suggestions and identify promising research avenues for uating large language models trained on code,” arXiv preprint
practitioners and researchers in the field. arXiv:2107.03374, 2021.
[16] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz,
E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks
of artificial general intelligence: Early experiments with gpt-4,”
ACKNOWLEDGEMENTS arXiv preprint arXiv:2303.12712, 2023.
This work was supported by National Natural Science [17] X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on
model compression for large language models,” arXiv preprint
Foundation of China (No. 62325405, 62104128, U19B2019, arXiv:2308.07633, 2023.
U21B2031, 61832007, 62204164), Tsinghua EE Xilinx AI Re- [18] S. Park, J. Choi, S. Lee, and U. Kang, “A comprehensive survey
search Fund, and Beijing National Research Center for In- of compression algorithms for language models,” arXiv preprint
formation Science and Technology (BNRist). We thank for arXiv:2401.15347, 2024.
[19] W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin,
all the support from Infinigence-AI. We thank Xiangsheng D. Cai, and X. He, “Model compression and efficient infer-
Shi, Zinan Lin, Xinhao Yang, Hongyi Wang, Linfeng Zhang, ence for large language models: A survey,” arXiv preprint
Yulin Wang, Xuemin Sun, Saiqian Zhang for their valuable arXiv:2402.09748, 2024.
suggestions on the paper. We thank Shengxiang Wang, Qiuli [20] T. Ding, T. Chen, H. Zhu, J. Jiang, Y. Zhong, J. Zhou, G. Wang,
Z. Zhu, I. Zharkov, and L. Liang, “The efficiency spectrum of
Mao for providing the efficiency profiling data of quantized large language models: An algorithmic survey,” arXiv preprint
operators. arXiv:2312.00678, 2023.
[21] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen,
and Z. Jia, “Towards efficient generative large language model
R EFERENCES serving: A survey from algorithms to systems,” arXiv preprint
arXiv:2312.15234, 2023.
[1] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., [22] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, Z. Qu, S. Yan, Y. Zhu,
“Improving language understanding by generative pre-training,” Q. Zhang, M. Chowdhury et al., “Efficient large language models:
2018. A survey,” arXiv preprint arXiv:2312.03863, vol. 1, 2023.
[2] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever [23] M. Xu, W. Yin, D. Cai, R. Yi, D. Xu, Q. Wang, B. Wu, Y. Zhao,
et al., “Language models are unsupervised multitask learners,” C. Yang, S. Wang et al., “A survey of resource-efficient llm and
OpenAI blog, vol. 1, no. 8, p. 9, 2019. multimodal foundation models,” arXiv preprint arXiv:2401.08092,
[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- 2024.
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., [24] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min,
“Language models are few-shot learners,” Advances in neural B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language
information processing systems, vol. 33, pp. 1877–1901, 2020. models,” arXiv preprint arXiv:2303.18223, 2023.
[4] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
transformer language models,” arXiv preprint arXiv:2205.01068, Advances in neural information processing systems, vol. 30, 2017.
2022. [26] Z. Yuan, Y. Shang, Y. Zhou, Z. Dong, C. Xue, B. Wu, Z. Li,
[5] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, Q. Gu, Y. J. Lee, Y. Yan et al., “Llm inference unveiled: Survey and
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., roofline model insights,” arXiv preprint arXiv:2402.16363, 2024.
“Llama: Open and efficient foundation language models,” arXiv [27] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
preprint arXiv:2302.13971, 2023. H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
[6] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, augmented generation for knowledge-intensive nlp tasks,” Ad-
D. Wang, D. Yan et al., “Baichuan 2: Open large-scale language vances in Neural Information Processing Systems, vol. 33, pp. 9459–
models,” arXiv preprint arXiv:2309.10305, 2023. 9474, 2020.
29

[28] A. Chevalier, A. Wettig, A. Ajith, and D. Chen, “Adapt- [51] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and
ing language models to compress contexts,” arXiv preprint K. Narasimhan, “Tree of thoughts: Deliberate problem solving
arXiv:2305.14788, 2023. with large language models,” Advances in Neural Information
[29] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, Processing Systems, vol. 36, 2024.
L. Zettlemoyer, and W. tau Yih, “Replug: Retrieval-augmented [52] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski,
black-box language models,” 2023. L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk
[30] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- et al., “Graph of thoughts: Solving elaborate problems with
rag: Learning to retrieve, generate, and critique through self- large language models,” in Proceedings of the AAAI Conference on
reflection,” 2023. Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690.
[31] D. Wingate, M. Shoeybi, and T. Sorensen, “Prompt compres- [53] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang,
sion and contrastive conditioning for controllability and toxicity J. Wang, S. Jin, E. Zhou et al., “The rise and potential of
reduction in language models,” arXiv preprint arXiv:2210.03162, large language model based agents: A survey,” arXiv preprint
2022. arXiv:2309.07864, 2023.
[32] J. Mu, X. L. Li, and N. Goodman, “Learning to compress prompts [54] Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex:
with gist tokens,” arXiv preprint arXiv:2304.08467, 2023. Pushing the boundaries of complex reasoning through multi-
[33] T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context model collaboration,” arXiv preprint arXiv:2310.00280, 2023.
autoencoder for context compression in a large language model,” [55] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla,
arXiv preprint arXiv:2307.06945, 2023. O. Wiest, and X. Zhang, “Large language model based multi-
[34] F. Xu, W. Shi, and E. Choi, “Recomp: Improving retrieval- agents: A survey of progress and challenges,” arXiv preprint
augmented lms with compression and selective augmentation,” arXiv:2402.01680, 2024.
arXiv preprint arXiv:2310.04408, 2023. [56] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large
[35] W. Fei, X. Niu, P. Zhou, L. Hou, B. Bai, L. Deng, and W. Han, “Ex- language models while reducing cost and improving perfor-
tending context window of large language models via semantic mance,” arXiv preprint arXiv:2305.05176, 2023.
compression,” arXiv preprint arXiv:2312.09571, 2023. [57] Y. Li, T. Cai, Y. Zhang, D. Chen, and D. Dey, “What makes
[36] W. Zhou, Y. E. Jiang, R. Cotterell, and M. Sachan, “Efficient convolutional models great on long sequence modeling?” arXiv
prompting via dynamic in-context learning,” arXiv preprint preprint arXiv:2210.09298, 2022.
arXiv:2305.11170, 2023. [58] D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, and
[37] Y. Li, B. Dong, F. Guerin, and C. Lin, “Compressing context M. Hoogendoorn, “Ckconv: Continuous kernel convolution for
to enhance inference efficiency of large language models,” in sequential data,” arXiv preprint arXiv:2102.02611, 2021.
Proceedings of the 2023 Conference on Empirical Methods in Natural [59] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus,
Language Processing, 2023, pp. 6342–6353. Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger
[38] F. Yin, J. Vig, P. Laban, S. Joty, C. Xiong, and C.-S. J. Wu, “Did you convolutional language models,” in International Conference on
read the instructions? rethinking the effectiveness of task defi- Machine Learning. PMLR, 2023, pp. 28 043–28 078.
nitions in instruction learning,” arXiv preprint arXiv:2306.01150, [60] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho,
2023. H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al.,
[39] H. Jung and K.-J. Kim, “Discrete prompt compression with “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint
reinforcement learning,” arXiv preprint arXiv:2308.08758, 2023. arXiv:2305.13048, 2023.
[40] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: [61] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and
Compressing prompts for accelerated inference of large language F. Wei, “Retentive network: A successor to transformer for large
models,” in The 2023 Conference on Empirical Methods in Natural language models,” arXiv preprint arXiv:2307.08621, 2023.
Language Processing, 2023.
[62] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent
[41] H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and memory with optimal polynomial projections,” Advances in neural
L. Qiu, “Longllmlingua: Accelerating and enhancing llms in information processing systems, vol. 33, pp. 1474–1487, 2020.
long context scenarios via prompt compression,” arXiv preprint
[63] A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and
arXiv:2310.06839, 2023.
C. Ré, “Combining recurrent, convolutional, and continuous-
[42] X. Huang, L. L. Zhang, K.-T. Cheng, and M. Yang, “Boosting llm
time models with linear state space layers,” Advances in neural
reasoning: Push the limits of few-shot learning with reinforced
information processing systems, vol. 34, pp. 572–585, 2021.
in-context pruning,” arXiv preprint arXiv:2312.08901, 2023.
[64] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences
[43] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu,
with structured state spaces,” arXiv preprint arXiv:2111.00396,
and Z. Sui, “A survey for in-context learning,” arXiv preprint
2021.
arXiv:2301.00234, 2022.
[65] A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as ef-
[44] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous
fective as structured state spaces,” Advances in Neural Information
prompts for generation,” in Proceedings of the 59th Annual Meeting
Processing Systems, vol. 35, pp. 22 982–22 994, 2022.
of the Association for Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language Processing (Volume 1: [66] A. Gu, K. Goel, A. Gupta, and C. Ré, “On the parameterization
Long Papers), 2021, pp. 4582–4597. and initialization of diagonal state space models,” Advances in
[45] X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of- Neural Information Processing Systems, vol. 35, pp. 35 971–35 983,
thought: Large language models can do parallel decoding,” arXiv 2022.
preprint arXiv:2307.15337, 2023. [67] H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, “Long
[46] S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, range language modeling via gated state spaces,” in International
A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph Conference on Learning Representations, 2023.
decoding,” arXiv preprint arXiv:2402.12280, 2024. [68] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré,
[47] M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y. Dong, “Hungry hungry hippos: Towards language modeling with state
“Apar: Llms can do auto-parallel auto-regressive decoding,” space models,” arXiv preprint arXiv:2212.14052, 2022.
arXiv preprint arXiv:2401.06761, 2024. [69] R. Hasani, M. Lechner, T.-H. Wang, M. Chahine, A. Amini, and
[48] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, D. Rus, “Liquid structural state-space models,” arXiv preprint
“Medusa: Simple llm inference acceleration framework with mul- arXiv:2209.12951, 2022.
tiple decoding heads,” 2024. [70] J. T. Smith, A. Warrington, and S. W. Linderman, “Simpli-
[49] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, fied state space layers for sequence modeling,” arXiv preprint
J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory manage- arXiv:2208.04933, 2022.
ment for large language model serving with pagedattention,” in [71] J. Pilault, M. Fathi, O. Firat, C. Pal, P.-L. Bacon, and R. Goroshin,
Proceedings of the 29th Symposium on Operating Systems Principles, “Block-state transformers,” Advances in Neural Information Pro-
2023, pp. 611–626. cessing Systems, vol. 36, 2024.
[50] L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, [72] J. Wang, J. N. Yan, A. Gu, and A. M. Rush, “Pretraining without
C. Kozyrakis, I. Stoica, J. E. Gonzalez et al., “Efficiently pro- attention,” arXiv preprint arXiv:2212.10544, 2022.
gramming large language models using sglang,” arXiv preprint [73] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
arXiv:2312.07104, 2023. selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
30

[74] J. Park, J. Park, Z. Xiong, N. Lee, J. Cho, S. Oymak, K. Lee, [95] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu,
and D. Papailiopoulos, “Can mamba learn how to learn? a M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scal-
comparative study on in-context learning tasks,” arXiv preprint ing of language models with mixture-of-experts,” in International
arXiv:2402.04248, 2024. Conference on Machine Learning. PMLR, 2022, pp. 5547–5569.
[75] N. Shazeer, “Fast transformer decoding: One write-head is all [96] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
you need,” arXiv preprint arXiv:1911.02150, 2019. and J. Dean, “Outrageously large neural networks: The sparsely-
[76] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, gated mixture-of-experts layer,” in International Conference on
and S. Sanghai, “Gqa: Training generalized multi-query trans- Learning Representations, 2016.
former models from multi-head checkpoints,” arXiv preprint [97] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang,
arXiv:2305.13245, 2023. M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant
[77] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Lin- models with conditional computation and automatic sharding,”
former: Self-attention with linear complexity,” arXiv preprint arXiv preprint arXiv:2006.16668, 2020.
arXiv:2006.04768, 2020. [98] C. Hwang, W. Cui, Y. Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang,
[78] G. I. Winata, S. Cahyawijaya, Z. Lin, Z. Liu, and P. Fung, R. Salas, J. Jose, P. Ram et al., “Tutel: Adaptive mixture-of-experts
“Lightweight and efficient end-to-end speech recognition using at scale,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
low-rank transformer,” in ICASSP 2020-2020 IEEE International [99] D. P. Bertsekas, “Auction algorithms for network flow problems:
Conference on Acoustics, Speech and Signal Processing (ICASSP). A tutorial introduction,” Computational optimization and applica-
IEEE, 2020, pp. 6144–6148. tions, vol. 1, pp. 7–66, 1992.
[79] A. Gupta, Y. Yuan, Y. Zhou, and C. Mendis, “Flurka: Fast fused [100] Z. Dai, G. Lai, Y. Yang, and Q. Le, “Funnel-transformer: Filtering
low-rank & kernel attention,” arXiv preprint arXiv:2306.15799, out sequential redundancy for efficient language processing,”
2023. Advances in neural information processing systems, vol. 33, pp. 4271–
[80] X. Ma, X. Kong, S. Wang, C. Zhou, J. May, H. Ma, and L. Zettle- 4282, 2020.
moyer, “Luna: Linear unified nested attention,” Advances in Neu- [101] L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang,
ral Information Processing Systems, vol. 34, pp. 2441–2453, 2021. “Vision mamba: Efficient visual representation learning with
[81] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, bidirectional state space model,” arXiv preprint arXiv:2401.09417,
“Set transformer: A framework for attention-based permutation- 2024.
invariant neural networks,” in International conference on machine [102] W. Hua, Z. Dai, H. Liu, and Q. Le, “Transformer quality in linear
learning. PMLR, 2019, pp. 3744–3753. time,” in International Conference on Machine Learning. PMLR,
[82] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Trans- 2022, pp. 9099–9117.
formers are rnns: Fast autoregressive transformers with linear [103] AI21, “Jamba: Ai21’s groundbreaking ssm-transformer model,”
attention,” in International conference on machine learning. PMLR, March 2024. [Online]. Available: https://www.ai21.com/blog/
2020, pp. 5156–5165. announcing-jamba
[83] K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, [104] W. He, K. Han, Y. Tang, C. Wang, Y. Yang, T. Guo, and
A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, Y. Wang, “Densemamba: State space models with dense hidden
L. Kaiser et al., “Rethinking attention with performers,” in In- connection for efficient large language models,” arXiv preprint
ternational Conference on Learning Representations, 2020. arXiv:2403.00818, 2024.
[84] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and [105] Q. Anthony, Y. Tokpanov, P. Glorioso, and B. Millidge, “Black-
L. Kong, “Random feature attention,” in International Conference mamba: Mixture of experts for state-space models,” arXiv preprint
on Learning Representations, 2022. arXiv:2402.01771, 2024.
[85] P. Kacham, V. Mirrokni, and P. Zhong, “Polysketchformer: Fast [106] M. Pióro, K. Ciebiera, K. Król, J. Ludziejewski, and S. Jaszczur,
transformers via sketches for polynomial kernels,” arXiv preprint “Moe-mamba: Efficient selective state space models with mixture
arXiv:2310.01655, 2023. of experts,” arXiv preprint arXiv:2401.04081, 2024.
[86] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling [107] S. Zhai, W. Talbott, N. Srivastava, C. Huang, H. Goh, R. Zhang,
to trillion parameter models with simple and efficient sparsity,” and J. Susskind, “An attention free transformer,” arXiv preprint
The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232– arXiv:2105.14103, 2021.
5270, 2022. [108] T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V. Tran,
[87] Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “Moefication: Y. Tay, and D. Metzler, “Confident adaptive language modeling,”
Transformer feed-forward layers are mixtures of experts,” in Advances in Neural Information Processing Systems, vol. 35, pp.
Findings of the Association for Computational Linguistics: ACL 2022, 17 456–17 472, 2022.
2022, pp. 877–890. [109] L. Del Corro, A. Del Giorno, S. Agarwal, B. Yu, A. Awadallah, and
[88] Z.-F. Gao, P. Liu, W. X. Zhao, Z.-Y. Lu, and J.-R. Wen, “Parameter- S. Mukherjee, “Skipdecode: Autoregressive skip decoding with
efficient mixture-of-experts architecture for pre-trained language batching and caching for efficient llm inference,” arXiv preprint
models,” in Proceedings of the 29th International Conference on arXiv:2307.02628, 2023.
Computational Linguistics, 2022, pp. 3263–3273. [110] W. Liu, P. Zhou, Z. Wang, Z. Zhao, H. Deng, and Q. Ju, “Fastbert:
[89] A. Komatsuzaki, J. Puigcerver, J. Lee-Thorp, C. R. Ruiz, a self-distilling bert with adaptive inference time,” in Proceedings
B. Mustafa, J. Ainslie, Y. Tay, M. Dehghani, and N. Houlsby, of the 58th Annual Meeting of the Association for Computational
“Sparse upcycling: Training mixture-of-experts from dense Linguistics, 2020, pp. 6035–6044.
checkpoints,” arXiv preprint arXiv:2212.05055, 2022. [111] J. Kong, J. Wang, L.-C. Yu, and X. Zhang, “Accelerating inference
[90] M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, for pretrained language models by unified multi-perspective
“Base layers: Simplifying training of large, sparse models,” in early exiting,” in Proceedings of the 29th International Conference
International Conference on Machine Learning. PMLR, 2021, pp. on Computational Linguistics, 2022, pp. 4677–4686.
6265–6274. [112] K. Liao, Y. Zhang, X. Ren, Q. Su, X. Sun, and B. He, “A global
[91] Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. past-future early exit method for accelerating inference of pre-
Dai, Q. V. Le, J. Laudon et al., “Mixture-of-experts with expert trained language models,” in Proceedings of the 2021 Conference
choice routing,” Advances in Neural Information Processing Systems, of the North American Chapter of the Association for Computational
vol. 35, pp. 7103–7114, 2022. Linguistics: Human Language Technologies, 2021, pp. 2013–2023.
[92] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, [113] J. Xin, R. Tang, J. Lee, Y. Yu, and J. Lin, “Deebert: Dynamic
and W. Fedus, “St-moe: Designing stable and transferable sparse early exiting for accelerating bert inference,” in Proceedings of the
expert models,” arXiv preprint arXiv:2202.08906, 2022. 58th Annual Meeting of the Association for Computational Linguistics,
[93] D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, 2020, pp. 2246–2251.
“Stablemoe: Stable routing strategy for mixture of experts,” in [114] W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses
Proceedings of the 60th Annual Meeting of the Association for Compu- patience: Fast and robust inference with early exit,” Advances in
tational Linguistics (Volume 1: Long Papers), 2022, pp. 7085–7095. Neural Information Processing Systems, vol. 33, pp. 18 330–18 341,
[94] T. Chen, Z. Zhang, A. K. JAISWAL, S. Liu, and Z. Wang, “Sparse 2020.
moe as the new dropout: Scaling dense and self-slimmable [115] T. Sun, X. Liu, W. Zhu, Z. Geng, L. Wu, Y. He, Y. Ni, G. Xie, X.-J.
transformers,” in The Eleventh International Conference on Learning Huang, and X. Qiu, “A simple hash-based early exiting approach
Representations, 2022. for language understanding and generation,” in Findings of the
31

Association for Computational Linguistics: ACL 2022, 2022, pp. 2409– [138] M. Javaheripi, G. de Rosa, S. Mukherjee, S. Shah, T. Religa,
2421. C. C. Teodoro Mendes, S. Bubeck, F. Koushanfar, and D. Dey,
[116] Y. Huang, Y. Chen, Z. Yu, and K. McKeown, “In-context learning “Litetransformersearch: Training-free neural architecture search
distillation: Transferring few-shot learning ability of pre-trained for efficient language models,” Advances in Neural Information
language models,” arXiv preprint arXiv:2212.10670, 2022. Processing Systems, vol. 35, pp. 24 254–24 267, 2022.
[117] J. Zhao, W. Zhao, A. Drozdov, B. Rozonoyer, M. A. Sultan, [139] D. D. Xu, S. Mukherjee, X. Liu, D. Dey, W. Wang, X. Zhang,
J.-Y. Lee, M. Iyyer, and A. McCallum, “Multistage collabora- A. Awadallah, and J. Gao, “Few-shot task-agnostic neural archi-
tive knowledge distillation from large language models,” arXiv tecture search for distilling large language models,” Advances in
preprint arXiv:2311.08640, 2023. Neural Information Processing Systems, vol. 35, pp. 28 644–28 656,
[118] C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, 2022.
R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! [140] A. Kaushal, T. Vaidhya, and I. Rish, “Lord: Low rank decomposi-
outperforming larger language models with less training data tion of monolingual code llms for one-shot compression,” arXiv
and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023. preprint arXiv:2309.14021, 2023.
[119] L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi, [141] M. Xu, Y. L. Xu, and D. P. Mandic, “Tensorgpt: Efficient com-
“Symbolic chain-of-thought distillation: Small models can also” pression of the embedding layer in llms based on the tensor-train
think” step-by-step,” arXiv preprint arXiv:2306.14050, 2023. decomposition,” arXiv preprint arXiv:2307.00526, 2023.
[120] L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Sev- [142] Y. Li, Y. Yu, Q. Zhang, C. Liang, P. He, W. Chen, and T. Zhao,
eryn, “Teaching small language models to reason,” arXiv preprint “Losparse: Structured compression of large language models
arXiv:2212.08410, 2022. based on low-rank and sparse approximation,” arXiv preprint
[121] H. Chen, S. Wu, X. Quan, R. Wang, M. Yan, and J. Zhang, “Mcc- arXiv:2306.11222, 2023.
kd: Multi-cot consistent knowledge distillation,” arXiv preprint [143] R. Saha, V. Srivastava, and M. Pilanci, “Matrix compression via
arXiv:2310.14747, 2023. randomized low rank and low precision factorization,” arXiv
[122] N. Ho, L. Schmid, and S.-Y. Yun, “Large language models are preprint arXiv:2310.11028, 2023.
reasoning teachers,” arXiv preprint arXiv:2212.10071, 2022. [144] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-v2: Exploring
[123] K. Shridhar, A. Stolfo, and M. Sachan, “Distilling reasoning post-training quantization in llms from comprehensive study to
capabilities into smaller language models,” in Findings of the low rank compensation,” arXiv preprint arXiv:2303.08302, 2023.
Association for Computational Linguistics: ACL 2023, 2023, pp. 7059– [145] R. Chand, Y. Prabhu, and P. Kumar, “Dsformer: Effective com-
7073. pression of text-transformers by dense-sparse weight factoriza-
[124] X. Zhu, B. Qi, K. Zhang, X. Long, and B. Zhou, “Pad: Program- tion,” arXiv preprint arXiv:2312.13211, 2023.
aided distillation specializes large models in reasoning,” arXiv [146] Z. Yuan, Y. Shang, Y. Song, Q. Wu, Y. Yan, and G. Sun, “Asvd:
preprint arXiv:2305.13888, 2023. Activation-aware singular value decomposition for compressing
[125] P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren, “Scott: large language models,” arXiv preprint arXiv:2312.05821, 2023.
Self-consistent chain-of-thought distillation,” arXiv preprint [147] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generat-
arXiv:2305.01879, 2023. ing long sequences with sparse transformers,” arXiv preprint
[126] Z. Chen, Q. Gao, A. Bosselut, A. Sabharwal, and K. Richardson, arXiv:1904.10509, 2019.
“Disco: distilling counterfactuals with large language models,” [148] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient
in Proceedings of the 61st Annual Meeting of the Association for streaming language models with attention sinks,” arXiv preprint
Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5514– arXiv:2309.17453, 2023.
5528. [149] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
[127] M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. Aji, document transformer,” arXiv preprint arXiv:2004.05150, 2020.
“Lamini-lm: A diverse herd of distilled models from large-scale [150] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti,
instructions,” arXiv preprint arXiv:2304.14402, 2023. S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big
[128] Y. Jiang, C. Chan, M. Chen, and W. Wang, “Lion: Adversarial bird: Transformers for longer sequences,” Advances in neural
distillation of proprietary large language models,” in Proceedings information processing systems, vol. 33, pp. 17 283–17 297, 2020.
of the 2023 Conference on Empirical Methods in Natural Language [151] S. Dai, H. Genc, R. Venkatesan, and B. Khailany, “Efficient trans-
Processing, 2023, pp. 3134–3154. former inference with statically structured sparse attention,” in
[129] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge distillation 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE,
of large language models,” arXiv preprint arXiv:2306.08543, 2023. 2023, pp. 1–6.
[130] R. Agarwal, N. Vieillard, P. Stanczyk, S. Ramos, M. Geist, and [152] Anonymous, “SemSA: Semantic sparse attention is hidden
O. Bachem, “Gkd: Generalized knowledge distillation for auto- in large language models.” 2023. [Online]. Available: https:
regressive sequence models,” arXiv preprint arXiv:2306.13649, //openreview.net/forum?id=eG9AkHtYYH
2023. [153] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse at-
[131] C. Liang, S. Zuo, Q. Zhang, P. He, W. Chen, and T. Zhao, “Less tention architecture with cascade token and head pruning,” in
is more: Task-aware layer-wise distillation for language model 2021 IEEE International Symposium on High-Performance Computer
compression,” in International Conference on Machine Learning. Architecture (HPCA). IEEE, pp. 97–110.
PMLR, 2023, pp. 20 852–20 867. [154] L. Ren, Y. Liu, S. Wang, Y. Xu, C. Zhu, and C. Zhai, “Sparse mod-
[132] I. Timiryasov and J.-L. Tastet, “Baby llama: knowledge distillation ular activation for efficient sequence modeling,” arXiv preprint
from an ensemble of teachers trained on a small dataset with no arXiv:2306.11197, 2023.
performance penalty,” arXiv preprint arXiv:2308.02019, 2023. [155] S. Anagnostidis, D. Pavllo, L. Biggio, L. Noci, A. Lucchi,
[133] C. Zhang, Y. Yang, J. Liu, J. Wang, Y. Xian, B. Wang, and D. Song, and T. Hoffmann, “Dynamic context pruning for efficient
“Lifting the curse of capacity gap in distilling language models,” and interpretable autoregressive transformers,” arXiv preprint
arXiv preprint arXiv:2305.12129, 2023. arXiv:2305.15805, 2023.
[134] S. Padmanabhan, Y. Onoe, M. J. Zhang, G. Durrett, and E. Choi, [156] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient
“Propagating knowledge updates to lms through distillation,” transformer,” arXiv preprint arXiv:2001.04451, 2020.
arXiv preprint arXiv:2306.09306, 2023. [157] M. Pagliardini, D. Paliotta, M. Jaggi, and F. Fleuret, “Faster causal
[135] Y. Yin, C. Chen, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Au- attention over large sequences through sparse flash attention,”
totinybert: Automatic hyper-parameter optimization for efficient arXiv preprint arXiv:2306.01160, 2023.
pre-trained language models,” arXiv preprint arXiv:2107.13686, [158] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient content-
2021. based sparse attention with routing transformers,” Transactions of
[136] J. Xu, X. Tan, R. Luo, K. Song, J. Li, T. Qin, and T.-Y. Liu, “Nas- the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021.
bert: task-agnostic and adaptive-size bert compression with neu- [159] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse
ral architecture search,” in Proceedings of the 27th ACM SIGKDD sinkhorn attention,” in International Conference on Machine Learn-
Conference on Knowledge Discovery & Data Mining, 2021, pp. 1933– ing. PMLR, 2020, pp. 9438–9447.
1943. [160] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song,
[137] A. Klein, J. Golebiowski, X. Ma, V. Perrone, and C. Archambeau, Y. Tian, C. Ré, C. Barrett et al., “H2o: Heavy-hitter oracle for
“Structural pruning of large language models via neural archi- efficient generative inference of large language models,” Advances
tecture search,” 2023. in Neural Information Processing Systems, vol. 36, 2024.
32

[161] A. Feng, I. Li, Y. Jiang, and R. Ying, “Diffuser: efficient transform- [184] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “Awq:
ers with multi-hop attention diffusion for long sequences,” in Activation-aware weight quantization for llm compression and
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, acceleration,” arXiv preprint arXiv:2306.00978, 2023.
no. 11, 2023, pp. 12 772–12 780. [185] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “Owq: Lessons learned
[162] E. Frantar and D. Alistarh, “Sparsegpt: Massive language models from activation outliers for weight quantization in large language
can be accurately pruned in one-shot,” 2023. models,” arXiv preprint arXiv:2306.02272, 2023.
[163] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective [186] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev,
pruning approach for large language models,” arXiv preprint E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh,
arXiv:2306.11695, 2023. “Spqr: A sparse-quantized representation for near-lossless llm
[164] H. Shao, B. Liu, and Y. Qian, “One-shot sensitivity-aware mixed weight compression,” arXiv preprint arXiv:2306.03078, 2023.
sparsity pruning for large language models,” arXiv preprint [187] S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W.
arXiv:2310.09499, 2023. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quan-
[165] A. Syed, P. H. Guo, and V. Sundarapandiyan, “Prune and tune: tization,” arXiv preprint arXiv:2306.07629, 2023.
Improving efficient pruning techniques for massive language [188] J. Chee, Y. Cai, V. Kuleshov, and C. De Sa, “Quip: 2-bit quantiza-
models,” 2023. tion of large language models with guarantees,” in Thirty-seventh
[166] X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, Conference on Neural Information Processing Systems, 2023.
“Outlier suppression+: Accurate quantization of large language [189] Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, “Finequant:
models by equivalent and optimal shifting and scaling,” arXiv Unlocking efficiency with fine-grained weight-only quantization
preprint arXiv:2304.09145, 2023. for llms,” arXiv preprint arXiv:2308.09723, 2023.
[167] P. Xu, W. Shao, M. Chen, S. Tang, K. Zhang, P. Gao, F. An, [190] K. Behdin, A. Acharya, A. Gupta, S. Keerthi, and R. Mazumder,
Y. Qiao, and P. Luo, “Besa: Pruning large language models with “Quantease: Optimization-based quantization for language
blockwise parameter-efficient sparsity allocation,” in The Twelfth models–an efficient and intuitive algorithm,” arXiv preprint
International Conference on Learning Representations, 2023. arXiv:2309.01885, 2023.
[168] Y. Zhang, H. Bai, H. Lin, J. Zhao, L. Hou, and C. V. Cannistraci, [191] S. Li, X. Ning, K. Hong, T. Liu, L. Wang, X. Li, K. Zhong, G. Dai,
“An efficient plug-and-play post-training pruning strategy in H. Yang, and Y. Wang, “Llm-mq: Mixed-precision quantization
large language models,” 2023. for efficient llm deployment,” 2023.
[169] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural [192] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He,
pruning of large language models,” Advances in neural information “Zeroquant: Efficient and affordable post-training quantization
processing systems, vol. 36, 2024. for large-scale transformers,” in Advances in Neural Information
[170] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerat- Processing Systems, 2022.
ing language model pre-training via structured pruning,” arXiv
[193] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang,
preprint arXiv:2310.06694, 2023.
C. Re, I. Stoica, and C. Zhang, “Flexgen: High-throughput gen-
[171] E. Kurtić, E. Frantar, and D. Alistarh, “Ziplm: Inference-aware erative inference of large language models with a single gpu,”
structured pruning of language models,” Advances in Neural 2023.
Information Processing Systems, vol. 36, 2024.
[194] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm. int8
[172] M. Zhang, H. Chen, C. Shen, Z. Yang, L. Ou, X. Yu, and
(): 8-bit matrix multiplication for transformers at scale,” arXiv
B. Zhuang, “Loraprune: Pruning meets low-rank parameter-
preprint arXiv:2208.07339, 2022.
efficient fine-tuning,” 2023.
[195] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
[173] T. Chen, T. Ding, B. Yadav, I. Zharkov, and L. Liang, “Lorashear:
“Smoothquant: Accurate and efficient post-training quantization
Efficient large language model structured pruning and knowl-
for large language models,” in International Conference on Machine
edge recovery,” arXiv preprint arXiv:2310.18356, 2023.
Learning. PMLR, 2023, pp. 38 087–38 099.
[174] S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler,
[196] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-v2: Exploring
and J. Hensman, “Slicegpt: Compress large language models
post-training quantization in llms from comprehensive study to
by deleting rows and columns,” arXiv preprint arXiv:2401.15024,
low rank compensation,” arXiv preprint arXiv:2303.08302, 2023.
2024.
[175] Q. Zhang, S. Zuo, C. Liang, A. Bukharin, P. He, W. Chen, [197] Z. Yuan, L. Niu, J. Liu, W. Liu, X. Wang, Y. Shang, G. Sun, Q. Wu,
and T. Zhao, “Platon: Pruning large transformer models with J. Wu, and B. Wu, “Rptq: Reorder-based post-training quantiza-
upper confidence bound of weight importance,” in International tion for large language models,” arXiv preprint arXiv:2304.01089,
Conference on Machine Learning. PMLR, 2022, pp. 26 809–26 823. 2023.
[176] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, and [198] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu,
N. Wong, “Structured pruning for efficient generative pre-trained M. Guo, and Y. Zhu, “Olive: Accelerating large language models
language models,” in Findings of the Association for Computational via hardware-friendly outlier-victim pair quantization,” in Pro-
Linguistics: ACL 2023, 2023, pp. 10 880–10 895. ceedings of the 50th Annual International Symposium on Computer
[177] S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng, “Llm-fp4: 4- Architecture, 2023, pp. 1–15.
bit floating-point quantized transformers,” in The 2023 Conference [199] X. Wu, Z. Yao, and Y. He, “Zeroquant-fp: A leap forward in llms
on Empirical Methods in Natural Language Processing, 2023. post-training w4a8 quantization using floating-point formats,”
[178] L. Li, Q. Li, B. Zhang, and X. Chu, “Norm tweaking: High- arXiv preprint arXiv:2307.09782, 2023.
performance low-bit quantization of large language models,” [200] W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang,
arXiv preprint arXiv:2309.02784, 2023. P. Gao, Y. Qiao, and P. Luo, “Omniquant: Omnidirectionally
[179] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, calibrated quantization for large language models,” in The Twelfth
“Qlora: Efficient finetuning of quantized llms,” Advances in Neu- International Conference on Learning Representations, 2023.
ral Information Processing Systems, vol. 36, 2024. [201] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, and B. Zhuang, “Qllm:
[180] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, Accurate and efficient low-bitwidth quantization for large lan-
X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low- guage models,” in The Twelfth International Conference on Learning
rank adaptation of large language models,” arXiv preprint Representations, 2023.
arXiv:2309.14717, 2023. [202] Y. Zhao, C.-Y. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze,
[181] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit
T. Zhao, “Loftq: Lora-fine-tuning-aware quantization for large quantization for efficient and accurate llm serving,” arXiv preprint
language models,” arXiv preprint arXiv:2310.08659, 2023. arXiv:2310.19102, 2023.
[182] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: [203] W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and
Accurate post-training quantization for generative pre-trained X. Qi, “Billm: Pushing the limit of post-training quantization for
transformers,” arXiv preprint arXiv:2210.17323, 2022. llms,” 2024.
[183] G. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S. J. Kwon, B. Kim, [204] S. Li, X. Ning, L. Wang, T. Liu, X. Shi, S. Yan, G. Dai, H. Yang, and
Y. Lee, D. Lee et al., “Lut-gemm: Quantized matrix multiplication Y. Wang, “Evaluating quantized large language models,” arXiv
based on luts for efficient inference in large-scale generative lan- preprint arXiv:2402.18158, 2024.
guage models,” in The Twelfth International Conference on Learning [205] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S.
Representations, 2023. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million
33

context length llm inference with kv cache quantization,” arXiv [230] Z. Sun, A. T. Suresh, J. H. Ro, A. Beirami, H. Jain, and F. Yu,
preprint arXiv:2401.18079, 2024. “Spectr: Fast speculative decoding via optimal transport,” arXiv
[206] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, preprint arXiv:2310.15141, 2023.
and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for [231] K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, Y. Dong,
kv cache,” arXiv preprint arXiv:2402.02750, 2024. and Y. Wang, “Flashdecoding++: Faster large language model
[207] E. Frantar and D. Alistarh, “Optimal brain compression: A frame- inference on gpus,” 2024.
work for accurate post-training quantization and pruning,” in [232] T. Gale, D. Narayanan, C. Young, and M. Zaharia, “Megablocks:
Advances in Neural Information Processing Systems, 2022. Efficient sparse training with mixture-of-experts,” in Proceedings
[208] N. Vaidya, F. Oh, and N. Comly, “Optimizing inference on of Machine Learning and Systems (MLSys), 2023.
large language models with nvidia tensorrt-llm, now pub- [233] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention:
licly available,” [Online], 2023, https://github.com/NVIDIA/ Fast and memory-efficient exact attention with io-awareness,”
TensorRT-LLM. Advances in Neural Information Processing Systems, vol. 35, pp.
[209] InternLM, “Lmdeploy,” 2024. [Online]. Available: https:// 16 344–16 359, 2022.
github.com/InternLM/lmdeploy [234] T. Dao, “Flashattention-2: Faster attention with better parallelism
[210] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
and W. Chen, “Lora: Low-rank adaptation of large language [235] Y. Zhai, C. Jiang, L. Wang, X. Jia, S. Zhang, Z. Chen, X. Liu,
models,” arXiv preprint arXiv:2106.09685, 2021. and Y. Zhu, “Bytetransformer: A high-performance transformer
[211] B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon boosted for variable-length inputs,” in 2023 IEEE International
and general network pruning,” in IEEE international conference on Parallel and Distributed Processing Symposium (IPDPS). IEEE,
neural networks. IEEE, 1993, pp. 293–299. 2023, pp. 344–355.
[212] Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” [236] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li,
Advances in neural information processing systems, vol. 2, 1989. E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley et al.,
[213] B. Zoph and Q. Le, “Neural architecture search with reinforce- “Deepspeed-inference: enabling efficient inference of transformer
ment learning,” in International Conference on Learning Representa- models at unprecedented scale,” in SC22: International Conference
tions, 2016. for High Performance Computing, Networking, Storage and Analysis.
[214] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, IEEE, 2022, pp. 1–15.
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train- [237] T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash-decoding for
ing language models to follow instructions with human feedback, long-context inference,” [Online], 2023, https://crfm.stanford.
2022,” URL https://arxiv. org/abs/2203.02155, vol. 13, 2022. edu/2023/10/12/flashdecoding.html.
[215] X. He, I. Keivanloo, Y. Xu, X. He, B. Zeng, S. Rajagopalan, and [238] HuggingFace, “Transformers: State-of-the-art machine learning
T. Chilimbi, “Magic pyramid: Accelerating inference with early for pytorch, tensorflow, and jax.” [Online], 2024, https://github.
exiting and token pruning,” Image, 2023. com/huggingface/transformers.
[216] TogetherAI, “Paving the way to efficient architectures: [239] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
Stripedhyena-7b, open source models offering a glimpse T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al.,
into a world beyond transformers,” December 2023. [Online]. “Llama: Open and efficient foundation language models,” arXiv
Available: https://www.together.ai/blog/stripedhyena-7b preprint arXiv:2302.13971, 2023.
[217] A. Jaiswal, Z. Gan, X. Du, B. Zhang, Z. Wang, and Y. Yang, [240] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang,
“Compressing llms: The truth is rarely pure and never simple,” “Glm: General language model pretraining with autoregressive
arXiv preprint arXiv:2310.01382, 2023. blank infilling,” in Proceedings of the 60th Annual Meeting of the
[218] Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from Association for Computational Linguistics (Volume 1: Long Papers),
transformers via speculative decoding,” in International Confer- 2022, pp. 320–335.
ence on Machine Learning. PMLR, 2023, pp. 19 274–19 286. [241] Sensetime, “Openppl: A high-performance deep learning infer-
[219] C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and ence platform,” [Online], 2023, https://openppl.ai/home.
J. Jumper, “Accelerating large language model decoding with
[242] NVIDIA, “cublas: Basic linear algebra on nvidia gpus,” [Online],
speculative sampling,” arXiv preprint arXiv:2302.01318, 2023.
2017, https://developer.nvidia.com/cublas.
[220] Y. Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh,
[243] ——, “Cutlass: Cuda templates for linear algebra subroutines,”
S. Kumar, J.-F. Kagy, and R. Agarwal, “Distillspec: Improving
[Online], 2017, https://github.com/NVIDIA/cutlass.
speculative decoding via knowledge distillation,” arXiv preprint
arXiv:2310.08461, 2023. [244] S. Wang, “Fastgemv: High-speed gemv kernels,” [Online], 2023,
[221] J. Zhang, J. Wang, H. Li, L. Shou, K. Chen, G. Chen, and S. Mehro- https://github.com/wangsiping97/FastGEMV.
tra, “Draft & verify: Lossless large language model acceleration [245] P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate lan-
via self-speculative decoding,” arXiv preprint arXiv:2309.08168, guage and compiler for tiled neural network computations,” in
2023. Proceedings of the 3rd ACM SIGPLAN International Workshop on
[222] X. Liu, L. Hu, P. Bailis, I. Stoica, Z. Deng, A. Cheung, Machine Learning and Programming Languages, 2019, pp. 10–19.
and H. Zhang, “Online speculative decoding,” arXiv preprint [246] M. Stern, N. Shazeer, and J. Uszkoreit, “Blockwise parallel
arXiv:2310.07177, 2023. decoding for deep autoregressive models,” Advances in Neural
[223] G. Monea, A. Joulin, and E. Grave, “Pass: Parallel speculative Information Processing Systems, vol. 31, 2018.
sampling,” arXiv preprint arXiv:2311.13581, 2023. [247] P. Patel, E. Choukse, C. Zhang, Íñigo Goiri, A. Shah, S. Maleki,
[224] Z. He, Z. Zhong, T. Cai, J. D. Lee, and D. He, “Rest: Retrieval- and R. Bianchini, “Splitwise: Efficient generative llm inference
based speculative decoding,” arXiv preprint arXiv:2311.08252, using phase splitting,” arXiv preprint arXiv:2311.18677, 2023.
2023. [248] C. Hu, H. Huang, L. Xu, X. Chen, J. Xu, S. Chen, H. Feng,
[225] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, R. Y. Y. C. Wang, S. Wang, Y. Bao, N. Sun, and Y. Shan, “Inference without
Wong, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia, “Specinfer: interference: Disaggregate llm inference for mixed downstream
Accelerating generative llm serving with speculative inference workloads,” arXiv preprint arXiv:2401.11181, 2024.
and token tree verification,” arXiv preprint arXiv:2305.09781, 2023. [249] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and
[226] B. Spector and C. Re, “Accelerating llm inference with staged H. Zhang, “Distserve: Disaggregating prefill and decoding for
speculative decoding,” arXiv preprint arXiv:2308.04623, 2023. goodput-optimized large language model serving,” arXiv preprint
[227] Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang, arXiv:2401.09670, 2024.
“Cascade speculative drafting for even faster llm inference,” [250] X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia, “Spot-
arXiv preprint arXiv:2312.11462, 2023. serve: Serving generative large language models on preemptible
[228] Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Breaking the sequential instances,” arXiv preprint arXiv:2311.15566, 2023.
dependency of llm inference using lookahead decoding,” [251] B. Lin, T. Peng, C. Zhang, M. Sun, L. Li, H. Zhao, W. Xiao,
November 2023. [Online]. Available: https://lmsys.org/blog/ Q. Xu, X. Qiu, S. Li, Z. Ji, Y. Li, and W. Lin, “Infinite-llm: Efficient
2023-11-21-lookahead-decoding/ llm service for long context with distattention and distributed
[229] Y. Li, C. Zhang, and H. Zhang, “Eagle: Lossless acceleration kvcache,” arXiv preprint arXiv:2401.02669, 2024.
of llm decoding by feature extrapolation,” December 2023. [252] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca:
[Online]. Available: https://sites.google.com/view/eagle-llm A distributed serving system for transformer-based generative
34

models,” in Proceedings of the 16th USENIX Symposium on Operat- [274] Y. Tang, F. Liu, Y. Ni, Y. Tian, Z. Bai, Y.-Q. Hu, S. Liu, S. Jui,
ing Systems Design and Implementation, 2022, pp. 521–538. K. Han, and Y. Wang, “Rethinking optimization and architecture
[253] ModelTC, “Lightllm,” February 2024. [Online]. Available: for tiny language models,” arXiv preprint arXiv:2402.02791, 2024.
https://github.com/ModelTC/lightllm/ [275] Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y. T.
[254] C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan, J. Rasley, S. Ra- Lee, “Textbooks are all you need ii: phi-1.5 technical report,”
jbhandari, R. Y. Aminabadi, H. Qin, A. Bakhtiari, L. Kurilenko, arXiv preprint arXiv:2309.05463, 2023.
and Y. He, “Deepspeed-fastgen: High-throughput text genera- [276] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes,
tion for llms via mii and deepspeed-inference,” arXiv preprint A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa,
arXiv:2401.08671, 2024. O. Saarikivi et al., “Textbooks are all you need,” arXiv preprint
[255] B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast arXiv:2306.11644, 2023.
distributed inference serving for large language models,” arXiv [277] P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-
preprint arXiv:2305.05920, 2023. source small language model,” arXiv preprint arXiv:2401.02385,
[256] Y. Sheng, S. Cao, D. Li, B. Zhu, Z. Li, and D. Zhuo, “Fairness in 2024.
serving large language models,” arXiv preprint arXiv:2401.00588, [278] C. Zhang, D. Song, Z. Ye, and Y. Gao, “Towards the law
2024. of capacity gap in distilling language models,” arXiv preprint
arXiv:2311.07052, 2023.
[257] A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani,
[279] X. Geng and H. Liu, “Openllama: An open reproduction of
and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking
llama,” May 2023. [Online]. Available: https://github.com/
decodes with chunked prefills,” arXiv preprint arXiv:2308.16369,
openlm-research/open llama
2023.
[280] M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi,
[258] A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S.
R. Adithyan, J. Baicoianu, B. Brooks, N. Cooper, A. Datta
Gulavani, A. Tumanov, , and R. Ramjee, “Taming throughput-
et al., “Stable lm 2 1.6 b technical report,” arXiv preprint
latency tradeoff in llm inference with sarathi-serve,” arXiv
arXiv:2402.17834, 2024.
preprint arXiv:2403.02310, 2024.
[281] “Minicpm: Unveiling the potential of end-side large language
[259] Y. Jin, C.-F. Wu, D. Brooks, and G.-Y. Wei, “S3 : Increasing gpu models,” 2024.
utilization during generative inference for higher throughput,” [282] Z. Liu, C. Zhao, F. Iandola, C. Lai, Y. Tian, I. Fedorov, Y. Xiong,
arXiv preprint arXiv:2306.06000, 2023. E. Chang, Y. Shi, R. Krishnamoorthi et al., “Mobilellm: Optimiz-
[260] Z. Ye, “flashinfer,” March 2024. [Online]. Available: https: ing sub-billion parameter language models for on-device use
//github.com/flashinfer-ai/flashinfer cases,” arXiv preprint arXiv:2402.14905, 2024.
[261] NVIDIA, “Fastertransformer: About transformer related opti- [283] M. team, “MLC-LLM,” 2023. [Online]. Available: https:
mization, including bert, gpt,” [Online], 2017, https://github. //github.com/mlc-ai/mlc-llm
com/NVIDIA/FasterTransformer. [284] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on
[262] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, large language model (llm) security and privacy: The good, the
H. Liu, and C. Ding, “Ftrans: Energy-efficient acceleration of bad, and the ugly,” High-Confidence Computing, p. 100211, 2024.
transformers using fpga,” arXiv preprint arXiv:2007.08563, 2020. [285] Y. Li, H. Wen, W. Wang, X. Li, Y. Yuan, G. Liu, J. Liu, W. Xu,
[263] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jun, and J. W. Lee, X. Wang, Y. Sun et al., “Personal llm agents: Insights and sur-
“Elsa: Hardware-software co-design for efficient, lightweight vey about the capability, efficiency and security,” arXiv preprint
self-attention mechanism in neural networks,” in ACM/IEEE 48th arXiv:2401.05459, 2024.
Annual International Symposium on Computer Architecture, 2021,
pp. 692–705.
[264] H. Fan, T. Chau, S. I. Venieris, R. Lee, A. Kouris, W. Luk, N. D.
Lane, and M. S. Abdelfattah, “Adaptable butterfly accelerator for
attention-based nns via hardware and algorithm co-design,” in
IEEE/ACM International Symposium on Microarchitecture, 2022, pp.
599–615.
[265] Y. Qin, Y. Wang, D. Deng, Z. Zhao, X. Yang, L. Liu, S. Wei,
Y. Hu, and S. Yin, “Fact: Ffn-attention co-optimized transformer
architecture with eager correlation prediction,” in Proceedings of
the 50th Annual International Symposium on Computer Architecture,
2023, pp. 1–14.
[266] H. Chen, J. Zhang, Y. Du, S. Xiang, Z. Yue, N. Zhang, Y. Cai,
and Z. Zhang, “Understanding the potential of fpga-based spatial
acceleration for large language model inference,” arXiv preprint
arXiv:2312.15159, 2023.
[267] S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y.
Kim, “Dfx: A low-latency multi-fpga appliance for accelerating
transformer-based text generation,” in IEEE Hot Chips 34 Sympo-
sium, 2022.
[268] S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun,
S. Li, Z. Huang et al., “Flightllm: Efficient large language model
inference with a complete mapping flow on fpga,” arXiv preprint
arXiv:2401.03868, 2024.
[269] S. teams, “Sharegpt,” 2023. [Online]. Available: https://sharegpt.
com/
[270] J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal
agents: A survey,” arXiv preprint arXiv:2402.15116, 2024.
[271] I. Lee, N. Jiang, and T. Berg-Kirkpatrick, “Exploring the relation-
ship between model architecture and in-context learning ability,”
arXiv preprint arXiv:2310.08049, 2023.
[272] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley,
K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth,
E. Raff et al., “Pythia: A suite for analyzing large language models
across training and scaling,” in International Conference on Machine
Learning. PMLR, 2023, pp. 2397–2430.
[273] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge,
Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint
arXiv:2309.16609, 2023.

Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
100% (5)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
326 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
Medical Informatics, E-Health - Fundamentals and Applications (PDFDrive) 1
100% (2)
Medical Informatics, E-Health - Fundamentals and Applications (PDFDrive) 1
495 pages
Survey On Efficient Inference For LLMs 1721657409
No ratings yet
Survey On Efficient Inference For LLMs 1721657409
36 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
Efficient Large Language Models - A Survey
No ratings yet
Efficient Large Language Models - A Survey
67 pages
LLM Inference Unveiled Survey and Roofline Model Insights
No ratings yet
LLM Inference Unveiled Survey and Roofline Model Insights
27 pages
A Survey of Efficient LLM Inference Serving
No ratings yet
A Survey of Efficient LLM Inference Serving
20 pages
Survay
No ratings yet
Survay
59 pages
Chen Et Al. - An Agile Framework For Efficient LLM Accelerator Development and Model Inference
No ratings yet
Chen Et Al. - An Agile Framework For Efficient LLM Accelerator Development and Model Inference
9 pages
New Solutions On LLM Acceleration Optimization
No ratings yet
New Solutions On LLM Acceleration Optimization
12 pages
An Overview of Large Language Models For Statisticians
No ratings yet
An Overview of Large Language Models For Statisticians
67 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Weeks 1-4 AI Paper by Hand PDF
No ratings yet
Weeks 1-4 AI Paper by Hand PDF
22 pages
Serving LLM 2312.15234
No ratings yet
Serving LLM 2312.15234
32 pages
Exegpt: Constraint-Aware Resource Scheduling For LLM Inference
No ratings yet
Exegpt: Constraint-Aware Resource Scheduling For LLM Inference
16 pages
Reasoning On A Budget: A Survey of Adaptive and Controllable Test-Time Compute in Llms
No ratings yet
Reasoning On A Budget: A Survey of Adaptive and Controllable Test-Time Compute in Llms
28 pages
LLM Seminar PDF
No ratings yet
LLM Seminar PDF
10 pages
Survery On Fpga and LLM
No ratings yet
Survery On Fpga and LLM
16 pages
LLM-Based Edge Intelligence A Comprehensive Survey On Architectures Applications Security and Trustworthiness
No ratings yet
LLM-Based Edge Intelligence A Comprehensive Survey On Architectures Applications Security and Trustworthiness
58 pages
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
No ratings yet
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
16 pages
LLM Survey
No ratings yet
LLM Survey
31 pages
All You Should Kno About LLM'S
No ratings yet
All You Should Kno About LLM'S
10 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
No ratings yet
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
11 pages
Toward A Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
No ratings yet
Toward A Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
10 pages
《A Primer on Large Language Models and their Limitations
No ratings yet
《A Primer on Large Language Models and their Limitations
33 pages
Large Language Models For Text Classification Case Study and 2rl2h1dz4onu
No ratings yet
Large Language Models For Text Classification Case Study and 2rl2h1dz4onu
12 pages
Wirelessllm: Empowering Large Language Models Towards Wireless Intelligence
No ratings yet
Wirelessllm: Empowering Large Language Models Towards Wireless Intelligence
12 pages
Inference Efficiency by Learning Task Complexity
No ratings yet
Inference Efficiency by Learning Task Complexity
9 pages
Survey
No ratings yet
Survey
23 pages
2024LLM Inference Bench ArXiv
No ratings yet
2024LLM Inference Bench ArXiv
18 pages
A Comprehensive Overview of Large Language Models: Preprint 1
No ratings yet
A Comprehensive Overview of Large Language Models: Preprint 1
46 pages
Fast Distributed Inference Serving For Large Language Models
No ratings yet
Fast Distributed Inference Serving For Large Language Models
15 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
No ratings yet
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
325 pages
Large Language Models and Their Use Cases
No ratings yet
Large Language Models and Their Use Cases
3 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
LLM Model
No ratings yet
LLM Model
43 pages
Techniques, Tricks & Frameworks
No ratings yet
Techniques, Tricks & Frameworks
143 pages
Splitwise Efficient Generative LLM Inference Using Phase Splitting
No ratings yet
Splitwise Efficient Generative LLM Inference Using Phase Splitting
15 pages
Thinking Machines: A Survey of LLM Based Reasoning Strategies
No ratings yet
Thinking Machines: A Survey of LLM Based Reasoning Strategies
15 pages
Industrial Applications of Large Language Models
No ratings yet
Industrial Applications of Large Language Models
23 pages
Know Thy Frenemy
No ratings yet
Know Thy Frenemy
40 pages
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
No ratings yet
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
46 pages
Training Large Language Models
No ratings yet
Training Large Language Models
7 pages
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
36 pages
A Comprehensive Survey On Integrating Large Language Models With Knowledge-Based Methods
No ratings yet
A Comprehensive Survey On Integrating Large Language Models With Knowledge-Based Methods
68 pages
LLM Advancements Applications Challenges 20000 Words
No ratings yet
LLM Advancements Applications Challenges 20000 Words
3 pages
Efficiently Scaling Transformer Inference
No ratings yet
Efficiently Scaling Transformer Inference
18 pages
Chatgpt: A Technical Perspective: Presented by Teamx
No ratings yet
Chatgpt: A Technical Perspective: Presented by Teamx
18 pages
A Review On Large Language Models Archit
No ratings yet
A Review On Large Language Models Archit
32 pages
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
No ratings yet
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
47 pages
Attention Is All You Need.
No ratings yet
Attention Is All You Need.
5 pages
Planet, Code - PYTHON For LARGE LANGUAGE MODELS - A Beginners Handbook For Leveraging Llms Into Modern Development Workflows and Applications (2025)
No ratings yet
Planet, Code - PYTHON For LARGE LANGUAGE MODELS - A Beginners Handbook For Leveraging Llms Into Modern Development Workflows and Applications (2025)
254 pages
LLM Research Paper
No ratings yet
LLM Research Paper
2 pages
KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
Project Proposal - Medical Image Analysis
No ratings yet
Project Proposal - Medical Image Analysis
2 pages
Distributed Fine-Tuning With The Transformers API by HuggingFace - Databricks
No ratings yet
Distributed Fine-Tuning With The Transformers API by HuggingFace - Databricks
7 pages
GRC Config Steps, ARA, EAM
100% (1)
GRC Config Steps, ARA, EAM
61 pages
Internetshutdown
No ratings yet
Internetshutdown
1 page
Ecommerce Assignment 1
No ratings yet
Ecommerce Assignment 1
3 pages
Enabling LDAP For IBM FlashSystem A9000
No ratings yet
Enabling LDAP For IBM FlashSystem A9000
28 pages
Paper HM
No ratings yet
Paper HM
50 pages
Speech Enhancement Using Kalman Filter
No ratings yet
Speech Enhancement Using Kalman Filter
14 pages
Rev 1 Module2 PLC
100% (2)
Rev 1 Module2 PLC
293 pages
Kenwood TRC-80 - User Manual PDF
73% (11)
Kenwood TRC-80 - User Manual PDF
33 pages
A Basic Derivation of The Finite Element Method (FEM)
No ratings yet
A Basic Derivation of The Finite Element Method (FEM)
11 pages
7 HomologyModelling 12oct2020
No ratings yet
7 HomologyModelling 12oct2020
8 pages
Udemy Strategic Plan
No ratings yet
Udemy Strategic Plan
27 pages
Getting Started With Power Query: Presented By: John Larimer
No ratings yet
Getting Started With Power Query: Presented By: John Larimer
45 pages
Brochurre Abbemat 300, 350,500
No ratings yet
Brochurre Abbemat 300, 350,500
7 pages
Raj Lab Basic Adobe Illustrator (CC) Guide
No ratings yet
Raj Lab Basic Adobe Illustrator (CC) Guide
28 pages
(AIEEE-2008) Ans. (4) Sol.: Section - 1: Single Choice Correct Questions
No ratings yet
(AIEEE-2008) Ans. (4) Sol.: Section - 1: Single Choice Correct Questions
36 pages
PWC Communication Tools and B.U.D.S. (Spark Series) - Shop Manual Supplement smr2016-108
No ratings yet
PWC Communication Tools and B.U.D.S. (Spark Series) - Shop Manual Supplement smr2016-108
6 pages
nm8sb 1 5
No ratings yet
nm8sb 1 5
4 pages
Spring Framework Notes
No ratings yet
Spring Framework Notes
93 pages
Communication
No ratings yet
Communication
3 pages
Satish - Quiz 1 Desktop Protection and Email - 1653325571341
No ratings yet
Satish - Quiz 1 Desktop Protection and Email - 1653325571341
11 pages
Algorithm Exam
No ratings yet
Algorithm Exam
14 pages
Resume Reference List Layout
100% (1)
Resume Reference List Layout
6 pages
Abstract 618 Letter
No ratings yet
Abstract 618 Letter
2 pages
Turbo HD DVR V3.4.70 Build160708 Release Notes - External
No ratings yet
Turbo HD DVR V3.4.70 Build160708 Release Notes - External
7 pages
CN Suggesion Ca3
No ratings yet
CN Suggesion Ca3
2 pages
5380CRP Operation Manual Main Body
No ratings yet
5380CRP Operation Manual Main Body
214 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Survey On Efficient Inference For Large Language Models

Uploaded by

A Survey On Efficient Inference For Large Language Models

Uploaded by

1

A Survey on Efficient Inference for Large

Output: ['Processing'] (1*dim) Output: ['! '] (1*dim) Memory

Activation Activation KV Cache

Add & LayerNorm Add & LayerNorm Peak

Efficient FFN Design

Graph and Operator

System-level Memory Management

Fig. 4. Taxonomy of efficient inference methods for Large Language Models.

4 DATA - LEVEL O PTIMIZATION by token-level pruning based on perplexity. To enhance

Prompt Pruning DYNAICL [36], Selective Context [37],

Input Prompt Summary RECOMP [34], SemanticCompression [35]

Retrieval-Augmented RAG [27], FLARE [28], REPLUG [29], Self-

Kernel-based Linear Transformer [82],

Low-Rank Linformer [77], LRT [78],

SSM HiPPO [62], LSSL [63], S4 [64], DSS [65],

Others SGConv [57], CKConv [58], Hyena [59],

Training Computational Training Memory Inference Computational Complexity

performance. length-agnostic in the decoding stage. Furthermore, in the

Post-Training Quantization GPTQ [182], LUT-GEMM [183], AWQ [184],

Quantization- LLM-QAT [177], Norm Tweaking [178],

Weight Pruning SparseGPT [162], Wanda [163], ISC [164],

Sparse Attention Sparse Transformer [147], StreamingLLM [148],

Neural Architecture Search AutoTinyBERT [135], NAS-BERT [136], Struc-

White-box KD MiniLLM [129], GKD [130], TED [131], BabyL-

Sample-level FastBERT [110], MPEE [111], Global Past-

Token-level CALM [108], SkipDecode [109]

Fig. 8. Taxonomy of model compression methods for Large Language Models.

Quantized Tensor Type Quantized Quantized Quantized

In weight-only quantization, GPTQ [182] represents an

Teacher Model Student Model

Attention Operator FlashAttention [233], [234], FlashDecod-

Graph-Level Optimization FlashAttention [233], [234], ByteTrans-

Inference Linear Operator TensorRT-LLM [208], FlashDecoding++ [231],

Fig. 14. Taxonomy of the optimization for LLM inference engine.

attention others attention others

Fig. 15. Inference runtime breakdown over multiple LLMs.

Memory Management S3 [259], vLLM [49], LightLLM [253], FlashIn-

Batching ORCA [252], vLLM [49], Sarathi [257],

Distributed Systems Splitwise [247], TetriInfer [248], Dist-

Fig. 17. Taxonomy of the optimization for LLM serving system.

Inference Optimization Inference Serving Optimization Serving

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Output: ['Processing'] (1dim) Output: ['! '] (1dim) Memory