Fast Decode
Fast Decode
Pipelines
Jiaao He Jidong Zhai
Tsinghua University Tsinghua University
arXiv:2403.11421v1 [cs.DC] 18 Mar 2024
Abstract changing the model. Different from prior use cases of neural
Cost of serving large language models (LLM) is high, but the networks, there is a larger opportunity for LLMs to run gener-
expensive and scarce GPUs are poorly efficient when gen- ation in batches, as an LLM usually serves many users online.
erating tokens sequentially, unless the batch of sequences is The latency requirement to generate a token is much looser
enlarged. However, the batch size is limited by some con- than other user cases of NN models, e.g., object detection in
stantly reused intermediate results, namely KV-Cache. They autonomous driving. Keeping up with reading speed of the
occupy too much memory to fit more sequences into a GPU si- user is the most strict latency requirement for text generation.
multaneously. While they could be offloaded to host memory, However, generating a new token depends on huge inter-
the CPU-GPU bandwidth is an inevitable bottleneck. mediate results of generating the previous tokens, namely
KV-cache [23]. Processing batched requests results in a much
We find a way to decompose the transformer models into
larger memory footprint, far beyond the capacity of GPU
two parts of different characteristics, one of which includes
memory. Figure 1 shows the dilemma instantiated on a com-
the memory-bound KV-Cache accessing. Our key insight is
mon 7b model and several different GPUs. Increasing batch
that the aggregated memory capacity, bandwidth, and com-
size makes the GPUs significantly better utilized, but the mem-
puting power of CPUs across multiple nodes is an efficient
ory footprint of the KV-cache is much larger than the GPU
option to process this part. Performance improvement comes
memory. To make it worse, the KV-cache becomes even larger
from reduced data transmission overhead and boosted GPU
as more tokens are generated and the sequences get longer.
throughput to process the other model part. Moreover, we ad-
dress efficiency challenges brought by heterogeneity at both
temporal and inter-device scopes using scheduling and per- 1513
Throughput / TFLOPs
our system achieves 1.88 × −5.04× the throughput of vLLM A10 Achieved
300
A10 Max
when serving modern LLMs with the same GPU.
200 V100 Achieved
V100 Max
100 H100 Achieved
1 Introduction H100 Max
0
The large language models (LLM) are gaining high attention. 150
These transformer-based models are very hardware-friendly Model Weight
Consumption / GB
when training and evaluating [18, 20], because the main com- 100
Scratchpad
Memory
1
cache is loaded into GPU memory to generate every token. performance of both types of hardware. Bottleneck may be
Considering the enormous size of KV-cache as suggested in either of the GPU or CPU, because they are tightly coupled.
Figure 1, transmitting it between GPU and host memory fre- We need to balance the two considering the heterogeneous
quently is the bottleneck of the offloading design. Essentially, hardware and token generation workload. We seek for a min-
the bandwidth of PCIe is always much lower than the memory imum CPU requirement that can fully exploit the compute
bandwidth of GPUs and even CPUs. power of the GPU, aiming at minimizing the overall cost.
Our system, FAST D ECODE, is a CPU-GPU heterogeneous
GPU CPU S-Part R-Part pipeline for LLM inference that addresses the challenges by
the following innovations.
Performance
(normalized)
(normalized)
Volume
Innovation 1: We employ multiple out-of-chassis remote
CPUs for KV-cache and the related computation. The ag-
gregated memory capacity and bandwidth of the system
Comp. Mem. M./C. Comp. Mem. M./C. are scaled up. The distributed CPUs can achieve sufficient
Hardware Specification* Model Characteristics throughput to saturate the GPU, with moderate communica-
* Accurate numbers are in Section 2.3 tion overhead.
Innovation 2: We invent a sequence-level load-stabilizing
Figure 2: Performance characteristics of typical GPUs and schedule to minimize idling and better utilize both types of
CPUs, matching the need of two parts of the model hardware. The workload on a CPU is proportional to the total
length of sequences it maintains. To keep the latency sta-
We study the performance characteristics, including com- ble, sequences are fed into the system following a workload
pute throughput and memory bandwidth, of both GPUs and control algorithm. Short and long sequences are simultane-
CPUs, as shown in Figure 2. We find that compared with the ously processed by CPU workers, leaving the total length of
huge gap in compute power, the two types of hardware have sequences stable. As a result, the overall latency of CPUs
a much closer gap in memory bandwidth. changes more gently, and both types of hardware are better
Fortunately, we find a way to partition the transformer utilized.
model into two parts, namely R-Part and S-Part. KV-cache Innovation 3: We adopt a model-guided approach to or-
is included by the former. Little performance loss is intro- chestrate the GPU with CPUs. It quantitatively characterizes
duced when completely moving the memory-bound part to the performance bottleneck considering different aspects of
the host side, as the ratio of memory bandwidth and com- the LLM inference tasks. Aggregated memory bandwidth
pute throughput can fulfill its requirement. Therefore, we get is identified as the key metric in selecting the CPUs. For a
our key insight that we should compute near KV-cache on given model and GPU setup, based on profiling result of a
CPUs. Instead of transmitting KV-cache data over any inter- micro-benchmark, we can estimate the minimum required
device connection, we transmit the activation tensors, which aggregated CPU memory bandwidth for different batch sizes.
are orders of magnitudes smaller than KV-cache. Overall, the throughput of a single GPU is saturated with
Our approach totally removes intermediate data of se- a significantly larger batch size. Thanks to the scalability
quences, the KV-cache, from GPU memory. Therefore, the and aggregated power of CPUs across nodes, high overall
batch size can be greatly increased, and the GPUs can be token generation throughput is achieved with affordable GPU
optimally utilized. However, such a heterogeneous approach resources. In our evaluation, up to 5× throughput of vLLM is
faces three challenges to achieve high overall throughput. achieved on the same GPU with acceptable latency.
Challenge 1: The CPU is busy but slow. It runs multiple Contributions of this paper are summarized as follows.
tasks, including batch gathering, tokenization, and coordinat- • We find an unconventional way to decompose the auto-
ing the GPUs. Performing extra computation interferes with regressive transformer model, with high potential of per-
these tasks, slowing down all of them. To add to the difficulty, formance improvement.
the memory bandwidth of a CPU is lower than GPU.
Challenge 2: The pattern of workload variation, as the gen- • We propose a near-memory processing system over the
erated sequences get longer, differs between the two parts. In KV-cache that exploits the aggregated memory band-
our solution, the CPU and the GPU take turns to perform com- width out-of-chassis CPUs for higher throughput.
putation, and pass the results to each other. A basic pipeline
• We invent a sequence-level pipeline schedule to balance
of multiple batches of requests is used to utilize both of them.
the growing-with-time workload and the fixed workload
However, computation on the CPU takes longer time as the
in token generation using LLMs.
generated sequence gets longer, while the latency of its coun-
terpart on GPU does not change at all. This makes it hard to • We create a performance model that can provide optimal
always utilize both CPU and GPU. hardware configuration for different models and require-
Challenge 3: Careful orchestration is needed to balance the ments using our system.
2
This paper is organized as follows. Section 2 provides back- the next token, only the latest token needs to be processed by
ground information of LLMs and hardware options to serve the model, because the fully-connected layers and MLPs pro-
them. Section 3 shows our way to decompose the model, and cess each token independently. However, eq. (2) and eq. (3)
illustrates our key insight that processing near KV-cache can of the attention layer involve reaction between the latest token
boost overall throughput. Section 4 introduces the design of and all previous tokens. Instead of re-computing them, K j and
our system, with techniques to resolve challenges brought by V j can be saved in memory and reused for the newly gener-
heterogeneity in both terms of workload and hardware. Sec- ated tokens. KV-cache [23] refers to these saved intermediate
tion 5 includes more details in our implementation. Section 6 tensors. For a sequence of length S, this technique reduces
compares the performance of our system with other systems, the total amount of inner production computation between
and Section 7 shows more experiment results for analyzing feature vectors from O(S3 ) to O(S2 ). So, using KV-cache is
our performance. Section 8 includes discussion with more mandatory in LLM inference.
related works, and Section 9 concludes our work.
2.2 Accelerating Decoding
2 Background and Motivation
There are two steps to use an LLM to respond to a request.
2.1 Transformer Model and KV-Cache First, during the prefilling stage, the entire input sequence
from the user is processed by the model, where all the tokens
The auto-regressive models are based on transformer struc- in the sequence can be processed as a batch in the MLP
ture [28]. The key module of these models is the attention layers. Then, in the decoding stage, the model use the feature
layer. We briefly illustrate its process as follows. Denote the vector of the last known token to predict the next token to be
feature vector of the i-th token in a sequence as Xi . appended to the sequence. So, each new token of the generated
First, Xi is mapped to three different linear spaces, which sequence goes through the model one-by-one.
is implemented by three fully-connected layers. Computation efficiency is extremely important, as it is di-
rectly related to the cost of serving LLMs. Unfortunately,
Qi = Wq Xi using GPUs, generating a single sequence is poorly efficient,
Ki = Wk Xi (1) because the main computation workload is applying fully
Vi = Wv Xi connected layers to one feature vector in the decoding stage.
In other words, the main computation task is multiplying ma-
For the i-th token, inner production is applied between its
trices with vectors (GeMV). There is little chance to reuse
feature vector and the K j feature vectors of all tokens before
the matrix data in near-processor memory, so accessing the
it. This is actually the attention process, and it generates an
global memory bounds the workload. The numerous floating
attention vector for the current token.
point number units on GPUs are underutilized.
To leverage the computation throughput of GPUs, enlarg-
Ai = Normalize{Qi · K j ( j = 1, . . . , i − 1)} (2)
ing batch size is the most feasible approach. A batch of se-
The attention vector is normalized, commonly using soft- quences are generated simultaneously, so multiple tokens are
max operation. Then, it is used as a weight to gather informa- processed at the same time. The feature vectors are stacked
tion, i.e., add up the V j vectors of all preceding tokens. together to be a matrix, and the GeMV computation be-
comes multiplying the weight matrix with the feature matrix
i−1
Oi = (3) (GeMM). GeMM is a highly optimized operation on GPUs.
∑ Ai jV j As long as the batch size is large enough, the computation
j=1
power of GPUs is fully exploited.
The output is applied with the final linear transformation
Specifically, for the auto-regressive transformer-based mod-
using another fully-connected layer.
els, Orca [30] points out that the granularity of batching can
be reduced to improve performance. Instead of batching com-
Yi = Wo Oi (4)
plete sequences together, it is more effective to batch the
A transformer block consists of one attention layer fol- generation task of single tokens. This technique greatly im-
lowed by a multi-layer perceptron (MLP) module, which in- proves the throughput of serving LLMs by introducing more
cludes multiple large fully-connected layers and non-linear chances of batching.
activation functions in between. Connecting tens of such trans- Unfortunately, beside leaving the memory issue of KV-
former blocks sequentially makes a complete decoder model, cache unaddressed, the flexible batching mechanism in Orca
which is the backbone of most well-known LLMs, including introduces significant memory fragmentation. Paged attention
GPT [3, 21], Llama [26, 27], and many more. technique is adapted by vLLM [14] to address the memory
When using such models in real-world tasks including chat issue. GPU and host memory of KV-cache is managed by
and text generation, tokens are produced one-by-one. To get pages, so that the GPU memory can be better utilized without
3
fragmentation. Also, host memory can be used to store more the top-level GPUs can barely have more than 10× bandwidth.
KV-cache for more sequences, so the chance for batching Additionally, different from the huge gap between GPUs of
token generation tasks is increased. different levels, the memory bandwidth remain similar from
Chance of batching in vLLM is still limited, because swap- entry-level to high-end in each generation of CPUs.
ping the large KV-cache over PCIe between GPU and host CPUs are more attractive, considering the cost. As a gen-
memory introduces high overhead. Therefore, vLLM has to re- eral purpose processor that exists in every computer, they
duce the swapping frequency. So, as the sequences get longer, are much more widely deployed than GPUs. It is much easy
KV-cache of few sequences can reside in the GPU memory, to acquire a large number of CPUs with relatively low cost.
resulting in a small batch size. The system achieves high We can easily enlarge the memory capacity and bandwidth
throughput in less common cases, e.g., generating tokens with by adding standard DIMMs to the servers. On the contrary,
shared prefix or wide beam searching, where multiple inde- GPU memory is not only expensive but also hard to extend
pendent new tokens share the same KV-cache. its capacity as the memory is soldered on the circuit.
FlexGen [24] studies on finding an optimal offloading order Table 1 includes power consumption of the hardware as a
of both model weights and KV-cache. Still, the KV-cache is metric of efficiency. Maximum power consumption of CPUs
orders of magnitudes larger than the model weights. Trans- is a few times lower than GPUs. Besides, when memory
mitting them over the PCIe link, which is much slower than access is the major workload, the hardware usually does not
the memory bandwidth, is inherently inefficient. consume as much power as the TDP. So, the actual efficiency
In summary, it is a consensus that increasing batch size gap is even smaller than our estimation.
is the most effective way for higher throughput. However, In conclusion, CPU is an appealing option for the memory-
because KV-cache has to be in the GPU memory for compu- bound jobs.
tation, few works can achieve actual speed up due to its large
memory footprint.
Differently, we find that KV-cache does not need to present 3 Observation and Insights
in GPU memory. In this paper, we show challenges and our
solutions that unleashes the power of CPUs to handle the 3.1 Performance Dilemma and Decomposition
KV-cache and achieve high token generating throughput by
enabling a significantly larger batch size.
Small Batch Largh Batch
XX
X XX
Poor GPU Good GPU
2.3 Memory-bound Workload Fits CPU Utilization
Model
Utilization
Param.
While the slower but larger host memory is used to compen- Wq,k,v
QQ
sate for the lack of GPU memory capacity, the CPUs are barely Q QQ
used to perform computation. They have up to TFLOPs of
KK&&VV
floating point computation throughput, negligible compared K &V KK&&VV
with hundreds of TFLOPs achieved by specialized tensor pro- OO
cessing units on GPUs. O OO Out of GPU
Model Memory
Param.
Wo & MLP R-Part
Table 1: Performance and Power Comparison
YY
Compute Memory Y YY S-Part
Type Model TDP
FLOPs W. per. GB/s W. per.
Xeon* 125 W 1.3 T 96.15 128 0.97 Figure 3: Performance dilemma in auto-regressive generation
CPU
Epyc* 155 W 1.2 T 129.2 205 0.76
A10 150 W 125 T 1.2 600 0.25
GPU Throughput of the fully connection layers in the trans-
V100 250 W 112 T 2.2 900 0.27
former blocks increases significantly with a larger batch size.
* Using Intel Xeon Gold 5218 and AMD Epyc 7452 CPUs. However, the attention operation benefits little when enlarg-
ing the batch size, because each sequence has different K and
However, in terms of memory access bandwidth, the gap V . When generating sequences in a batch, instead of GeMM,
between CPU and GPU is smaller. Table 1 lists the compute the GeMV becomes batched GeMV, which is still memory-
throughput and memory bandwidth of several common ones. bound. To make it worse, the size of KV-cache is proportional
Modern server-class CPUs can achieve hundreds of GB/s. to the batch size. As shown in Figure 3, when using a larger
Memory bandwidth of mid-range GPUs, e.g. NVIDIA A10, batch size to better saturate the computation power of GPU,
is only a few times larger than the CPUs. A dual-socket AMD the memory footprint of KV-cache is much larger than the ca-
Epyc server can achieve 68% of its memory bandwidth. Even pacity of GPU memory. In fact, the key difficulty is handling
4
the memory-intensive operations with KV-cache when using be moved to CPUs, because it involves much heavier com-
only GPUs for computation. putation that can be greatly accelerated by the GPU. The
We categorize the computation workload during generating poor throughput of CPUs leads to very high latency. It is
a token into two parts. They are denoted using dashed and also notable as the batch size is 1024× larger, the latency is
solid lines, respectively, in Figure 3. only 5× larger. There is more than 100× potential throughput
gain. Overall, with similar latency and throughput of R-Part,
• R-Part: The auto-Regressive computation related to pre- efficiency is increased in S-Part and thus the whole system.
ceding tokens in the sequence, as formulated in eq. (2)
and eq. (3). Each sequence is processed independently
with its own KV-cache. It benefits little from enlarging Table 3: Size of Data and Communication Latency
batch size, but introduces huge memory footprint. No- Data Batch Size Data Size
Latency / ms
tably, no model parameter is involved in R-Part. PCIe* RoCE*
Model Weight N/A 402 MB 12.6 32.2
• S-Part: Rest of the model where sequences Share the 1 4.19 MB 0.131 0.335
KV-Cache
1024 4.29 GB 134 343
same parameters. It mainly consists of fully connected Intermediate 1 32.7 KB <0.01 0.03
layers. GPU utilization can be significantly increased by Vectors (ours) 1024 33.5 MB 1.04 2.68
batching tokens in more sequences together in S-Part.
* Calculated using 32 GB/s PCIe 4.0 x16 and 100 Gbps RoCE
3.2 CPUs can Undertake More in LLM Sizes of data within a transformer block of a typical 7B
In the LLM inference workload, the KV-cache takes enormous model and the latency to send them across different types inter-
memory, ideal to be placed in the larger CPU-side DRAM, device connection are shown in Table 3. Instead of previous
despite the bottleneck to move the data from host memory to works that may send the huge model or KV-cache across
GPU memory for computation. the links, we only communicate intermediate vectors, i.e.,
However, the R-Part is inherently memory-bound work- Qi , Ki ,Vi , Oi in eq. (2) and eq. (3). These vectors are orders of
load, where using a GPU gets little benefit over using CPUs. magnitudes smaller than others. The estimated latency to send
Therefore, we get our key insight: not only should we store a large batch of them from 1024 sequences over the network
KV-cache in CPU memory, but also process them with CPUs. is only a few milliseconds. The latency is a moderate portion
In other words, R-Part should be processed near data. of the computation latency presented in Table 2. Compared
As a result, the KV-cache is removed from GPU memory, with existing systems, communication overhead of our near-
and the batch size can be as large as 1024 or more. Recall KV-cache processing design is much smaller.
the throughput curve in Figure 1. The computation in S-Part In brief, our approach enables serving LLMs with a much
is able to utilize the GPUs with much higher efficiency. The larger batch size. It brings huge potential of throughput gain,
overall token generation throughput is significantly increased. despite the minor overhead that is minimized in our system.
Two concerns are intuitively raised on such design. (1)
CPUs may not be fast enough to match the throughput of
GPUs. (2) Transmitting the intermediate data between S-Part
4 Methodology
and R-Part across devices may be slow.
4.1 System Overview
Table 2: Latency of Computation Operations The local CPU in a server equipped with GPU may be too busy
Lat. / ms TFLOPs and too slow to provide sufficient throughput of processing
Operation Batch Size
GPU CPU GPU CPU the R-Part. Our approach uses out-of-chassis CPUs, whose
R-Part 1 0.084 0.287 0.050 0.015 aggregated compute power is exploited.
(eq. (2) & eq. (3)) 1024 8.32 8.12 0.516 0.529
S-Part 1 1.46 49.5 0.366 0.011 Figure 4 shows the basic design of FAST D ECODE, which
(∼ 16× eq. (4)) 1024 7.08 611 77.5 0.899 consists of two types of workers.
An S-worker computes S-Part of an LLM. It may use one
Table 2 compares latency of computation tasks in gener- or multiple GPUs. All weights of the model are on the S-
ating tokens on two CPU nodes with those on an A10 GPU. worker, and partitioned by a certain way of model parallelism
We highlight the latency of our selected mapping between if using multiple GPUs. It acts as a typical token generation
device and model part using underlines. When generating worker simply using GPUs, except for its much larger batch
tokens using a widely-used 7B foundation model, the latency size and the behavior of computing R-Part. To generate a new
of R-Part is almost identical between using the GPU or CPUs, token, it goes through the transformer blocks. After Qi , Ki ,Vi
as the total hardware memory bandwidths are similar. are produced by fully connected layers in S-Part, instead
When computing on GPUs, the latency of S-Part is at a of computing R-Part locally, the S-worker sends different
similar magnitude with R-Part. However, the S-Part cannot parts of them related to different sequences to the R-workers,
5
GPU Node Original S-worker S-Part 1 S-Part 2
S-worker R-Part w/ R-worker 1 R-Part 1 R-Part 2
KV-Cache Model
Model R-worker 2 R-Part 1 R-Part 2
Params.
Params.
Qi (a) No pipeline
QQiKi Oi
i
S-Part KKiVi OOi i S-Part S-worker S-Part A-1 S-Part B-1 S-Part A-2 S-Part B-2
V i i
V i
R-worker 1 R-Part A-1 R-Part B-1 R-Part A-2
R-worker 2 R-Part A-1 R-Part B-1 R-Part A-2
(b) Ideal case of the basic 2-stage pipeline
c a r
In this system, the S-worker and the R-workers work in A R-P
turns to generate a token. When one type of worker is working, Idling
the other idles, as shown in Figure 5(a). S-worker
6
may move the workload of R-Part, keeping its total amount This schedule reduces the total length of all sequences by
unchanged. The triangular area of idling S-worker is moved mixing sequences of different lengths together. To be more
to the area of idling R-workers. As the total overall latency specific, assume that there are originally B sequences of length
can be indicated by the area under the latency curve, this S. The total length of sequences can be Wmax = BS in the final
can reduce as much as 20% the total overall latency. In other step if all sequences are started together. In our schedule, the
words, the overall throughput is increased by 20%. Besides, size of micro-batches is defined as follows.
the maximum latency to generate a token is reduced by 50%,
indicated by the highest point of the latency curve. BF
M= (5)
We identify that the long latency of R-Part is caused by S
all sequences being long. As previous works [30] indicates,
sequences of different lengths can be batched together in S- So, in the final step of generating a micro-batch, we have
Part to increase throughput. And in R-Part, processing the the maximum total sequence length.
sequences separately on different workers introduce no extra
overhead. Therefore, we schedule the sequences in a pipeline S/F
′ B(S + F) BS Wmax
to control the total length being processed at each step, as Wmax = ∑ MkF = ≈ = (6)
k=1 2 2 2
shown in Figure 7.
Although S = 13 F in the example in Figure 7 leads to
Serving in Sequence-level ′
Large Batches Load-stabilizing Sch. Wmax = 23 Wmax , S is usually much larger than F in real cases
1 2 3 4 5 6 123 4 5 6 (thousands compared to tens). So, (S + F) is closer to S. As
1 2 3 4 5 6 123 4 5 6
(load per token)
7
The sequence-level load-stabilizing schedule can be gener- L indicates the expected latency to generate a sequence.
alized to a load control algorithm that dynamically determines Larger B leads to larger T(B ), as well as better overall
when a new micro-batch starts. Given a maximum load limit throughput. A maximum possible B is selected as the above
Wlim , the earliest starting step index for a micro-batch of M se- constraint is fulfilled.
quences can be calculated based on the current micro-batches If there is no constraint on L, B is selected based on the
being processed. Algorithm 1 shows the algorithm that figures overall throughput on GPU.
out the earliest step index, with a few more information to
maintain when a micro-batch actually starts. As maximum to- B
E(B ) = (8)
tal length is reached at the last step of each micro-batches, the T(B )
algorithm maintains the workload at these steps. The margin
E(B ) is proportional to the GPU throughput that is shown
between the maximum workload and the limit is used to get
in Figure 1. It increases sharply when B is small, indicating
the maximum length of the new micro-batch at the specific
that increasing B brings much benefit. When it becomes more
step, Then, we get the earliest step index constrained by a
stable as B is large enough, the performance gains little. In
certain peak of workload.
this case, we should select a B that further increasing it only
Notably, during cold starting of the schedule, there may
′ . With sufficient input to be brings marginal throughput improvement.
be an issue to set Wlim to Wmax
Another constraint of B is the host-side memory capacity.
processed, a large number of sequences are started at step
Assume that each CPU has memory for K and V vectors for C
0. Then, step S becomes the peak, and the rest sequences
tokens, which can be calculated from the size of the memory
can only launch at step S + 1. Instead, we need to gradually
and specifications of the model.
increase Wlim , or use a fixed F in the beginning.
1
B S ≤ CP (9)
4.3 Workload-balanced Hardware Selection 2
In fact, this constraint is barely the actual limitation, be-
FAST D ECODE introduces hardware heterogeneity between
cause CPUs commonly have abundant memory.
the GPU and CPUs. To optimally utilize both of them, beside
After having B determined, P is minimized by the con-
stabilizing the workload, selection of hardware also makes
straint of computing R-Part in similar time to S-Part. Assume
significant impact. Specifically, we need to determine the
that we are using the same model of CPUs in the system. We
number of CPUs to use in our system. If we have too many
use another micro-benchmark to get R, which indicates the
CPUs, it is a waste because they have to wait for the GPU. If
latency that one CPU processes one token for R-Part. So, the
we have too few CPUs, it is the GPU that idles.
CPU takes time Rk in R-Part when generating a token append-
Also, the expected service latency to LLM users should
ing to a sequence of k existing tokens. We get the constraint
be considered. In some cases, the acceptable latency to gen-
for P using R.
erate a single sequence is large, so we have more space to
optimize the throughput. In other cases, the batch size should BS
be reduced to fulfill a stricter latency limit. R ≈ T(B ) (10)
2P
We introduce a quantitative approach to determine the two
most important parameters of our system: the batch size, B , Then, we get a direct approximation for the optimal number
and the number of CPUs, P . of CPUs to work with a GPU.
To start with, there are two given conditions: the LLM and
B SR 1
the GPU we use. Then, we need a few reference metrics. P≈ = SRE(B ) (11)
2T(B ) 2
Throughput to compute S-Part of the model on the GPU can
be measured by a micro-benchmark. As shown in Figure 1, the Briefly, to cope with increased GPU efficiency E(B ) thanks
throughput varies significantly as the batch size, B , changes. to increased B , more CPUs are needed. However, as we select
Therefore, we use a function T(B ) to indicate the latency the B where increasing it brings marginal E(B ), and P has
to compute S-Part of one transformer block on the GPU. to be an integer, it has little impact when tweaking B . Also,
Also, we have the user-specified expected maximum length longer expected sequence length S makes the CPUs more
of sequences S. heavily loaded, so more of them are needed.
We assume that perfect efficiency is achieved by the In summary, given specifications of the hardware and the
pipelines. As the latency of S-Part is fixed, and the latency model, we first measure T(B ) and R with micro-benchmarks.
of R-Part equals to the latency of S-Part in such a pipeline, Then, a definite optimal choice of B and P is given by Equa-
the latency to generate a token using a model of N layers is tion (7), Equation (9), and Equation (11).
calculated as follows. Furthermore, assume that the feature dimension of the
model is h. The workload of S-Part, reflected in T(B ), is
2NS · T(B ) ≤ L (7) proportional to h2 . Meanwhile, R, the per-token workload of
8
R-Part, is proportional to h. So, P is approximately propor- latency. It is mandatory in many cases as the model cannot fit
tional to 1h . The optimal number of CPUs tends to be smaller in the memory of a single GPU.
for larger h, which commonly appears in larger models. FAST D ECODE naturally have good support for such tech-
niques. A separate group of R-workers are assigned to every
S-worker that plays the part of a worker in the parallel groups.
5 Implementation For inter-layer model parallelism, i.e., pipeline parallelism,
different transformer blocks are processed by each worker.
The S-worker of FAST D ECODE is implemented using Py- So, R-Part related to each worker are totally independent.
Torch, for ease of adapting to various models and serving For intra-layer model parallelism, i.e., the tensor-model
APIs. R-Part is stripped from the model and the token gener- parallelism, the fully-connected layers before and after R-
ation scheduling is taken over by our system. The R-worker Part are commonly partitioned across attention heads [20].
is implemented using C++. As a light-weight service, it re- Therefore, each group of R-workers maintains independent
ceives data from S-worker. We find that the performance the KV-cache for different attention heads.
R-worker is more critical but understudied. In both types of model parallelism, the workloads of S-
Part and R-Part are divided by a same factor. Therefore, the
5.1 Mix-precision CPU Attention number of R-workers to work with each S-worker remain
unchanged, as revealed in Equation (11). Latency of FAST D E -
Optimizing the performance of R-worker is critical to the CODE directly benefits from the resulting latency reduction.
overall throughput of the system. Compared to the well-
established neural network libraries on GPUs, there lacks
existing high performance neural network libraries on CPUs 6 Evaluation
that can be used out-of-box. Most current LLMs use 16-bit
floating point numbers (fp16), which is not supported by most 6.1 Setup
CPU libraries. However, using fp32 libraries doubles the vol- Models and tasks All the auto-regressive models use the
ume of memory access, which means doubling the latency. same transformer backbone. We choose a state-of-the-art
We develop a mixed-precision attention operator that reads open-source LLMs, Llama [27] and OPT [32]. We evalu-
fp16 data from memory, convert them to fp32 in registers, ate system performance over different sizes of the models,
and compute. Luckily, we find intrinsics in AVX-2 instruction including Llama-7b, Llama-13b, and Opt-175b.
set to perform the vectorized fp16-fp32 conversion in one We reduce the number of layers in the models to reduce
instruction. Although fp16 floating point multiply and add evaluation cost. The estimated throughput and latency of the
(FMA) instruction is included in AVX-512 instruction set, we original model is reported. The number of layers is strongly
exclude it for compatibility to a wider range of CPUs. proportional to the overall latency, and inversely proportional
to the throughput. So, throughput and latency of the real orig-
5.2 Supporting Quantization inal model can be directly calculated. Fairness of comparing
systems is not lost by using models of reduced number of
The above fp16-fp32 mix-precision implementation is loss- layers, because little chance is found across layers, and no
less comparing with the original fp16 computation on GPUs. system have done optimizations at the layer dimension. This
We also support more aggressive performance optimization if is justified by Figure 8, showing the latency of Opt-175b
model accuracy degradation is tolerated. Model quantization model using different number of layers, i.e., the number of
is widely used and welcomed to boost our performance. transformer blocks. When keeping other settings unchanged,
Various quantization algorithms are supported with a few they are almost linearly related.
extra functions to implement. Given Qi , Ki ,Vi vectors in fp16,
the user function adds Ki and Vi to the KV-cache after quantiza- 150
tion. Qi is transformed as the quantization algorithm requires
Latency / ms
to produce Oi . 100
Our throughput benefits from storing the KV-cache data in
a quantized format. Suppose that 4-bit integers are used to 50
9
are used to generate a sequence over a short prompt. The total 2202 2279
length of the generated sequences is 1024 for both models. 2000
Llama-7b
The models run over fp16 data format, without any quan- 1542
Llama-13b
1026
dual sockets of AMD Epyc CPUs are used as the R-workers 1000
of FAST D ECODE. The cluster is connected via Infiniband 660
network. 500 350
204 143
As no existing system exploits out-of-chassis CPUs, all the 103
0
baseline systems run on the GPU node only. While it may not
vL
Te
fa
Va
O
ur
ur
ur
ur
st
ns
LM
ni
look fair because FAST D ECODE introduces extra hardware,
s
llm LM
lla
or
(1
(2
(5
(1
RT
28
56
12
02
existing approaches can only use the CPU nodes as a stand-
-L
4)
alone CPU worker for the text generation task, contributing
less than 1% to the total throughput beside the GPU.
Figure 9: Token generating throughput
Baselines vLLM [14] uses paged attention technique to
manage the KV-cache, and efficiently swapping its parts to batch size is large enough. When the batch size increases by
host memory. As vLLM is reported to totally outperform 8× from 128 to 1024, we only get 2× throughput.
Orca [30], we omit Orca in our experiments. Comparing with the most powerful baseline, vLLM, in the
TensorRT-LLM1 is the newest generation of FasterTrans- generation task of the 7b model, FAST D ECODE achieves a
former2 . Both systems are developed by NVIDIA with state- maximum throughput of more than 2k tokens per second,
of-the-art performance optimizations that can best utilize the 4× more than vLLM, and 8.7× more than TensorRT-LLM.
GPUs with hardware-specific tuning. For the 13b model, our maximum throughput is 4.12× the
FastLLM 3 is an accelerated LLM serving system crafted throughput of vLLM. Even when reducing the batch size to
by experts. It adopts a pure C++ implementation that targets 128 for lower latency, we achieve 2.32 × /1.88× the through-
on fast deployment, low latency, and high throughput with put of vLLM.
various accelerators. When running vLLM, we observe that it can achieve the
A Vanilla4 implementation of Llama [26, 27] is released batch size of 1024 in the beginning, because the sequences
with the model. The pure PyTorch [22] implementation and are short, and the KV-cache can be all stored in GPU mem-
its derived versions are widely used in both academia and ory. However, as the sequences get longer, it finds less batch-
industry. It includes a simple KV-cache implementation on ing opportunity, and can only use a similar batch size with
GPUs, and achieves competitive throughput thanks to the other GPU-only systems. TensorRT-LLM performs better
optimized PyTorch library. than fastllm and vanilla because of its more efficient CUDA
kernels. However, the maximum possible batch size of these
systems is barely more than 16, limited by the GPU memory.
6.2 Maximum Throughput
Thus, they have much lower throughput than ours. The aver-
Figure 9 shows the measured throughput of all the systems. age throughput of our system is 6.71× and 6.04× the average
The number in brackets after ours indicates the batch size of throughput of all baseline systems, respectively.
FAST D ECODE. The possible batch size is enormous in our
system, because the distributed host memory is large enough
6.3 Token Generating Latency
for thousands of sequences. Increasing the batch size can
increase the utilization of GPUs, and thus the overall through- Figure 10 shows the measured latency to generate a new token
put. However, as there may be constraint on the latency, the by all the systems. The wide bar indicates the average latency
batch size should be set properly. Also, we observe that the between generating two adjacent token, and the three narrow
performance gain of increasing batch size gets less when the bars show P = 0.01/0.5/0.99 latency, respectively.
1 https://github.com/NVIDIA/TensorRT-LLM
When we maximize our batch size to target on highest
2 https://github.com/NVIDIA/FasterTransformer throughput, the latency is about 3.5× the latency when using
3 https://github.com/ztxz16/fastllm 8× smaller batch size. This also implies the GPU utilization
4 https://github.com/facebookresearch/llama improvement of increased batch size.
10
600 Ours w/o SLS
1000 min max
medium mean
Llama-7b
750
Llama-7b
400 Ours w/ SLS
500
250 200
Vanilla
0
Latency / ms
Latency / ms
2000
Ours w/o SLS
1500
Llama-13b
800
Llama-13b
600
500
400
0
200 Vanilla
vL
Te
fa
Va
O
ur
ur
ur
ur
st
no
LM
ni
s
llm LM
lla
sr
(1
(2
(5
(1
RT
56
12
02
-L
4) Step (time)
Figure 10: Token generating latency Figure 11: Latency at each step
TensorRT-LLM achieves the minimum average latency of because the latency becomes dominated by the increasing
34.2 ms and 77.0 ms per token, respectively in the two models. latency of R-Part, and the GPU gets underutilized.
Using a batch size of 128, the average latency of FAST D E - After a cold start process of latency and low throughput
CODE is 120.8 ms and 191.6 ms. With at most 2.5× larger due to smaller batch size, the sequence-level load-stabilizing
latency of the 7b model, we have 4.5× throughput. Poten- schedule provides a stable latency at 66% − 70% the max-
tially, given 4× GPUs, we are able to retain the throughput imum latency without it. The sustainable throughput is in-
improvement while reducing the per-token average latency to creased by 8% − 11% by the technique. Overhead of the
the same level as TensorRT-LLM. pipeline stops the system from achieving the ideal benefit
The latency of vLLM is as low as other systems when of 50% maximum latency reduction and 20% throughput im-
generating most tokens, because it uses a similar small batch provement indicated by Figure 6,
size in most steps. However, the average latency of vLLM The smaller improvement on the 7b model compared with
is higher than all setups of FAST D ECODE. This is because the 13b model is also caused by overload of the R-workers.
a few steps that swaps the KV-cache between host and GPU Feature dimension h of the 13b model is larger than the 7b
memory are significantly slow, a key bottleneck of all systems model. The workload of fully-connected layers in S-Part is
that offloads the KV-cache. proportional to O(h2 ), while it is O(h) in R-Part. Therefore,
it is expected that R-Part have more workload than S-Part in
smaller models.
7 Performance Analysis
w/o SLS
7.1 Coping with Heterogeneity 400 w/ SLS
Latency / ms
11
length of generated sequences to 768. The latency of S-Part We show the ability of scaling up FAST D ECODE by using
gets closer to the sustainable latency of the sequence-level more S-workers with model parallelism. In this case, we use
load-stabilizing schedule, indicating more balanced work- the Opt-175b model which requires less R-workers to work
loads between the S-worker and the R-workers. The through- with a S-worker, because the Opt model is larger, so less R-
put improvement increases from 8% to 13%. workers are needed, as Equation (11) suggests. As a baseline
setting, we use one A10 GPU with two Epyc CPUs in a node.
Both hardware are well utilized, while the R-workers are
7.2 Scalability slightly overloaded. Figure 14 shows the results of introduc-
Employing more CPUs is a basic requirement of FAST D E - ing more hardware. When only using 2× CPUs as R-workers,
CODE for certain workload. So, scalability of the R-workers the overall throughput is only slightly increased. For the bar
is important to the overall efficiency. We use a fixed workload, on the right, we double the number of both R-workers and
generating tokens after 1024 sequences of length 1024 or 128. S-workers. The two S-workers work in model parallelism by
Each worker is bound to a socket. We evaluate the scalability partitioning all the parameter and activation tensors. FAST-
of FAST D ECODE on up to 8 sockets on 4 nodes. D ECODE achieves 1.84× throughput using double hardware.
7b-1024
4000 13b-1024
13b-128
3000 Ideal Wait S-Part Before R-Part
Send / Recv QKV R-Part
Copy QKV S-Part After R-Part
2000
1000 R-Worker-8
R-Worker-7
0
1 2 4 8 R-Worker-6
# R-Workers (# Sockets) R-Worker-5
R-Worker-4
Figure 13: Strong scalability of FAST D ECODE R-Worker-3
R-Worker-2
Figure 13 shows the strong scalability experiment results of
R-Worker-1
FAST D ECODE over the 7b and 13b model. When the length of
sequences is 1024, FAST D ECODE achieves 72.8% and 84.1% S-Worker
efficiency when scaling up from 1 socket to 8 sockets, on
0 10 20 30 40
the 7b and 13b models, respectively. As the total latency is Time / ms
smaller in the 7b model, overhead of the pipeline is more
significant, leading to lower efficiency with 8 sockets. When
sequence is as short as 128, the efficiency is 37.6% for the Figure 15: Latency of two layers in a 13b model
13b model. Using 8 sockets achieves even lower throughput
To see the detailed utilization of different workers, we trace
than using 4 sockets with 75.9% efficiency. This is implied
different operations of the workers, as shown in Figure 15.
by our performance model. Shorter sequences require less
The R-workers are busy with computation in more than 75%
R-workers. Employing more R-workers does not increase the
of the time. The performance variance across nodes makes
performance when the S-worker is the bottleneck.
some of the workers wait for others.
Copying the QKV data from GPU to CPU takes 3 ms,
2.0 1.84x
during the iteration of 43 ms to generate a new token. Sending
(normalized)
1.5 QKV across the network takes another 7.4 ms. In total, the
Throughput
1.17x
1.0
distributed design of FAST D ECODE introduces about 25%
overhead to transmit the feature vectors. Notably, we change
0.5 the asynchronous communication to synchronous mode for
0.0
profiling. In production, the asynchronous communication
Baseline 2x R-worker 2x R-worker can overlap part of the communication overhead.
+2x S-worker
The S-worker is actually working in less than 50% of time
in the profiled case, due to overloading and performance vari-
Figure 14: Using more workers in FAST D ECODE ance of the R-workers. However, the overall throughput is
12
still competitive, as the efficiency to compute S-Part is signif- load-stabilizing schedule and a performance model. Finally,
icantly increased because of the much larger batch size. as the GPU is better utilized thanks to greatly enlarged batch
size, and the overall throughput is competitive.
8 Related Works and Discussion
References
Optimizing the Attention Operator For training LLMs,
FlashAttention [5, 6] is a widely-used optimized fused at- [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan.
tention operator. It achieves performance improvement by Longformer: The long-document transformer. CoRR,
eliminating the need to store the memory-consuming inter- abs/2004.05150, 2020.
mediate A matrix. The idea is ported to the token generation
[2] Yelysei Bondarenko, Markus Nagel, and Tijmen
scenario by FlashDecoding [9]. However, as A is a vector of
Blankevoort. Understanding and overcoming the chal-
much smaller size in decoding, it has less impact than that
lenges of efficient transformer quantization. In Proceed-
in training. These techniques can be ported to CPUs and
ings of the 2021 Conference on Empirical Methods in
accelerate our computation.
Natural Language Processing, EMNLP 2021, Virtual
The idea of window attention [1], which is further extended
Event / Punta Cana, Dominican Republic, 7-11 Novem-
in StreamingLLM [29], is a variation of the original attention
ber, 2021, pages 7947–7969. Association for Computa-
algorithm that reduces the number of tokens to interact with
tional Linguistics, 2021.
for each new token. Our system benefits from these techniques
in the same way as quantization [2,7,17] and pruning [13,25]. [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
They reduce the workload of R-Part, while the user has to be Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
are of the potential change of the model quality. Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Speculative token generation [19] is currently the only Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
approach that essentially increases the efficiency of attention Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
operation, which depends on accurate prediction of tokens. Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christo-
Distributed and Heterogeneous LLM Serving Typical pher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
model partitioning systems [11, 15, 33] can barely handle the Scott Gray, Benjamin Chess, Jack Clark, Christopher
token generation case where hardware and workload of the Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
model are both heterogeneous. Beside offloading KV-cache to and Dario Amodei. Language models are few-shot
host memory [10, 12, 14, 24], peer GPUs [16] may also be the learners. In Advances in Neural Information Processing
place to offload, despite the expense. In fact, FPGAs [4, 31] Systems 33: Annual Conference on Neural Information
may be a better choice to store and process KV-cache. Processing Systems 2020, NeurIPS 2020, December 6-
The idea of this paper can be generalized as using hetero- 12, 2020, virtual, 2020.
geneous hardware for the different parts of LLMs for better
[4] Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie
efficiency of the whole system. Beside CPU, possible se-
Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and
lection of hardware for the memory-intensive part includes
Zhiru Zhang. Understanding the potential of fpga-based
cheaper GPUs, FPGAs, and domain-specific chips. A memory
spatial acceleration for large language model inference,
pool directly connected to the GPU by CXL [8] would be a
2023.
even more reasonable option for the KV-cache. Among the
various possibilities, our approach that unleashes the power [5] Tri Dao. Flashattention-2: Faster attention with
of CPUs for R-Part is an immediately feasible approach that better parallelism and work partitioning. CoRR,
uses existing hardware with high affordability. abs/2307.08691, 2023.
13
[8] Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and [16] Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li,
Myoungsoo Jung. Direct access, high-performance Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen
memory disaggregation with directcxl. In 2022 Li, Zhigang Ji, Yong Li, and Wei Lin. Infinite-llm: Effi-
USENIX Annual Technical Conference, USENIX ATC cient llm service for long context with distattention and
2022, Carlsbad, CA, USA, July 11-13, 2022, pages 287– distributed kvcache, 2024.
294. USENIX Association, 2022.
[17] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu
[9] Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Dang, and Song Han. AWQ: activation-aware weight
Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. quantization for LLM compression and acceleration.
Flashdecoding++: Faster large language model infer- CoRR, abs/2306.00978, 2023.
ence on gpus. CoRR, abs/2311.01282, 2023.
[18] Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuan-
[10] Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapad-
wei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang,
visor: Pushing deep learning beyond the GPU memory
Shizhi Tang, Tianyu Zheng, Junyang Lin, Guanyu Feng,
limit via smart swapping. In ASPLOS ’20: Architec-
Zeqiang Huang, Jie Gao, Aohan Zeng, Jianwei Zhang,
tural Support for Programming Languages and Operat-
Runxin Zhong, Tianhui Shi, Sha Liu, Weimin Zheng,
ing Systems, Lausanne, Switzerland, March 16-20, 2020,
Jie Tang, Hongxia Yang, Xin Liu, Jidong Zhai, and Wen-
pages 1341–1355. ACM, 2020.
guang Chen. Bagualu: targeting brain scale pretrained
[11] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond models with over 37 million cores. In PPoPP ’22: 27th
data and model parallelism for deep neural networks. ACM SIGPLAN Symposium on Principles and Prac-
In Proceedings of Machine Learning and Systems 2019, tice of Parallel Programming, Seoul, Republic of Korea,
MLSys 2019, Stanford, CA, USA, March 31 - April 2, April 2 - 6, 2022, pages 192–204. ACM, 2022.
2019. mlsys.org, 2019.
[19] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao
[12] Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. Deepum: Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming
Tensor migration and prefetching in unified memory. In Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao
Proceedings of the 28th ACM International Conference Jia. Specinfer: Accelerating generative LLM serving
on Architectural Support for Programming Languages with speculative inference and token tree verification.
and Operating Systems, Volume 2, ASPLOS 2023, Van- CoRR, abs/2305.09781, 2023.
couver, BC, Canada, March 25-29, 2023, pages 207–221.
ACM, 2023. [20] Deepak Narayanan, Mohammad Shoeybi, Jared Casper,
Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti,
[13] Woosuk Kwon, Sehoon Kim, Michael W. Mahoney,
Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer,
Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A
Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia.
fast post-training pruning framework for transformers.
Efficient large-scale language model training on GPU
In Advances in Neural Information Processing Systems
clusters using megatron-lm. In International Conference
35: Annual Conference on Neural Information Process-
for High Performance Computing, Networking, Storage
ing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,
and Analysis, SC 2021, St. Louis, Missouri, USA, Novem-
November 28 - December 9, 2022, 2022.
ber 14-19, 2021, page 58. ACM, 2021.
[14] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, [21] OpenAI. GPT-4 technical report. CoRR,
Hao Zhang, and Ion Stoica. Efficient memory manage- abs/2303.08774, 2023.
ment for large language model serving with pagedatten-
tion. In Proceedings of the 29th Symposium on Operat- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam
ing Systems Principles, SOSP 2023, Koblenz, Germany, Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
October 23-26, 2023, pages 611–626. ACM, 2023. Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban
Desmaison, Andreas Köpf, Edward Z. Yang, Zachary
[15] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-
Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Al- Soumith Chintala. Pytorch: An imperative style, high-
paserve: Statistical multiplexing with model parallelism performance deep learning library. In Advances in Neu-
for deep learning serving. In 17th USENIX Sympo- ral Information Processing Systems 32: Annual Confer-
sium on Operating Systems Design and Implementation, ence on Neural Information Processing Systems 2019,
OSDI 2023, Boston, MA, USA, July 10-12, 2023, pages NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
663–679. USENIX Association, 2023. Canada, pages 8024–8035, 2019.
14
[23] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Systems 2017, December 4-9, 2017, Long Beach, CA,
Jacob Devlin, James Bradbury, Jonathan Heek, Kefan USA, pages 5998–6008, 2017.
Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal-
ing transformer inference. Proceedings of Machine [29] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song
Learning and Systems, 5, 2023. Han, and Mike Lewis. Efficient streaming language
models with attention sinks. CoRR, abs/2309.17453,
[24] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan 2023.
Li, Max Ryabinin, Beidi Chen, Percy Liang, Christo-
[30] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
pher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-
jeong Kim, and Byung-Gon Chun. Orca: A distributed
throughput generative inference of large language mod-
serving system for transformer-based generative mod-
els with a single GPU. In International Conference on
els. In 16th USENIX Symposium on Operating Systems
Machine Learning, ICML 2023, 23-29 July 2023, Hon-
Design and Implementation, OSDI 2022, Carlsbad, CA,
olulu, Hawaii, USA, volume 202 of Proceedings of Ma-
USA, July 11-13, 2022, pages 521–538. USENIX Asso-
chine Learning Research, pages 31094–31116. PMLR,
ciation, 2022.
2023.
[31] Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu
[25] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao
Powerinfer: Fast large language model serving with a Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang,
consumer-grade gpu, 2023. Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang.
Flightllm: Efficient large language model inference with
[26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier a complete mapping flow on fpgas, 2024.
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, [32] Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Aurélien Rodriguez, Armand Joulin, Edouard Grave, Artetxe, Moya Chen, Shuohui Chen, Christopher De-
and Guillaume Lample. Llama: Open and efficient foun- wan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor
dation language models. CoRR, abs/2302.13971, 2023. Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang,
[27] Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- and Luke Zettlemoyer. Opt: Open pre-trained trans-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- former language models. CoRR, abs/2205.01068, 2022.
lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,
Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya [33] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao
Chen, Guillem Cucurull, David Esiobu, Jude Fernan- Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang,
des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E.
Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Gonzalez, and Ion Stoica. Alpa: Automating inter- and
Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, intra-operator parallelism for distributed deep learning.
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem In 16th USENIX Symposium on Operating Systems De-
Korenev, Punit Singh Koura, Marie-Anne Lachaux, sign and Implementation, OSDI 2022, Carlsbad, CA,
Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, USA, July 11-13, 2022, pages 559–578. USENIX Asso-
Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar ciation, 2022.
Mishra, Igor Molybog, Yixin Nie, Andrew Poulton,
Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
Schelten, Ruan Silva, Eric Michael Smith, Ranjan Sub-
ramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor,
Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng
Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie
Kambadur, Sharan Narang, Aurélien Rodriguez, Robert
Stojnic, Sergey Edunov, and Thomas Scialom. Llama
2: Open foundation and fine-tuned chat models. CoRR,
abs/2307.09288, 2023.
15