0% found this document useful (0 votes)
82 views13 pages

A Large Language Model (LLM) Research Paper

This is a random research paper about Large Language Models (so AI) Maybe you can find something useful in here. Try :)

Uploaded by

dondejhon1337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views13 pages

A Large Language Model (LLM) Research Paper

This is a random research paper about Large Language Models (so AI) Maybe you can find something useful in here. Try :)

Uploaded by

dondejhon1337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

LLM in a flash:

Efficient Large Language Model Inference with Limited Memory

Keivan Alizadeh, Iman Mirzadeh∗, Dmitry Belenko∗, S. Karen Khatamifard,


Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Apple †

Abstract Compute Load From Flash Memory Management


3100
Large language models (LLMs) are central to

Inference Latency (ms)


modern natural language processing, delivering 2250
exceptional performance in various tasks.
arXiv:2312.11514v2 [cs.CL] 4 Jan 2024

However, their substantial computational and


memory requirements present challenges,
especially for devices with limited DRAM
700
capacity. This paper tackles the challenge 450
of efficiently running LLMs that exceed the 100
available DRAM capacity by storing the model Naive Ours Naive Ours Naive Ours
parameters in flash memory, but bringing them
Falcon 7B OPT 6.7B OPT6.7B
on demand to DRAM. Our method involves (CPU) (CPU) (GPU)
constructing an inference cost model that takes
into account the characteristics of flash mem- Figure 1: Inference latency of 1 token when half the
ory, guiding us to optimize in two critical areas: memory of the model is available. Our method selec-
reducing the volume of data transferred from tively loads parameters on demand per token generation
flash and reading data in larger, more contigu- step. The latency is the time needed to load from flash
ous chunks. Within this hardware-informed multiple times back and forth during the generation of
framework, we introduce two principal all tokens and the time needed for the computations,
techniques. First, “windowing” strategically averaged over all generated tokens.
reduces data transfer by reusing previously
activated neurons, and second, “row-column
bundling”, tailored to the sequential data access
strengths of flash memory, increases the size
unprecedented capabilities of these models come
of data chunks read from flash memory. These with substantial computational and memory re-
methods collectively enable running models quirements for inference. LLMs can contain hun-
up to twice the size of the available DRAM, dreds of billions or even trillions of parameters,
with a 4-5x and 20-25x increase in inference which makes them challenging to load and run effi-
speed compared to naive loading approaches in ciently, especially on resource-constrained devices.
CPU and GPU, respectively. Our integration of
sparsity awareness, context-adaptive loading,
Currently, the standard approach is to load the en-
and a hardware-oriented design paves the way tire model into DRAM (Dynamic Random Access
for effective inference of LLMs on devices Memory) for inference (Rajbhandari et al., 2021;
with limited memory. Aminabadi et al., 2022). However, this severely
limits the maximum model size that can be run.
1 Introduction For example, a 7 billion parameter model requires
In recent years, large language models (LLMs), over 14GB of memory just to load the parameters
such as GPT-3 (Brown et al., 2020), OPT (Zhang in half-precision floating point format, exceeding
et al., 2022b), and PaLM (Chowdhery et al., 2022), the capabilities of most edge devices.
have demonstrated strong performance across a To address this limitation, we propose to store
wide range of natural language tasks. However, the the model parameters in flash memory, which is
∗ at least an order of magnitude larger than DRAM.
Major Contribution

{kalizadehvahid, imirzadeh, d_belenko, skhatamifard, Then, during inference, we directly load the re-
minsik, cdelmundo, mrastegari, farajtabar}@apple.com quired subset of parameters from the flash mem-

1
Random Read Throughput (MB/s)
Flash Memory GPU CPU
6000
Upper Bound (Sequential Read)
~100 GB Threads

~100 GB/s
5000
32
4000 16
8
3000 4
2
2000

1000

DRAM 0
~ 1 GB/s ~10 GB 4 8 16 32 64
Chunk Size (KB)

(a) Bandwidth in a unified memory architecture (b) Random read throughput of flash memory

Figure 2: (a) Flash memory offers significantly higher capacity but suffers from much lower bandwidth compared
to DRAM and CPU/GPU caches and registers. (b) The throughput for random reads in flash memory increases with
the size of sequential chunks and the number of threads.

ory, avoiding the need to fit the entire model in memory preallocation to minimize transfers within
DRAM. Our method is built on the top of re- DRAM and reduce inference latency. Our load
cent works that have shown LLMs exhibit a high from flash cost model captures the tradeoff between
degree of sparsity in the Feed Forward Network loading less data and reading bigger chunks. Op-
(FFN) layers, with models like OPT (Zhang et al., timizing this cost model and selectively loading
2022b), Falcon (Almazrouei et al., 2023), and Per- parameters on demand yields flash loading strate-
simmon (Elsen et al., 2023), exhibiting more than gies that can run models 2x larger than the device’s
90% sparsity (Mirzadeh et al., 2023; Liu et al., DRAM capacity and speed up inference by 4-5x
2023b). We exploit this sparsity to selectively and 20-25x compared to naive implementation in
load only parameters from flash memory that either CPU and GPU, respectively. It significantly out-
have non-zero input or are predicted to have non- performs the baseline approach, which reloads the
zero output. Specifically, we discuss a hardware- model’s weights on every forward pass.
inspired cost model that includes flash memory,
DRAM, and compute (CPU or GPU). Then, we 2 Flash Memory & LLM Inference
introduce two complementary techniques to min- In this section, we explore the characteristics of
imize data transfer and maximize flash memory memory storage systems (e.g., flash, DRAM), and
throughput: their implications for large language model (LLM)
inference. Our aim is to elucidate the challenges
• Windowing: We load and temporarily cache pa-
and hardware-specific considerations essential for
rameters for only the past few tokens, reusing ag-
algorithm design, particularly in optimizing infer-
gregate sparsity structure predicted over the past
ence when working with flash memory.
few tokens. This sliding window approach re-
duces the number of IO requests to load weights. 2.1 Bandwidth and Energy Constraints
• Row-column bundling: We store a concate- While modern NAND flash memories offer high
nated row and column of the up-projection and bandwidth and low latency, they fall well short
down-projection layers to read bigger contigu- of the performance levels of DRAM (Dynamic
ous chunks from flash memory. This increases Random-Access Memory), in terms of both latency
throughput by reading larger chunks. and throughput. Figure 2a illustrates these differ-
ences. A naive inference implementation that relies
To further minimize the number of weights to be on NAND flash memory might necessitate reload-
transferred from flash memory to DRAM, we also ing the entire model for each forward pass. This
employ methods to predict FFN sparsity and avoid process is not only time-consuming, often taking
loading zeroed-out parameters, akin to approaches seconds for even compressed models, but it also
documented in Deja Vu (Li and Lu, 2023). To- consumes more energy than transferring data from
gether, windowing and sparsity prediction allow DRAM to the CPU or GPU’s internal memory.
us to load only 2% of the FFN layer from flash Load times for the models can be a problem
for each inference query. We also propose a static even in the traditional DRAM-resident set up where

2
weights are not reloaded partially – the initial, full 3 Load From Flash
load of the model still incurs a penalty, particu-
This section addresses the challenge of conducting
larly in situations requiring rapid response times
inference on devices where the available DRAM
for the first token. Our approach, leveraging activa-
is substantially smaller than the size of the model.
tion sparsity in LLMs, addresses these challenges
This necessitates storing the full model weights in
by enabling selective reading of model weights,
flash memory. Our primary metric for evaluating
thereby reducing the response latency.
various flash loading strategies is latency, dissected
2.2 Read Throughput into three distinct components: the I/O cost of load-
ing from flash, the overhead of managing memory
Flash memory systems perform optimally with
with newly loaded data, and the compute cost for
large sequential reads. For instance, benchmarks
inference operations.
on an Apple MacBook Pro M2 with 2TB flash
demonstrate speeds exceeding 6GiB/s for a 1GiB Our proposed solutions for reducing latency un-
linear read of an uncached file. However, this high der memory constraints are categorized into three
bandwidth is not replicated for smaller, random strategic areas, each targeting a specific aspect of
reads due to the inherent multi-phase nature of the latency:
these reads, encompassing the operating system, • Reducing Data Load: Aiming to decrease la-
drivers, interrupt handling, and the flash controller, tency associated with flash I/O operations by
among others. Each phase introduces latency, dis- loading less data1 .
proportionately affecting smaller reads.
To circumvent these limitations, we advocate • Optimizing Data Chunk Size: Enhancing flash
two primary strategies, which can be employed throughput by increasing the size of data chunks
jointly. The first involves reading larger chunks of loaded, thereby mitigating latency.
data. For smaller blocks, a substantial part of the
overall read time is spent waiting for data transfer • Efficient Management of Loaded Data:
to begin. This is often referred to as latency to first Streamlining the management of data once it is
byte. This latency reduces the overall throughput loaded into memory to minimize overhead.
of each read operation considerably, because the
It is important to note that our focus is not on the
overall measured throughput has to take into ac-
compute aspect of the process, as it is orthogonal to
count not just the speed of transfer once it begins,
the core concerns of our work. This delineation al-
but the latency before it begins as well, which pe-
lows us to concentrate on optimizing flash memory
nalizes small reads. This means that if we coalesce
interactions and memory management to achieve
the reads for rows and colums of the FFN matri-
efficient inference on memory-constrained devices.
ces, we can pay the latency cost only once for any
Finally, we will elaborate on the implementation
given row/column pair in both matrices, and higher
of these strategies in subsequent sections.
throughput can be realized. This principle is de-
picted in Figure 2b. Perhaps a counterintuitive yet 3.1 Reducing Data Transfer
interesting observation is that in some scenarios, it
will be worthwhile to read more than needed (but in Our methodology leverages the inherent activation
larger chunks) and then discard, than only reading sparsity found in Feed-Forward Network (FFN)
strictly the necessary parts but in smaller chunks. models, as documented in preceding research. The
The second strategy leverages parallelized reads, OPT 6.7B model, for instance, exhibits a notable
utilizing the inherent parallelism within storage 97% sparsity within its FFN layer. Similarly, the
stacks and flash controllers. Our results indicate Falcon 7B model has been adapted through fine-
that throughputs appropriate for sparse LLM infer- tuning, which involves swapping their activation
ence are achievable on modern off-the-shelf hard- functions to ReLU, resulting in 95% sparsity while
ware using 32KiB or larger random reads across being almost similar in accuracy (Mirzadeh et al.,
multiple threads. 2023). In light of this information, our approach
Motivated by the challenges described in this sec- 1
It is notable that, by data we mean weights of the neural
tion, in section 3, we propose methods to optimize network. However, our developed techniques can be eas-
ily generalized to other data types transferred and used for
data transfer volume and enhance read throughput LLM inference, such as activations or KV cache, as suggested
to significantly enhance inference speeds. by Sheng et al. (2023).

3
Up Projection
M = d ffn
Predictor
N Up Projection
 ReLU M
(FC)
Count

N= dmodel 1
0
0
False Negative 0
0
1
.
sigmoid
 .
N R M > 0.5
.
M

−4 −3 −2 −1 0 1 2 0
Output Magnitude (before ReLU) Low Rank 0

Predictor
(a) predictor vs relu (b) low rank predictor

Figure 3: (a) Preactivations of tokens in one sequence in OPT 6.7B. The blue graph shows preactivation of elements
that predictor detected positive while the green graph is for up projection. As it can be seen most of the False
Positives are close to 0 and False Negatives constitute a small portion of the elements. (b) A small low rank predictor
finds out which intermediate neurons are going to be activated instead of running heavy up projection.

involves the iterative transfer of only the essential, Table 1: Using predictors doesn’t change the accuracy
non-sparse data from flash memory to DRAM for of zero-shot metrics significantly as predictor of each
layer accurately identifies sparsity
processing during inference.
While we employ the 7B models as practical Zero-Shot Task OPT 6.7B with Predictor
examples to elucidate our approach, our findings
Arc Easy 66.1 66.2
are adaptable, and they can be extrapolated to both
Arc Challenge 30.6 30.6
larger and smaller scale models. HellaSwag 50.3 49.8
Selective Persistence Strategy. We opt to re-
tain the embeddings and matrices within the at-
tention mechanism of the transformer constantly outcome due to deferred inputs. We thereby only
in RAM. For the Feed-Forward Network (FFN) load elements indicated by the predictor.
portions, only the non-sparse segments are dynam- Neuron Data Management via Sliding Win-
ically loaded into DRAM as needed. Storing at- dow Technique. In our study, we define an active
tention weights, which constitute approximately neuron as one that yields a positive output in our
one-third of the model’s size, in memory, allows low rank predictor model. Our approach focuses
for more efficient computation and quicker access, on managing neuron data by employing a Sliding
thereby enhancing inference performance without Window Technique. This technique entails main-
the need for full model loading. taining a DRAM cache of of only the weight rows
Anticipating ReLU Sparsity. The ReLU acti- that were predicted to be required by the the re-
vation function naturally induces over 90% sparsity cent subset of input tokens. The key aspect of this
in the FFN’s intermediate outputs, which reduces technique is the incremental loading of neuron data
the memory footprint for subsequent layers that uti- that differs between the current input token and its
lize these sparse outputs. However, the preceding immediate predecessors. This strategy allows for
layer, namely the up project for OPT and Falcon, efficient memory utilization, as it frees up memory
must be fully present in memory. To avoid loading resources previously allocated to cached weights
the entire up projection matrix, we follow Liu et al. required by tokens that are no longer within the
(2023b), and employ a low-rank predictor to iden- sliding window (as depicted in Figure 4b).
tify the elements zeroed by ReLU (see Figure 3b). From a mathematical standpoint, let sagg (k) de-
In contrast to their work, our predictor needs only note the cumulative use of neuron data across a
the output of the current layer’s attention module, sequence of k input tokens. Our memory architec-
and not the previous layer’s FFN module. We have ture is designed to store an average of sagg (k) in
observed that postponing the prediction to current DRAM. As we process each new token, the incre-
layer is sufficient for hardware aware weight load- mental neuron data, which is mathematically repre-
ing algorithm design, but leads to more accurate sented as sagg (k +1)−sagg (k), is loaded from flash

4
50 sagg(k)
Initial Window
sagg(k + 1) − sagg(k)
Once Upon A Time There Was A Kid Who Had A Dream
40 Aggregated Usage
Active neurons in the initial window
Percentage

30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

20 Sliding Window

Once Upon A Time There Was A Kid Who Had A Dream


10
Active neurons in the new window
Incremental Transfer 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 5 10 15 20 25 30
Window size Neurons to be deleted Neurons from initial window New Neurons

(a) aggregated neuron use (b) sliding window

Figure 4: (a) Aggregated neuron use of the tenth layer of Falcon 7B. As it can be seen the slope of aggregated
neuron use is decreasing. Other layers exhibit the same pattern. (b) Instead of deleting neurons that brought to
DRAM we keep the active neurons of past 5 tokens: when the new token "Was" is being processed only a few
amount of data needs to be changed.

memory into DRAM. This practice is grounded in Flash Memory


the observed trend of decreasing aggregated neuron
usage over time. Consequently, larger values of k
result in a lesser volume of data being loaded for 0 load 

each new token. (refer to Figure 4a), while smaller 0 from 

flash
values of k can help conserve DRAM that is used
0
to store the cached weights. In determining the
size of the sliding window, the aim is to maximize 0
it within the constraints imposed by the available
memory capacity. Predictor’s Up Proj 
 Down Proj

Output Columns Rows
3.2 Improving Transfer Throughput with
Figure 5: By bundling columns of up project and rows
Increased Chunk Sizes
of down project in OPT 6.7B we will load 2x chunks
To increase data throughput from flash memory, it instead of reading columns or rows separately.
is crucial to read data in larger chunks, preferably
sized as the multiples of the block size of the un-
derlying storage pool. In this section, we detail the in their activity patterns, which may enable further
strategy we have employed to augment the chunk bundling. To verify this we calculated the activa-
sizes for more efficient flash memory reads. tions of neurons over C4 validation dataset. For
Bundling Columns and Rows. For OPT and each neuron the coactivation of that neuron with
Falcon models, the usage of the ith column from other ones forms a power law distribution as de-
the upward projection and the ith row from the picted in Figure 6a. Now, let’s call the neuron that
downward projection coincides with the activation coactivates with a neuron the most closest friend.
of the ith intermediate neuron. Consequently, by Indeed, the closest friend of each neuron coacti-
storing these corresponding columns and rows to- vates with it very often. As Figure 6b demonstrates,
gether in flash memory, we can consolidate the data it is interesting to see each neuron and its closest
into larger chunks for reading. Refer to Figure 5 friend coactivate with each other at least 95% of
for an illustration of this bundling approach. If the times. The graphs for the 4th closest friend
each element of weights of the network is stored in and 8th closest friend are also drawn. Based on
num_bytes such bundling doubles the chunk size this information we decided to put a bundle of each
from dmodel ×num_bytes to 2dmodel ×num_bytes as neuron and its closest friend in the flash memory;
shown in Figure 5. Our analysis and experiment whenever a neuron is predicted to be active we’ll
show this increases the throughput of the model. bring its closes friend too. Unfortunately, this re-
Bundling Based on Co-activation. We had a sulted in loading highly active neurons multiple
conjecture that neurons may be highly correlated times and the bundling worked against our original

5
100 16000 700
1200

Number of neurons

Number of neurons

Number of neurons
90 14000 600
Frequency (%)

12000 1000
70 500
10000 800
400
50 8000
300 600
6000
30 200 400
4000
2000 100 200
10
0 0 0
20 100 200 300 400 500 600 700 800 900 1000 94 95 96 97 98 99 100 60 70 80 90 100 50 60 70 80 90 100
Top Co-activated Neurons Percentage of coactivation Percentage of coactivation Percentage of coactivation

(a) coactivation intensity (b) Closest friend (c) 4th closest friend (d) 8th closest friend

Figure 6: (a) For a randomly selected neuron from the 10th layer of OPT 6.7B, there exist a group of neurons which
are coactivated with high probability (b) The closest friend of a neuron is defined as the most coactivated neuron
in the same layer, and the closet friend of every neuron in OPT 6.7B almost always get coactivated. (c) The 3rd
closest friend gets coactivatd with each neuron 86% of the time in average (d) The 7th closest friend seems to be
less relevant and doesn’t coactivate with the neuron very often.

intention. It means, the neurons that are very active allocating a sufficient amount of memory for each
are ‘closest friend‘ of almost everyone. We inten- layer in advance, we minimize the need for fre-
tionally present this negative result, as we believe quent reallocation. Finally, the last_k_active
it may lead to interesting future research studies on component identifies the neurons from the original
how to effectively bundle the neurons and how to model that were most recently activated using the
leverage it for efficient inference. last k tokens.
The following operations are done during infer-
3.3 Optimized Data Management in DRAM
ence as depicted in Figure 7.
Although data transfer within DRAM is more ef-
ficient compared to accessing flash memory, it 1. Deleting Neurons: Neurons that are no longer
still incurs a non-negligible cost. When introduc- required are identified efficiently in linear time,
ing data for new neurons, reallocating the matrix utilizing the last_k_active data and the cur-
and appending new matrices can lead to signifi- rent prediction. The matrix, pointer, and
cant overhead due to the need for rewriting exist- scalars of these redundant neurons are re-
ing neurons data in DRAM. This is particularly placed with the most recent elements, and their
costly when a substantial portion (approximately count is subtracted from num_rows. For O(c)
25%) of the Feed-Forward Networks (FFNs) in neurons to be deleted, a memory rewrite of the
DRAM needs to be rewritten. To address this order O(c × dmodel ) is required.
issue, we adopt an alternative memory manage-
ment strategy. This involves the preallocation of 2. Bringing in New Neurons: Necessary neuron
all necessary memory and the establishment of data is retrieved from flash memory. The cor-
a corresponding data structure for efficient man- responding pointers and scalars are read from
agement. The data structure comprises elements DRAM, and these rows are then inserted into
such as pointers, matrix, bias, num_used, and the matrix, extending from num_row to num_row
last_k_active shown in Figure 7. + num_new. This approach eliminates the need
Each row in the matrix represents the concate- for reallocating memory in DRAM and copying
nated row of the ’up project’ and the column of existing data, reducing inference latency.
the ’down project’ of a neuron. The pointer vec-
tor indicates the original neuron index correspond- 3. Inference Process: For the infer-
ing to each row in the matrix. The bias for the ence operation, the first half of the
’up project’ in the original model is represented in matrix[:num_rows,:d_model] is used as the
the corresponding bias element. The num_used ’up project’, and the transposed second half,
parameter tracks the number of rows currently matrix[:num_rows,d_model:].transpose(),
utilized in the matrix, initially set to zero. The serves as the ’down project’. This configuration
matrix for the ith layer is pre-allocated with a size is possible because the order of neurons in the
of Reqi × 2dmodel , where Reqi denotes the maxi- intermediate output of the feed-forward layer
mum number of neurons required for the specified does not alter the final output, allowing for a
window size in a subset of C4 validation set. By streamlined inference process.

6
Pointer Scalar Pointer Scalar Pointer Scalar

1 0.5 1 0.5 1 0.5


Copy num_rows
10 0.7 3 5 0.4 5 0.4
num_rows 15 0.2 15 0.2 15 0.2
4 num_rows
5 0.4 5 0.4 5 7 0.4
1 1 1 1 9 0.3
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1

1. Start deletion 2. Deletion complete 3. Insertion complete

To be deleted Remaining New


-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
=
=
=
Figure 7: Memory management, first we copy last elements to deleting neurons to maintain a consecutive block of
memory then the required ones are stack to the end, this prevents from copying whole data multiple times

These steps collectively ensure efficient memory format. The second setup involves a Linux ma-
management during inference, optimizing the neu- chine equipped with a 24 GB NVIDIA GeForce
ral network’s performance and resource utilization. RTX 4090 graphics card. For this machine, com-
putations are GPU-based, and models are run in
4 Results the bfloat16 format. For both setups, we operate
under the assumption that almost half of the total
Experimental Setup: Our experiment is designed available memory (DRAM plus GPU memory) is
to optimize inference efficiency on personal de- allocated for model computations.
vices. To this end, we process sequences individ- Models. We use OPT 6.7B (Zhang et al., 2022b)
ually, running only one sequence at a time. This and a sparsified Falcon 7B (Mirzadeh et al., 2023)
approach allows us to allocate a specific portion of model for our evaluations.
DRAM for the Key-Value (KV) cache while pri- Baselines. For methods not employing sparsity
marily focusing on the model size. This strategy is or weight sharing, at least half of the model must be
particularly effective when dealing with only one transferred from flash memory during the forward
sequence/query at a time.2 pass. This necessity arises because, initially, only
For the implementation of our inference pro- half of the model is available in DRAM, but as the
cess, we utilize the HuggingFace’s transformers forward pass progresses, the entire model capacity
and KV caching. This setup is tested under the is utilized. Consequently, any data not present at
condition where approximately half of the model the start must be transferred at least once. Thus, the
size is available in DRAM. We select this amount most efficient theoretical baseline involves loading
as a showcase of the idea of hosting the LLM in half of the model size from the flash memory into
flash. With a different level of sparsity or employ- DRAM. This optimal I/O scenario serves as our
ing quantization, one can work with smaller avail- primary baseline. Comparative methods, such as
able DRAM capacity as well. Such a configuration FlexGen (Sheng et al., 2023) and Petals (Borzunov
demonstrates the practicality of executing inference et al., 2023), are also constrained by the limited
with lower memory footprints. available DRAM or GPU memory, and therefore
Hardware Configuration. Our models are eval- cannot surpass this theoretical I/O efficiency.
uated using two distinct hardware setups. The Flash memory Data Loading Implementation.
first setup includes an Apple M1 Max with a 1TB To optimize data loading from flash memory, our
solid-state drive (SSD) for flash memory. In this system employs reads parallelized over 32 threads.
configuration, computations are performed on the This multithreaded approach is intended to both
CPU, and the models are maintained in a 32-bit better amortize latency to first byte by not wait-
2
ing for each read sequentially, and maximize read
For OPT 6.7 B model with context length 2048 KV-cache
requires 2048 × 2dmodel elements which is only 8% of model throughput by reading multiple streams at once
size. Also the KV-cache itself can be held in flash memory. (Figure 2b).

7
Caching Considerations for Data Loading vated neurons, while occasionally misidentifying
from Flash Memory. When data is read from flash inactive ones with values near zero. Notably, these
memory, the operating system typically caches false negatives, being close to zero, do not signifi-
these pages, anticipating future reuse. However, cantly alter the final output when they are excluded.
this caching mechanism consumes additional mem- Furthermore, as demonstrated in Table 1, this level
ory in DRAM beyond what is allocated for the of prediction accuracy does not adversely affect the
model. To accurately assess the real throughput model’s performance in 0-shot tasks.
of flash memory under limited DRAM conditions, Windowing in the OPT 6.7B Model. Utilizing
benchmarks should be conducted without relying a windowing method with k = 5 in the OPT 6.7B
on caching. Practical systems may or may not rely model significantly reduces the necessity for fresh
on filesystem cache, depending on requirements. data loading. Using active neurons of predictor
For the purpose of our hardware benchmarking would require about 10% of the DRAM memory
in this study, we deliberately and significantly capacity in average; however, with our method, it
pessimize our NVMe throughput measurements. drops to 2.4%. This process involves reserving
On macOS and iOS, we employ the F_NOCACHE DRAM memory for a window of the past 5 tokens,
flag with the fcntl() function, while on Linux, which, in turn, increases the DRAM requirement
we use DirectIO. Additionally, on macOS, we for the Feed Forward Network (FFN) to 24%.
clear any resident buffers before initiating the The overall memory retained in DRAM for the
benchmark using the purge command. This model comprises several components: Embed-
approach provides a conservative lower bound dings, the Attention Model, the Predictor, and the
of throughput in scenarios where no caching is Loaded Feed Forward layer. The Predictor ac-
permitted, and makes the benchmarks repeatable. counts for 1.25% of the model size, while Em-
It’s worth noting that these figures can improve if beddings constitute 3%. The Attention Model’s
either the inference code or the operating system weights make up 32.3%, and the FFN occupies
is allowed to cache some part of the weights. 15.5% (calculated as 0.24×64.62). Summing these
While OS-level buffer caching is advantageous up, the total DRAM memory usage amounts to
for general purpose applications with high cache 52.1% of the model’s size.
hit rates, it lacks fine-grained control over cache
Latency Analysis: Using a window size of 5,
usage per process or buffer eviction at the appli-
each token requires access to 2.4% of the Feed
cation level. In the context of on-device memory
Forward Network (FFN) neurons. For a 32-bit
constraints and large model sizes, this could lead to
model, the data chunk size per read is 2dmodel ×
a situation where filesystem level does not help, be-
4 bytes = 32 KiB, as it involves concatenated rows
cause in order to evaluate later layers earlier layers
and columns. On an M1 Max, this results in the
must be evicted in a rolling pattern, so the effective
average latency of 125ms per token for loading
cache hit rate is close to zero. Aside from being
from flash and 65ms for memory management (in-
inefficient, this can cause coexistence issues with
volving neuron deletion and addition). Thus, the
other processes due to memory allocation pressure
total memory-related latency is less than 190ms
and Translation Lookaside Buffer (TLB) churn.
per token (refer to Figure 1). In contrast, the base-
4.1 Results for OPT 6.7B Model line approach, which requires loading 13.4GB of
This section presents the outcomes for the OPT data at a speed of 6.1GB/s, leads to a latency of
6.7B model, specifically under conditions where approximately 2330ms per token. Therefore, our
the memory allocated for the model in DRAM is method represents a substantial improvement over
approximately half of its baseline requirement. the baseline.
Predictors. For the initial 28 layers of the OPT For a 16-bit model on a GPU machine, the flash
6.7B model, we train predictors with a rank of load time is reduced to 40.5ms, and memory man-
r = 128. To reduce the occurrence of false nega- agement takes 40ms, slightly higher due to the
tives, the final four layers employ predictors with a additional overhead of transferring data from CPU
higher rank of r = 1024. These predictors achieve to GPU. Nevertheless, the baseline method’s I/O
an average of 5% false negatives and 7% false posi- time remains above 2000 milliseconds.
tives in the OPT 6.7B model. As depicted in Figure Detailed comparisons of how each method im-
3a, our predictor accurately identifies most acti- pacts performance are provided in Table 2.

8
Table 2: The I/O latency of OPT 6.7B 16 bit on M1 max for different techniques when half the memory is available

Configuration Performance Metrics


Hybrid Predictor Windowing Bundling DRAM (GB) Flash→ DRAM(GB) Throughput (GB/s) I/O Latency (ms)
✗ ✗ ✗ ✗ 0 13.4 GB 6.10 GB/s 2130 ms
✓ ✗ ✗ ✗ 6.7 6.7 GB 6.10 GB/s 1090 ms
✓ ✓ ✗ ✗ 4.8 0.9 GB 1.25 GB/s 738 ms
✓ ✓ ✓ ✗ 6.5 0.2 GB 1.25 GB/s 164 ms
✓ ✓ ✓ ✓ 6.5 0.2 GB 2.25 GB/s 87 ms

4.2 Results for Falcon 7B Model broadly fall into two categories: model compres-
sion techniques like pruning and quantization (Han
To verify that our findings generalize beyond OPT
et al., 2016b; Sun et al., 2023; Jaiswal et al., 2023;
models we also apply the idea of LLM in flash to
Xia et al., 2023), (Zhang et al., 2022a; Xu et al.,
Falcon model. Since, the base line Falcon model is
2023; Shao et al., 2023; Lin et al., 2023; Hoang
not sparse, we used a sparsified (relufied) version
et al., 2023; Zhao et al., 2023; Ahmadian et al.,
with almost the same performance as that of the
2023; Liu et al., 2023a; Li et al., 2023), and se-
base version (Mirzadeh et al., 2023). Similar to
lective execution like sparse activations (Liu et al.,
previous section, we present the results obtained
2023b), (Mirzadeh et al., 2023) or conditional com-
under the condition that approximately half of the
putation (Graves, 2016; Baykal et al., 2023). Our
model size is available for use in DRAM.
work is complementary, focusing on minimizing
Predictors. In the Falcon 7B model, predictors
data transfer from flash memory during inference.
of rank r = 256 are used for the initial 28 layers,
and r = 1152 for the last four layers. Selective Weight Loading. Most related to our
approach is prior work on selective weight loading.
Window Configuration. Our model reserves
SparseGPU (Narang et al., 2021) exploits activa-
memory for a window containing the last 4 tokens.
tion sparsity to load a subset of weights for each
This setup utilizes 33% of the Feed Forward Net-
layer. However, it still requires loading from RAM.
work (FFN). In terms of memory allocation, em-
Flexgen (Sheng et al., 2023) offloads the weights
beddings take 4.2% of the model size, attention
and kv-cache from GPU memory to DRAM and
weights account for 19.4%, and predictors require
DRAM to flash memory, in contrast we consider
4%. The active portion of the FFN, given our win-
only the cases the full model can’t reside in the
dow size, is 25.3% (calculated as 0.33 × 76.8).
whole DRAM and GPU memory on the edge de-
Overall, this amounts to 52.93% of the model’s
vices. Flexgen is theoretically bound by the slow
total size.
throughput of flash to DRAM in such scenarios.
Latency Analysis. Using a window size of 4
Firefly (Narang et al., 2022) shares our goal of
in our model requires accessing 3.1% of the Feed
direct flash access but relies on a hand-designed
Forward Network (FFN) neurons for each token. In
schedule for loading. In contrast, we propose a
a 32-bit model, this equates to a data chunk size of
cost model to optimize weight loading. Similar
35.5 KiB per read (calculated as 2dmodel × 4 bytes).
techniques have been explored for CNNs (Parashar
On an M1 Max device, the time taken to load this
et al., 2017), (Rhu et al., 2013). Concurrently,
data from flash memory is approximately 161ms,
Adapt (Subramani et al., 2022) has proposed adap-
and the memory management process adds another
tive weight loading for vision transformers. We
90ms, leading to a total latency of 250ms per token.
focus on transformer-based LLMs and introduce
In comparison, the baseline latency is around 2330
techniques like neuron bundling tailored to LLMs.
milliseconds, making our method approximately 9
to 10 times faster. To hide flash latency, we build on speculative
execution techniques like SpAtten (Dai et al., 2021;
Bae et al., 2023). But, we introduce lightweight
5 Related Works
speculation tailored to adaptive weight loading.
Efficient Inference for Large Language Models. Hardware Optimizations. There is a rich body
As LLMs grow in size, reducing their computa- of work on hardware optimizations for efficient
tional and memory requirements for inference has LLM inference, including efficient memory ar-
become an active area of research. Approaches chitectures (Agrawal et al., 2022), (Gao et al.,

9
2022), dataflow optimizations (Han et al., 2016a), resource-limited environments, thereby expanding
(Shao et al., 2022), hardware evaluation frame- their applicability and accessibility. The PyTorch
works Zhang2023AHE, and flash optimizations based implementation for forward pass have only
(Ham et al., 2016), (Meswani et al., 2015). We fo- undergone algorithmic (as opposed to systems)
cus on algorithmic improvements, but these could optimization. Significant additional gains are ex-
provide additional speedups. pected from a custom lower level implementation.
Speculative Execution. Speculative decoding Our work not only provides a solution to a
(Leviathan et al., 2022; Zhang et al., 2023; He et al., current computational bottleneck but also sets a
2023) is a technique that uses a draft model for precedent for future research. It underscores the
generation and uses the larger model to verify those importance of considering hardware characteristics
tokens. This technique is orthogonal to us and in the development of inference-optimized
can be used for further improvement. In case of algorithms, suggesting a promising direction for
speculative decoding, the window in our method further explorations in this domain. We believe
should be updated with multiple tokens rather one. as LLMs continue to grow in size and complexity,
Mixture of Experts. Mixture of Experts (Yi approaches like this work will be essential for
et al., 2023) have a sparse structure in their feed harnessing their full potential in a wide range of
forward layer and can leverage our method for devices and applications.
enabling larger models on device. Our study represents an initial endeavor in the
In summary, we propose algorithmic techniques pursuit of democratizing Large Language Model
to minimize weight loading from flash memory dur- (LLM) inference, making it accessible to a wider
ing LLM inference. By combining cost modeling, array of individuals and devices. We recognize that
sparsity prediction, and hardware awareness, we this early effort has its limitations, which, in turn,
demonstrate 4-5x and 20-25x speedup on CPU and open up compelling avenues for future research. A
GPU, respectively. critical aspect for future exploration is the analy-
sis of power consumption and thermal limitations
6 Conclusion and Discussion inherent in the methods we propose, particularly
In this study, we have tackled the significant chal- for on-device deployment. Currently, our focus
lenge of running large language models (LLMs) is on single-batch inference. However, expanding
on devices with constrained memory capacities. this to include scenarios like prompt processing,
Our approach, deeply rooted in the understand- multi-batch inference, and speculative decoding
ing of flash memory and DRAM characteristics, presents itself as a valuable area for further investi-
represents a novel convergence of hardware-aware gation. In our initial proof of concept, we operated
strategies and machine learning. By developing an under the assumption of memory availability being
inference cost model that aligns with these hard- half the size of the model. Exploring the dynam-
ware constraints, we have introduced two inno- ics of working with varying memory sizes—both
vative techniques: ’windowing’ and ’row-column larger and smaller—introduces a fascinating bal-
bundling.’ These methods collectively contribute ance between latency and accuracy, and is a com-
to a significant reduction in the data load and an pelling area for future exploration. In conclusion,
increase in the efficiency of memory usage. Weight our methodology is constructed on the foundation
bundling and windowing are two very basic tech- of sparsified networks. Nonetheless, the underlying
niques aimed at showcasing the potentials to in- concept holds potential for broader applications. It
crease chunk size and read sequentiality while re- can be adapted to selectively load weights in non-
ducing data transfer through sparsity. Numerous sparse networks or to dynamically retrieve model
opportunities exist for developing smarter and more weights from flash storage. This adaptation would
efficient methods to achieve these objectives. be contingent on the specific requirements of the
The practical outcomes of our research are note- input prompt or the contextual parameters provided.
worthy. We have demonstrated the ability to run Such an approach suggests a versatile strategy for
LLMs up to twice the size of available DRAM, managing model weights, optimizing performance
achieving an acceleration in inference speed by based on the nature of the input, thereby enhancing
4-5x compared to traditional loading methods in the efficiency, usefulness, and applicability of the
CPU, and 20-25x in GPU. This innovation is par- proposed scheme in various scenarios dealing with
ticularly crucial for deploying advanced LLMs in Large Language Models (LLMs).

10
Acknowledgements 558–568, Toronto, Canada. Association for Compu-
tational Linguistics.
We would like to thank Itay Sagron, Lailin Chen,
Mahyar Najibi, Qichen Fu, Moin Nabi, Peter Zat- Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
loukal, Arsalan Farooq, Sachin Mehta, Mohammad Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Samragh, Matt Johnson, Etai Zaltsman, Lin Chang, Askell, et al. 2020. Language models are few-shot
Dominic Giampaolo, Taal Uliel, Hadi Pouransari, learners. Advances in neural information processing
Fartash Faghri, Oncel Tuzel, Samy Bengio, Ruom- systems, 33:1877–1901.
ing Pang, Chong Wang, Ronan Collobert, David Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Grangier, and Aftab Munshi for the valuable feed- Maarten Bosma, Gaurav Mishra, Adam Roberts,
back and discussions. Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, et al. 2022. Palm: Scaling
language modeling with pathways. arXiv preprint
arXiv:2204.02311.
References
Udit Agrawal, Rangharajan Venkatesan, Brucek Han Dai, Yi Zhang, Ziyu Gong, Nanqing Yang, Wei Dai,
Khailany, Stephen W Keckler, and William J Dally. Eric Song, and Qiankun Xie. 2021. Spatten: Efficient
2022. Atomlayer: minimizing dram data movement sparse attention architecture with cascade token and
for ultra-sparse models on gpus. In Proceedings of head pruning. In Advances in Neural Information
the 27th ACM International Conference on Archi- Processing Systems, volume 34.
tectural Support for Programming Languages and Erich Elsen, Augustus Odena, Maxwell Nye, Sağ-
Operating Systems, pages 223–238. nak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Moparthi, and Arushi Somani. 2023. Releasing
Venkitesh, Stephen Gou, Phil Blunsom, A. Ustun, Persimmon-8B.
and Sara Hooker. 2023. Intriguing properties of quan- Mingyu Gao, Jie Yu, Wentai Li, Michael C Dai,
tization at scale. ArXiv, abs/2305.19268. Nam Sung Kim, and Krste Asanovic. 2022. com-
putedram: In-memory compute using off-the-shelf
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
dram. In Proceedings of the 27th ACM International
shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
Conference on Architectural Support for Program-
Maitha Alhammadi, Mazzotta Daniele, Daniel Hes-
ming Languages and Operating Systems, pages 1065–
low, Julien Launay, Quentin Malartic, Badreddine
1079.
Noune, Baptiste Pannier, and Guilherme Penedo.
2023. The falcon series of language models: To- Alex Graves. 2016. Adaptive computation time for re-
wards open frontier models. current neural networks. In International Conference
on Machine Learning, pages 3500–3509. PMLR.
Reza Yazdani Aminabadi, Samyam Rajbhandari, Am-
mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jongmin Ham, Jinha Kim, Jinwoong Choi, Cheolwoo
Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Cho, Seulki Hong, Kyeongsu Han, and Taejoo Chung.
Rasley, et al. 2022. Deepspeed-inference: enabling 2016. Graphssd: a high performance flash-based stor-
efficient inference of transformer models at unprece- age system for large-scale graph processing. In 2016
dented scale. In SC22: International Conference for USENIX Annual Technical Conference (USENIXATC
High Performance Computing, Networking, Storage 16), pages 243–256.
and Analysis, pages 1–15. IEEE.
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pe-
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and dram, Mark A Horowitz, and William J Dally. 2016a.
Se-Young Yun. 2023. Fast and robust early- Eie: efficient inference engine on compressed deep
exiting framework for autoregressive language mod- neural network. arXiv preprint arXiv:1602.01528.
els with synchronized parallel decoding. ArXiv,
abs/2310.05424. Song Han, Huizi Mao, and William J Dally. 2016b.
Deep compression: Compressing deep neural net-
Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil works with pruning, trained quantization and huff-
Ghosh, Rina Panigrahy, and Xin Wang. 2023. Al- man coding. In International Conference on Learn-
ternating updates for efficient transformers. ArXiv, ing Representations (ICLR).
abs/2301.13310.
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee,
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, and Di He. 2023. Rest: Retrieval-based speculative
Maksim Riabinin, Younes Belkada, Artem Chu- decoding. ArXiv, abs/2311.08252.
machenko, Pavel Samygin, and Colin Raffel. 2023.
Petals: Collaborative inference and fine-tuning of Duc Nien Hoang, Minsik Cho, Thomas Merth, Moham-
large models. In Proceedings of the 61st Annual mad Rastegari, and Zhangyang Wang. 2023. (dy-
Meeting of the Association for Computational Lin- namic) prompting might be all you need to repair
guistics (Volume 3: System Demonstrations), pages compressed llms. ArXiv, abs/2310.00867.

11
Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Angshuman Parashar, Minsoo Rhu, Anurag Mukkara,
Zhangyang Wang, and Yinfei Yang. 2023. Compress- Antonio Puglielli, Rangharajan Venkatesan, Brucek
ing llms: The truth is rarely pure and never simple. Khailany, Joel Emer, Stephen W Keckler, and
ArXiv, abs/2310.01382. William J Dally. 2017. Timeloop: A systematic ap-
proach to dnn accelerator evaluation. In 2017 IEEE
Yaniv Leviathan, Matan Kalman, and Yossi Matias. International Symposium on Performance Analysis
2022. Fast inference from transformers via spec- of Systems and Software (ISPASS), pages 241–251.
ulative decoding. IEEE.
Jiaxi Li and Wei Lu. 2023. Contextual distortion reveals Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley,
constituency: Masked language models are implicit Shaden Smith, and Yuxiong He. 2021. Zero-infinity:
parsers. In Proceedings of the 61st Annual Meet- Breaking the gpu memory wall for extreme scale
ing of the Association for Computational Linguistics deep learning. In SC21: International Conference for
(Volume 1: Long Papers), pages 5208–5222, Toronto, High Performance Computing, Networking, Storage
Canada. Association for Computational Linguistics. and Analysis, pages 1–14.
Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Minsoo Rhu, Natalia Gimelshein, Jason Clemons,
Chu. 2023. Norm tweaking: High-performance low- Arslan Zulfiqar, and Stephen W Keckler. 2013.
bit quantization of large language models. ArXiv, vdnn: Virtualized deep neural networks for scalable,
abs/2309.02784. memory-efficient neural network design. In 2016
49th Annual IEEE/ACM International Symposium on
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Microarchitecture (MICRO), page Article 13. IEEE
Xingyu Dang, and Song Han. 2023. Awq: Activation- Computer Society.
aware weight quantization for llm compression and
acceleration. ArXiv, abs/2306.00978. Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng
Xu, Lirui Zhao, Zhiqiang Li, Kaipeng Zhang, Peng
Zechun Liu, Barlas Oğuz, Changsheng Zhao, Ernie
Gao, Yu Jiao Qiao, and Ping Luo. 2023. Omniquant:
Chang, Pierre Stock, Yashar Mehdad, Yangyang
Omnidirectionally calibrated quantization for large
Shi, Raghuraman Krishnamoorthi, and Vikas Chan-
language models. ArXiv, abs/2308.13137.
dra. 2023a. Llm-qat: Data-free quantization
aware training for large language models. ArXiv,
Yifan Shao, Mengjiao Li, Wenhao Cai, Qi Wang,
abs/2305.17888.
Dhananjay Narayanan, and Parthasarathy Ran-
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang ganathan. 2022. Hotpot: Warmed-up gigascale infer-
Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, ence with tightly-coupled compute and reuse in flash.
Yuandong Tian, Christopher Re, et al. 2023b. Deja In Proceedings of the 55th Annual IEEE/ACM In-
vu: Contextual sparsity for efficient llms at infer- ternational Symposium on Microarchitecture, pages
ence time. In International Conference on Machine 335–349.
Learning, pages 22137–22176. PMLR.
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan
Moinuddin K Meswani, Sergey Blagodurov, David Li, Max Ryabinin, Beidi Chen, Percy Liang, Christo-
Roberts, John Slice, Mike Ignatowski, and Gabriel pher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen:
Loh. 2015. Neural cache: Bit-serial in-cache acceler- High-throughput generative inference of large lan-
ation of deep neural networks. In 2015 48th Annual guage models with a single GPU. In International
IEEE/ACM International Symposium on Microarchi- Conference on Machine Learning, ICML 2023, 23-29
tecture (MICRO), pages 383–394. IEEE. July 2023, Honolulu, Hawaii, USA, volume 202 of
Proceedings of Machine Learning Research, pages
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo 31094–31116. PMLR.
C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mo-
hammad Rastegari, and Mehrdad Farajtabar. 2023. Vedant Subramani, Marios Savvides, Li Ping, and Sha-
Relu strikes back: Exploiting activation sparsity in ran Narang. 2022. Adapt: Parameter adaptive token-
large language models. wise inference for vision transformers. In Proceed-
ings of the 55th Annual IEEE/ACM International
Sharan Narang, Logan Feistel, Erich Elsen Undersander, Symposium on Microarchitecture.
Cindy Song, and Gregory Diamos. 2022. Firefly:
A lightweight system for running multi-billion pa- Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter.
rameter models on commodity hardware. In 2022 2023. A simple and effective pruning approach for
ACM/IEEE 49th Annual International Symposium large language models. ArXiv, abs/2306.11695.
on Computer Architecture (ISCA), pages 757–771.
IEEE. Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang,
Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and
Sharan Narang, Erich Elsen Undersander, and Gregory Shuaiwen Leon Song. 2023. Flash-llm: Enabling
Diamos. 2021. Sparse gpu kernels for deep learning. low-cost and highly-efficient large generative model
In International Conference on Learning Representa- inference with unstructured sparsity. Proc. VLDB
tions. Endow., 17:211–224.

12
Zhaozhuo Xu, Zirui Liu, Beidi Chen, Yuxin Tang, Jue
Wang, Kaixiong Zhou, Xia Hu, and Anshumali Shri-
vastava. 2023. Compress, then prompt: Improving
accuracy-efficiency trade-off of llm inference with
transferable prompt. ArXiv, abs/2305.11186.
Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shang-
guang Wang, and Mengwei Xu. 2023. Edgemoe:
Fast on-device inference of moe-based large language
models. ArXiv, abs/2308.14352.
Jinchao Zhang, Jue Wang, Huan Li, Lidan Shou,
Ke Chen, Gang Chen, and Sharad Mehrotra. 2023.
Draft & verify: Lossless large language model ac-
celeration via self-speculative decoding. ArXiv,
abs/2309.08168.
Shizhao Zhang, Han Dai, Tian Sheng, Jiawei Zhang,
Xiaoyong Li, Qun Xu, Mengjia Dai, Yunsong Xiao,
Chao Ma, Rui Tang, et al. 2022a. Llm quantization:
Quantization-aware training for large language mod-
els. In Advances in Neural Information Processing
Systems, volume 35.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin,
Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shus-
ter, Daniel Simig, Punit Singh Koura, Anjali Srid-
har, Tianlu Wang, and Luke Zettlemoyer. 2022b.
OPT: open pre-trained transformer language mod-
els. CoRR, abs/2205.01068.
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn
Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy,
Tianqi Chen, and Baris Kasikci. 2023. Atom: Low-
bit quantization for efficient and accurate llm serving.
ArXiv, abs/2310.19102.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy