0% found this document useful (0 votes)
14 views34 pages

Achieving Peak Performance For Large Language

This systematic literature review analyzes optimization and acceleration techniques for large language models (LLMs) from 2017 to December 2023, focusing on methods to enhance performance while managing computational costs. The study categorizes strategies into training optimization, hardware optimization, and scalability, providing a comprehensive overview of frameworks and libraries used in LLM development. Two case studies illustrate practical applications of these techniques, highlighting ways to improve training and inference efficiency without sacrificing accuracy.

Uploaded by

Bernardo Azevedo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Achieving Peak Performance For Large Language

This systematic literature review analyzes optimization and acceleration techniques for large language models (LLMs) from 2017 to December 2023, focusing on methods to enhance performance while managing computational costs. The study categorizes strategies into training optimization, hardware optimization, and scalability, providing a comprehensive overview of frameworks and libraries used in LLM development. Two case studies illustrate practical applications of these techniques, highlighting ways to improve training and inference efficiency without sacrificing accuracy.

Uploaded by

Bernardo Azevedo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2023.1120000

Achieving Peak Performance for Large Language


Models: A Systematic Review
ZHYAR RZGAR K ROSTAM 1 (Member, IEEE), SÁNDOR SZÉNÁSI 2,3
(Member, IEEE), and
GÁBOR KERTÉSZ 2,4 (Senior Member, IEEE)
1
Doctoral School of Applied Informatics and Applied Mathematics, Óbuda University, Budapest, Hungary
2
John von Neumann Faculty of Informatics, Óbuda University, Budapest, Hungary
arXiv:2409.04833v1 [cs.CL] 7 Sep 2024

3
Faculty of Economics and Informatics, J. Selye University, Komarno, Slovakia
4
Laboratory of Parallel and Distributed Systems, Institute for Computer Science and Control (SZTAKI), Hungarian Research Network (HUN-REN), Budapest,
Hungary
Corresponding author: Zhyar Rzgar K Rostam (e-mail: zhyar.rostam@stud.uni-obuda.hu).

ABSTRACT In recent years, large language models (LLMs) have achieved remarkable success in natural
language processing (NLP). LLMs require an extreme amount of parameters to attain high performance.
As models grow into the trillion-parameter range, computational and memory costs increase significantly.
This makes it difficult for many researchers to access the resources needed to train or apply these
models. Optimizing LLM performance involves two main approaches: fine-tuning pre-trained models for
specific tasks to achieve state-of-the-art performance, and reducing costs or improving training time while
maintaining similar performance. This paper presents a systematic literature review (SLR) following the
Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. We reviewed 65
publications out of 983 from 2017 to December 2023, retrieved from 5 databases. The study presents methods
to optimize and accelerate LLMs while achieving cutting-edge results without sacrificing accuracy. We begin
with an overview of the development of language modeling, followed by a detailed explanation of commonly
used frameworks and libraries, and a taxonomy for improving and speeding up LLMs based on three classes:
LLM training, LLM inference, and system serving. We then delve into recent optimization and acceleration
strategies such as training optimization, hardware optimization, scalability and reliability, accompanied by
the taxonomy and categorization of these strategies. Finally, we provide an in-depth comparison of each
class and strategy, with two case studies on optimizing model training and enhancing inference efficiency.
These case studies showcase practical approaches to address LLM resource limitations while maintaining
performance.

INDEX TERMS Distributed training, GPU acceleration, Large Language Model, LLM, LLM Acceleration,
LLM frameworks, LLM Optimization.

I. INTRODUCTION (BERTlarge , 340 million parameters) [8], Generative Pre-


In recent years, dense deep learning models have seen an trained Transformer-3 (GPT-3, 175 billion parameters) [14],
extraordinary growth in the number of parameters [1]–[3]. to General Language Model (GLM-3, 1.75 trillion parame-
Transformer as an effective deep learning architecture has ters) [15]. With models now reaching trillions of parameters,
been widely used over the recent years, and transformer- even the most powerful GPUs are struggling to keep up [1].
based models have achieved notable success and recognition This resource-intensive requirement is making it difficult for
in various fields including language modeling compared to many researchers to access the computational resources they
the existing models [4]–[13]. need to train these models [1], [4], [16]. Also, handling,
managing, and fitting these models into device memory is
To achieve significant accuracy in deep learning, large a daunting challenge due to memory limitations, and this
models with billions to trillions of parameters are essential. tremendous size of data brings complexity, and requires high-
Therefore, deep learning models continue to grow in com- end computing resources with significant memory require-
plexity with an array of large-scale models ranging from ments to process [5], [17]–[19]. Training large-scale models
Bidirectional Encoder Representations from Transformers

VOLUME 11, 2023 1


Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

effectively require significant adjustments [20]–[24], espe- in specific sections, focus on enhancing efficiency, scal-
cially in terms of increasing training throughput and loading ability, and flexibility for LLMs (Section X).
these kinds of large models into GPU memory [18]. This review paper is organized as follows: an overview of lan-
As a result, developing frameworks, libraries and propos- guage modeling development (Section II), followed by an in-
ing new techniques to overcome the mentioned challenges has depth explanation of the most commonly utilized frameworks
become an essential task. There are many studies that have and libraries specifically designed for optimizing and accel-
worked on possibilities for optimization and acceleration with erating large language models (LLMs) (Section IV, Tables 3
large models and using various techniques to achieve state- and 4), accompanied by taxonomy and categorization. Addi-
of-the-art (SOTA) results without sacrificing accuracy. These tionally, it delves into recent optimization and acceleration
remarkable advancements in the field of language models strategies employed within LLMs, including the taxonomy
(LMs) required a systematic review of recent LM optimiza- and categorization of these strategies (presented in Fig. 1)
tion and acceleration techniques. To address these challenges (Section V, VI, VII), Table 8 summarizes the reviewed papers,
and guide future research, this SLR paper aims to: excluding those already covered in Tables 3 and 4 or the main
• Analyze recent optimization and acceleration techniques text. Moreover, we present an individual comparison in terms
for LLMs. of performance, cost, and scalability for reviewed strategies
• Identify challenges associated with training, inference, discussed in Tables 6, and 7, and the classes (training opti-
and system serving for LLMs (billions/trillions of pa- mization, hardware optimization, scalability and reliability)
rameters). presented in Table 5. In addition to the mentioned factors,
• Develop a structured taxonomy to categorize LLM opti- we consider the classes’ focus in this comparison. Finally,
mization techniques. we illustrate these concepts with two real-world examples:
• Review and evaluate recent libraries and frameworks optimizing model training and improving inference efficiency
designed for LLM optimization. through case studies (Section VIII).
• Identify promising areas for future research in LLM
A. RELATED WORKS
development, focusing on efficiency, scalability, and
flexibility. In this section, we will present the related studies that inves-
tigate optimization and acceleration with dense deep learn-
In this SLR we are making the following contributions: ing models and LLMs. Jahan et al., in [25] present a sys-
• Comprehensive overview: We offer a comprehensive tematic literature review (SLR) by comparing 31 language
overview of the development of language modeling models inspired by BERT, published between 2018 and
(Section II), detailing commonly used frameworks and 2020, to help researchers choose the best model based on
libraries (Section IV), and recently used techniques and their requirements. By analyzing each model’s performance
strategies (Sections V, VI, VII). This serves as a valu- against RoBERTa, the study identified seven models that
able resource for understanding the current landscape of performed better, and the rest of the studies investigated
LLM optimization. with different parameter settings. The outperforming models
• Taxonomy of optimization strategies: We categorize varied in dataset size, suggesting that both large and small
optimization strategies into three classes: training op- datasets can be effective depending on the model’s archi-
timization, hardware optimization, and scalability and tecture. Ultimately, this research provides valuable insights
reliability. This taxonomy helps clarify the various ap- for researchers seeking the optimal language model for their
proaches and their specific applications (presented in Fig specific tasks. Yu et al [26] conduct a survey that explores the
4, Sections V, VI, VII). growing challenges and opportunities for optimizing large-
• Detailed analysis of techniques: Our analysis explores scale deep learning systems. By highlighting recent advances
recent optimization and acceleration strategies, we pro- in optimization techniques, it proposes a new way to cate-
vide two comparative analyses regarding performance, gorize and explain the different computing approaches used.
cost, and scalability for reviewed strategies (presented Zhao et al [27] carry out a survey that focuses on the recent
in Tables 6 and 7) and their core categories: training advancements in LLMs. The study concentrates on four major
optimization, hardware optimization, and scalability and dimensions of LLMs: pre-training, adaptation tuning, uti-
reliability (presented in Table 5). In the latter analysis, lization, and capacity evaluation. The survey emphasizes the
we also consider the focus of classes. techniques or discoveries essential for LLMs’ success. Ad-
• Case studies: We include two in-depth case stud- ditionally, it provides an overview of available development
ies that demonstrate practical approaches to optimiz- resources and offers valuable guidelines for successful LLM
ing model training and enhancing inference efficiency. implementation, drawing from the latest research. Bai et al.,
These case studies highlight how resource limitations in [28] provide a systematic survey that provides an overview
can be addressed while maintaining performance (Sec- of LLM resource efficiency. It focuses on LLM significant
tions VIII-A, VIII-B). resource consumption in computational, memory, energy, and
• Future direction: We explore a range of promising future financial aspects. It categorizes techniques aimed at improv-
directions for LLM development. These areas, detailed ing LLMs’ resource efficiency. Standardized evaluation met-
2 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

LLM Optimization as text generation to maximize the utilization of knowledge


Training Optimization embedded within a generative language model. Qiu et al
Model Optimization [31] provide a comprehensive overview of pre-trained models
Algorithmic Optimization FlexGen; SwapAdvisor; NLP-Fast; ByteTransformer;
Sheared LLAMA; GrowLength; PagedAttention
(PTMs), from fundamental knowledge and model architec-
Layer-Specific Kernels LightSeq; LightSeq2
tures to diverse pre-training tasks, extensions, and real-world
Model Partition NLP-Fast; GPipe; Megatron-LM
applications. It proposes a clear taxonomy for easy navigation
Fine-tuning AlphaTuning; QFT
and provides abundant resources like code, tools, corpora, and
Scheduler Optimization TurboTransformers; PetS
reading lists. Recognizing current limitations, the survey also
Size Reduction Optimization Cramming
presents a discussion on promising future directions to shape
Model Compression and Quantization
FlexGen; QMoE; SWARM Parallelism; AlphaTuning;
GPTQ; FPTQ; Norm Tweaking; FineQuant; PETALS;
the NLP landscape. This survey [32] explores techniques for
QFT; QunatEase; LLM-Pruner

Pruning SparseGPT; Sheared LLAMA; LLM-Pruner


building efficient LLMs. It categorizes approaches into three
groups: Model-centric, Data-centric, and LLM frameworks.
Hyperparameter Optimization Cramming
In the model-centric method focuses on optimizing the LLMs
Distributed Training
through techniques including compression, efficient training,
Data Parallelism
and specialized architectures. Data-centric focusing on im-
Model Parallelism
proving data quality and using prompts to guide the model
Tensor Parallelism Megatron-LM
efficiently. LLM frameworks create specialized software to
Pipeline Parallelism FlexGen; PETALS; GPipe; DFX
handle LLMs, the survey aims to provide a comprehensive
Combined Parallelism Narayanan et al.,; ZeRO; GPipe
understanding of how to make LLMs more efficient and
ZeRO ZeRO
accessible.
Sequence Parallelism Sequence parallelism; Colossal-ai
In this SLR, we examine research published between 2017
Automatic Parallelization FlexFlow; Alpa
and December 2023, filling a gap in existing surveys by
ZeRO-Offload; ZeRO-Infinity; SWARM
Heterogeneous Training Parallelism; NLP-Fast; Splitwise specifically focusing on optimizing and speeding up LLMs.
Hardware Optimization Splitwise Following the PRISMA approach, we reviewed 65 articles.
Memory Optimization The study starts with an overview of language modeling’s
Memory Management TurboTransformers; LightSeq2; EET development and then dives deep into the most popular frame-
Hardware-Aware Optimization HAO works and libraries for optimizing and accelerating LLMs. It
Offloading FlexGen; ZeRO-Offload; Eliseev and Mazur organizes these models with a clear taxonomy and catego-
Mixed Precision Micikevicius et al.,; Cramming; LightSeq2;
FP8-LM
rizes them effectively. The research also investigates recent
Scalability and Reliability approaches for optimizing and accelerating LLMs, offering a
Fault Tolerance SWARM; PETALS classification system along with a summary and comparison
Scalability ZeRO-Offload of the reviewed papers containing the latest optimization
techniques. Moreover, resource limitations and the impact
FIGURE 1. LLM optimization techniques and taxonomy of various optimization techniques in LLMs were addressed
through two in-depth case studies. These studies delve into
practical approaches for optimizing training and enhancing
rics and datasets are also proposed to facilitate fair compar- inference efficiency, demonstrating how these techniques can
isons. The survey offers insights into current advancements be applied effectively without excessive resources.
and guides future developments toward more sustainable and
efficient LLMs. Wang et al [29] explore new methods to B. RESEARCH METHODOLOGY
achieve comparable accuracy while reducing training costs. In this study, we have followed the PRISMA statement to
They highlight optimized algorithms for quicker learning, ensure a systematic and transparent methodology. PRISMA
distributed architectures leveraging widespread computing provides a comprehensive set of guidelines for conducting
resources, and hardware acceleration with communication systematic reviews. Our approach included a detailed search
optimization for collaborative training. While challenges re- strategy across multiple databases, explicit inclusion and
main, these advancements pave the way for more affordable exclusion criteria, and a thorough study selection process.
and accessible AI in the future. Min et al [30] present a We documented each step meticulously, including the study
survey that explored recent studies for using powerful pre- selection and exclusion procedures. (presented in Fig. 2).
trained language models (PLMs) in natural language pro- Eligibility criteria: This review will focus on the optimiza-
cessing (NLP) tasks through the analysis of three popular tion and acceleration of LLMs, examining the most recent
approaches. The first approach trains on a massive dataset and widely utilized libraries, frameworks, and techniques in
for general language understanding, then specializes it for this field. To ensure focused analysis, strict eligibility cri-
a specific task with focused training. The second approach teria are applied. Only studies published between 2017 and
prompts the PLM to treat the desired task as similar to its pre- December 2023 are considered, excluding publications not
training tasks, allowing for efficient ‘‘few-shot’’ learning with written in English and retracted papers. Additionally, studies
just a few examples. The third approach modifies NLP tasks are excluded if they are irrelevant to our SLR, or do not
VOLUME 11, 2023 3
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1. Research queries executed or task-specific accuracy measures depending on the


study’s focus.
No. Research Query
• Training time reduction: We looked for data on how
RQ 1 LLM GPU Acceleration
RQ 2 LLM GPU Optimization different techniques impacted the time required to train
RQ 3 LLM Acceleration LLMs.
RQ 4 LLM Optimization 1 • Resource usage: If studies reported resource (memory)
RQ 5 Large Language Model GPU Acceleration
RQ 6 Large Language Model GPU Optimization
usage changes with different optimization techniques,
we collected that data.
We aimed to collect all relevant results within these out-
explicitly address ‘‘Optimization’’ ‘‘Acceleration,’’ or ‘‘Large come domains whenever possible. This included considering
Language Models’’ in their titles, abstracts, or keywords. data from different measures, time points, and analyses re-
Information sources: To ensure a comprehensive search for ported by the study authors.
authentic studies, a variety of sources, including databases, Additional variables: In addition to the main outcomes, we
websites, and tools, were employed. Digital libraries like also extracted data on the following aspects of the studies:
IEEE Xplore, Web of Science, and Scopus alongside open • LLM architecture: The specific type of LLM architec-
access libraries like arXiv and dedicated tools like Zotero ture used in the study.
facilitated the data collection and reference management. The • Optimization techniques: Detailed description of the op-
last search was conducted on May 25th , 2024. Additionally, timization techniques employed in the study.
Rayyan and the researchrabbit.ai websites were utilized for • Hardware/Software platforms: The hardware and soft-
data exploration and study selection. ware platforms used for training, inference, serving, and
Search strategy: This systematic review leveraged two evaluation.
web-based AI tools, ResearchRabbit [33], and Rayyan [34], Data collection process: ResearchRabbit is a web-based
for both data collection and study selection. In all databases tool powered by AI that helps and guides researchers to find
and websites, we were particularly interested in finding stud- relevant studies in a variety of digital libraries and allows
ies that focused on language modeling, particularly those that researchers to export retrieved results in a collection to ref-
focused on LLM optimization and acceleration. We employed erence managers tools (similar to Zotero). ResearchRabbit’s
various queries in each source (see Table 1) and exported search is powered by SemanticScholar and shows only the
the retrieved studies for import into Rayyan. Rayyan’s AI top 50 search results for a single query, aiming to maintain
capabilities facilitated both the selection of desired studies the research focus effectively [33]. Initially, we applied our
and the exclusion of irrelevant ones. queries to the ResearchRabbit website and then added the
Selection process: The process of selecting which works most relevant retrieved results to our collection. Following
to review in this study employed strict inclusion criteria. that, we applied the same queries in digital libraries like IEEE
In this SLR we explore the techniques and methods that Xplore, Web of Science, Scopus, and arXiv (see Table 2). The
were primarily examined based on their focus on large-scale papers were reviewed on a case-by-case basis. Then, a precise
language modeling, including transformer-based models such summary of each paper was written. Finally, the interesting
as PLMs, LLMs, and even general NLP models. The Rayyan data that directly addressed the issues the papers attempted to
platform facilitated the selection process. Two stages were address were extracted from the summaries.
involved: initial screening using eligible and inclusion cri- Study Risk of Bias Assessment: In this SLR, we followed a
teria, followed by author selection of the most relevant and meticulous process to assess the risk of bias in the included
impactful studies. Finally, the ‘‘compute rating’’ function in studies, adhering to best practices for ensuring the reliability
Rayyan was used, and the authors double-checked excluded and validity of our findings.
studies for accuracy. Automation Tools:
Data Extraction: In this stage, we focused on extracting • We utilized Rayyan, an AI-powered tool, to facilitate
relevant data from selected studies. Our aim was to collect the initial screening and selection process. Rayyan’s AI
information on two key aspects: capabilities helped in identifying potential biases and
Outcomes: We were particularly interested in outcomes categorizing studies based on relevance and quality.
related to LLM optimization and acceleration. Specifically, • ResearchRabbit was used for gathering relevant studies,
we sought data on: which provided a focused list of top search results, aid-
• Performance metrics: This could include metrics like ing in maintaining the research scope effectively.
perplexity [19], BLEU score, ROUGE score [35], Reviewer Process:
• Each study was assessed by three independent reviewers.
1 The initial search query in arXiv with this RQ was broad, returning 440 This approach helps to minimize subjective bias and
studies, many irrelevant to our research. To refine the results and minimize ensures a more balanced evaluation.
the risk of bias, also ensure retrieval of high-quality, relevant papers, we
employed the AND operator along with the title field within the search query • The reviewers independently examined each study based
specifically on arXiv. on predefined criteria including selection bias, perfor-
4 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Identification of new studies via databases and registers

Records identified from:


Databases (n = 5):
Records removed before screening:
Identification

IEEE Xplore (n = 158)


Duplicate records (n = 41)
Web of Science (n = 224)
Records marked as ineligible by automation
Scopus (n = 170)
tools (n = 0)
researchrabbitapp.com (n = 163)
Records removed for other reasons (n = 2)
arXive (n = 268)
Registers (n = 983)

Records screened Records excluded


(n = 940) (n =745)
Screening

Reports sought for retrieval Reports not retrieved


(n = 195) (n = 0)

Reports assessed for eligibility Reports excluded:


(n = 195) Focus on different aims (n = 130)
Included

New studies included in review


(n = 65)

FIGURE 2. The PRISMA 2020 flow diagram of the performed search.

mance bias, detection bias, attrition bias, and reporting ployed. The initial categorization will consist of group stud-
bias. ies based on the utilized LLM libraries/frameworks and the
Independent Review and Consensus: optimization techniques investigated. Subgroups within these
categories will be further established based on the specific
• The reviewers worked independently during the initial
type of LLM or the NLP task addressed by the studies. This
assessment phase to ensure unbiased evaluations.
method enables a highly detailed examination of how the ef-
• After the independent assessments, the reviewers com-
fectiveness of optimization techniques varies across different
pared their findings. Any discrepancies or disagreements
LLM and NLP task configurations. Additionally, key findings
were resolved through discussion and consensus.
from each individual study will be summarized in tables,
We adhered to a rigorous and systematic approach to as- including details like the optimization technique used, LLM
sess the risk of bias, which involved multiple independent type, NLP task addressed, achieved performance metrics, and
reviewers and the use of validated tools. Automation tools the study’s aims. Finally, a narrative synthesis will be con-
such as Rayyan and ResearchRabbit played a crucial role ducted to analyze recurring themes across the studies. This
in streamlining the screening and selection process, thereby thematic analysis will focus on the effectiveness of LLM li-
enhancing the efficiency and accuracy of our assessments. braries and optimization techniques in achieving performance
By combining independent reviews, consensus discussions, improvements while considering resource constraints. It will
and advanced AI tools, we ensured a robust and unbiased also explore potential explanations for observed variations
evaluation of the included studies. in effectiveness, with particular attention paid to factors like
Synthesis Methods: To enable a comprehensive and insight- LLM size, resources used, and the NLP task addressed.
ful analysis of LLM optimization techniques across diverse
contexts, a three-tiered categorization scheme will be em- Reporting Bias Assessment and Certainty Assessment: To
VOLUME 11, 2023 5
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2. Studies retrieved per database / search engine

Database / Search Engine Total N-gram LM:


IEEE Xplore 158 1 bigram and trigram.
Web of Science 224
Scopus 170
ResearchRabbit 163 1990s
arXiv 268
Total 983 Markov assumption LM:
2 HMMs.

minimize the risk of bias in our systematic review, we im-


plemented a multifaceted strategy. First, to address reporting
Machine learning LM:
bias, we utilized Rayyan and ResearchRabbit, as AI-powered
tools, during the initial screening and selection process. These 3 Decision tree, Random forest, Late 1990s
Naive Bayes, and SVM.
tools can categorize studies based on relevance and quality
and can help flag studies with characteristics suggestive of
reporting bias, such as those focusing solely on positive out-
comes. Second, to strengthen the certainty of our findings and Neural network LM:

minimize subjective bias, we implemented a multi-reviewer Late 2000s 4 RNN, CNN, and LSTM.

approach. Each study underwent independent assessment by


three reviewers based on predefined criteria. This approach
ensures a more balanced evaluation and reduces the influence
Transformers LM:
of individual reviewer bias. 5 BERT, GPT, LLM: GPT-3.5, 2017
PaLM2, LLaMA-2

II. LANGUAGE MODELING DEVELOPMENT


Language modeling is a fundamental approach to enhancing
the ability of machines to understand and process human FIGURE 3. Language model development
language. It is a computational model that can learn and
predict the possibilities of incoming (or missing) tokens [24].
The development of language models can be classified as
follows (see Fig. 3): that these models have an important impact on the field
• N-gram language models, like bigrams and trigrams, of NLP [24], [45].
are basic methods that learn from the frequency of • Transformer language models refer to those models that
word sequences in text [36], [37]. However, their limited leverage the capabilities of a deep learning architecture
context window restricts their ability to capture long- called Transformer to process and understand human
range dependencies and understand the deeper semantic language [46], [47]. These models achieved remarkable
relationships between words. results by using ‘‘special attention mechanism’’ to un-
• Markov assumption language models, refers to those derstand the relationship between words and sentences.
models that predict the next word based on the most These models capture context-aware representation in-
recent in the context [24]. Both n-gram and Markov stead of learning fixed word representations, first pre-
assumption language models are commonly used to im- training then fine-tuning according to specific down-
prove task performance in NLP, and information re- stream tasks [2], [8], [24], [48]. Transformer architecture
trieval (IR) [38]. has been used to build PLMs such as BERT [8], GPT-2
• Machine learning models, these models investigate ma- [49], and BART [50]. These models underwent training
chine learning algorithms to enhance language compre- using bidirectional language models and specifically de-
hension. They are trained on extensive text corpora to signed pre-training tasks applied to extensive unlabeled
discern patterns and relationships [39]. The adoption of datasets. The growth in model size and data size has
machine learning in NLP introduced a more advanced revolutionized the way we approach downstream tasks,
methodology, enabling the creation of applications such enabling large-sized PLMs to achieve remarkable per-
as spam detection [40] and sentiment analysis [41]. formance gains. These models exhibit unique character-
• Neural language models, these models are developed istics compared to smaller PLMs, such as 330M-BERT
based on NN for working with a sequence of data. and 1.5B-GPT-2, demonstrating exceptional abilities in
They have a special ability of learning effective features solving complex tasks. As a result, LLM is the term used
for words or sentences. These studies [42]–[44] have to refer to large-sized PLMs [48], [51], [52].
initiated the use of language models for representation
learning (beyond word sequence modeling), and show
6 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

III. MACHINE LEARNING MODELS and its parameters were scaled up to 557 million. The model
The process of building, deploying, and managing a machine achieved a top-1 validation accuracy of 84.4%. Meanwhile,
learning model involves three distinct phases: training, in- the transformer model a single 128-layer, 6-billion-parameter
ference, and system serving. Training is the foundation of multilingual model trained across 103 languages, was also
machine learning, where a vast dataset of labeled data is used evaluated. GPipe achieved superior performance compared
to develop a model that can identify patterns and relationships to training 350-million-parameter bilingual transformer big
within the data. Inference is the application of the trained models individually across 100 language pairs. The model
model, where new, unseen data is fed into the model to obtain presents its efficiency by boosting the performance on a vari-
predictions or classifications based on the learned patterns. ety of devices, with the support of flexibility on any deep net-
System serving ensures the model’s longevity and effective- work architectures, utilizing the synchronous gradient decent,
ness in real-world applications, handling large volumes of and ensuring consistent training regardless of the number of
requests, monitoring the model’s performance, and providing partitions.
continuous updates or modifications as needed [11], [19],
[53]. In the section IV, we provide a categorization of the most 2) ByteTransformer
recent frameworks and libraries utilized for LLMs optimiza- ByteTransformer [4] is a transformer framework for GPU ac-
tion, structured into three primary classes: training, inference, celeration with an efficient and high performance optimized
and deployment and system serving (presented in Fig. 4). for variable-length inputs in NLP problems. The framework
However, certain studies can be classified into two categories uses an algorithm for overcoming the redundant computa-
simultaneously, owing to their ability to handle multiple tasks, tions on zero-padding tokens, and variable input length. Fur-
such as LightSeq2 (section IV-A4), TurboTransformers (sec- thermore, the model proposed a fused Multi-Head Attention
tion IV-B4), and PetS (section IV-B5). (MHA) to reduce the memory overhead of the intermediate
matrix. This model manually optimizes the memory sizes
IV. FRAMEWORKS AND LIBRARIES of layer normalization by introducing bias and activation to
As most LLMs are designed based on Transformers, these maximize the overall system performance. It has been used
models are a powerful type of neural network that have by some famous applications including TikTok and Douying
achieved SOTA results on a wide range of applications. To of ByteDance. The model was evaluated on an NVIDIA
achieve these results the models are required to have a huge A100, focusing on the forward pass of BERT-like transform-
model size with hundreds of billions, even trillion of parame- ers, including BERT [8], ALBERT [57], DistilBERT, and
ters. Training LLMs requires distributed training algorithms, DeBERTa. It showcased a significant improvement, enhanc-
which employ parallel processing techniques to efficiently ing the fused MHA mechanism by 6.13× compared to Py-
train these massive models. To streamline distributed training, Torch attention. Additionally, ByteTransformer outperformed
various optimization frameworks have been developed, pro- PyTorch, TensorFlow, Tencent TurboTransformer [11], Mi-
viding tools and infrastructure for implementing and deploy- crosoft DeepSpeed [5], and NVIDIA FasterTransformer by
ing parallel algorithms [24], [54], [56]. In this section, we will 87%, 131%, 138%, 74%, and 55%, respectively, in terms of
provide the most recent frameworks and libraries designed to the end-to-end performance of a standard BERT transformer.
overcome those limitations.
3) Megatron-LM
A. LLM TRAINING FRAMEWORKS AND LIBRARIES Megatron-LM [19] is a deep learning library for training
This section will delve into the objectives and outcomes LLMs efficiently and effectively. It enables the library for
of LLM frameworks and libraries employed in the training the training of very large transformer models with billions of
phase. Additionally, a summary of each framework/library parameters. It offers a set of optimization methods for dis-
will be provided individually (see Table 3). tributed training, it includes strategies like intra-layer model
parallelism, and mixed-precision training. These optimiza-
1) GPipe tion techniques significantly enhance training efficiency and
GPipe [3] introduces a novel pipeline parallelism framework speed, facilitating effective distributed training across mul-
based on batch partitioning. It divides each mini-batch ex- tiple GPUs. Megatron-LM operates independently without
ample into smaller micro-batches, which are subsequently requiring new compiler or library changes. This makes it
executed in sequence across the cells. During training, it orthogonal and complementary to pipeline model parallelism,
employs synchronous mini-batch gradient descent, where allowing for seamless integration and flexibility within exist-
gradients from all micro-batches are aggregated and applied ing NLP frameworks.
to the model at the end of the mini-batch. GPipe has been The library has been shown to be highly effective for
shown to train two large-scale models: a convolutional model training LLMs. A Megatron-LM model with 8.3 billion pa-
for image classification and a transformer model for ma- rameters was trained on 512 NVIDIA V100 GPUs using 8-
chine translation. The convolutional model, AmoebaNet, was way model parallelism and achieved sustained performance
trained on 480×480 input from the ImageNet 2012 dataset. of up to 15.1 PetaFLOPs across the entire application. This
To enhance its performance, the model width was expanded, is significantly faster than previous approaches to training
VOLUME 11, 2023 7
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

LLM Inference LLM Training


DeepSpeed Inference GPipe
FlexGen NLP-Fast ByteTransformer Megatron-LM

EET Splitwise CoLLiE


PETALS LLMCompass

PowerInfer
LightSeq
TurboTransformers LightSeq2
PetS

vLLM

Deployment and Serving

FIGURE 4. LLM frameworks and libraries

TABLE 3. Summary of LLM training frameworks and libraries

Studies Aims Outcomes


GPipe [3] A new pipeline parallelism library based on batch Investigated single 6-billion-parameter, 128-layer Transformer model on a
splitting. It is efficient, task independent, and works dataset with 103 languages achieved better results than individually training
with different NN architectures. 350-million-parameter.
ByteTransformer A transformer framework for GPU acceleration, Evaluated on an NVIDIA A100 GPU, for BERT-like transformers. Boosted
[4] optimized for variable-length inputs in NLP prob- fused MHA by 6.13× compared to PyTorch attention. Also, outperformed
lems. PyTorch, TensorFlow, Tencent TurboTransformer, Microsoft DeepSpeed,
and NVIDIA FasterTransformer by 87%, 131%, 138%, 74%, and 55%,
respectively.
Megatron-LM A deep learning library for training LLMs with With 8.3 billion parameters trained on 512 NVIDIA V100 GPUs achieved
[19] billions of parameters. Offers a set of optimization 15.1 PetaFLOPs throughput. Also, achieving a perplexity of 10.8 on the
methods for distributed training. WikiText103 benchmark and an accuracy of 66.5% on the LAMBADA
dataset.
LightSeq2 [54] Software library that accelerates the training of Achieves significant speedups on a variety of NLP tasks. speedup of 308%
transformer-based models within GPUs. on the WMT14 English-German machine translation task compared to
PyTorch.
CoLLiE [55] A library for collaborative training of massive Improves training efficiency for LLMs. Techniques like LoRA and
LMs. It explores memory usage and throughput un- AdaLomo specifically helped a large model (LLaMA-65B) follow instruc-
der different optimization methods, and investigates tions better, with an average score of 56.9, all without sacrificing overall
training techniques to improve the ability of a LM performance.
(LLaMA-65B) to follow user instructions.
Bold text in the "Aims" column indicates the framework’s primary area of specialization or the range of tasks it is designed to address.

LLMs. Additionally, it has been shown to achieve SOTA training behavior. The system works with BERT (encoder),
results on several NLP benchmarks. A Megatron-LM model GPT (decoder), Transformer (encoder-decoder), and vision
with 8.3 billion parameters achieved a perplexity of 10.8 on transformer. The system uses three techniques for improv-
the WikiText103 benchmark. Also, it achieved an accuracy ing training speed and efficiency. First (layer-specific ker-
of 66.5% on the LAMBADA dataset, which outperforms the nels), after analyzing Transformer-specific layers in detail,
previous SOTA of 63.2%. rewriting the kernels with dependencies and other techniques
to improve parallelism, and using small kernels to improve
4) LightSeq2 GPU utilization. Second (mixed-precision trainer), instead
of applying batch updates to many individual full-precision
LightSeq2 [54] proposes a software library that accelerates updates, it applies batch updates to reduced-precision param-
the training of transformer-based models within GPUs. It is eters. Finally, introduced an efficient memory management
a system-level optimization while maintaining accuracy and
8 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

technique to minimize the need for frequent allocation and • Fine-tuning and performance: CoLLiE [55] investigates
release calls. This strategy involves recycling the memory methods to improve specific LLM capabilities, such
space of tensors that remain unused during the backward pass. as following user instructions. It explores parameter-
The system accelerates the entire training process for trans- efficient fine-tuning methods that enhance model perfor-
former models. LightSeq2 achieves significant performance mance in targeted areas without compromising overall
improvement on a variety of NLP tasks, including machine functionality.
translation, on the WMT14 English-German machine transla- Key Findings:
tion task, it achieved a 308% speedup on the WMT14 English- • GPipe [3] demonstrates successful training of a large
German machine translation task compared to PyTorch. multilingual transformer model, achieving superior re-
sults compared to training individual smaller models.
5) CoLLiE • ByteTransformer [4] significantly outperforms existing
CoLLiE [55] introduces a library designed to efficiently facil- frameworks in terms of performance for BERT-like
itate the collaborative training of LLMs using 3D parallelism transformers on various benchmarks.
[24] (Sections V-C4, V-C1, V-C2), parameter-efficient fine- • Megatron-LM [19] facilitates training of LLMs with
tuning (PEFT) methods, and optimizers. The library demon- billions of parameters, achieving SOTA on NLP tasks
strated significantly improved training efficiency compared while offering high throughput.
to existing solutions. The study empirically evaluates the cor- • LightSeq2 [54] accelerates transformer model training
relation between model size and GPU memory consumption by up to 308%, showcasing substantial performance
under different optimization methods and analyzes through- improvements.
put. Additionally, the study investigates training methods • CoLLiE [55] introduces a library for collaborative LLM
to improve the abilities of the LLaMA-65B model, specif- training, demonstrating improved efficiency and effec-
ically focusing on following user instructions. Techniques tiveness in training large models like LLaMA-65B. It
like LoRA [35], LOMO [58] (Section V-A4), AdaLomo [59], explores methods to enhance specific functionalities
and AdamW demonstrated success in boosting the model’s without impacting overall performance.
instruction following capabilities without sacrificing its over-
all performance. Notably, LoRA and AdaLomo achieved im- B. LLM INFERENCE FRAMEWORKS AND LIBRARIES
pressive results, enabling the model to achieve an average
This section will introduce the LLM frameworks and libraries
score of 56.9.
designed particularly for inference tasks, followed by a sum-
mary of each one (see Table 4).
6) LLM Training Frameworks and Libraries: Challenges and
Key Findings 1) DeepSpeed Inference
This section explores five prominent frameworks and li- DeepSpeed Inference [5] presents a comprehensive system
braries: GPipe [3], ByteTransformer [4], Megatron-LM [19], solution for efficient transformer inference. It has the poten-
LightSeq2 [54], and CoLLiE [55]. Each offers unique func- tial to enable new and innovative applications of transformer
tionalities to overcome limitations in LLM training. models in cloud datacenters and other resource-constrained
Addressing Training Challenges: environments. The system consists of two main parts: Deep-
• Distributed training: As LLMs grow complex, training Speed Transformer and ZeRO-Inference [1]. The model is
them on a single device becomes impractical. Frame- a GPU-only solution that leverages a variety of optimiza-
works like Megatron-LM [19] and CoLLiE [55] em- tions to achieve SOTA (minimize) latency and (maximize)
ploy distributed training algorithms that split the model throughput for transformer models of all sizes. Specifically,
across multiple GPUs, enabling parallel processing and in the first phase DeepSpeed Transformer uses tensor-slicing
faster training. and inference-optimized pipeline parallelism to scale dense
• Efficiency and speed: LightSeq2 [54] tackles train- transformer models across GPUs.
ing speed through system-level optimizations. It uti- For sparse transformer models, it has developed a massive-
lizes techniques like layer-specific kernels and mixed- GPU sparse transformer layer that can extend the scalability
precision training to enhance GPU utilization and reduce of Mixture-of-Experts (MoE) transformer layers to hundreds
memory usage. Similarly, ByteTransformer [4] accel- of GPUs. This is achieved through a combination of par-
erates transformer models for variable-length inputs in allelism techniques and optimization strategies for commu-
NLP tasks. nication. Then, DeepSpeed Transformer employs optimized
• Memory management: Efficient memory allocation is sparse kernels to reduce the computational burden on a single
crucial for LLM training. CoLLiE [55] overcomes mem- GPU. ZeRO-Inference [1] is a heterogeneous solution that
ory constraints in LLM training by utilizing 3D paral- leverages GPU, CPU, and NVMe memory to enable massive
lelism to efficiently distribute memory across training transformer inference with limited GPU resources.
machines and GPUs, enabling the training of large mod- It is particularly useful for inferring models that are too
els even in resource limited environments. large to fit in GPU memory. It works by partitioning the model
VOLUME 11, 2023 9
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4. Summary of LLM inference frameworks and libraries

Studies Aims Outcomes


DeepSpeed GPU-only solution, powerful and versatile system for Boosts throughput by 1.5×, minimizes latency by 7.3×, empowers infer-
Inference [5] efficient transformer inference at scale. ence of 25×-larger models at 84 TFLOPS.
FlexGen [17] An offloading engine model designed for high- 1) Achieved 40× throughput speedup with 5000-second latency for batch
throughput LLM inference by efficiently utilizing size 64 or 2048 tokens. 2) Achieved 79× throughput speedup with 12000-
limited resources from GPU, CPU, and disk, em- second latency for batch size 256 or 8192 tokens. 3) Achieved 100×
ploying various techniques to enhance efficiency. throughput speedup with 4000-second latency for batch size 144 or 4608
tokens using 4-bit quantization compression.
NLP-Fast [60] Accelerates the performance of large-scale heteroge- Evaluated a variety of NLP models and hardware platforms, including CPU,
neous NLP models. GPU, and FPGA. The throughput improved by up to 2.92×, 1.59×, and
4.47× over the baseline performance.
TurboTransformers A lightweight, easy-to-use system that enables effi- It introduces three innovative features:1) Efficient GPU-based batch re-
[11] cient deployment of transformer models for online duction kernels for Softmax and LayerNorm. 2) Sequence-length-aware
services. memory allocation algorithm. 3) New batch scheduler employing dynamic
programming for optimal throughput on variable-length requests.
PetS [61] A unified framework for multitask PET serving in a Enables 26× more concurrent tasks and enhances serving throughput by
single system. 1.53× on Desktop GPUs and 1.63× on Server GPUs.
PETALS [62] A collaborative platform for distributed inference With an optimal hardware setup involving CPU RAM offloading via PCIe
and fine-tuning of LLMs over the internet. 4.0 and GPU pairs connected through PCIe switches, offloading 176B
parameters takes 5.5 seconds in a regular setup and 11 seconds in a multi-
GPU setup, with each GPU having 1 GB of memory per billion parameters
and PCIe 4.0 throughput of 256 Gbit/s (or 128 Gbit/s behind a PCIe switch
for two GPUs).
LightSeq [63] Inference library, addresses the need for efficient and In machine translation benchmarks, consistently outperforms TensorFlow
convenient deployment of Transformer models in on- and FasterTransformer (FT), achieving up to 14× and 1.4× speedups,
line services. respectively.
Easy and Efficient Library to accelerate transformer inference. Compared against Fairseq, LightSeq, and Faster Transformer (FT) on
Transformer 2080Ti and A100, EET achieves speedups of 4.48-20.27× and 4.30-
(EET) [64] 27.43×, respectively, over Fairseq. On 2080Ti, EET achieves a speedup of
0.82-2.46× over LightSeq for model sizes of 768 and 1024. Additionally,
EET achieves speedups of 1.21-6.30× and 1.62-812× over FT v3.1 on
2080Ti and A100, respectively, and a speedup of 1.40-4.20× over FT v4.0
on A100.
Splitwise [65] Improve LLM inference efficiency by separating Achieve significant outcomes like up to 1.4× higher throughput at 20%
compute-intensive and memory-intensive phases lower cost or 2.35× higher throughput with same cost and power consump-
onto different machines with hardware specialization. tion.
Zhang et al., [66] Evaluate hardware design for LLMs using LLM- Achieve significant outcomes like accurate (average error 10.4% for task
LLMCompass Compass library for recommending cost-effective time, 4.1% for LLM tasks), simulates GPT-3 175B on 4× A100 GPUs in 16
hardware designs. minutes, identifies more affordable hardware designs.
PowerInfer [67] Build a faster LLM inference engine for consumer- Achieve significant speedups (up to 11.69× faster) by optimizing for hot
grade GPUs. neurons on GPU and cold neurons on CPU, while maintaining accuracy and
approaching performance of high-end GPUs.
Bold text in the "Aims" column indicates the framework’s primary area of specialization or the range of tasks it is designed to address.

weights across multiple GPUs and offloading unused weights proposes an offloading engine model which focuses on us-
to CPU and NVMe memory. This allows ZeRO-Inference ing (resource-constrained devices) limited resources to reach
to infer models that are much larger than would be possible high-throughput LLM inference. The engine is flexible for
with GPU-only solutions. As a result, DeepSpeed Inference configuration using different hardware resources by aggregat-
boosts throughput by more than 1.5× for throughput-oriented ing memory and computation from the GPU, CPU, and disk.
scenarios and minimizes the latency by more than 7.3× com- To optimize throughput within the search space, researchers
pared to existing solutions for latency orientation scenarios. It developed a linear programming-based search algorithm,
facilitates real-time inference at a trillion-parameter scale by through it the model can find efficient patterns for saving and
utilizing hundreds of GPUs, marking an unparalleled achieve- accessing tensors. It has a larger space of batch size options
ment in terms of inference scale. This technology allows to choose from without sacrificing accuracy through using
for the inference of models that are 25 times larger than 4-bit to compress weights and attention cache without the
what GPU-only solutions can handle, achieving a substantial need for retraining or calibration. The model’s efficiency has
throughput of 84 TFLOPS, which is over 50% of the A6000 been experimented by using NVIDIA T4 (16 GB) GPUs for
peak performance. running OPT-175B. It significantly outperforms DeepSpeed
Zero-Inference [1], [5] and Hugging Face Accelerate by en-
2) FlexGen
abling significantly larger batch sizes, often reaching orders
of magnitude higher than its competitors. As a result, it can
Accelerating LLM inference is achievable by using multiple achieve significant speedups in throughput. On a single T4
high-end accelerator technologies, due to their high com- GPU equipped with 208 GB CPU DRAM and a 1.5 TB SSD,
putational and memory requirements. FlexGen [17] study
10 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

input sequence length 512, and output sequence length 32: algorithm overcomes the problem of variability of the input
With a latency of 5,000 seconds, it (effective batch size 64) sentences. Finally, applying the framework involves utilizing
surpasses DeepSpeed Zero-Inference (batch size 1) by over a novel batch scheduler that leverages dynamic programming
40×, whereas Hugging Face Accelerate fails to complete a to achieve the optimal throughput on variable-length requests.
single batch. Furthermore, it can reach 69× higher throughput It is a lightweight and easy-to-use system that can be inte-
with a higher latency of 12000 seconds compared to baselines grated into PyTorch code with just a few lines of code. This
(effective batch size 256, or 8192 tokens in total). Finally, makes the model a very accessible option for researchers and
the model can achieve 100× higher maximum throughput practitioners who want to use transformer models for online
(effective batch size 144, or 4608 tokens in total) with 4-bit services.
quantization to compression and 4000 seconds by holding
all weights in the CPU and getting rid of disk offloading. 5) PetS
The model achieved these results by aggregating memory The existing large-scale transformer modes follow the pre-
and computation from the GPU, CPU, and disk, and by train-then-fine-tune paradigm, copying the entire model for
using a number of techniques to improve efficiency, such each downstream task consumes a lot of storage. This ap-
as I/O schedule tasks, possible compression techniques, and proach is unsuited for multi-purpose serving. Parameter Effi-
distributed pipeline parallelism. FlexGen is a significant ad- cient Transformers (PET) reduce the resource overhead. They
vancement in LLM inference, as it enables high-throughput share the pre-trained model among tasks and only fine-tune a
generation on resource-constrained devices. This opens new specific portion based on the task parameters. Prior to PetS
possibilities for deploying and using LLMs in a wider range [61], the serving systems did not have any mechanism to
of applications. provide flexibility for PET task management, and also there is
no available efficient method to serve queries to different task
3) NLP-Fast batches. It is the first unified framework for multi-task PET
NLP-Fast [60] is a system that accelerates the perfor- serving in a single system. As a class of transformer models
mance of large-scale heterogeneous NLP models by iden- PETs have been designed to be more efficient in terms of
tifying performance-critical operations and applying holis- both parameters and computation. Therefore, PETs are well-
tic model partitioning, cross-operation zero skipping, and suited for deployment in resource-constrained environments.
model/config adaptive hardware reconfiguration. NLP-Perf, Conventional serving frameworks move data between the
a performance analysis tool, collects performance data for GPU and CPU memory when the GPU cannot hold all of
NLP models and identifies performance-critical operations. the data for the tasks that are being processed. This reduces
Holistic model partitioning is a comprehensive optimiza- the throughput of the system. It has the potential to revo-
tion technique, which integrates three model partitioning ap- lutionize the way that LLMs are served, making it possible
proaches: partial-head update, column-based algorithm, and to deploy and run LLMs on a wider range of devices and
feed-forward splitting, to facilitate end-to-end model parti- with lower resource requirements. This could make LLMs
tioning. Cross-operation zero skipping, skips zero or near- more accessible to a wider range of users and businesses. Pets
zero values across multiple operations, which can signif- framework is a flexible PET tasks management mechanism
icantly reduce the amount of computation required, these and a specialized PET Inference Engine (PIE) that allows both
two optimization can be executed on different hardware plat- inter-task and inter-algorithm query-batching. It enables 26×
forms. Model/config adaptive hardware reconfiguration, re- more concurrent tasks and enhances serving throughput by
configures the model architecture for the specific hardware 1.53× on Desktop GPUs and 1.63× on Server GPUs.
platform that it is running on, which can further improve
performance. NLP-Fast has been evaluated on a variety of 6) PETALS
NLP models and hardware platforms, including CPU, GPU, PETALS [62] emerges as a collaborative platform specifi-
and FPGA. The evaluation results show that NLP-Fast can cally designed for the distributed inference and fine-tuning
improve throughput by up to 2.92×, 1.59×, and 4.47× over of LLMs over the internet. It aims to overcome the limita-
the baseline performance on each platform. tions associated with existing approaches, offering a range
of advantages. The platform focuses on achieving high per-
4) TurboTransformers formance by leveraging pipeline parallelism, effectively en-
TurboTransformers [11] is a lightweight, easy-to-use sys- hancing the efficiency of LLM inference and fine-tuning pro-
tem that enables efficient deployment of transformer mod- cesses. Furthermore, it showcases scalability, demonstrating
els for online services. It achieves SOTA performance on its capability to support a substantial number of users and
GPU platforms by proposing three innovative features that accommodate large-scale LLMs. This adaptability is comple-
distinguish it from other similar models: Firstly, proposes mented by the provision of a flexible API, allowing users
an efficient and parallel GPU-based batch reduction kernels to tailor the inference and fine-tuning processes according
algorithm for Softmax and LayerNorm. Secondly, proposes to their specific requirements. The PETALS key feature is
a sequence-length-aware algorithm for memory allocation to its emphasis on collaboration, providing a framework that
efficiently balance memory allocation and deallocation, this enables multiple participants to actively engage in LLM infer-
VOLUME 11, 2023 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

ence and fine-tuning tasks collectively. The collaborative na- 9) Splitwise


ture of PETALS contributes to its potential in democratizing Splitwise [65] investigates inefficiencies in LLM inference,
access to LLMs, making them more accessible and valuable which relies on expensive GPUs. The analysis reveals two dis-
across a diverse range of applications. In summary, PETALS tinct phases in LLM inference: a compute-intensive prompt
emerges as a promising platform with the potential to enhance computation and a memory-intensive token generation phase.
the accessibility and utility of LLMs. It can offload a 176B While existing methods optimize batching and scheduling,
parameter model in 5.5 seconds for a regular setup and 11 they underutilize compute resources during token generation.
seconds for a multi-GPU setup. These results demonstrate To address this, the study proposes separating these phases
PETALS’s superior performance for running large models across different machines. This allows for hardware opti-
with limited resources. mized for each phase: powerful machines for prompt com-
putation and potentially older, more cost-effective machines
7) LightSeq for token generation. Splitwise facilitates communication be-
LightSeq [63] is a lightweight inference library, addresses tween these machines using fast interconnects. This approach
the need for efficient and convenient deployment of Trans- enables the design of clusters optimized for throughput, cost,
former models in online services. It utilizes a combination or power consumption. The model achieves up to 1.4× higher
of GPU optimization techniques, including coarse-grained throughput at 20% lower cost or 2.35× higher throughput
fused kernel functions, hierarchical auto-regressive search, with the same cost and power consumption. This approach
and dynamic GPU memory reuse strategy, to achieve signifi- improves LLM inference efficiency by leveraging hardware
cant performance gains compared to TensorFlow and Faster- specialization, leading to more cost-effective and power-
Transformer (FT). It supports a wide range of models and efficient deployments.
search algorithms, encompassing BERT, GPT, Transformer,
and Variational Autoencoders (VAEs), whereas seamlessly 10) LLMCompass
integrating with popular models like BERT [8], RoBERTa Zhang et al [66] propose LLMCompass, a library that effi-
[68], GPT, VAEs, MT Transformer, and Speech Transformer. ciently evaluates hardware design for LLMs. LLMCompass
The library is user-friendly, with a serving system and CUDA considers various hardware options and identifies the optimal
implementations, enabling easy deployment of popular mod- configuration for a specific task. The study also uses a cost
els online without code modification. It addresses the de- model to recommend the most economical design. The library
ployment challenges of resource-intensive sequence models, demonstrates high accuracy, with an average error of 10.4%
narrowing the performance gap between large models and the for predicting task execution time and 4.1% for LLM tasks,
demands of online services. In machine translation bench- compared to real hardware. Notably, the model can simulate
marks, it consistently outperforms TensorFlow and Faster- running a massive LLM like GPT-3 175B on a powerful
Transformer (FT), achieving up to 14× and 1.4× speedups, computer setup with 4× A100 GPUs in just 16 minutes.
respectively. Leveraging LLMCompass, the study identified hardware de-
signs that are more affordable than current options (e.g., using
8) EET less powerful components or cheaper memory) while still
Easy and Efficient Transformer (EET) [64] offers a library offering good performance. These designs could make LLMs
designed to accelerate transformer inference. It encompasses more accessible to a wider range of users.
a range of optimizations for transformer inference, spanning
both algorithmic and implementation aspects. To address the 11) PowerInfer
inefficiencies of explicit matrix addition and masked atten- PowerInfer [67] is a high performance inference engine de-
tion, the study implements custom CUDA kernels. Also, to signed to run LLMs efficiently on consumer-grade GPUs.
extend all kernels to support a larger model size up to 12288 It leverages the power-law distribution of neuron activation
and a longer sequence above 4096 the research proposes a in LLMs, assigning frequently activated (hot) neurons to the
new method called thread block folding. Furthermore, the GPU and input-specific (cold) neurons to the CPU. This
study introduced a CUDA memory management mechanism hybrid approach significantly reduces the pressure on GPU
aimed at minimizing memory usage for models of the size. memory and minimizes data transfers between CPU and
EET evaluated against Fairseq, LightSeq, and Faster Trans- GPU. Furthermore, PowerInfer incorporates adaptive predic-
former (FT), On both a 2080Ti and A100, EET achieves a tors and neuron-aware sparse operators to optimize perfor-
speedup of 4.48-20.27× and 4.30-27.43×, respectively, com- mance and maintain model accuracy. Evaluations demon-
pared to Fairseq. On a 2080Ti, EET outperforms LightSeq strate that PowerInfer on an NVIDIA RTX 4090 GPU
[63] with a speedup of 0.82-2.46× for model sizes of 768 achieves inference speeds up to 11.69× faster inference than
and 1024. EET attains a speedup of 1.21-6.30× and 1.62- systems like llama.cpp.It delivers an average token generation
812× over FT v3.1 on a 2080Ti and A100, respectively, and rate of 13.20 tokens per second, rivaling the performance of
a speedup of 1.40-4.20× over FT v4.0 on an A100. top-tier server-grade GPUs.

12 VOLUME 11, 2023


Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

12) LLM Inference Frameworks and Libraries: Challenges requests. vLLM is a distributed system that supports popu-
and Key Findings lar LLMs and even models exceeding single GPU memory.
This section presents various frameworks and libraries de- Evaluations present vLLM significantly improves throughput
signed to improve the efficiency of running LLMs. The fol- by 2-4× faster compared to existing systems, especially for
lowing paragraphs discuss the challenges and key findings of complex tasks involving long sequences, large models, and
the reviewed studies. intricate decoding algorithms. This makes vLLM a significant
Challenges of LLM Inference: advancement for efficient LLM processing, enabling faster
• LLMs are computationally expensive due to their mas- and more scalable LLM applications.
sive size and complex architecture.
• Traditional inference methods struggle to handle large 2) LLM Deployment and Serving Libraries: Challenges and
models on resource-constrained devices. Key Findings
• Balancing speed, accuracy, and resource utilization is As explored in previous sections (Sections IV-A6 and
crucial for deploying LLMs in real-world applications. IV-B12) a variety of LLM frameworks exist that hold promise
Key Findings: for deployment and serving applications. This section will
• Hardware specialization: Splitwise [65] proposes sepa- discuss the key challenges and findings associated with LLM
rating compute-intensive and memory-intensive phases deployment and serving.
onto different machines with specialized hardware. Challenges of LLM Deployment and Serving:
• Resource optimization: FlexGen [17] utilizes various • Memory limitations: Large LLMs can easily overwhelm

techniques like I/O scheduling, compression, and dis- the memory capacity of a single GPU. This limits their
tributed processing to efficiently use resources from deployment and serving for real-world applications.
CPU, GPU, and disk. • Scalability: Effectively handling multiple user requests

• Algorithmic optimizations: Libraries like EET [64] and simultaneously with large LLMs requires efficient scal-
LightSeq [63] implement custom algorithms and mem- ing solutions.
ory management techniques to accelerate inference on • Variability of input: LLM performance can suffer when

GPUs. dealing with input sequences of varying lengths, requir-


• Heterogeneous platforms: NLP-Fast [60] leverages dif- ing dynamic memory allocation strategies.
ferent hardware platforms (CPU, GPU, FPGA) by iden- • Ease of deployment: Integrating complex LLM serving

tifying performance-critical operations and applying tar- systems into existing workflows can be challenging for
geted optimizations. researchers and practitioners.
• Distributed inference: PETALS [62] facilitates collabo- Key Findings:
rative inference and fine-tuning of LLMs across a net- • PagedAttention: This algorithm (introduced by vLLM
work, enabling scalability and efficient resource utiliza- [69]) breaks down the KV cache into manageable
tion. blocks, minimizing wasted memory and enabling effi-
• Efficiency gains: Several frameworks achieve signifi- cient sharing across requests. This is a significant im-
cant performance improvements. DeepSpeed Inference provement for processing large LLMs.
[5] boasts throughput boosts of 1.5× and latency re- • Efficient GPU utilization: TurboTransformers [11] uti-
ductions of 7.3×. FlexGen demonstrates even greater lize techniques like parallel GPU kernels and dynamic
throughput gains, particularly on resource-constrained batch scheduling to optimize performance on GPUs.
devices. Other frameworks like NLP-Fast [60], Turbo- This translates to faster inference for transformer-based
Transformers [11], LightSeq [63], and EET [64] show models.
promising results in accelerating inference. • System-level optimizations: LightSeq2 [54] demon-
strates how system-level optimizations within the train-
C. LLM DEPLOYMENT AND SERVING LIBRARIES ing process can significantly improve training speed
As mentioned in section IV, some of the frameworks and and efficiency for transformer models. This translates to
libraries are utilized for multiple purposes. Besides vLLM faster deployment of LLMs in general.
[69] (Section IV-C1), the models used for deployment and These findings from vLLM [69], TurboTransformers [11],
serving purposes are mentioned in these sections LightSeq2 and LightSeq2 [54] offer promising solutions for overcoming
IV-A4, TurboTransformer IV-B4, PetS IV-B5. challenges in LLM deployment and serving. By focusing
on memory management, efficient GPU utilization, user-
1) vLLM friendly tools, and co-optimization.
vLLM [69] is a high performance system that efficiently
handles LLMs at a large scale. The model tackles the memory V. TRAINING OPTIMIZATION
limitations of existing LLM serving systems through a novel Training optimization in LLMs involves improving the effi-
algorithm called PagedAttention (Section V-A1). PagedAt- ciency and effectiveness of the training process. This encom-
tention splits the KV cache into manageable blocks, mini- passes a range of techniques and strategies aimed at improv-
mizing wasted memory and enabling efficient sharing across ing factors such as convergence speed, model generalization,
VOLUME 11, 2023 13
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Tensor Pipeline
Parallelism Parallelism

Algorithmic
Optimization
Model
Model Partition
Parallelism
Layer-Specific
Model
Kernels
Combined Optimization
Data
Parallelism
Parallelism Fine-tuning
Schedular
Distributed Optimization
Training ZeRO

Sequence
Parallelism
Automatic
Parallelism
Training Model Compression
and Quantization
Optimization
Pruning
Hyperparameeter
Optimization

Size
Heterogeneous Reduction
Optimization

FIGURE 5. Training optimization techniques

and resource utilization. The goal of training optimization is ing of large models despite limited GPU memory. Through
to achieve the desired model performance with faster training smart swapping between CPU and GPU memory, it opti-
times, reduced resource requirements, and improved overall mizes scheduling, memory allocation, and swap planning
training effectiveness. In this section, we will focus on model to maximize computational efficiency. This approach allows
optimization, size reduction, distributed training, and hetero- training models up to 12× beyond the usual GPU memory
geneous training (Fig. 5). limit while maintaining significant performance. It stands as
an innovative solution for deep learning with limited GPU
A. MODEL OPTIMIZATION resources.
Model optimization in LLMs refers to the process of im- NLP-Fast [60] employs algorithmic optimization tech-
proving the model’s architecture, structure, or parameters niques to enhance the performance of large-scale heteroge-
to enhance its overall performance. We stated various tech- neous NLP models. One of the techniques is cross-operation
niques aimed at achieving better accuracy, efficiency, or both. zero skipping, which eliminates unnecessary computations by
Common model optimization strategies for LLMs include al- skipping zero or near-zero values across multiple operations.
gorithmic optimization (section V-A1), layer-specific kernels By leveraging these techniques, NLP-Fast can significantly
(section V-A2), model partition (section V-A3), fine-tuning improve the overall performance of NLP models on various
(section V-A4), and scheduler optimization (section V-A5). hardware platforms.
ByteTransformer [4] was developed to address the chal-
1) Algorithmic Optimization lenges of redundant computations and memory overhead in
FlexGen [17] devised a linear programming-based search transformer models, it employs a combination of algorithmic
algorithm to optimize throughput within the search space. optimizations and memory-efficient techniques, including a
This model can identify efficient patterns for tensor saving padding-free algorithm, fused MHA, and manually optimized
and access. memory sizes of layer normalization. These techniques effec-
Building on techniques for efficient model execution, Swa- tively eliminate unnecessary computations, minimize mem-
pAdvisor [70], proposes a novel approach to deep learning ory footprint, and reduce the cost of accessing GPU global
memory management, that enables the training and serv- memory, leading to significant performance gains compared
14 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 5. Comparative analysis between different categories

Optimization Focus Performance Cost Scalability


Training Focuses on accelerating the training Techniques like SwapAdvisor [70] Lower training costs due Techniques like ZeRO [18] can
Optimization process and minimizing resource us- enables the training of models up to to faster training times, re- scale to trillionn parameters,
age within LLMs. This is accom- 12× larger than the standard GPU duced memory usage, or en- SparseGPT [72] processes
plished through various techniques memory capacity while preserving abling training on less pow- very large models (OPT-175B,
that enhance the efficiency of the substantial performance, ZeRO [18] erful hardware. BLOOM-176B) efficiently,
training workflow. The aim is to achieves 10× speedup, train trillion FlexFlow [73] improves
achieve the same level of model per- parameter models (8× larger than ex- parallelism efficiency by 2.5-
formance in a reduced timeframe and isting models), Cramming [16] en- 10×, ZeRO-Offload [20] enables
with less computational power. ables single-GPU LLM training in training 10× larger models on
a one day, and Megatron-LM [71] the same hardware, and ZeRO-
outperforms ZeRO-3, achieving a re- Infinity [1] Highly scalable for
markable 70% improvement in per- training models with trillions of
formance for models with both 175 parameters.
billion and 530 billion parameters.
Hardware Systematically enhances the perfor- Techniques like FlexGen [17], Light- Lower deployment costs Techniques like TurboTransform-
Optimization mance, efficiency, and functionality Seq2 [54], TurboTransformers [11] by enabling inference ers [11] and Splitwise [65] can
of computer hardware by addressing improve performance (throughput, la- on resource-constrained potentially scale well on different
bottlenecks in hardware architecture, tency) for inference, potentially re- devices (CPUs, FPGAs) or hardware configurations.
software, and operating systems. This ducing operational costs. requiring fewer servers for
approach can increase overall speed, the same workload.
reduce power consumption, and im-
prove hardware reliability. Addition-
ally, it enables more efficient use of
hardware resources and allows for de-
ployment on less powerful devices.
Scalability and Re- Improve the ability to train and run Techniques like PETALS [62] Techniques like SWARM Techniques like ZeRO-Offload
liability large models on distributed systems achieves faster inference for large Parallelism [10] trains [20] enables large model training
and handle potential hardware issues. language models, with an optimal large models on unreliable, on single GPUs and scales to
setup inferring a 176 billion parameter heterogeneous devices with larger systems using model
model in 5.5 seconds. low network bandwidth. parallelism.

to other transformer frameworks. training than LLM128. Using GrowLength led to a significant
Sheared LLAMA [74] model introduces a dynamic batch decrease in loss, and emphasized its computational efficiency
loading. This innovative algorithm efficiently adjusts the and practical value in resource-constrained configurations.
composition of sampled data within each training batch based PagedAttention [69] introduces another innovative ap-
on varying losses observed across different domains. The proach to improve learning efficiency. This novel attention
primary objective is to dynamically update the batch loading algorithm is inspired by virtual memory used in operating
process to maximize learning efficiency, ensuring that the systems. The algorithm splits the KV cache into fixed-size
model achieves the reference loss approximately simultane- blocks, similar to memory pages, reducing fragmentation and
ously across all domains. All training experiments have been enabling efficient sharing across requests. This approach sig-
done on a maximum of 16 Nvidia A100 GPUs (80 GB). nificantly improves memory utilization and allows for larger
GrowLength [75] enhances the pre-training of LLMs by batch sizes.
gradually increasing the training length, it introduces an
innovative method inspired by the principles of extending 2) Layer-Specific Kernels
context windows during training. This innovative approach LightSeq [63] (Section IV-B7) is a lightweight inference
accelerates the pre-training phase of LLMs by dynamically library instead of using a straightforward combination of
and gradually extending the length of the training sentence. the fine-grained GPU kernel functions in TensorFlow or Py-
The primary benefit of this method lies in its adaptability Torch implementations, it utilizes a method known as coarse-
and efficient resource utilization. It optimizes computational grained fusion. This strategy mitigates the significant time
resources effectively, allowing models to process more to- costs associated with numerous kernel function launches and
kens within a restricted time frame. Throughout the train- GPU memory I/O operations for intermediate results. There-
ing process, the model incrementally increases the training fore, it achieves a significant reduction in the number of
length, resulting in reduced computational expenses and en- atomic kernel functions, leading to a remarkable performance
hanced training efficiency. The experiments have been done boost compared to conventional TensorFlow approaches.
in three different setups: 1) LLM128, in this setup the sen- LightSeq2 [54] (Section IV-A4) proposed a software li-
tence length fixed of 128 tokens, totaling 0.36B. 2) LLM1024, brary that accelerates the training of transformer-based mod-
sentence length was set to 1024 tokens, the same total to- els within GPUs. It is a system-level optimization while main-
kens as LLM128, allowing direct runtime comparison. 3) taining accuracy and training behavior. The library works
GrowLength, in this experiment the method progressively with BERT (encoder), GPT (decoder), Transformer (encoder-
grew from 128 to 1024 tokens, saving time with shorter decoder), and vision transformer. LightSeq2 uses three tech-
lengths and enhancing performance at 1024 tokens. As a niques for improving training speed and efficiency. The
result, with equivalent tokens, LLM1024 required longer pre- first technique used for increasing GPU utilization is layer-
VOLUME 11, 2023 15
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

specific kernels technique. After analyzing Transformer- Existing techniques like LOMO reduce memory usage but
specific layers in detail, rewriting the kernels with dependen- compromise performance.
cies and other techniques to improve parallelism, and using AdaLomo [59] offers a better solution. It incorporates a key
small kernels to improve GPU utilization. feature from the powerful AdamW optimizer (adaptive learn-
ing rate) but uses clever techniques to stay memory-friendly.
3) Model Partition This allows AdaLomo to match AdamW’s performance on
NLP-Fast [60] (Section IV-B3) accelerates the performance various tasks, making LLM training more accessible with
of large-scale heterogeneous NLP models by applying sev- less memory needed. On average, AdaLomo achieved scores
eral techniques. It proposed holistic model partitioning as a of 30.8, 39.7, 51.0, and 56.9 on the LLaMA benchmark for
solution for optimizing every operation in NLP models. This models with 7B, 13B, 30B, and 65B parameters, respectively.
technique breaks down the model into smaller, more efficient LoRA [35] is a method designed to adapt LLMs, such as
submodels can be tailored for different hardware platforms. GPT-3, for specific tasks, addressing the challenges of tradi-
GPipe [3] (Section IV-A1) is an efficient, task-independent, tional fine-tuning. Instead of adjusting all pre-trained model
and supports any deep neural network architecture that can be weights, LoRA introduces trainable rank decomposition ma-
expressed as a sequence of layers. It can use different acceler- trices into each layer of the Transformer architecture, signif-
ators, each of which supports re-materialization. GPipe parti- icantly reducing the number of trainable parameters needed
tions the model across the accelerators, with each accelerator for downstream tasks. This approach reduces the number of
responsible for a sequence of layers (called a cell). trainable parameters by 10,000× and reduces GPU memory
Megatron-LM [19] introduces a new method for training requirements by 3× compared to GPT-3 175B fine-tuned
LLMs, which empowers the training of exceptionally large with Adam, while maintaining or improving model quality
transformer models with billions of parameters within GPUs. on benchmarks like RoBERTa, DeBERTa, GPT-2, and GPT-
Megatron-LM uses intra-layer model parallelism, a strategy 3. LoRA achieves higher training throughput with no added
that subdivides the model into smaller submodels capable of inference latency and facilitates efficient task-switching by
being trained separately. sharing the pre-trained model and only optimizing the small
low-rank matrices, thereby reducing storage and hardware
4) Fine-tuning costs. It is versatile and can be combined with other methods,
AlphaTuning [76] is a novel method specifically designed for applicable to any neural networks with dense layers. For
large-scale pre-trained language models (PLMs). It combines GPT-3 175B, LoRA with 4.7M parameters achieves 73.4%
the quantization of PLMs with fine-tuning, only a subset of accuracy on WikiSQL, 91.7% on MNLI-m, and Rouge-1/2/L
quantized parameters is fine-tuned for the target task. This scores of 53.8/29.8/45.9 on SAMSum, demonstrating its su-
selective approach significantly decreases the overall memory perior performance and efficiency.
footprint and the number of parameters to be trained. Despite
these reductions, it maintains performance levels comparable
5) Scheduler Optimization
to full fine-tuning across a diverse range of downstream tasks.
QFT [77] proposes a novel framework designed for TurboTransformers [11] (Section IV-B4) introduces a novel
memory-efficient fine-tuning of LLMs. The model utilizes sequence-length-aware batch scheduler that utilizes dynamic
quantization techniques to significantly reduce memory us- programming (DP) to optimize response throughput. This ap-
age during fine-tuning while preserving model performance. proach overcomes the limitations of traditional batch sched-
The framework adopts the Lion optimizer, known for its ulers that struggle with varying input lengths. The model con-
memory efficiency and compatibility with quantization, and siders sequence length in batching decisions. The scheduler’s
the conversion of all model states into integers to minimize core algorithm operates in O(n2 ) time complexity, making it
memory footprint. The study also features a specialized gradi- efficient for real-time applications.
ent flow and parameter update scheme tailored for quantized PetS [61] (Section IV-B5) introduces a unified framework
weights. Extensive evaluations show the framework’s effec- aimed at enhancing multi-task PET serving efficiency. It com-
tiveness, allowing fine-tuning of large LLaMA-7B models prises two main components: a flexible PET task manage-
with less than 30 GB of memory on a single A6000 GPU ment mechanism and a specialized PIE. Together, these com-
a substantial reduction compared to standard methods while ponents facilitate both inter-task and inter-algorithm query-
maintaining similar performance across various benchmarks. batching, streamlining the processing of PET tasks. This
LOMO [58] is a novel technique for training LLMs on approach optimizes resource utilization and enhances the
machines with limited GPU capacity. LOMO proposed a efficiency of PET serving. The PET task scheduler efficiently
memory-efficient update method that greatly lowers memory schedules PET operations to run in parallel on the GPU,
consumption compared to traditional methods. This enables maximizing hardware utilization and performance. It dynam-
fine-tuning large models, such as those with 65 billion param- ically assigns PET tasks to CUDA streams, considering both
eters, on consumer-grade GPUs like the RTX 3090. The study PET operator characteristics and system resource constraints.
validates LOMO’s efficiency through analyses of memory This lightweight online scheduling strategy effectively bal-
usage, performance testing, and benchmark task evaluations. ances computational and memory-intensive tasks, leading to
16 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

improved throughput and reduced latency in multi-task PET data that needs to be transferred between nodes during train-
serving scenarios. ing, leading to improved efficiency and throughput.
QMoE [79] is a compression and execution framework
B. SIZE REDUCTION OPTIMIZATION that reduces memory usage significantly. This is achieved
Minimizing the size or complexity of LLMs is a crucial through a scalable compression algorithm that shrinks trillion
optimization technique known as size reduction optimiza- parameter MoEs down to less than 1 bit per parameter. This
tion. This approach is essential for addressing challenges impressive compression is facilitated by a custom format
associated with memory demands, computational efficiency, specifically designed to work with bespoke GPU kernels, en-
and storage limitations. Size reduction optimization encom- abling efficient processing with minimal slowdowns. QMoE
passes various techniques, including model compression and can compress the SwitchTransformer-c2048 model to under
quantization (Section V-B1), pruning (Section V-B2), and 160 GB (20× compression, 0.8 bits per parameter) with min-
hyperparameter optimization (Section V-B3). imal impact on accuracy, achievable within a day on a single
Cramming [16] investigates the trade-offs involved in scal- GPU. This enables the execution of trillion-parameter models
ing down language model training, and investigates different on affordable commodity hardware, such as a single server
parts of the training pipeline to identify the modifications that with 4× NVIDIA A6000 or 8× NVIDIA 3090 GPUs, with
have the biggest impact on performance in a scaled-down less than 5% runtime overhead compared to uncompressed
setting. As a result, the research figured out that even under inference. The framework reduces the model size from 3.2TB
customized and constrained settings, the scaling laws [78] in bfloat16 to less than 160 GB, allowing efficient execution
were almost true as it was observed for performance in large- on commodity hardware and enhancing the practical adoption
compute settings. As a predictable outcome of these laws, and research of MoE architectures.
it is a challenging task to perform downscaling. However, a AlphaTuning [76] is a compression-aware parameter-
smaller model architecture requires less computation power efficient adaptation method for large-scale PLMs. It com-
and allows to boost up the gradient computations, as a result bines the quantization of PLMs with fine-tuning, but only a
the rates of the improved model within the time remain nearly subset of quantized parameters are fine-tuned for the target
unchanged. The study shared that doing modifications to the task. This significantly reduces the total memory footprint
training methodology leverages scaling law to bring about and the number of trainable parameters, while still achieving
enhancements by increasing the effective rate of gradient comparable performance to full fine-tuning on a variety of
computations without sacrificing the model size. Two model downstream tasks. It relies on binary-coding quantization,
setups have been analyzed: one utilizing a classical rtx2080ti a technique that decomposes full-precision parameters into
GPU, and the other employing a modern rtxa4000 or rtxa6000 binary parameters alongside a distinct set of scaling factors.
GPUs. Each setup was configured with 4 CPU cores and 32 The model is evaluated across various PLMs and downstream
GB of RAM. The paper proposes several modifications to tasks, and achieves comparable performance to full fine-
the standard training pipeline to make it possible to train a tuning, even at low bitwidths. While it was applied to GPT-2
language model on a single GPU in one day. As a result, each and OPT, it achieved a compression ratio of over 10 times
of these modifications has a direct impact on the model size under 4-bit quantization and a reduction in the number of
reduction such as smaller model architecture, shorter training trainable parameters by over 1,000-fold, while still achieving
schedule, lower learning rate, mixed precision training, and competitive performance on a variety of downstream tasks.
specialized training library. GPTQ [80] proposes a new highly accurate and highly
efficient post-training quantization method based on approx-
1) Model Compression and Quantization imate second-order information which is called a new one-
FlexGen [17] (Sections IV-B, and V-A1) through a linear shot weight quantization. This model reaches a level that
programming-based search algorithm identifies optimal pat- is considered acceptable to precisely quantize models to 3
terns for tensor storage and retrieval. Furthermore, it em- or 4 bits per parameter, it requires a few hours at most to
ploys 4-bit quantization to compress weights and attention run on a model that has hundreds of billions of parameters.
cache without compromising accuracy, significantly reduc- The model experimented on both OPT-175B and BLOOM-
ing model size and memory footprint. These optimizations 176B it took approximately 4 GPU hours by reducing the
enable it to achieve impressive throughput gains compared to bitwidth down to 3 or 4 bits per weight, with minimal loss of
existing LLM inference systems. accuracy compared to the uncompressed baseline. Compared
SWARM parallelism [10], proposes a model for training to previous one-shot quantization methods the model achieves
a large model with unreliable heterogeneous devices with more than twice the compression without sacrificing accu-
low network bandwidth by using dynamically generated, racy. Also, within the method for first-time models with 175
randomized pipelines instead of static pipelines dynamically billion parameters can execute inside a single GPU for gen-
instead of statically. The study incorporates 8-bit compression erative inference. The results show that these enhancements
to minimize model size and facilitate training on resource- can boost performance by up to 3.25× while using high-
constrained devices with limited network bandwidth. This end GPUs (NVIDIA A100) over FP16 and reach 4.5× and
compression technique significantly reduces the amount of up to 4.5× while using more cost-effective GPUs (NVIDIA
VOLUME 11, 2023 17
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

A6000). This model can also achieve robust accuracy results GPU, thereby decreasing the need for consecutive devices and
even using an extreme quantization regime, while the weights communication rounds and use 8-bit precision to compress
are quantized to 2-bit or ternary quantization level. the weights, reducing the nodes required to store all layers. To
FPTQ [81] also proposes a novel post-training quantization achieve more efficient data transfer between pipeline stages,
technique to address the deploying LLM challenge. This dynamic blockwise quantization is utilized. It utilize 8-bit
technique effectively compresses LLMs into a format using mixed matrix decomposition for matrix multiplication allows
4-bit weights and 8-bit activations (W4A8). This approach the model to quantize the weights to 8-bit precision, signif-
achieves SOTA performance on popular LLMs like BLOOM icantly reducing the memory footprint compared to 16-bit
[82], LLaMA [14], and LLaMA-2 without requiring further weights.
fine-tuning. FPTQ offers a significant advantage by optimiz- QFT [77] (Section V-A4) addresses memory limitations
ing both memory usage and computation efficiency during the during fine-tuning LLMs by introducing a novel quantization
inference stage without sacrificing accuracy. This technique framework. It converts all model states into integers to mini-
simplifies the deployment process for LLMs and makes them mize memory footprint and employs the Lion optimizer for its
more practical for real-world use. The model was validated memory efficiency and compatibility with quantization. Ad-
on various datasets, including LAMBADA, MMLU, and a set ditionally, the framework incorporates a specialized scheme
of Common Sense QA tasks. The researchers compared the for handling quantized weights during training.
model’s performance to an existing technique called LLM- QuantEase [84] is a framework for post-training quantiza-
QAT (LLM-Quantization-Aware Training). However, limited tion of LLMs that enhanced their deployment efficiency. The
data availability for LLM-QAT restricted the comparison to framework addresses the challenge of layer-wise quantization
the Common Sense QA dataset. On this task, FPTQ achieved by optimizing each layer individually, utilizing Coordinate
results closer to the FP16 compared to LLM-QAT. While the Descent (CD) to achieve high quality solutions efficiently
analysis was only possible for 7B and 13B parameter LLaMA without complex matrix operations. The framework includes
models due to data limitations, FPTQ consistently performed an outlier-aware variant that maintains crucial ‘‘outlier’’
better across all subsets of the dataset. This is evidenced by weights in full precision to enhance accuracy. Demonstrating
the average scores: 73.38 and 76.81 for LLaMA-7B and 13B, SOTA performance, QuantEase significantly improves per-
respectively. These findings suggest that FPTQ is an effective plexity and zero-shot accuracy compared to existing methods
approach for LLM quantization. like GPTQ [80], with up to 15% relative improvement. Effi-
Norm Tweaking [56] method introduces a novel technique cient linear algebra optimizations allow for the quantization
in quantization, specifically for LLMs. While existing quan- of large models such as Falcon-180B on a single GPU in
tization methods like GPTQ [80] achieve acceptable 4-bit under 3 hours. The outlier-aware variant supports near or sub-
weight-only quantization, attempts at lower bit quantization 3-bit quantization with minimal accuracy loss, outperforming
often lead to significant performance degradation. It intro- methods like SpQR by up to two times in perplexity reduc-
duces a strategy to rectify the quantized activation distribu- tion.
tion, restoring accuracy for LLMs. The method involves gen- LLM-Pruner [85] (Section V-B2) compresses LLMs by
erating calibration data and applying channel-wise distance removing non-essential parts based on gradient information
constraints to normalization layer weights. Experiments show while keeping their functionality. This significantly reduces
significant improvements in both weight-only quantization model size with minimal accuracy loss, achieved through
and joint quantization of weights and activations, achieving fine-tuning with a small amount of data.
high accuracy even at 2-bit quantization. It offers a practi-
cal solution for reducing computational and storage costs in 2) Pruning
LLMs while maintaining performance. SparseGPT [72] framework has developed an efficient and
FineQuant [83] introduces an innovative weight-only quan- precise post-training pruning technique for significantly re-
tization technique that significantly decreases memory usage ducing the size of large-scale GPT-family models. This
and speeds up LLM inference with minimal quality loss. method achieves at least 50% sparsity in a single step, with-
Key features of this technique include utilizing pre-trained out requiring retraining. Remarkably, it enables the process-
model weights without further fine-tuning, applying adaptive ing of the largest open-source models, such as OPT-175B
granularity quantization to minimize accuracy loss, and im- and BLOOM-176B, in less than 4.5 hours. It makes the
plementing an efficient GPU processing approach. Tested on model achieve 50-60% unstructured sparsity with a negligible
large-scale models like OPT-175B, FineQunat demonstrates increase in perplexity and removes more than 100 billion
minimal accuracy loss, achieves up to 3.65× higher through- weights with minimal impact on accuracy. The study demon-
put with the same number of GPUs, and reduces resource strates that the parameterization of massive GPT models en-
demands. ables pruning without relying on the gradient information.
PETALS [62] (Section IV-B6) is a collaborative platform It highlights that sparse models with comparable accuracy
for distributed inference and fine-tuning of LLMs over the to dense models can be identified within the ‘‘close neigh-
internet. To enhance efficiency, quantization techniques have borhood’’ of the dense models. The study’s findings reveal
been employed to store a higher number of parameters per that sparse models achieve performance very similar to the
18 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

dense models. The study also shows that it is easier to prune 1) Data Parallelism
larger models: for a fixed sparsity level, the accuracy drop Data parallelism is a parallel training technique that repli-
for larger sparse models is smaller, to the point where there is cates the entire model across multiple GPUs or devices and
practically no accuracy decrease when reaching 50% sparsity, distributes the training data among them. Each device han-
to the point where reach 50% sparsity does not result in any dles a portion of the data, performs forward and backward
noticeable accuracy decrease on the largest models. propagation, and computes gradients independently. These
Sheared LLAMA [74] is used to reduce the size of the gradients are then aggregated across all devices to update
LLaMA2-7B model to 1.3B and 2.7B parameters, and it the global model parameters. It is a fundamental and widely
performed better than other open-source models of the same used technique for improving the training throughput of deep
size on a variety of downstream and instruction tuning evalua- learning models. Its simplicity, scalability, and effectiveness
tions. LLM-shearing also requires only 3% of the computing make it a valuable tool for researchers and practitioners in the
resources to train as the same models trained from scratch. field of machine learning [15], [24], [71], [82].
One of the main steps of Sheared LLAMA is a novel pruning
algorithm that can prune a source model to any specified 2) Model Parallelism
target architecture. The algorithm is an extended version of The model parallelism can be classified into two groups
CoFiPruning that allows the source model to be pruned to tensor parallelism (section V-C2a) and pipeline parallelism
any specified target architecture, based on the desired model (section V-C2b).
size and performance requirements. Pre-trained models are
typically well-optimized to balance expressivity and infer- a: Tensor Parallelism
ence efficiency, therefore these configurations are used as the Tensor parallelism involves partitioning a tensor across an ar-
target architectures. ray of devices, necessitating a distributed matrix-matrix mul-
LLM-Pruner [85] introduces a framework for compressing tiplication algorithm for mathematical computations. Using
LLMs in a task-agnostic way while minimizing the need for the tensor parallelism reduces the response time for individual
the original training corpus. The framework uses structural queries [15], [17]. Megatron-LM introduced 1D tensor paral-
pruning to remove non-critical parts of the model based on lelism (Section IV-A3) which partitions the linear layer within
gradient information. The pruned models’ performance is the Transformer architecture along either the row or column
recovered using LoRA [35] tuning, which takes just 3 hours dimensions. Within Megatron-LM, tensors are broken down
and 50K data samples. Experiments on LLaMA [14], Vicuna into a single dimension [15], [19].
[86], and ChatGLM [87] show that the compressed models
maintain 94.97% of their original performance even after b: Pipeline Parallelism
removing 20% of parameters. However, higher pruning rates FlexGen [17] (Section IV-B2) utilizes pipeline parallelism to
lead to significant performance drops and incoherent sentence distribute an l-layer LLM evenly across m GPUs, enabling
generation. parallel execution of all layers. Each GPU executes the same
sequence of operations, essentially reducing the problem to
3) Hyperparameter Optimization training an n/m-layer transformer on a single GPU. This ap-
Selecting the right hyperparameters is essential for develop- proach leverages the existing policy search algorithm devel-
ing effective LLMs, as these parameters significantly influ- oped for single-GPU training. In order to implement micro-
ence the model’s convergence speed, generalization ability, batch pipelining, a new repetition statement (for-loop) is used
and overall performance in various language tasks. Whereas within the applied algorithm effectively merging the iteration-
often an iterative and computationally demanding process, level pipeline parallel execution schedule with a single-device
hyperparameter optimization is crucial for achieving optimal offloading runtime.
PETALS [62] (Section IV-B6) utilizes pipeline parallelism
model performance. Cramming [16] employs a lower learning
to efficiently distribute the computation of LLMs among
rate to stabilize the training process and prevent overfitting,
multiple servers. Servers are organized into a chain, with
enabling effective model training within limited computa-
each server responsible for executing a portion of the model
tional resources.
pipeline. This approach enables efficient parallel processing,
improving the overall performance of inference and fine-
C. DISTRIBUTED TRAINING
tuning tasks.
Distributed training refers to the process of training LLMs GPipe [3] (Section IV-A1) employs a novel pipeline paral-
across multiple computing devices or processing units. This lelism algorithm based on batch splitting, where mini-batch
approach harnesses the power of parallelism to distribute the examples are divided into smaller micro-batches and sequen-
computational burden, enabling faster training of large mod- tially executed across cells during training. The model utilizes
els with millions or even billions of parameters. Distributed synchronous mini-batch gradient descent, accumulating gra-
training is crucial for managing the massive datasets and dients from all micro-batches and applying them to the model
computational demands associated with cutting-edge LLMs. at the end of the mini-batch. The efficiency of the model is
demonstrated through the successful training of large-scale
VOLUME 11, 2023 19
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

models, including a convolutional model (AmoebaNet) for For optimizing model state memory, which occupies most
image classification and a transformer model for machine of the memory during training, the study introduces ZeRO-
translation. The model showcases its flexibility across various DP, ZeRO-powered data parallelism which has three main
deep network architectures, achieving superior results and optimization stages: in the first stage, only the optimizer states
consistent training performance on diverse devices. are partitioned; in the second stage, both optimizer states and
DFX [88] is a low-latency multi-FPGA appliance for ac- gradients are partitioned; and in the final stage, all three model
celerating transformer-based text generation. It uses model states are partitioned. This results in a significant boost in
parallelism to split the transformer model across multiple memory efficiency. The rest of the memory consumed by
FPGAs. This allows each FPGA to process a different part residual states could become a secondary memory bottleneck.
of the model in parallel, thereby accelerating the overall text The study overcame this problem by three factors: Firstly
generation process. Also, it uses an efficient network to in- using activation partition to optimize activation memory by
terconnect the FPGAs and reduce communication overhead. locating and deleting activation replication in existing MP
The network uses a ring topology to minimize communica- (model parallelism), and when appropriate offloads activa-
tion overhead. This model utilized four Xilinx Alveo U280 tions to CPU. Secondly, keeping the balance of memory and
FPGAs and evaluated its performance on the GPT-2 language computation efficiency by introducing appropriate size tem-
model. It demonstrated a 5.58× acceleration in speed and a porary buffers to strike. Finally, during the training process,
3.99× enhancement in energy efficiency compared to four memory becomes fragmented because tensors have varying
NVIDIA V100 GPUs. In addition to its performance and lifetimes. The lack of contiguous memory, resulting from
energy efficiency benefits, this solution proves to be more this fragmentation, can lead to memory allocation failures,
cost-effective than GPU-based alternatives. Moreover, it of- even when there is sufficient free memory space available. To
fers an 8.21× cost advantage over a GPU appliance delivering address this problem, ZeRO-R takes a proactive approach by
similar performance levels. effectively handling memory based on the distinct lifetimes of
tensors, thereby preventing memory fragmentation. Remark-
3) Combined Parallelism ably, this model achieves a throughput of 15 Petaflops when
Narayanan et al., in [71] proposed a new technique called training models with over 100 billion parameters, demonstrat-
PTD-P for training LLMs on GPU clusters. PTD-P combines ing super-linear speedup on 400 GPUs. It indicates an 8×
pipeline parallelism, tensor parallelism, and data parallelism increase in model size and a 10× increase in performance
to achieve high computational performance and graceful scal- compared to recent SOTA models.
ing. Data parallelism divides the training data into smaller
batches, which are then processed in parallel on all the GPU 5) Sequence Parallelism
servers. This allows PTD-P to achieve faster training by lever- Sequence parallelism [15], [89], is a novel approach pro-
aging the parallel computing capabilities of the GPU cluster. posed to efficiently train Transformers with longer sequences
Also, GPipe [3], and ZeRO [18] (section IV-A1, and V-C4 on GPUs. It addresses the quadratic memory requirements
respectively) are other examples of combined parallelism. of self-attention in Transformer models. Unlike traditional
methods, it does not require a single device to handle the
4) ZeRo entire sequence. By splitting sequences into chunks and dis-
ZeRO [18] proposed solutions to overcome the limitations of tributing them across devices, it achieves effective train-
existing methods and efficiently train large models. While ing with infinitely long sequences. It introduces Ring Self-
using existing systems the memory consumption can be Attention to enhance the process, demonstrating superior
classified into two main parts which are model states, and performance in batch size and sequence length compared to
residual states. Most of the memory capacity is used by tensor parallelism, handling sequences over 27× longer than
model states (such as momentum, variance in Adam, gra- existing methods.
dients, and parameters) while working with large models.
The rest part of the memory is occupied by residual states 6) Automatic Parallelism
(such as activation, temporary buffers, and unusable frag- The automatic selection and parallelization strategies as the
mented memory). For applying optimization in both model latest advances in parallel training demonstrated by FlexFlow
state memory and residual state memory, efficiently training [73] and Alpa [24], [90]. Alpa is an automated system that
models of such colossal sizes is crucial as they grow from generates execution plans for distributed model-parallel train-
billions to trillions of parameters. The study introduces a ing. It’s an architecture that can automatically derive effi-
novel memory optimization technique aimed at substantially cient parallel execution plans at each parallelism level. It is
improving training speed, and with approach enables scaling different from specialized systems as it can handle models
the model size in proportion to the number of devices while with heterogeneous architectures and models without manu-
maintaining high efficiency. Leveraging the latest hardware, ally designed plans. However, it is not hardware-aware and
this model can scale to over 1 trillion parameters by carefully does not consider network topology. Also, it does not search
evaluating communication volume and memory capacity re- for activation checkpointing, which could lead to suboptimal
quirements, this boosts memory efficiency for model states. results. Alpa has been evaluated on large models training
20 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

with billions of parameters. The model’s performance has models by pinpointing the most resource-intensive operations
been compared with the SOTA systems such as Megatron- and employing a combination of techniques: holistic model
LM [19] and DeepSpeed [5], on an Amazon EC2 cluster partitioning, cross-operation zero skipping, and model/config
with 64 GPUs. It presents the similar training performance as adaptive hardware reconfiguration. Splitwise [65] (section
Megatron-LM on GPT models and outperforms DeepSpeed IV-B9) improves LLM inference by separating workload onto
on GShared MoE models with up to 9.7× speedup. Moreover, different machines for high throughput, cost, or power effi-
it generalized well to models without manual strategies and ciency. It allows for building both homogeneous and hetero-
demonstrated 80% liner scaling efficiency on Wide-ResNet. geneous clusters depending on the optimization goal.
The results presented that Alpa’s performance in training
large models efficiently and its ability to generalize to diverse E. TRAINING OPTIMIZATION: CHALLENGES AND KEY
models. FINDINGS
In the previous sections we have offered a comprehensive
D. HETEROGENEOUS TRAINING overview of training optimization (Section V) which includes
ZeRO-Offload [20], aims to democratize large-scale model model optimization (Section V-A), size reduction optimiza-
training, making it accessible to a wider audience. It achieves tion (Section V-B), distributed training (Section V-C), and
this by using a single GPU to train models with over 13 billion heterogeneous training (Section V-D). In this section and the
parameters, eliminating the need for data scientists to modify following paragraphs, we will discuss training optimization’s
the models or sacrifice computational efficiency. The study challenges and key findings.
introduces ZeRO-Offload, a novel heterogeneous deep learn- Challenges of Model Optimization:
ing (DL) training technology. The model leverages both CPU
• Resource constraints: LMs demand significant memory
memory and compute for offloading and offers an efficient
and computational power, limiting training and deploy-
scaling path on multiple GPUs through collaboration with
ment on single devices.
ZeRO-powered data parallelism [18]. Through first-principle
• Balancing efficiency and accuracy: Optimizing LLMs
analysis, the study asserts that the model provides an opti-
requires finding a balance between efficient resource
mal solution, maximizing memory savings while minimizing
utilization and maintaining model performance.
communication and CPU compute overhead for large model
• Memory bottlenecks: Distributing LMs across devices
training.
introduces memory limitations on each device.
ZeRO-Infinity [1] introduces an innovative system tech-
• Communication overhead: Data exchange between de-
nology that enables the model scaling on constrained re-
vices during training can become a bottleneck, slowing
sources. It achieves this without the need for extensive model
down the process.
code modifications by harnessing the power of GPU, CPU,
• Hardware heterogeneity: Efficiently utilizing devices
and NVMe memory. The model made up of five innovative
with varying memory capacities and processing speeds
technologies: 1) infinity offload engine, this technique uses
in a distributed setting is challenging.
simultaneous exploitation of GPU, CPU, and NVMe memory,
• Scalability limitation: Traditional methods might not
as well as GPU and CPU compute to fully leverage hetero-
scale well with increasing device numbers due to mem-
geneous architecture on modern clusters, 2) memory-centric
ory and communication constraints.
tiling, handle extensive operators without necessity of model
parallelism, 3) bandwidth-centric partitioning, is employed Key Findings:
to make the most of the aggregate memory bandwidth across • Algorithmic: Techniques like FlexGen [17], LightSeq
all parallel devices, 4) overlap-centric design, is implemented [63], and NLP-Fast [60] improve efficiency by optimiz-
to enable the simultaneous execution of compute and com- ing computations, memory access, and utilizing special-
munication tasks, 5) ease-inspired implementation, to pre- ized hardware kernels.
vent the need for extensive model code refactoring. SWARM • Model partitioning: Techniques like GPipe [3] and
Parallelism [10] (section V-B1) introduced a model aimed at Megatron-LM [19] partition models for efficient pro-
training large models efficiently, particularly on unreliable cessing across multiple devices.
heterogeneous devices with limited network bandwidth. In- • Fine-tuning for efficiency: Techniques like AlphaTuning
stead of employing static pipelines, the model utilizes dynam- [76] and LoRA [35] enable fine-tuning large models on
ically generated and randomized pipelines to adapt to varying limited memory by reducing the number of parameters
conditions. This allows each device to share its results with requiring adjustment.
any other device that is responsible for the next stage of the • Scheduler optimization: Techniques like TurboTrans-
pipeline. This enables devices with high performance to pro- formers [11] improve response throughput and task ex-
cess inputs from multiple predecessors, distribute their results ecution on GPUs.
across multiple weaker peers, and rebalance the workload in • Size reduction optimization: This approach focuses on
case of failure to improve utilization. reducing model complexity through techniques like
NLP-Fast [60] (Section IV-B3) is a system designed to quantization (reducing storage bits) and pruning (remov-
enhance the performance of large-scale heterogeneous NLP ing non-essential parts).
VOLUME 11, 2023 21
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

A. MEMORY OPTIMIZATION
Hardware
In the process of training deep learning models, memory us-
Optimization
age is primarily attributed to various factors, including model
Memory Hardware-Aware
Optimization Optimization parameters, layer activations, gradients, and optimizer states,
such as momentum and variances in the Adam algorithm
Memory
Management
Mixed Precision Offloading
[15], [18]. The terms ‘‘model states’’ [18] or ‘‘model data’’
[15] encompass model parameters, gradients, and optimizer
FIGURE 6. Hardware optimization states collectively, while ‘‘residual states’’ [18] or ‘‘non-
model data’’ [15] refer to layer activations, temporary buffers,
and unusable fragmented memory collectively.
• Parallelism strategies: 1) Data parallelism: Distributes In this section, we will explain the common and recent ap-
training data across devices for faster training. 2) Model proaches that have been used for increasing training through-
parallelism: Splits the model across devices for parallel put and loading larger models into GPU memory while train-
computations (tensor, pipeline, sequence parallelism). ing deep learning models.
3) Combined parallelism: Combines data and model
parallelism for even faster training (PTD-P, ZeRO [18], 1) Memory Management
GPipe [3]). TurboTransformers [11] (section IV-B4), proposed a se-
• Memory optimization: ZeRO [18] optimizes memory quence length aware algorithm for memory allocation to
for trillions of parameters, Activation Partitioning deals efficiently balance memory allocation and deallocation, this
with activation memory efficiently, and ZeRO-Offload algorithm overcomes the problem of variability of input sen-
[20] and ZeRO-Infinity [1], which allow training on tence. LightSeq2 [54] introduces an innovative memory man-
single GPUs or limited resources by utilizing CPU and agement approach, specifically designed for the Transformer
NVMe memory. structure. This strategy efficiently reduces peak memory con-
• Heterogeneous optimization: SWARM Parallelism [10] sumption and minimizes the need for frequent allocation and
tackles unreliable devices with limited bandwidth by release calls. Notably, LightSeq2 stands out as the pioneer in
adapting workloads, NLP-Fast [60] optimizes execu- accelerating the entire training process of Transformers. In
tion on mixed platforms by pinpointing resource-heavy real-time applications where response time is crucial, model
operations, and Splitwise [65] distributes work across parallelism and pipeline parallelism can introduce signifi-
heterogeneous machines considering different goals like cant delays due to the extra communication overhead caused
throughput, cost, and power consumption. by splitting tensors or layers, even with technologies like
• Automatic parallelism: Alpa [90] automatically gener- NVLink and GPUDirect. EET [64] (section IV-B8) focuses
ates execution plans for distributed model parallel train- on minimizing memory usage for loading large models in
ing, applicable to diverse models. online services. The proposed solution involves dynamic
memory management, specifically targeting the reduction of
Overcoming these challenges and leveraging these tech- memory consumption for activation caches and operation
niques, model training can be made more efficient, scalable, result buffers, as weights and certain pre-allocated caches
and accessible, paving the way for even more powerful and are inherently difficult to compress. They introduce a dy-
versatile LLMs. namic CUDA memory management mechanism specifically
designed to reduce CUDA memory usage for the same model
VI. HARDWARE OPTIMIZATION size, unlike the manual memory allocation required by FT.
Hardware optimization is a systematic approach to improv- B. HARDWARE-AWARE OPTIMIZATION
ing the performance, efficiency, and functionality of com-
Hardware-aware optimization (HAO) is the process of opti-
puter hardware. By identifying and addressing bottlenecks
mizing the hardware utilization of deep learning models to
in hardware architecture [18], software, and the operating
achieve maximum performance on specific hardware plat-
system, hardware optimization can enhance overall speed,
forms [93]. In this section, we will explain offloading and
reduce power consumption, and improve the reliability of
mixed precision optimization.
hardware components (Fig. 6). Splitwise [65] (Section IV-B9)
is a technique to optimize hardware utilization by separating 1) Offloading
the prompt computation and token generation phases onto
FlexGen [17] (SectionIV-B2) presents an offloading frame-
different machines. This approach allows designing clusters
work for LLMs that optimizes I/O efficiency and throughput
optimized for cost, throughput, and power consumption. The
by considering computation schedule, tensor placement, and
model achieves up to 1.4× higher throughput at 20% lower
computation delegation. It utilizes a linear programming-
cost or 2.35× higher throughput with the same cost and
based search algorithm and unifies the placement of weights,
power.
activations, and the KV cache, enabling significantly larger
batch sizes compared to existing methods.
22 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

ZeRO-Offload [20] model facilitates the training of large rameters and gradients are in FP16 during forward and back-
model heterogeneous on GPU + CPU systems, enabling the ward propagation, maintaining FP32 copies is necessary for
handling of models up to 10× larger on a single GPU without accuracy during the update values calculation. Typically, a
sacrificing efficiency by using a unique optimal offload strat- system copies each piece of gradients, parameters to/from its
egy. Also, the design achieves a highly scalable multi-GPU FP32 counterpart in one training step, ensuring the accurate
configuration by integrating the offload strategy with ZeRO- update of FP32 parameters with the loaded FP32 gradient by
powered data parallelism, enabling ZeRO-Offload to achieve the trainer kernel.
nearly linear scalability, and smooth integration with model- FP8-LM [92] introduces a novel FP8 automatic mixed-
parallel training. This combination allows for the training of precision framework for training LLMs, optimizing mixed-
even larger models than using ZeRO-Offload or model par- precision and distributed parallel training through three levels
allelism independently. Moreover, the model enhances CPU of FP8 utilization. By gradually incorporating 8-bit gradients,
performance by introducing a high-performance Adam opti- optimizer states, and distributed learning, the framework sig-
mizer, achieving a 6× improvement over SOTA Adam im- nificantly enhances training efficiency. During the training
plementations. It also employs a one-step delayed parameter of a GPT-175B model on the H100 GPU platform, the FP8
update strategy to overlap GPU forward and backward passes framework reduced memory usage by 39% and increased
with CPU parameter updates. Additionally, the model’s size training speed by 75% compared to the BF16 framework,
has increased by a factor of 10 compared to widely used outperforming Nvidia’s Transformer Engine by 37%. This
frameworks such as PyTorch. To maintain computational effi- advancement leads to substantial cost reductions for training
ciency, the model minimizes data traffic to and from the GPU, large models and is adaptable to various tasks such as instruc-
increases GPU memory utilization, and allows offloading tion tuning and reinforcement learning with human feedback.
data and computation to the CPU. On a single NVIDIA V100
GPU, the model can achieve 40 TFlops/GPU for 10 billion C. HARDWARE OPTIMIZATION: CHALLENGES AND KEY
parameters, and it can scale up to 128 GPUs when avail- FINDINGS
able. The model also supports model parallelism, enabling Challenges of Hardware Optimization:
training models with more than 70 billion parameters on a • Memory limitation: Deep learning models can require
single DGX-2 box, resulting in a 4.5× increase in model size vast amounts of memory to store parameters, activations,
compared to employing model parallelism alone. and gradients. This limits the size and complexity of
Eliseev and Mazur [91] propose a model to efficiently run models that can be trained on a single device.
large sparse MoE language models on hardware with limited • Limited hardware utilization: Traditional training meth-
GPU memory. Using parameter offloading and leveraging ods may not fully utilize the capabilities of modern
the properties of MoE models enabled Mixtral-8x7B with hardware like GPUs.
mixed quantization to operate on desktop hardware and free- • Balancing speed and accuracy: Techniques like mixed
tier Google Colab instances. The study showed that some precision training aim to improve training speed by re-
experts are reused between adjacent tokens, and early layers ducing memory usage, but this can potentially compro-
can predict subsequent experts. This led to an MoE-specific mise model accuracy.
offloading strategy employing an LRU (Least Recently Used) Key Findings:
cache and advanced prediction of needed experts. The model • Memory management: Techniques like sequence length
significantly improves speed, achieving 2-3 tokens per second aware allocation and dynamic memory management can
on various consumer GPUs, and offers a practical solution for significantly reduce memory usage during training.
running large MoE models on limited hardware. • Hardware-aware optimization: Offloading computations
to CPUs or leveraging mixed precision training can im-
2) Mixed Precision prove hardware utilization and training speed.
Mixed precision training [94] proposes a method for train- • Model parallelism: Splitting models across multiple de-
ing deep neural networks using half-precision floating-point vices can handle larger models but can introduce com-
numbers, aiming to reduce memory requirements by almost munication overhead, impacting training speed.
half and accelerate arithmetic on modern GPUs without com- • Large model training: Frameworks like ZeRO-Offload
promising model accuracy or requiring adjustments to hyper- [20] enable training models significantly larger than
parameters. what a single GPU can handle.
Cramming [16] conducts all experiments and ablation stud- In the domain of hardware optimization, a continuous
ies using a consistent setup that employs automated mixed stream of novel methodologies is emerging, demonstrably
precision for both standard 16-bit and 32-bit floating-point expanding the frontiers of feasibility within the training
precision. paradigm.
LightSeq2 [54] (section IV-A4) optimizes the training pro-
cess by implementing batched updates on reduced-precision VII. SCALABILITY AND RELIABILITY OPTIMIZATION
parameters instead of numerous individual updates on full- Scalability optimization focuses on improving hardware sys-
precision parameters. In mixed precision training, where pa- tems’ capacity to flexibly handle varying workloads, en-
VOLUME 11, 2023 23
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

when multiple GPUs are available.

Scalability and C. SCALABILITY AND RELIABILITY OPTIMIZATION:


Reliability CHALLENGES AND KEY FINDINGS
Challenges of Scalability and Reliability:
• In the context of optimizing LLMs, a trade-off exists
between achieving high scalability and maintaining re-
Fault Tolerance Scalability
liability. Scalability, which involves handling increased
workloads, often necessitates the integration of more
FIGURE 7. Scalability and reliability optimization
complex components. However, this added complexity
can introduce new potential points of failure, thereby im-
pacting the system’s overall reliability. Balancing these
abling smooth scaling adjustments to meet evolving demands two objectives is crucial to ensure both effective per-
[1], [5], [18]–[20], [71], and reliability optimization aims formance and robustness in large-scale deep learning
to strengthen the dependability and stability of hardware systems.
infrastructure, reducing the likelihood of failures, errors, or
Key Findings:
disruptions [10], [62] (Fig. 7).
• Fault tolerance: This approach involves creating mech-
anisms to handle failures gracefully. Two notable tech-
A. FAULT TOLERANCE
niques are SWARM Parallelism [10] and PETALS [62].
SWARM Parallelism [10] (section V-D) allows high- SWARM Parallelism distributes the workload across
performance devices to handle inputs from several preceding multiple devices and compensates for failures by redis-
sources, share their outcomes with less powerful peers, and tributing tasks if a device fails. Similarly, PETALS, a
adjust the workload distribution in the event of a failure, distributed Transformer model, employs load balancing
enhancing resource utilization. The model ensures continuous and routing strategies to maintain smooth operation even
training and boosts overall efficiency by redistributing work- in the event of server failures.
load in case of device failure or premature termination. • Scalability techniques: Technique like ZeRO-Offload
PETALS [62] (section IV-B6) is a distributed Transformer [20] achieve high scalability for training large models.
model that can be easily scaled and fault-tolerant. It uses a This method combines data parallelism with an offload-
load-balancing algorithm to distribute servers evenly among ing strategy, minimizing data traffic and maximizing
Transformer blocks and a routing algorithm to find the fastest resource utilization.
path for inference. It also stores past inputs to each server
in case one fails, so that the client can quickly continue VIII. CASE STUDIES
with a replacement server. PETALS is a reliable and scalable The following case studies delve into the practical application
Transformer model that can be used for both inference and of advanced optimization strategies on LLMs. With the rapid
training. It uses a combination of load balancing, routing, and growth and increasing complexity of LLMs, efficient deploy-
fault tolerance to ensure that it can handle network disruptions ment and execution have become critical challenges. These
and server failures without impacting performance. case studies illustrate how cutting-edge techniques in model
compression, pruning, and inference optimization can signif-
B. SCALABILITY icantly enhance the performance and feasibility of deploying
ZeRO-Offload [20] is a highly scalable multi-GPU design these massive models on more accessible hardware. By exam-
achieved through an integrated offload strategy and ZeRO- ining specific implementations and outcomes, these examples
powered data parallelism. This combination leads to nearly provide valuable insights into overcoming the computational
linear scalability, allowing for the training of significantly and resource constraints associated with large-scale language
larger models than when using ZeRO-Offload or model par- models, thereby promoting their broader adoption and utility
allelism independently. The model further optimizes CPU in real-world applications.
execution with a high-performance Adam optimizer, resulting
in a 6 time higher than SOTA Adam implementation. Despite A. OPTIMIZING MODEL TRAINING WITH SPARSEGPT
a growth in model size by a factor of 10, the approach Background: LLMs like GPT-3 have billions of parameters,
minimizes data traffic to and from the GPU, maximizes which pose significant challenges in terms of storage, compu-
GPU memory utilization, and facilitates offloading data and tational requirements, and energy consumption. Pruning, or
computation to the CPU. ZeRO-Offload maintains a single removing less important parameters, can help mitigate these
copy of optimizer states in CPU memory, ensuring constant issues, but traditional pruning methods often require multiple
communication volume and CPU computation, regardless of iterations of fine-tuning, which is computationally expensive.
data parallelism. This design choice enables excellent scal- This approach (SparseGPT [72]) proposes a one-shot pruning
ability on up to 128 GPUs, and ZeRO-Offload can also be method that significantly reduces the number of parameters
combined with model parallelism for higher memory savings without the need for extensive retraining.
24 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Context and Problem: In this case study, the focus is • Accuracy maintenance: The pruned models exhibited
on training a LLM with billions of parameters on limited negligible increases in perplexity and retained perfor-
hardware. The initial challenge was the high computational mance levels very similar to their dense counterparts.
and memory requirements that exceeded the capabilities of • Scalability: The study revealed that larger models are
available resources, making it difficult to efficiently train the easier to prune, with practically no accuracy decrease
model within a reasonable timeframe and budget. observed at 50% sparsity.
Optimization Strategy: The primary optimization strategies This case study demonstrates the efficacy of SparseGPT’s
involved in SparseGPT are: one-shot pruning approach for reducing the size of mas-
One-Shot Pruning: To achieve significant sparsity in the sive language models. By leveraging unstructured sparsity
LLM in a single pruning step, eliminating the need for iter- and parametrization strategies without gradient dependence,
ative pruning and retraining. One-Shot Pruning: SparseGPT SparseGPT achieves substantial reductions in model size and
implements its pruning strategy through a streamlined pro- resource requirements while maintaining high levels of per-
cess. First, a thorough model analysis is conducted to pinpoint formance. This approach enables more efficient and accessi-
parameters that can be removed without significant impact. ble deployment of large language models in various applica-
This analysis leverages pruning criteria that assess parameter tions, making them more practical for real-world use.
importance without requiring gradient calculations, saving
on computational resources. Finally, SparseGPT employs a
B. ENHANCING INFERENCE EFFICIENCY WITH QMOE
single step pruning approach, achieving substantial sparsity
(at least 50% for massive models) in a single step. This one- Background: LLMs with trillions of parameters are becom-
shot approach significantly reduces the time and complexity ing increasingly common. However, training and deploying
compared to iterative pruning methods. these models is challenging due to their immense compu-
Unstructured Sparsity: To reduce the number of parame- tational and memory demands. Existing compression tech-
ters while maintaining model accuracy through unstructured niques struggle to handle such large models effectively.
pruning, where individual weights are removed based on their QMoE [79] framework addresses this challenge by introduc-
importance. This approach focuses on eliminating individual ing novel compression methods to make these models more
weights within the model that are deemed less important. By practical for real-world use.
analyzing the model’s internal structure, SparseGPT achieves Strategy Selection: QMoE was chosen as the optimization
impressive sparsity levels of 50-60%, significantly reducing strategy. This approach allows for the compression of large
model size. This aggressive pruning strategy is remarkable models by quantizing their parameters to extremely low pre-
because it achieves this with minimal impact on the model’s cision, which drastically reduces the model size while main-
ability to perform language modeling tasks accurately. For taining its performance. This strategy is particularly useful for
instance, SparseGPT can remove over 100 billion weights handling the large parameter counts typical of MoE models.
from massive models like OPT-175B and BLOOM-176B Optimization Strategy: The core optimization strategies
without compromising their performance on language mod- involved in QMoE are:
eling tasks. Scalable Compression Algorithm: QMoE tackles the chal-
Parametrization without Gradient Dependence: To lever- lenge of massive model sizes with a scalable compres-
age the parametrization of massive GPT models to enable sion algorithm. This innovative technique achieves im-
pruning without relying on gradient information. This method pressive sub-1-bit compression for trillion-parameter MoE
allows the identification of sparse counterparts within a close models, without requiring retraining. In the case of the
range of the original dense model, ensuring these sparse mod- SwitchTransformer-c2048 model, this translates to a dramatic
els maintain similar performance. Interestingly, the strategy size reduction from 3.2 TB to a mere 160 GB (roughly 0.8 bits
highlights that larger models are even easier to prune using per parameter). Remarkably, this is achieved with minimal
this approach. They experience minimal accuracy drops even compromise on accuracy, as measured by performance on
at significant sparsity levels (e.g., 50%). This observation pretraining validation tasks and zero-shot data.
underscores the effectiveness of the parametrization tech- Customized Compression Format and GPU Kernels:
nique in enabling aggressive pruning while preserving model QMoE takes advantage of custom designed GPU kernels to
performance. unlock the potential of its compressed format. These special-
Outcomes: The application of SparseGPT led to remark- ized kernels enable swift, on-the-fly decoding of the model,
able results: ensuring efficient processing during use. This allows the com-
• Model size reduction: SparseGPT achieved 50-60% pressed model to run seamlessly on common hardware like 8
sparsity, significantly reducing the model size by remov- NVIDIA RTX 3090 or 4× NVIDIA A6000 GPUs. Even with
ing more than 100 billion weights in models like OPT- this readily available hardware, the runtime overhead stays
175B and BLOOM-176B. below 5% compared to an uncompressed model, which would
• Processing time: The pruning process was completed in require a staggering 20 times more GPUs.
less than 4.5 hours for the largest open-source models, Outcomes: The implementation of QMoE resulted in sig-
demonstrating high efficiency. nificant improvements:
VOLUME 11, 2023 25
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

• Compression ratio: The model size was reduced by improve GPU utilization and reduce memory usage. Simi-
approximately 95%, allowing the SwitchTransformer- larly, ByteTransformer [4] is designed to accelerate trans-
c2048 model to fit within the memory constraints of former models, particularly for variable-length inputs in NLP
standard hardware. This reduction from 3.2 TB to less tasks, thereby improving performance and reducing latency.
than 160 GB translates to a compression ratio of around Memory Management: Efficient memory allocation is cru-
0.8 bits per parameter. cial for training large models. CoLLiE [55] addresses mem-
• Inference speed: The QMoE framework enables the ory constraints in LLM training through a comprehensive
efficient execution of massive MoE models on com- strategy. It implements 3D parallelism to effectively distribute
modity hardware with a runtime overhead of less memory across training machines and GPUs. This approach
than 5%. This efficiency allows the trillion-parameter allows CoLLiE to train large language models even in envi-
SwitchTransformer-c2048 model to run on a single com- ronments with limited resources.
modity GPU server. Fine-Tuning and Performance: CoLLiE [55] also focuses
• Accuracy: Despite the substantial compression, the on enhancing specific capabilities of LLMs through PEFT
model maintains high performance on pretraining val- methods. These methods allow models to be fine-tuned for
idation tasks and zero-shot data, with only a minor de- particular tasks or user instructions without compromising
cline in accuracy. their overall performance. This targeted improvement is vital
This case study demonstrates the feasibility of deploying for developing models that can adapt to specific application
trillion-parameter models in real-world applications through needs while maintaining high general performance.
the use of advanced compression techniques. The QMoE
approach not only reduces resource requirements but also B. LLM TRAINING KEY FINDINGS
enhances the deployability of cutting-edge language mod- The advancements in these frameworks have led to several
els across various environments. By leveraging a scalable significant findings:
compression algorithm, a customized compression format, GPipe: Demonstrates the successful training of a large
and bespoke GPU kernels, QMoE achieves significant im- multilingual transformer model, achieving superior results
provements in model efficiency and performance. This makes compared to smaller, individually trained models [3].
large-scale models more accessible and practical for real- ByteTransformer: Outperforms existing frameworks in
world applications. It addresses key limitations of MoE archi- terms of performance for BERT-like transformers on various
tectures and promotes their wider adoption, paving the way benchmarks [4].
for further research and advancements in this field. Megatron-LM: Enabled the training of LLMs with billions
of parameters, achieving SOTA results on numerous NLP
IX. DISCUSSION tasks while providing high throughput [19].
This section examines optimization and acceleration tech- LightSeq2: Accelerates transformer model training by up
niques for LLMs. We will discuss the relevant libraries to 308%, showcasing substantial performance improvements
and frameworks that facilitate these advancements, alongside [54].
challenges and key findings of various optimization strate- CoLLiE: Introduces collaborative training methodologies
gies. that improved efficiency and effectiveness in training large
models like LLaMA-65B, exploring ways to enhance specific
A. LLM TRAINING CHALLENGES functionalities without impacting overall performance [55].
Training LLMs poses significant challenges due to their com-
plexity and resource requirements. Recent advancements in C. LLM INFERENCE CHALLENGES
frameworks like GPipe [3], ByteTransformer [4], Megatron- Efficient inference of LLMs is critical for their practical
LM [19], LightSeq2 [54], and CoLLiE [55] have made sig- application, as these models are computationally expensive
nificant strides in addressing these challenges: due to their size and complexity. In this section, we will
Distributed training: As LLMs become increasingly com- discuss and explore the challenges and key findings of various
plex, training them on a single device becomes impracti- frameworks and libraries designed to enhance the efficiency
cal. Megatron-LM [19] and CoLLiE [55] address this by of LLM inference.
employing distributed training algorithms that partition the Computational expense: The massive size and complex
model across multiple GPUs. This approach enables paral- architecture of LLMs make traditional inference methods
lel processing and significantly accelerates training times. inefficient, especially on resource-constrained devices.
By distributing the workload, these frameworks mitigate the Balancing speed, accuracy, and resource utilization:
memory bottlenecks that arise when trying to train massive Achieving an optimal balance between these factors are cru-
models on single devices. cial for real-world deployment of LLMs.
Efficiency and speed: Efficiency and speed are critical for
the practical deployment of LLMs. LightSeq2 [54] enhances D. LLM INFERENCE KEY FINDINGS
training speed through system-level optimizations such as Hardware specialization: Frameworks like Splitwise [65]
layer-specific kernels and mixed-precision training, which improve inference by separating compute-intensive and
26 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

memory-intensive phases onto different machines with spe- System-level optimizations (LightSeq2): Demonstrates
cialized hardware. This targeted approach optimizes resource how system-level optimizations within the training process
usage and enhances performance. can significantly improve training speed and efficiency, trans-
Resource optimization: FlexGen [17] employs techniques lating to faster deployment of LLMs [54].
such as I/O scheduling, compression, and distributed process-
ing to efficiently utilize resources across CPUs, GPUs, and G. HARDWARE OPTIMIZATION IN LLM
disk storage. This holistic resource management approach Optimizing hardware for LLM involves overcoming memory
significantly improves inference efficiency. limitations and improving utilization. Key findings include
Algorithmic optimizations: Libraries like EET [64] and efficient memory management, hardware-aware optimiza-
LightSeq [63] implement custom algorithms and advanced tion, and model parallelism. Future research should focus on
memory management techniques to accelerate inference on efficient offloading strategies and advanced mixed precision
GPUs. These optimizations reduce latency and improve training.
throughput, making LLM inference more practical for real-
time applications. H. SCALABILITY AND RELIABILITY OPTIMIZATION IN
Heterogeneous platforms NLP-Fast [60] leverages differ- HARDWARE SYSTEMS
ent hardware platforms, including CPUs, GPUs, and FPGAs, Achieving scalable and reliable hardware systems re-
by identifying performance-critical operations and applying quires balancing complexity with reliability. Techniques like
targeted optimizations. This flexibility allows for efficient SWARM parallelism and ZeRO-Offload [20] improve fault
inference across various hardware configurations. tolerance and scalability. Future research should develop ad-
Distributed Inference PETALS [62] facilitates collabora- vanced fault tolerance mechanisms and optimize for new
tive inference and fine-tuning of LLMs across a network, hardware.
enabling scalable and efficient resource utilization. This ap- These advancements collectively enhance the efficiency,
proach allows for distributed processing, which is essential scalability, and accessibility of LLM training, inference, de-
for handling large-scale inference tasks. ployment, and serving, paving the way for more powerful
language models.
E. LLM DEPLOYMENT AND SERVING CHALLENGES
X. CONCLUSION AND FUTURE DIRECTIONS
Deploying and serving LLMs in real-world applications
This SLR investigated optimization and acceleration tech-
presents several challenges. This section explores these chal-
niques for LLMs. We identified the challenges associated
lenges, key findings from recent advancements, and future
with training, inference, and system serving for LLM with
directions for making LLM deployment and serving more
billion or trillion parameters. We presented a structured tax-
efficient and accessible.
onomy of optimization techniques alongside a comprehensive
Memory limitation: LLMs often exceed the memory ca-
analysis of recent libraries and frameworks. Following the
pacity of a single GPU, complicating their deployment and
PRISMA statement, we meticulously analyzed 65 relevant
serving in practical applications.
studies published between 2017 and December 2023. Our
Scalability: Handling multiple user requests simultane- proposed taxonomy provides a roadmap for researchers to
ously requires efficient scaling solutions to manage the large navigate the diverse landscape of optimization strategies and
and complex models effectively. select the most suitable approaches for their specific tasks.
Variability of input: LLM performance can be inconsistent Additionally, the review of libraries and frameworks empow-
when dealing with input sequences of varying lengths, ne- ers researchers to efficiently train and deploy LLMs, accel-
cessitating dynamic memory allocation strategies to maintain erating progress in real-world applications. Furthermore, the
efficiency. inclusion of two in-depth case studies demonstrates practical
Ease of deployment: Integrating complex LLM serving approaches to optimizing model training and enhancing in-
systems into existing workflows can be challenging, par- ference efficiency, highlighting how resource limitations can
ticularly for researchers and practitioners without extensive be addressed while maintaining performance.
expertise in the field. While recent advancements in LLM frameworks and opti-
mization techniques are promising, further research is crucial
F. LLM DEPLOYMENT AND SERVING KEY FINDINGS to unlock their full potential. We identified several key areas
PagedAttention (vLLM) : This algorithm breaks down the KV for future exploration, focusing on enhanced efficiency, scal-
cache into manageable blocks, minimizing wasted memory ability, and flexibility for LLMs.
and enabling efficient sharing across requests. This is a sig-
nificant improvement for processing large LLMs [69]. A. OPTIMIZATION FOR RESOURCE-CONSTRAINED
Efficient GPU utilization (TurboTransformers): Utilizes ENVIRONMENTS
techniques like parallel GPU kernels and dynamic batch Hybrid processing: Develop hybrid processing techniques,
scheduling to optimize performance on GPUs, resulting in where computation is split between GPUs and CPUs to op-
faster inference for transformer-based models [11]. timize memory usage and computational load.
VOLUME 11, 2023 27
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Efficient offloading mechanisms: Extend the capabilities of Custom kernel implementations: Continue to develop and
models like FlexGen [17] and DeepSpeed Inference [5] by re- refine custom kernel implementations for key operations like
fining offloading techniques. This includes better utilization Softmax and LayerNorm to achieve better performance, as
of CPU, GPU, and NVMe memory to handle larger models seen in TurboTransformers [11]. This could also involve
with fewer resources. hardware-specific optimizations for different GPU architec-
Resource-aware scheduling: Implement intelligent schedul- tures.
ing mechanisms that consider the specific resource con-
straints of the hardware, optimizing the allocation of GPU,
F. ADVANCED COMPRESSION AND QUANTIZATION
CPU, and memory resources for different types of tasks.
Sophisticated compression techniques: To reduce model size
B. MEMORY AND COMPUTATION OPTIMIZATION without significant accuracy loss instigate new methods for
Advanced memory management: Implement various tech- both lossless and lossy compression going beyond FlexGen’s
niques like dynamic catching, memory recycling, and effi- 4-bit quantization [17].
cient layer normalization (as presented in ByteTransformer Dynamic quantization: Develop dynamic quantization
[4] and LightSeq2 [54]) to overcome the memory overhead techniques that adjust the precision of weights and activations
problem. in real time based on the computational requirements and
Mixed-precision Training In order to significantly reduce available resources.
training time and resource consumption without sacrific-
ing accuracy, develop robust mixed-precision methods (like
Megatron-LM [19] and LightSeq2 [54]). XI. LIMITATIONS
Dynamic input handling: Focusing on variable-length in- m In this section, we will present the limitations of our SLR.
puts, like ByteTransformer [4], is seen as a promising area for Here, we acknowledge that while our review offers valuable
improvement in ML, especially for NLP tasks that often deal insights, it is essential to consider its scope and boundaries.
with data of varying lengths. By developing more advanced The limitations of our SLR can be stated as follows:
algorithms to handle these inputs and minimize unnecessary Timeframe: This SLR focused on studies published be-
computations, frameworks could achieve significant perfor- tween 2017 and December 2023. While this timeframe de-
mance gains in NLP. liberately captured a period of significant advancement in
LLM optimization techniques, it is acknowledged that rel-
C. PARALLELISM AND DISTRIBUTION evant research published before 2017 or after December
Adaptive parallelism: Develop more advanced techniques 2023 might have been excluded. This could potentially limit
that can dynamically adapt the parallelism strategy based on the comprehensiveness of the analysis, particularly regarding
the model size and hardware configuration. This includes foundational concepts or emerging advancements outside the
both data and model parallelism that can be adjusted on-the- chosen timeframe.
fly to optimize performance.
Search strategy: The chosen search queries might not have
Distributed training and inference: Improve frameworks
encompassed all possible relevant terminology used in LLM
like PETALS [62] and CoLLiE [55] to better leverage dis-
optimization research. This limitation could result in missing
tributed and heterogeneous hardware resources for efficient
out on studies that use different terminologies or keywords to
training and inference.
describe similar concepts and techniques.
D. SCALABLE AND MODULAR ARCHITECTURE Database coverage: If the search excluded specific
Composable frameworks: Design frameworks with modular databases that are highly relevant to LLM research, signif-
components, similar to NLP-Fast [60]. These components act icant studies might have been overlooked. Comprehensive
like building blocks for inference pipelines. Users can easily database coverage is crucial to ensure the inclusion of all
swap or optimize individual components independently, al- pertinent research.
lowing for greater flexibility and customization.
Flexible APIs: Create user-friendly APIs, like those in ACKNOWLEDGMENT
PETALS [62]. These APIs allow users to customize inference
The authors are grateful to the members of the Applied Ma-
and fine-tuning processes according to their specific needs
chine Learning Research Group of Óbuda University John
without having to make extensive changes to the underlying
von Neumann Faculty of Informatics for constructive com-
framework. This provides greater control and adaptability for
ments and suggestions. The authors would also like to ac-
different use cases.
knowledge the support of the Doctoral School of Applied
Informatics and Applied Mathematics of Óbuda University.
E. PERFORMANCE OPTIMIZATION TECHNIQUES
Adaptive algorithms: Develop algorithms that can adapt to
varying input sizes and sequences, optimizing both memory LIST OF ABBREVIATIONS
allocation and computational load dynamically.
28 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

AdaLomo Low-Memory Optimization with Adaptive


Learning Rate of deep bidirectional transformers for language understanding,’’ arXiv
BART Bidirectional and Auto-Regressive Transformers preprint arXiv:1810.04805, 2018.
BERT Bidirectional Encoder Representations from Transformers [9] L. Torbarina, T. Ferkovic, L. Roguski, V. Mihelcic, B. Sarlija, and Z. Kral-
BLOOM BigScience Large Open-science Open-access Multilingual jevic, ‘‘Challenges and opportunities of using transformer-based multi-
Language Model task learning in nlp through ml lifecycle: A survey,’’ arXiv preprint
CD Coordinate Descent arXiv:2308.08234, 2023.
EET Easy and Efficient Transformer [10] M. Ryabinin, T. Dettmers, M. Diskin, and A. Borzunov, ‘‘Swarm par-
FPGA Field Programmable Gate Arrays allelism: Training large models can be surprisingly communication-
FPTQ Fine-Grained Post-Training Quantization efficient,’’ 2023.
FT Faster Transformer [11] J. Fang, Y. Yu, C. Zhao, and J. Zhou, ‘‘Turbotransformers: an efficient
GLM General Language Model gpu serving system for transformer models,’’ in Proceedings of the 26th
GPT Generative Pre-trained Transformer ACM SIGPLAN Symposium on Principles and Practice of Parallel Pro-
GPU Graphical Processing Unit gramming, PPoPP ’21, (New York, NY, USA), p. 389–402, Association
HAO Hardware-Aware Optimization for Computing Machinery, 2021.
IR Information Retrieval [12] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shak-
KV Key Value eri, E. Taropa, P. Bailey, and Z. Chen, ‘‘Palm 2 technical report,’’ arXiv
LAMBADA LAnguage Modeling Broadened to Account preprint arXiv:2305.10403, 2023.
for Discourse Aspects [13] L. J. Laki and Z. G. Yang, ‘‘Sentiment analysis with neural models for
LLaMA Large Language Model Meta AI hungarian,’’ Acta Polytechnica Hungarica, vol. 20, no. 5, 2023.
LLM-QAT LLM-Quantization-Aware Training [14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix,
LM Language Model B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., ‘‘Llama: Open and
LOMO Low-Memory Optimization efficient foundation language models,’’ arXiv preprint arXiv:2302.13971,
LoRA Low-Rank Adaptation 2023.
MHA Multi-Head Attention [15] S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and Y. You,
MoE Mixture-of-Experts ‘‘Colossal-ai: A unified deep learning system for large-scale parallel train-
MMLU Massive Multitask Language Understanding ing,’’ in Proceedings of the 52nd International Conference on Parallel
NLP Natural Language Processing Processing, ICPP ’23, (New York, NY, USA), p. 766–775, Association for
NN Neural Network Computing Machinery, 2023.
OPT Open Pre-trained Transformer [16] J. Geiping and T. Goldstein, ‘‘Cramming: Training a language model on a
PET Parameter Efficient Transformers single gpu in one day,’’ in International Conference on Machine Learning,
PetS Parameter-Efficient Transformers Serving p. 11117–11143, PMLR, 2023.
PEFT Parameter-Efficient Fine-Tuning [17] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y. Fu, Z. Xie,
PIE PET Inference Engine B. Chen, C. Barrett, J. E. Gonzalez, et al., ‘‘High-throughput generative
PLM Pre-trained Language Model inference of large language models with a single gpu,’’ arXiv preprint
PRISMA Preferred Reporting Items for Systematic Reviews arXiv:2303.06865, 2023.
and Meta-Analyses [18] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, ‘‘Zero: Memory optimiza-
PTM Pre-Trained Model tions toward training trillion parameter models,’’ in SC20: International
PTQ Post-Training Quantization Conference for High Performance Computing, Networking, Storage and
SLR Systematic Literature Review Analysis, pp. 1–16, IEEE, 2020.
SWARM Stochastically Wired Adaptively Rebalanced Model [19] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,
VAE Variational Autoencoder ‘‘Megatron-lm: Training multi-billion parameter language models using
W4A8 4-bit weights and 8-bit activations model parallelism,’’ arXiv preprint arXiv:1909.08053, 2019.
[20] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang,
D. Li, and Y. He, ‘‘{ZeRO-Offload}: Democratizing {Billion-Scale}
REFERENCES model training,’’ in 2021 USENIX Annual Technical Conference (USENIX
ATC 21), pp. 551–564, 2021.
[1] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, ‘‘Zero-infinity:
breaking the gpu memory wall for extreme scale deep learning,’’ in Pro- [21] T. Chen, B. Xu, C. Zhang, and C. Guestrin, ‘‘Training deep nets with
ceedings of the International Conference for High Performance Comput- sublinear memory cost,’’ arXiv preprint arXiv:1604.06174, 2016.
ing, Networking, Storage and Analysis, SC ’21, (New York, NY, USA), [22] B. Yuan, Y. He, J. Davis, T. Zhang, T. Dao, B. Chen, P. S. Liang,
Association for Computing Machinery, 2021. C. Re, and C. Zhang, ‘‘Decentralized training of foundation models in
heterogeneous environments,’’ Advances in Neural Information Processing
[2] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, ‘‘Pre-trained models
Systems, vol. 35, pp. 25464–25477, 2022.
for natural language processing: A survey,’’ Science China Technological
[23] M. Ryabinin and A. Gusev, ‘‘Towards crowdsourced training of large neu-
Sciences, vol. 63, no. 10, pp. 1872–1897, 2020.
ral networks using decentralized mixture-of-experts,’’ Advances in Neural
[3] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, Information Processing Systems, vol. 33, pp. 3659–3672, 2020.
J. Ngiam, Q. V. Le, Y. Wu, et al., ‘‘Gpipe: Efficient training of giant neural
[24] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,
networks using pipeline parallelism,’’ Advances in neural information
J. Zhang, Z. Dong, et al., ‘‘A survey of large language models,’’ arXiv
processing systems, vol. 32, 2019.
preprint arXiv:2303.18223, 2023.
[4] Y. Zhai, C. Jiang, L. Wang, X. Jia, S. Zhang, Z. Chen, X. Liu, and Y. Zhu, [25] M. Shah Jahan, H. U. Khan, S. Akbar, M. Umar Farooq, S. Gul, and A. Am-
‘‘Bytetransformer: A high-performance transformer boosted for variable- jad, ‘‘Bidirectional language modeling: A systematic literature review,’’
length inputs,’’ in 2023 IEEE International Parallel and Distributed Pro- Scientific Programming, vol. 2021, pp. 1–15, 2021.
cessing Symposium (IPDPS), (Los Alamitos, CA, USA), pp. 344–355, [26] F. Yu, D. Wang, L. Shangguan, M. Zhang, X. Tang, C. Liu, and X. Chen,
IEEE Computer Society, may 2023. ‘‘A survey of large-scale deep learning serving system optimization: Chal-
[5] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, lenges and opportunities,’’ arXiv preprint arXiv:2111.14247, 2021.
O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, ‘‘Deepspeed- [27] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,
inference: enabling efficient inference of transformer models at unprece- J. Zhang, Z. Dong, et al., ‘‘A survey of large language models,’’ arXiv
dented scale,’’ in Proceedings of the International Conference on High preprint arXiv:2303.18223, 2023.
Performance Computing, Networking, Storage and Analysis, SC ’22, IEEE [28] G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi, Z. Yu,
Press, 2022. M. Zhu, Y. Zhang, C. Yang, Y. Cheng, and L. Zhao, ‘‘Beyond efficiency:
[6] Y. Gong, ‘‘Multilevel large language models for everyone,’’ arXiv preprint A systematic survey of resource-efficient large language models,’’ 2024.
arXiv:2307.13221, 2023. [29] H. Wang, Z. Qu, Q. Zhou, H. Zhang, B. Luo, W. Xu, S. Guo, and R. Li, ‘‘A
[7] B. Spector and C. Re, ‘‘Accelerating llm inference with staged speculative comprehensive survey on training acceleration for large machine learning
decoding,’’ arXiv preprint arXiv:2308.04623, 2023. models in iot,’’ IEEE Internet of Things Journal, vol. 9, no. 2, pp. 939–963,
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘Bert: Pre-training 2021.

VOLUME 11, 2023 29


Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[30] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, [55] K. Lv, S. Zhang, T. Gu, S. Xing, J. Hong, K. Chen, X. Liu, Y. Yang, H. Guo,
E. Agirre, I. Heintz, and D. Roth, ‘‘Recent advances in natural language T. Liu, et al., ‘‘Collie: Collaborative training of large language models in an
processing via large pre-trained language models: A survey,’’ ACM Com- efficient way,’’ in Proceedings of the 2023 Conference on Empirical Meth-
puting Surveys, vol. 56, no. 2, pp. 1–40, 2023. ods in Natural Language Processing: System Demonstrations, pp. 527–
[31] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, ‘‘Pre-trained models 542, 2023.
for natural language processing: A survey,’’ Science China Technological [56] L. Li, Q. Li, B. Zhang, and X. Chu, ‘‘Norm tweaking: High-
Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. performance low-bit quantization of large language models,’’ arXiv
[32] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, Z. Qu, S. Yan, Y. Zhu, preprint arXiv:2309.02784, 2023.
Q. Zhang, M. Chowdhury, et al., ‘‘Efficient large language models: A [57] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ‘‘Al-
survey,’’ arXiv preprint arXiv:2312.03863, vol. 1, 2023. bert: A lite bert for self-supervised learning of language representations,’’
[33] V. Cole and M. Boutet, ‘‘Researchrabbit,’’ The Journal of the Canadian arXiv preprint arXiv:1909.11942, 2019.
Health Libraries Association, vol. 44, no. 2, p. 43, 2023. [58] K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu, ‘‘Full parameter fine-
[34] M. Ouzzani, H. Hammady, Z. Fedorowicz, and A. Elmagarmid, tuning for large language models with limited resources,’’ arXiv preprint
‘‘Rayyan—a web and mobile app for systematic reviews,’’ Systematic arXiv:2306.09782, 2023.
reviews, vol. 5, pp. 1–10, 2016. [59] K. Lv, H. Yan, Q. Guo, H. Lv, and X. Qiu, ‘‘Adalomo: Low-memory op-
[35] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and timization with adaptive learning rate,’’ arXiv preprint arXiv:2310.10195,
W. Chen, ‘‘Lora: Low-rank adaptation of large language models,’’ arXiv 2023.
preprint arXiv:2106.09685, 2021. [60] J. Kim, S. Hur, E. Lee, S. Lee, and J. Kim, ‘‘Nlp-fast: A fast, scalable, and
[36] J. Gao and C.-Y. Lin, ‘‘Introduction to the special issue on statistical flexible system to accelerate large-scale heterogeneous nlp models,’’ arXiv
language modeling,’’ 2004. preprint arXiv:1712.06139, 2017.
[37] A. Pauls and D. Klein, ‘‘Faster and smaller n-gram language models,’’ in [61] Z. Zhou, X. Wei, J. Zhang, and G. Sun, ‘‘{PetS}: A unified framework
Proceedings of the 49th annual meeting of the Association for Computa- for {Parameter-Efficient} transformers serving,’’ in 2022 USENIX Annual
tional Linguistics: Human Language Technologies, pp. 258–267, 2011. Technical Conference (USENIX ATC 22), pp. 489–504, 2022.
[38] S. M. Thede and M. Harper, ‘‘A second-order hidden markov model for [62] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada,
part-of-speech tagging,’’ in Proceedings of the 37th annual meeting of the A. Chumachenko, P. Samygin, and C. Raffel, ‘‘Petals: Collaborative infer-
Association for Computational Linguistics, pp. 175–182, 1999. ence and fine-tuning of large models,’’ arXiv preprint arXiv:2209.01188,
[39] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, 2022.
N. Akhtar, J. Wu, S. Mirjalili, et al., ‘‘Large language models: a com- [63] X. Wang, Y. Xiong, Y. Wei, M. Wang, and L. Li, ‘‘Lightseq: A
prehensive survey of its applications, challenges, limitations, and future high performance inference library for transformers,’’ arXiv preprint
prospects,’’ Authorea Preprints, 2023. arXiv:2010.13887, 2020.
[40] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, and H. Al Na- [64] G. Li, Y. Xi, J. Ding, D. Wang, B. Liu, C. Fan, X. Mao, and Z. Zhao, ‘‘Easy
jada, ‘‘Survey of review spam detection using machine learning tech- and efficient transformer: Scalable inference solution for large nlp model,’’
niques,’’ Journal of Big Data, vol. 2, no. 1, pp. 1–24, 2015. arXiv preprint arXiv:2104.12470, 2021.
[41] A. López-Chau, D. Valle-Cruz, and R. Sandoval-Almazán, ‘‘Sentiment [65] P. Patel, E. Choukse, C. Zhang, Í. Goiri, A. Shah, S. Maleki, and R. Bian-
analysis of twitter data through machine learning techniques,’’ Software chini, ‘‘Splitwise: Efficient generative llm inference using phase splitting,’’
engineering in the era of cloud computing, pp. 185–209, 2020. arXiv preprint arXiv:2311.18677, 2023.
[42] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and [66] H. Zhang, A. Ning, R. Prabhakar, and D. Wentzlaff, ‘‘A hardware eval-
P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’ Journal of uation framework for large language model inference,’’ arXiv preprint
machine learning research, vol. 12, no. ARTICLE, pp. 2493–2537, 2011. arXiv:2312.03134, 2023.
[43] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ‘‘Dis- [67] Y. Song, Z. Mi, H. Xie, and H. Chen, ‘‘Powerinfer: Fast large lan-
tributed representations of words and phrases and their compositionality,’’ guage model serving with a consumer-grade gpu,’’ arXiv preprint
Advances in neural information processing systems, vol. 26, 2013. arXiv:2312.12456, 2023.
[44] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of [68] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
word representations in vector space,’’ arXiv preprint arXiv:1301.3781, L. Zettlemoyer, and V. Stoyanov, ‘‘Roberta: A robustly optimized bert
2013. pretraining approach,’’ arXiv preprint arXiv:1907.11692, 2019.
[45] Y. Bengio, R. Ducharme, and P. Vincent, ‘‘A neural probabilistic language [69] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez,
model,’’ Advances in neural information processing systems, vol. 13, 2000. H. Zhang, and I. Stoica, ‘‘Efficient memory management for large language
[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, model serving with pagedattention,’’ in Proceedings of the 29th Symposium
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ Advances in on Operating Systems Principles, pp. 611–626, 2023.
neural information processing systems, vol. 30, 2017. [70] C.-C. Huang, G. Jin, and J. Li, ‘‘Swapadvisor: Pushing deep learning
[47] T. A. Chang and B. K. Bergen, ‘‘Language model behavior: A comprehen- beyond the gpu memory limit via smart swapping,’’ in Proceedings of
sive survey,’’ arXiv preprint arXiv:2303.11504, 2023. the Twenty-Fifth International Conference on Architectural Support for
[48] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo- Programming Languages and Operating Systems, pp. 1341–1355, 2020.
gatama, M. Bosma, D. Zhou, D. Metzler, et al., ‘‘Emergent abilities of large [71] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Kor-
language models,’’ arXiv preprint arXiv:2206.07682, 2022. thikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al.,
[49] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., ‘‘Efficient large-scale language model training on gpu clusters using
‘‘Language models are unsupervised multitask learners,’’ OpenAI blog, megatron-lm,’’ in Proceedings of the International Conference for High
vol. 1, no. 8, p. 9, 2019. Performance Computing, Networking, Storage and Analysis, pp. 1–15,
[50] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, 2021.
V. Stoyanov, and L. Zettlemoyer, ‘‘Bart: Denoising sequence-to-sequence [72] E. Frantar and D. Alistarh, ‘‘Sparsegpt: Massive language models can be
pre-training for natural language generation, translation, and comprehen- accurately pruned in one-shot,’’ in International Conference on Machine
sion,’’ arXiv preprint arXiv:1910.13461, 2019. Learning, pp. 10323–10337, PMLR, 2023.
[51] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, [73] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, ‘‘Flexflow: A flexible
A. Poulton, V. Kerkez, and R. Stojnic, ‘‘Galactica: A large language model dataflow accelerator architecture for convolutional neural networks,’’ in
for science,’’ arXiv preprint arXiv:2211.09085, 2022. 2017 IEEE International Symposium on High Performance Computer
[52] M. Shanahan, ‘‘Talking about large language models,’’ Communications of Architecture (HPCA), pp. 553–564, 2017.
the ACM, vol. 67, no. 2, pp. 68–79, 2024. [74] M. Xia, T. Gao, Z. Zeng, and D. Chen, ‘‘Sheared llama: Accelerat-
[53] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Ra- ing language model pre-training via structured pruning,’’ arXiv preprint
jashekhar, S. Ramesh, and J. Soyke, ‘‘Tensorflow-serving: Flexible, high- arXiv:2310.06694, 2023.
performance ml serving,’’ arXiv preprint arXiv:1712.06139, 2017. [75] H. Jin, X. Han, J. Yang, Z. Jiang, C.-Y. Chang, and X. Hu, ‘‘Growlength:
[54] X. Wang, Y. Wei, Y. Xiong, G. Huang, X. Qian, Y. Ding, M. Wang, and Accelerating llms pretraining by progressively growing training length,’’
L. Li, ‘‘Lightseq2: Accelerated training for transformer-based models on arXiv preprint arXiv:2310.00576, 2023.
gpus,’’ in SC22: International Conference for High Performance Comput- [76] S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, J.-W.
ing, Networking, Storage and Analysis, pp. 1–14, IEEE, 2022. Ha, N. Sung, and D. Lee, ‘‘Alphatuning: Quantization-aware parameter-

30 VOLUME 11, 2023


Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

efficient adaptation of large-scale pre-trained language models,’’ arXiv SÁNDOR SZÉNÁSI (Member, IEEE) received his
preprint arXiv:2210.03858, 2022. PhD in 2013 from Doctoral School of Applied
[77] Z. Li, X. Liu, B. Zhu, Z. Dong, Q. Gu, and K. Keutzer, ‘‘Qft: Quantized Informatics and Applied Mathematics of Óbuda
full-parameter tuning of llms with affordable resources,’’ arXiv preprint University, Budapest, Hungary. Currently, he is a
arXiv:2310.07147, 2023. professor at the John von Neumann Faculty of In-
[78] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, formatics, Óbuda University, Budapest, Hungary.
S. Gray, A. Radford, J. Wu, and D. Amodei, ‘‘Scaling laws for neural His research areas are (data) parallel algorithms,
language models,’’ arXiv preprint arXiv:2001.08361, 2020.
GPU programming, and medical image process-
[79] E. Frantar and D. Alistarh, ‘‘Qmoe: Practical sub-1-bit compression of
ing. He engages both in theoretical fundamentals
trillion-parameter models,’’ arXiv preprint arXiv:2310.16795, 2023.
[80] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, ‘‘Gptq: Accurate and in algorithmic issues with respect to realization
post-training quantization for generative pre-trained transformers,’’ arXiv of practical requirements and given constraints.
preprint arXiv:2210.17323, 2022.
[81] Q. Li, Y. Zhang, L. Li, P. Yao, B. Zhang, X. Chu, Y. Sun, L. Du, and Y. Xie,
‘‘Fptq: Fine-grained post-training quantization for large language models,’’
arXiv preprint arXiv:2308.15987, 2023.
[82] T. Le, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné,
A. S. Luccioni, and e. a. Yvon, François, ‘‘Bloom: A 176b-parameter open-
access multilingual language model,’’ 2023.
[83] Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, ‘‘Finequant: Unlocking
efficiency with fine-grained weight-only quantization for llms,’’ arXiv
preprint arXiv:2308.09723, 2023.
[84] K. Behdin, A. Acharya, A. Gupta, Q. Song, S. Zhu, S. Keerthi, and
R. Mazumder, ‘‘Quantease: Optimization-based quantization for language
models,’’ arXiv e-prints, pp. arXiv–2309, 2023.
[85] X. Ma, G. Fang, and X. Wang, ‘‘Llm-pruner: On the structural pruning
of large language models,’’ Advances in neural information processing
systems, vol. 36, pp. 21702–21720, 2023.
[86] B. Peng, C. Li, P. He, M. Galley, and J. Gao, ‘‘Instruction tuning with gpt-
4,’’ arXiv preprint arXiv:2304.03277, 2023.
[87] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
W. Zheng, X. Xia, et al., ‘‘Glm-130b: An open bilingual pre-trained
model,’’ arXiv preprint arXiv:2210.02414, 2022.
[88] S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y. Kim, ‘‘Dfx:
A low-latency multi-fpga appliance for accelerating transformer-based text
generation,’’ in 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–17, GÁBOR KERTÉSZ (senior member, IEEE) re-
2022. ceived his PhD in 2019 in Information Science
[89] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, ‘‘Sequence parallelism: Long
and Technology; the main areas of his research
sequence training from system perspective,’’ 2022.
was computer vision, parallel processing, deep
[90] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu,
D. Zhuo, E. P. Xing, et al., ‘‘Alpa: Automating inter-and {Intra-Operator} machine learning. His current research interests
parallelism for distributed deep learning,’’ in 16th USENIX Symposium on include distributed deep learning, metric learning,
Operating Systems Design and Implementation (OSDI 22), pp. 559–578, and applied machine intelligence.
2022. He is an associate professor and the vice-dean
[91] A. Eliseev and D. Mazur, ‘‘Fast inference of mixture-of-experts language for research at Óbuda University John von Neu-
models with offloading,’’ arXiv preprint arXiv:2312.17238, 2023. mann Faculty of Informatics, Budapest, Hungary,
[92] H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni, also a part-time research-fellow at the HUN-REN SZTAKI (Institute for
J. Hu, et al., ‘‘Fp8-lm: Training fp8 large language models,’’ arXiv preprint Computer Science and Control). He is the leader of the Applied Machine
arXiv:2310.18313, 2023. Learning Research Group at the John von Neumann Faculty of Informatics.
[93] Z. Dong, Y. Gao, Q. Huang, J. Wawrzynek, H. H. So, and K. Keutzer, Dr. Kertész is the founding president of the High Performance Computing
‘‘Hao: Hardware-aware neural architecture optimization for efficient in- division of the John von Neumann Computer Society, and the president of
ference,’’ in 2021 IEEE 29th Annual International Symposium on Field- the IEEE Computational Intelligence Hungary Chapter.
Programmable Custom Computing Machines (FCCM), (Los Alamitos,
CA, USA), pp. 50–59, IEEE Computer Society, may 2021.
[94] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García,
B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, ‘‘Mixed
precision training,’’ CoRR, vol. abs/1710.03740, 2017.

ZHYAR RZGAR K ROSTAM (Member, IEEE)


received the B.Sc. and M.Sc. degree in computer
science from the University of Sulaimani, KRG-
Iraq, in 2013 and 2019. He is currently pursuing a
Ph.D. degree in Information Science and Technol-
ogy at Óbuda University, Budapest, Hungary. His
current research interests include large language
models, scientific text classification, deep learning
techniques, machine learning algorithms, and arti-
ficial intelligence.

VOLUME 11, 2023 31


Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 6. Comparative analysis between different strategies [A]

Technique Performance Cost Scalability


FlexGen [17] 1) Batch size of 64 or 2048 tokens: Throughput speedup Enabling high-throughput LLM inference Efficiently running the OPT-175B model
of 40× with a latency of 5000 seconds.2) Batch size of on resource-constrained devices. Minimiz- on NVIDIA T4 (16 GB) GPUs, showcas-
256 or 8192 tokens: Throughput speedup of 79× with a ing the need for multiple high-end GPUs. ing its ability to handle LMs on resource-
latency of 12000 seconds.3) Batch size of 144 or 4608 constrained hardware.
tokens: Throughput speedup of 100× with a latency of
4000 seconds (using 4-bit quantization compression).
SwapAdvisor [70] Memory allocation: achieves up to 4× reduction in Train models up to 12× beyond GPU Supports efficient training and inference
serving latency and boosts training throughput by 20% memory limit. of LMs on standard GPUs, significantly
to 1100%. extending their capability.
NLP-Fast [60] Up to 2.92×, 1.59×, and 4.47× higher throughput on Reduce the need for high-end, resource- Scales to different hardware platforms
CPU, GPU, and FPGA respectively. intensive hardware. (CPU, GPU, FPGA).
Byte Transformer [4] Up to 87% better performance compared to other Reducing redundant computations. Im- Scales to different sequence lengths and
frameworks (PyTorch JIT). proves the efficiency of running inference transformer architectures.
on transformer models, potentially reduc-
ing the cost of deployment.
Sheared LLAMA [74] Superior performance compared to other open-source Reduce the size of the LLaMA2-7B model Potentially efficient for deployment on
models of similar size. to 1.3B and 2.7B parameters. Significantly resource-limited devices.
reduced training cost (3% of original).
GrowLength [75] Lower loss compared to fixed-length training Potentially lower training cost (faster train- Dynamically increasing training sen-
(LLM128). ing). tence length from 128 to 1024 tokens,
enhancing efficiency in handling diverse
text data, while maintaining lower loss.
PagedAttention [69] PagedAttention and vLLM achieve 2-4× higher vLLM’s efficiency improvements can po- Handles large LLMs and vLLM’s mem-
throughput in LLM serving compared to existing sys- tentially reduce deployment costs by re- ory management scales well with diverse
tems, especially for large models, long sequences, and quiring fewer servers for the same work- LLM architectures.
complex decoding algorithms. load.
LightSeq [63] Up to 14x and 1.4x speedups compared to TensorFlow Reduced operational costs during deploy- Well-suited for various transformer ar-
and FasterTransformer respectively. ment due to lower computational de- chitectures.
mands (inference on resource-constrained
devices).
LigthSeq2 [54] Achieves 1.4-3.5× faster training compared to previous Potentially lower cost (by enabling faster Supports various transformer architec-
systems across various models and benchmarks, and training). tures, including BERT, GPT, and vision
308% speedup on WMT14 English-German machine transformers.
translation compared to PyTorch.
GPipe [3] With a 557-million-parameter AmoebaNet model Potentially lower cost (reduced hardware Handling large and complex models
achieving 84.4% top-1 accuracy on ImageNet-12 requirements). across multiple accelerators, and achiev-
dataset. ing better quality than all bilingual mod-
els.
Megatron-LM [19] Achieves SOTA results on NLP tasks (perplexity Allows utilizing fewer training instances Scales to train models with billions of pa-
of 10.8 on WikiText103, 66.5% accuracy on LAM- or smaller model sizes for achieving simi- rameters using multiple GPUs (demon-
BADA). lar performance. strated with 8.3B parameter model on
512 V100 GPUs).
AlphaTuning [76] Maintains competitive performance on various tasks Reduces deployment costs by enabling Works with a wider range of LLMs, and
with over 1000× fewer parameters compared to full less powerful hardware for inference. its efficiency increases with even larger
fine-tuning. models due to quantization.
QFT [77] Maintains similar performance across various bench- Potentially reduces deployment costs due Handles large models with efficient
marks. to lower memory requirements. Allows memory management. Demonstrates
utilizing less powerful hardware, poten- successful fine-tuning of a 7B parameter
tially leading to lower acquisition and LLaMA model, suggesting scalability
maintenance costs. for working with large language models.
LOMO [58] Enables full parameter fine-tuning (65 billion parame- Potentially reducing deployment costs Especially suited for handling very large
ter) of LLMs on limited resource GPUs. through lower hardware acquisition and models.
maintenance expenses.
AdaLomo [59] Achieves scores of 30.8, 39.7, 51.0, and 56.9 on the Potentially lower training cost (due to re- Enables LLM training on resource-
LLaMA benchmark for models with 7B, 13B, 30B, and duced memory requirements). constrained environments by signifi-
65B parameters (performance comparable to AdamW) cantly reducing memory footprint while
achieving comparable performance to
AdamW, especially for models with a
large number of parameters.
LoRA [35] Maintains competitive performance on various tasks Reduces deployment and training costs Applicable to various Transformer mod-
compared to full fine-tuning despite significantly fewer (3× reduction for GPT-3). els (e.g., RoBERTa, DeBERTa, GPT-2,
parameters (reduction by 10,000× for GPT-3) GPT-3). Designed to be efficient even for
extremely LMs like GPT-3 (175B param-
eters).
TurboTransformers [11] Enhances latency and throughput for transformer mod- Reduces operational costs by optimizing Highly scalable, efficiently handling
els, achieving better speed than PyTorch and compa- memory usage through a sequence-length- variable-length inputs with a sequence-
rable performance to TensorFlow-XLA, TensorRT, and aware memory allocation algorithm, en- length-aware batch scheduler to
FasterTransformers. suring efficient resource utilization. maximize throughput in diverse
deployment scenarios.
PetS [61] Enhances serving throughput by 1.53x on Desktop Reduces the cost by requiring less storage Supports up to 26× more concurrent
GPUs and 1.63x on Server GPUs. space due to their smaller size compared to tasks compared to existing serving sys-
traditional transformers. tems.
QMoE [79] compresses a 1.6 trillion-parameter SwitchTransformer Achieves a 20× compression ratio, Enables running trillion-parameter mod-
model to 160 GB (0.8 bits per parameter), resulting in QMoE significantly reduces storage els on readily available hardware (e.g.,
a 20x reduction in size with minimal accuracy loss. requirements. single server with 4× NVIDIA A6000
GPUs) due to its compressed size.
SWARM Parallelism [10] Trains large models on unreliable, heterogeneous de- Enables training on preemptible cloud in- It is designed for heterogeneous and un-
vices with low network bandwidth. stances or pooled resources from various reliable devices, making it scalable to
regions, potentially reducing training costs large deployments with varying compu-
compared to dedicated high-performance tational power and network connectivity.
computing clusters.
32 VOLUME 11, 2023
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 7. Comparative analysis between different strategies [B]

Technique Performance Cost Scalability


GPTQ [80] Achieves high accuracy with post-training quantization to 3 Reduces the bit width of the weights (down to 3 Allows inference of large GPT mod-
or 4 bits per parameter. or 4 bits), significantly reducing the model size. els on a single GPU due to its com-
pressed size.
FPTQ [81] Achieves SOTA performance on popular LLMs (BLOOM, Utilizes a 4-bit weight quantization strategy, re- Enables deployment of LLMs
LLaMA) with 4-bit weights and 8-bit activations (W4A8). duces the model size compared to full precision on resource-constrained devices
models. by achieving high-performance
W4A8 quantization (low memory
footprint) without sacrificing
accuracy.
Norm Tweaking Achieves high accuracy for large language models (GLM- Enables effective quantization down to even 2 Allows for deploying LLMs on de-
[56] 130B, OPT-66B) even at 2-bit quantization. bits, significantly reduces the model size com- vices with limited memory or com-
pared to full precision. putational power.
FineQuant [83] Up to 3.65× higher throughput on LLM inference with Focuses on weight-only quantization, reduces Enables deployment of massive
minimal accuracy loss for large models (OPT-175B). model size efficiently, potentially enabling de- LLMs (like OPT-175B) on
ployment on less powerful hardware. resource-constrained environments
by achieving efficient weight-only
quantization with high throughput
and minimal accuracy loss.
PETALS [62] Achieves faster inference for large language models, with an 8-bit compression reduces resource require- Scales by distributing computations
optimal setup inferring a 176 billion parameter model in 5.5 ments. across a network, enabling it to han-
seconds. dle even larger models or more in-
ference requests simultaneously.
QuantEase [84] Up to 15% better accuracy (perplexity, zero-shot) than Enables effective quantization down to 3-bit or Quantizes large models (Falcon-
GPTQ. Sub-3-bit quantization with minimal accuracy loss. even lower precisions, significantly reducing 180B) on 1 GPU in 3 hours.
the model size.
LLM-Pruner [85] Up to 95% performance retention with 20% parameter re- Potentially lower training cost due to less data Applicable to various LLM architec-
duction (LLaMA, Vicuna, ChatGLM). needed for fine-tuning (3 hours, 50K samples). tures (LLaMA, Vicuna, ChatGLM).
SparseGPT [72] Up to 50% sparsity (weight reduction) with minimal ac- Potentially lower computational cost due to Processes very large models (OPT-
curacy loss (perplexity). Larger models prune more easily single-shot pruning (no retraining). 175B, BLOOM-176B) efficiently.
(with less accuracy drop).
Cramming [16] Achieves reasonable performance by training on a single Lower computational cost due to single GPU Not designed for large-scale
GPU in one day (trade-off between model size and training training. training, but explores trade-offs for
time). resource-constrained settings.
DFX [88] 5.58x speedup in text generation compared to 4x NVIDIA 8.21x more cost-effective than a GPU appliance Designed for model parallelism
V100 GPUs. with similar performance. across multiple FPGAs (scalability
not explicitly quantified).
Narayanan et al., High speed via pipeline, tensor, and data parallelism. Potentially cost-efficient due to parallel pro- Megatron-LM enables training
[71] cessing, but not explicitly quantified. LLMs (like trillion-parameter
models) on thousands of GPUs
by combining data, pipeline, and
tensor parallelism (PTD-P) for
efficient scaling and achieving high
throughput.
ZeRO [18] 10x speedup, trains trillion parameter models (8x larger than Potentially reduces memory requirements for Scales to trillion parameter models
previous models). training large models. on large GPU clusters.
Colossal-ai [15] Up to 2.76x faster training with various parallelism methods. Potentially lower cost due to faster training and Modular design for customization
improved hardware utilization. and supports distributed training.
FlexFlow [73] Achieves 2-10x speedup in performance for CNN workloads Improves power efficiency by 2.5-10x com- Highly scalable with growing com-
compared to existing architectures. pared to existing architectures. puting engine size.
Alpa [90] Matches Megatron-LM on GPT models, surpasses Deep- Alpa automates efficient model-parallel train- Designed for distributed deep learn-
Speed on GShared MoE models (up to 9.7x speedup). ing for large deep learning models, potentially ing.
reducing development and infrastructure costs.
ZeRO-Offload Trains 10x larger models on single GPUs (40 TFlops/GPU Potentially lower training cost due to efficient Enables large model training on sin-
[20] for 10B parameters). Supports models over 70B parameters single-GPU or smaller system training. gle GPUs and scales to larger sys-
with model parallelism. tems using model parallelism.
ZeRO-Infinity [1] Trains models with trillions of parameters on GPU clusters. Potentially reduces memory requirements for Highly scalable for training models
Enables fine-tuning on a single DGX-2 node. Achieves over training large models. with trillions of parameters.
25 petaflops (exceeds peak performance by 40%).
Splitwise [65] Up to 1.4x higher throughput for LLM inference compared Potentially lower cost due to 20% reduction in Scales well using homogeneous or
to existing methods. resource requirements for inference. heterogeneous machines for prompt
computation and token generation
phases.
Easy and Efficient Up to 27.43x speedup in transformer inference compared to Potentially lower cost due to reduced inference The library is designed to work with
Transformer Fairseq on A100 GPUs. Significant speedups over LightSeq time (potentially leading to lower resource us- large model sizes and potentially
(EET) [64] and Faster Transformer as well. age). scales well on different hardware
configurations (2080Ti and A100
GPUs are mentioned).
Eliseev and Mazur Achieves 2-3 tokens per second inference speed on con- Enables running large MoE models on lim- Enables running large MoEs lan-
[91] sumer GPUs for large sparse MoE models (Mixtral-8x7B). ited hardware (consumer GPUs and free-tier guage models (like Mixtral-8x7B)
Google Colab) potentially reducing costs. on resource-constrained hardware
(consumer GPUs and even free-tier
Google Colab) by leveraging MoE-
specific optimizations.
FP8-LM [92] Achieves 75% faster training speed and 39% memory reduc- Significantly reduces training costs for large Reduces memory usage by 39%
tion compared to BF16 training for a GPT-175B model on models due to faster training and lower memory and speeds up training of LLMs
H100 GPUs (outperforms NVIDIA’s Transformer Engine by usage. (like GPT-175B) by 75% through
37%). an innovative FP8 mixed-precision
framework, enabling training on
resource-constrained environments.
VOLUME 11, 2023 33
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 8. Summary on reviewed papers excluding those already covered in Tables 3 and 4 or the main text.

Studies Aims Outcomes


ZeRO-Infinity [1] Effectively breaks the GPU memory barrier, making large-scale model Models scaled to trillions of parameters on GPU clusters, with fine-tuning possible on a single
training accessible on constrained resources. NVIDIA DGX-2 node. Consistently reach over 25 petaflops, exceeding peak performance by 40%.
ZeRO [18] Efficiently train large models, overcome limitations of the existing methods. Achieved 15 Petaflops during training with 100B parameter models on 400 GPUs, showing super-
linear speedup. This means an 8× larger model size and a 10× performance boost compared to
prior benchmarks.
SwapAdvisor [70] An approach for deep learning memory management, enables training and Training models up to 12× beyond the usual GPU memory limit.
serving of large models despite limited GPU memory.
Sheared LLAMA [74] A dynamic batch loading, efficiently adjusts the composition of sampled The LLaMA2-7B model is reduced to 1.3B and 2.7B parameters, needing only 3% of the usual
data within each training batch based on varying losses observed across computing resources for training. Tests were conducted using a maximum of 16 Nvidia A100
different domains. GPUs (80 GB).
GrowLength [75] Accelerates the pre-training process of LLMs by dynamically and progres- Three different setups were investigated: LLM128 with 128-token sentences and 0.36B pa-
sively growing the training sentence length. rameters; LLM1024 with longer sentences but the same total tokens; and GrowLength, which
grows from 128 to 1024 tokens. GrowLength shows lower loss than LLM128, emphasizing its
computational efficiency and practicality in resource-limited scenarios.
AlphaTuning [76] It combines quantization of PLMs with fine-tuning, only a subset of quan- Applied to GPT-2 and OPT, achieved over 10x compression under 4-bit quantization and reduced
tized parameters is fine-tuned for the target task. trainable parameters by over 1,000-fold, maintaining competitive performance on various tasks.
Cramming [16] Investigates the trade-offs involved in scaling down language model and Explored two setups: one with a classical RTX2080Ti GPU and another with modern RTXA4000
training on a single GPU in one day. or RTXA6000 GPUs, each paired with 4 CPU cores and 32 GB RAM.
SWARM Parallelism [10] Train a large model with unreliable heterogeneous devices with low net- Trained a large transformer language model with 1B shared parameters using compression strategy
work bandwidth by using dynamically generated, randomized pipelines. on preemptive T4 GPUs with network bandwidth below 200Mb/s.
GPTQ [80] High accurate, and efficient post-training quantization method which is Precisely quantized models to 3 or 4 bits per parameter, taking a few hours on models with
known as a new one-shot weight quantization. hundreds of billions of parameters. Experiments on OPT-175B and BLLOM-176B, it took around
4 GPU hours with minimal loss of accuracy compared to the uncompressed baseline.
Norm Tweaking [56] Presents a strategy to minimize computational and storage demands in large Achieving high accuracy, GLM-130B and OPT-66B maintain accuracy even at 2-bit quantization.
language models without compromising their performance. Improvements in weight-only and joint quantization surpass existing post-training quantization
(PTQ) methods.
SparseGPT [72] A post-training pruning method to prune massive GPT-family models Ran on open-source models OPT-175B and BLOOM-176B in under 4.5 hours. Achieved 50-60%
efficiently and accurately. unstructured sparsity with minimal impact on perplexity and removed over 100 billion weights
with negligible accuracy loss.
ZeRO-Offload [20] Democratize large-scale model training, making it accessible to a wider Trained large model heterogeneously on GPU + CPU systems, achieving 10× greater model size
audience. on a single GPU without sacrificing efficiency. Achieved 40 TFlops/GPU for 10 billion parameters
on a single NVIDIA V100 GPU, scalable up to 128 GPUs. Supports model parallelism, enabling
training models with over 70 billion parameters on a single DGX-2 box, resulting in a 4.5×
increase in model size.
Alpa [90] Automated system that generates execution plans for distributed model- Achieves comparable training performance to Megatron-LM on GPT models and surpasses
parallel training. DeepSpeed on GShared MoE models with up to 9.7× speedup.
Efficient large-scale Introduces PTD-P, a novel technique for training LLMs on GPU clusters, Offers significant performance enhancements over ZeRO-3, delivering a 70% improvement for
language model combining pipeline, tensor, and data parallelism for high computational models with 175 and 530 billion parameters, mainly due to reduced cross-node communication
training on GPU clusters performance and scalable training. overhead.
using megatron-LM [71]
DFX [88] A low-latency multi-FPGA appliance for accelerating transformer-based Utilized four Xilinx Alveo U280 FPGAs to evaluate performance on the GPT-2 language model,
text generation. achieving a 5.58× speedup and 3.99× energy efficiency improvement over four NVIDIA V100
GPUs. Demonstrated to be 8.21× more cost-effective than a GPU appliance with similar perfor-
mance.
Colossal-AI [15] A unified deep learning system. This system would streamline the process Colossal-AI is a user-friendly system that offers various parallel training techniques and inte-
of training complex models with billions of parameters on multi-GPU grates with advanced methods for enhanced performance. Notably, Colossal-AI achieved training
clusters. speedups of up to 2.76 times for large models compared to traditional methods.
LoRA [35] A method that improves LLM adaptation by reducing the number of Reducing trainable parameters by 10,000x and memory usage by 3× compared to traditional
trainable parameters during fine-tuning. methods. It maintained or improved model performance while offering faster training and efficient
task-switching.
AdaLomo [59] Aims to address the memory limitations of existing optimizers like The outcome is a successful optimizer that achieves performance comparable to AdamW on
AdamW by using memory-efficient techniques while retaining the benefits various tasks. This allows for training LLMs with significantly less memory usage, making large-
of adaptive learning rates. scale LLM training more accessible.

Li et al (PagedAttention) A novel attention algorithm inspired by virtual memory. This technique aims Achieves significant improvements in throughput (2-4×) for LLM serving, particularly for large
[69] to improve memory efficiency in LLM training by reducing fragmentation models, complex decoding algorithms, and long sequences.
and enabling efficient sharing.
FlexFlow [73] A novel dataflow architecture that leverages complementary parallelism ef- The outcome is a significant improvement in performance (2-10× speedup) and power efficiency
fects to achieve improved resource utilization within CNN accelerators. (2.5-10×) compared to existing architectures on various CNN workloads.
QFT [77] Develop a memory-efficient framework (QFT) for fine-tuning LLMs. Achieves significant memory reduction during fine-tuning by utilizing quantization techniques,
the Lion optimizer, and integer-based model states. This allows fine-tuning of large models on a
single GPU with minimal performance loss compared to traditional methods.
QMoE [79] A framework that significantly reduces memory usage for large MoE Achieves significant memory reduction through a custom compression algorithm that shrinks
models. models to less than 1 bit per parameter, enabling execution on affordable hardware like single
servers with multiple GPUs.
FPTQ [81] A post-training quantization technique for compressing LLMs. Achieves significant memory reduction and computational efficiency during inference with min-
imal accuracy loss.

FineQuant [83] A method to improve the efficiency of LLM inference. Achieves significant memory reduction and faster inference speeds with minimal accuracy loss
for LLMs.

QuantEase [84] A framework for efficiently deploying LLMs by making them smaller. Achieved SOTA performance in quantizing LLMs. This also improved perplexity and zero-shot
accuracy by up to 15% compared to existing methods. It quantifies large models like Falcon-180B
on a single GPU in 3 hours.

LLM-Pruner [85] A task-agnostic framework for compressing large LLMs with minimal Compressed LLMs (LLaMA, Vicuna, ChatGLM) by 20% while maintaining 94.97% of their
reliance on the original training data. original performance.

Li et al [89] A memory-efficient method called sequence parallelism to train Trans- Achieved × longer maximum sequence length and 13.7 × larger batch size compared to SOTA
formers with much longer sequences on GPUs. tensor parallelism on 64 GPUs. With sparse attention, it can handle sequences over 27 × longer
than existing methods.

FP8-LM [92] A new framework called FP8-LM for training LLMs using mixed preci- Reduced memory usage by 39% and increased training speed by 75% compared to the BF16
sion to improve efficiency. framework. It outperformed Nvidia’s Transformer Engine by 37% in training speed.

LOMO [58] A memory-efficient training method for LLMs on limited GPU re- Enables fine-tuning of massive LLMs (65 billion parameters) on consumer-grade GPUs (RTX
sources. 3090) by significantly reducing memory usage compared to traditional methods.

Eliseev and Mazur [91] A method to run large, sparse MoE LMs on limited GPU memory. This method uses parameter offloading and MoE properties to run on desktop hardware and free
Google Colab instances. It leverages expert reuse and early layer prediction to achieve an MoE-
specific offloading strategy. This significantly improves speed (2-3 tokens per second) on various
consumer GPUs, making large MoE models practical for limited hardware.

Bold text in the "Aims" column indicates the framework’s primary area of specialization or the range of tasks it is designed to address.
This table summarizes the reviewed papers excluding those already covered in Tables 3 and 4 or the main text.
34 VOLUME 11, 2023

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy