0% found this document useful (0 votes)
19 views21 pages

Auto Parallel

This document surveys the challenges, strategies, and methods related to auto-parallelism in the training of large-scale neural networks. It discusses the evolution of distributed training techniques and the necessity for efficient parallelism strategies to manage the increasing complexity and size of deep learning models. The authors analyze existing auto-parallelism works, categorize parallelism strategies, and highlight the need for automated solutions to optimize training across heterogeneous computing environments.

Uploaded by

zwz.chao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views21 pages

Auto Parallel

This document surveys the challenges, strategies, and methods related to auto-parallelism in the training of large-scale neural networks. It discusses the evolution of distributed training techniques and the necessity for efficient parallelism strategies to manage the increasing complexity and size of deep learning models. The authors analyze existing auto-parallelism works, categorize parallelism strategies, and highlight the need for automated solutions to optimize training across heterogeneous computing environments.

Uploaded by

zwz.chao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1

A Survey on Auto-Parallelism of Neural Networks


Training
Peng Liang, Yu Tang, Xiaoda Zhang, Youhui Bai, Teng Su, Zhiquan Lai† , Linbo Qiao† , Dongsheng Li†

Abstract—Deep learning (DL) has gained great success in efficiently train large-scale models gains more and more
recent years, leading to state-of-the-art performance in research attention from research community and industrial fields.
community and industrial fields like computer vision and natural Distributed training [11] jointly make use of multiple de-
language processing. One of the reasons for this success is
the huge amount parameters adopted in DL models. However, vices to train the model and achieve reasonable training
it is impractical to train a moderately large model with a speedup. In 2012, [12] trained AlexNet with 2 GPUs in
large number of parameters on a typical single device. It is parallel, which is a initiate and successful attempt at training
necessary to train DL models in clusters with novel parallel model with multiple devices. Then, Jeffrey Dean etc. pro-
and distributed training algorithms. However, traditional training poses the first generation distributed deep learning system
algorithms owns the inability to train large-scale neural networks
in heterogeneous computing clusters. Nowadays, auto-parallelism DistBelief [13], introducing the concept of distributed cal-
is promising to handle the above issue. Auto-parallelism makes culation in deep neural network model training. Moreover,
large-scale DL model training efficient and practical in vari- they systematically design the policies of parallelism and way
ous computing clusters. In this survey, we perform a broad of synchronization so that the training process can apply to
and thorough investigation on challenges, basis, and strategy large-scale clusters. At present, the research on accelerating
searching methods of auto-parallelism in DL training. Firstly, we
abstract basic parallelism schemes with their communication cost distributed training mainly focuses on the design of parallelism
and memory consumption in DL training. Further, we analyze strategies and how to select them.
and compare a series of current auto-parallelism works and In this work, we draw conclusion on parallelism works
investigate strategies and searching methods which are commonly available in publications and make a comprehensively anal-
used in practice. At last, we discuss several trends in auto- ysis on these algorithms. Parallelism strategy can be divided
parallelism which are promising in further research.
into two categories: intra-operator parallelism [14] and inter-
Index Terms—auto-parallelism, large-scale neural networks, operator parallelism [15]. Intra-operator parallelism includes
training technique, parallel and distributed training data parallelism (DP) and tensor parallelism (TP). TP is
also known as intra-layer model parallelism and has some
I. I NTRODUCTION varieties, such as Row-TP, Column-TP, 2D-TP [16], and 3D-
TP [17]. Inter-operator parallelism includes inter-layer model
D EEP learning [1] has drawn a lot of attention for its
superior performance in tasks like speech recognition [2],
machine translation [3], object detection [4], recommenda-
parallelism and pipeline parallelism (PP). All these strategies
are helpful to accelerate the training of models, but it is
unable to achieve the best performance by using only few
tion [5], and so on. The success of deep learning is highly
of them. To gain better performance, researchers propose
related to the availability of large, labeled databases, and the
hybrid parallelism. Hybrid parallelism is the work that uses
ability to train these huge volumes networks. So far, the largest
the combination of data, model, and pipeline parallelism to
models have trillions of parameters [6]–[8]. Furthermore,
partition the model in a fine-grained way to increase through-
the volume of training corpus have achieved tera-byte. For
put. Representative work includes Megatron-LM [18] and 3D-
instance, Wudao 2.0 [6] is trained with 4.9TB high-quality
parallelism from DeepSpeed [19].
Chinese and English corpus from WuDaoCorpora [9] and Pile
However, manually applying the intra-operator or inter-
[10] datasets.
operator strategy of parallelism methods to a model is difficult
While the number of model parameters increases expo-
since manual partitions require the engineer to be an expert
nentially, the storage capacity of a computing device only
at communication and computing. They are required to be
increases from a few GBs to 80 GBs (e.g., NVIDIA A100,
aware of all the execution time and memory state in every sub-
H100) in the last decade, which results in a memory wall
procedure of the training process. Moreover, strategies vary
bottleneck. Thus, a moderate computing device can no longer
according to the device topology and the model structure. Once
hold the entire model. In order to train a large-scale model, we
the devices or the model changes, experts may need to redesign
may need thousands of computing devices to work corporately,
a modified strategy from scratch. An excellent parallelism
and to deliberately manage these devices to effectively and
strategy can perform better than only using data or model
Peng Liang, Yu Tang, Zhiquan Lai, Linbo Qiao and Dongsheng Li are parallelism, repeatedly manually designing is empirical, and
with the College of Computer, National University of Defense Technology, often sub-optimal, sometimes even impractical in real-world
Changsha, Hunan, P.R.China, 410073. Xiaoda Zhang, Youhui Bai and Teng applications. Thus, it is essential to search for the optimal
Su are with the Huawei Technologies Co. Ltd. E-mail: dsli@nudt.edu.cn.
† : corresponding authors. hybrid parallel strategy automatically to save time, energy, and
Manuscript received Mar. 31, 2022; revised xx xx, xxxx. money.
2

Auto-parallelism is a technique that automatically generates Details Ignored

parallelism strategies for a given model on a specific cluster.


It tries to obtain the optimal or decent parallelism strategy Update

by, for example, optimizing a cost model built for a given


Forward
neural network model and a cluster. Then it maps the strat-
Propagation
egy to devices using a well-designed runtime system. Auto- x W δW Ex
parallelism is the ultimate goal of distributed training, which Backward
Propagation
could liberate engineers from manually designing strategies,
and make industrial departments own the ability to efficiently Parameter

train large-scale models on various computing infrastructures. MatMul MatMul MatMul


Update

Representative works include OptCNN [14], Flexflow [20],


Pipedream [15], Dapple [21], Double-Recursive [22], etc [23]–
[36]. However, currently, all practicable works consider only y Ey
a few combinations of parallelism schemes or otherwise have
weak scalability for its high arithmetical complexity [36].
Details Ignored
What is more, researchers have proposed many emerging
parallelism technologies, such as ZeRO [37], Sequence par-
Loss
allelism [38], Token-level parallelism [39], and so on [16],
[17], [40], [41]. However, most of the existing auto-parallelism
methods fail to involve all of them, while the optimal strategy Fig. 1. A Part of a Training Computation Graph
needs to consider using them all. Therefore, auto-parallelism is
still an insufficiently explored field for us to step in. On the one
hand, the automatic parallelization search space can be further
GPU3 GPU0 GPU4 GPU7
expanded, which may implicitly include better solutions. On
the other hand, auto-parallelism should comprehensively con-
sider heterogeneous computing devices, communication pace,
and topology. GPU2 GPU1 GPU5 GPU6
There are a few related surveys in the research field
of distributed machine learning. Mayer and Jacobsen [11]
published a detailed survey about scalable deep learning
on distributed infrastructures. Verbraeken [42] discusses the
techniques used for distributed machine learning. They gave a CPU0 CPU1
comprehensive understanding of the deep learning system and
related machine-learning algorithms but did not involve clear
16GB/s 160GB/s 20GB/s
illustrations on selecting strategies. Unlike these two surveys,
we explore more basis and details about auto-parallelism and
how they are used to accelerate a model’s training in this Fig. 2. Device Topology Graph of DGX-1
survey.
In this survey, we introduce the definition, challenges, basis,
II. P ROBLEM D EFINITION
classification, and existing works of auto-parallelism. We first
present a unified auto-parallelism definition in Sec.II that ab- Auto-parallelism, also known as auto-parallelization or au-
stracts a wide range of traditional and current auto-parallelism tomatic parallelization, refers to automatically converting se-
methods. And then in Sec.III, we conclude the challenges quential code into multi-threaded or vectorized code in order
of auto-parallelism, which are the problems that researchers to make use of available computing devices. Nowadays, auto-
should be concerned about. Thirdly, we comprehensively ana- parallelism is frequently used in deep learning community
lyze the basis of a broad class of auto-parallelism methods in for training and inference of deep neural networks. In the
Sec.IV. Specifically, we analyze the communication consump- DL field, auto-parallelism refers to automatically generating
tion and memory consumption of data-parallelism, model- computation tasks for computing devices by means of splitting,
parallelism, and pipeline-parallelism. We also summarize some merging, or re-formalization on the network’s computation
extended parallelism methods in this section. Fourthly, we give graph. Auto-parallelism usually generates parallelism strate-
a precise classification of strategy searching methods for auto- gies by automatically determining partitions of each tensor in
parallelism and discuss their advantages and disadvantages the computation graph, inserting communication operations,
in Sec.V. We divide the strategy searching methods into and scheduling the whole computation process.
two categories: machine-learning-based and classic-algorithm- We abstract the computing of neural network training or
based methods. Lastly, we discuss and suggest some research inference as a directed acyclic graph (DAG) (i.e., computation
hotspots for the development of auto-parallelism in Sec.VI, graph). Suppose there is a computation graph G = (V, E),
including the drawbacks of existing auto-parallelism methods where each node vi ∈ V is an operator (e.g., matrix multipli-
and possible solutions. cation, Softmax, etc.) or a tensor (i.e., a n-dimensional array).
3

Index: [[0,4],[0,8]] well so that each device in a heterogeneous cluster has a


shape: (4, 8) shape: (4, 9)
g: [[2,6],[3,7],[0,4] ,[1,5]] devices: 2, 6 devices: 3, 7 similar computation time.
shape: (4, 8) shape: (4, 9) • The fourth one is the optimization of network communi-
devices: 0, 4 devices: 1, 5
cation on specific device topology. A good communica-
17x8 matrix 4 sub-blocks on 8 devices tion arrangement often brings less communication time
and thus increases the computation/communication ratio.
Fig. 3. Illustration of partitioning a 17x8 matrix into sub-blocks. • The last is the trade-off between runtime and strategy
performance in finding strategy. Profiling every strategy
is time-consuming, and thus many works try to use a
Tensor could be inputs, outputs of the model, intermediate or cost-model-based method to reduce runtime.
model states (parameter weight, gradients, or optimizer states). In the following subsections, we present detailed analysis for
Each edge eij (vi , vj ) ∈ E in the DAG indicates that there is each challenge.
a data transport between vi and vj .
For example, if vj is an operator, then vi is one of the inputs
A. Detailed Analysis on Parallelism Schemes
of this operator, which could be a global input, an output
generated by operator vi , or a model state like parameter Auto-parallelism needs to consider the computation, com-
weight. Fig. 1 shows an extracted sub-computation graph, munication, and memory cost of different parallelism schemes.
which shows the details of training a matrix multiplication A good auto-parallelism strategy set S often has minor com-
operator, ignores its preceding and succeeding details, and uses putation and communication costs while having an acceptable
an optimizer without optimizer states. memory cost. Based on the analysis, auto-parallelism search-
As for devices, device topology can be modeled as a undi- ing methods decide the appropriate parallelism strategy for
rected graph D = (VD , ED ), where each node di ∈ VD is a each operator. We give our thorough analysis in sec IV.
device (e.g., CPU, GPU, etc.), and each edge bij (di , dj ) ∈ ED
is labeled with bandwidth, and represents for a hardware B. Trade-offs between Different Parallelism Schemes
connection (e.g., PCI-e, NVLink, InfiniBand, etc.) between Different parallelism schemes may bring different compu-
devices di and dj . Auto-parallelism algorithm A takes G and tation, communication, and memory cost. In a homogeneous
D as inputs, and then outputs a partition set P for all vi ∈ G, cluster, the computation resource on each device are usually
a sub-graph set Gd for all di ∈ D, and pipeline schedules. P the same. Tofu [43], Hypar [44] and D-Rec [22] utilize this
records the partitions of all nodes in G. property, and consider communication cost only to produce DP
For example, pi = (index, g) ∈ P is a partition of vi , where and TP strategies for each vi ∈ V . Intuitively, we tend to select
index is a 2D-array recording split index of each axis that strategies with less communication cost and less replication to
can infer the sizes of all sub-blocks after partitioning, g is a improve throughput and scalability. As analyzed in Section IV,
device group 2D-array in which i-th array holds ids of device most of the communication happens in the redundant part of
that holds i-th sub-block. By applying P to G and inserting the model, especially for Vanilla DP and 1D-TP (Row-TP and
corresponding communication and tensor redistribution oper- Column-TP). It seems to be enough to choose the strategy with
ators, the auto-parallelism algorithm generates a sub-graph set the least communication amount for each node, as it, in the
G, where G1d0 ∈ G represents the sub-graph that would be meanwhile, has the least memory cost. However, the technique
deployed on d0 , and its pipeline stage number is 1. Finally, of check-pointing reduces the memory of intermediate results
the auto-parallelism algorithm inserts corresponding control (e.g., Tin , Tout ) to a sub-linear degree, which makes Row-TP
flows to arrange the execution of the pipeline (i.e., the schedule and Column-TP have lower memory cost and so that we can
of executing sub-graphs). Figure 3 illustrates a toy example increase batchsize and model size. Some researchers prefer to
of partition using p = ([[0, 4], [0, 8]], [[2, 6], [3, 7], [0, 4], [1, 5]]) reduce memory cost by applying 1D-TP strategies and check-
on a 17 × 8 matrix. Taking the fist sub-block for illustration, pointing instead of DP, although DP theoretically has smaller
the index array indicates the range of this sub-block, which communication cost in some cases. Applying check-pointing
is from 0-th row to 4-th row, and from 0-th column to 8-th requires us to analyze both communication cost and memory
column; the g indicates that this sub-block is held by device cost in order to improve throughput finally.
2 and 6. Applying PP into the training can also reduce a large
amount of intra-operator communication cost because PP cre-
III. C HALLENGES OF AUTO - PARALLELISM ates stages held by corresponding subsets of devices. Devices
There are five main challenges in auto-parallelism. only need to do intra-operator communication within their
• The first and the most important one is the detailed communication group. Nevertheless, it may bring a little
analysis of different parallelism schemes, which is the performance degradation due to unavoidable bubbles in the
foundation of auto-parallelism. pipelines.
• The second challenge is considering trade-offs between To achieve the highest throughput, we need to choose
different parallelism schemes, on which most of the auto- appropriate parallelism strategies for each vi ∈ V , which
parallelism methods work. is usually decided after deep consideration by humans or
• The third one is the load-balance problem across het- long time searching by algorithms. Many auto-parallelism
erogeneous devices. The goal is to organize the program methods apply algorithms to search for trade-off strategies
4

automatically. We will further discuss these strategy searching E. Trade-off of Runtime and Strategy Performance in Finding
methods in Sec.V. Strategy
Strategy searching algorithms are time-consuming for two
reasons. The first one is that partitioning a DAG for optimal
C. Load-Balance in Heterogeneous Topology performance is an NP-hard problem [49]–[52]. The second one
is to evaluate every strategy that the algorithm finds.
Heterogeneous topology here represents a device graph with
To solve the NP-hard problem, researchers have tried to use
different types of computing devices in it (e.g., CPU and GPU
machine-learning algorithms [1], [53] and classic algorithms
of multiple types). Using heterogeneous clusters to train a
like dynamic programming [54]. Some of the works [43],
model more environmentally is a good choice. Buying new
[55] additionally uses heuristic assumptions that help shorten
devices does not mean the old ones cannot work anymore
searching runtime but may sacrifice the performance of the
because they could also participate in some parts of the work.
strategy. We will discuss more details of this in Sec.V.
Different types of devices usually have different computing
performances. The goal is to arrange the computation properly To evaluate the performance of searched strategies, we can
to achieve load-balance on each device. However, only a use a cost model that calculates the cost of a strategy or
few works involve the load-balance analysis in the strategy profile the execution time of strategy by deploying a strategy
searching task. DeepSpeed [19] heuristically uses CPU to to the model and running it. Some auto-parallelism works
execute the update of parameters because the updates are less [22], [56], [57] use a symbolic cost model to analyze the
complicated compared to forward and backward propagation performance of strategies. However, most accelerators (e.g.,
and the computation of updates on CPU can overlap the GPU, NPU, FPGA) do the computation in parallel, while a
computation on GPU in certain situations. Paddle-HeterPS symbolic cost model only reflects on the serial amount of
[28] uses reinforcement learning to decide computing devices computation. Meanwhile, different types of devices may have
for every layer, but it only supports DP and PP. AccPar [23] different performances and implementation on some specific
introduces a method that solves partition ratios of each kind tasks like convolution, which is hard for the system to be
of device and then partitions model layer by layer, after which aware of and thus needs more artificial annotation work on
the computation time on each device is similar. However, it tuning the cost model. Moreover, it is hard for symbolic cost
only supports DP and 1D-TP without check-pointing. Auto- models to be aware of the overlaps between computation and
parallelism on heterogeneous topology is still explorable, and communication. These all make the symbolic cost model hard
it will save much money from buying more devices. to accurately reflect the actual performance of found strategies,
though it has shorter runtime than profiling. On the other hand,
profiling every schedule that the algorithm generates is too
time costly [15], [25], [58], although it can accurately tell us
D. Topology-aware Communication Optimization the difference between any two strategies. Using a profiling-
Auto-parallelism algorithms need to consider topology- based cost model [55], [59] seems to be a more reasonable
aware communication strategies to reduce communication time decision, whose costs are the actual time obtained by running
further and increase throughput. Due to the limited size of the each operator in the computation graph. And then, we can find
motherboard of a node, a large number of computing devices a near-optimal parallelism strategy by minimizing cost.
are distributed to different nodes, which results in bandwidth
differences between intra-node and inter-node communication. IV. T HE A NALYSIS ON D IFFERENT PARALLELISM
While intra-node bandwidth is usually faster than inter-node’s, S CHEMES
making full use of intra-node bandwidth can optimize com-
munication and thus reduce overall execution time. [45], [46] Detailed analysis on communication, computation, and
divide all-reduce operations among all devices in a cluster memory cost of every parallelism scheme is the basis of auto-
into several all-reduce operations among subgroups of devices parallelism since different partition strategies bring different
to achieve better performance. Inspired by this, P 2 [47] can amounts of cost. Auto-parallelism methods try every combina-
generate DP and 1D-TP partition strategies and utilizes the tion of parallelism schemes they can handle and select the one
system hierarchy to synthesize the best reduction strategies with minimum cost as the final decision. This section discusses
that consist of sequences of common collective communication the partition and communication in every parallelism scheme.
operations, which is proven to be faster than a single All- To simplify the illustration in this section, we assume that
Reduce operation among all devices in many cases. These we are using homogeneous clusters. Because in homogeneous
works on all-reduce optimize intra-layer communication we clusters, devices have the same computation capacity, which
mentioned above. Another communication optimization work means that partitions can be executed averagely, and thus
on tensor distribution reduces the inter-layer communication each device has the same computation cost when given the
cost. [48] reduces the communication amount of tensor re- partitioned task. This assumption helps us focus on evaluating
distribution by replacing original All-to-All operations with communication costs in different strategies. It should be noted
sequences of portable collective communication operations that communication amount is the most crucial factor that we
including All-Gather, Dynamic-Slice, All-Permute, and All- need to consider in generating strategies, and computation is
to-All. another critical factor for balancing works on every device.
5

2) ZeRO-Powered Data Parallelism: To address the above


Worker 1 Worker 2 Worker 7
1 Worker 2
redundancy problem, DeepSpeed [19] team from Microsoft
develop Zero Redundancy Optimizer (ZeRO) [37], which
Worker 8 Worker 3 can partition model states including parameters, gradients, and
optimizer states across all the computing devices (worker)
Server 1 Server 2
averagely. Under the ZeRO-powered data parallelism (ZeRO-
Worker 7 Worker 4 DP) strategy, each computing device also trains a different
part of the input data and only maintains its owned parti-
tioned model parameters. Therefore, ZeRO-DP is essentially
Worker 3 Worker 4 Worker 6 Worker 5 a kind of data parallelism. ZeRO-DP requires a worker to
gather a subset of the model from other workers only when
(a) Parameter Sever architecture. (b) Ring-Allreduce architecture.
needed, eliminating the redundancy of model states in vanilla
Fig. 4. Typical centralized architecture and decentralized architecture. DP. DeepSpeed has implemented an auto-parallelism runtime
system that helps determine the stage of ZeRO-DP, batchsize,
and other ZeRO optimization configurations. Users can use
We divide parallelism schemes into two categories: intra-
DeepSpeed to launch a ZeRO-DP training with only a few
operator parallelism and inter-operator parallelism. Intra-
lines of change.
operator parallelism shards the tensors (i.e., vi ∈ V ) along
ZeRO-DP has three stages: stage 1 that only partitions
its axes while inter-operator divides a computation graph G
optimizer states, stage 2 that partitions gradients and optimizer
into several sub-graphs G by nodes. Tab. I shows the intra-
states, and stage 3 that partitions parameters additionally.
operator partitions of a matrix multiplication Tout = Tin W ,
When using ZeRO-DP stage 3 in a communication group with
where Tout is an output of shape (b, wout ), Tin is an input of
N devices, each worker only needs to maintain 1/N of the
shape (b, win ) and W is a weight matrix of shape (win , wout ).
model states; this means training a model with 1 billion param-
It shows that different partitions represent different intra-
eters with Adam optimizer, each device only needs 16GB/N
operator parallelism schemes, including DP, ZeRO-powered
memory to store the model. However, ZeRO-DP stage 3 needs
DP, row-wise TP (Row-TP), column-wise TP (Column-TP),
an extra all-gather communication operation in backward-
and 2D, 2.5D, 3D-TP. As Tab. II shows some examples of
propagation to collect parameters. ZeRO-Infinity [61] points
different strategies under a specific number of devices, the
out that using ZeRO and tricks like memory offloading, we
partitions of intra-operator parallelism can be modeled by
can train a dense model with over 30 trillion parameters on
pi ∈ P mentioned in Section II. Note that tensor parallelism is
512 NVIDIA V100 Tensor Core GPUs, which is 107x larger
also known as intra-layer model parallelism since it partitions
than the current biggest singleton model Gopher [62] with
model weight. Inter-operator parallelism includes inter-layer
280 billion parameters. However, this requires an ultra-high
model parallelism and pipeline parallelism (PP). They both
communication bandwidth to train it in practical due to the
partition models into several stages that consist of layers and
highly increased communication volume.
do not change the partitions within any tensors, and we can
3) Communication of DP: The communication of DP is the
use sub-graphs to describe this kind of partition.
communication of model parameters. There are two topology
Then, we move to analyze the parallelism schemes we
architectures of synchronizing parameters: centralized archi-
mentioned above.
tecture and decentralized architecture. Centralized architecture
refers to the system that has one or more master workers,
A. Data Parallelism which send and receive parameters or gradients to/from each
Data Parallelism (DP) is the earliest parallelism scheme and slave worker, such as Parameter Server [63]. Currently, the
is still one of the most commonly used schemes in distributed most representative work is BytePS [64], which claims to use
training. In DP, training data samples are split into several more CPU worker parameter servers to increase bandwidth
parts along the batchsize axis, and each worker computes between master workers and slave workers. In parameter
the corresponding part. Each worker conducts an independent server, the main characteristic is that only Server devices
training process through stochastic gradient algorithms while connect to all other devices called Worker. Fig. 4a shows a
using communication to sync the models. parameter server architecture with 2 Servers and 4 Workers.
1) Vanilla Data Parallelism: Vanilla data parallelism only Servers receive gradients from all the Workers after backward
partitions data-related tensors. Training samples are divided propagation. It updates model weights (parameter) with
into several parts in the vanilla DP, and the entire model is collected gradients and sends updated parameters back to other
duplicated in each worker. PyTorch DDP [60] is an auto- workers to finish an iteration step. This process results in a
parallelism method that only supports vanilla DP. It automat- communication volume of two times of parameters. We can
ically inserts all-reduce operators for gradients synchroniza- increase the bandwidth of the parameter server by adding more
tion after backward propagation. Its usability attracts many Servers. BytePS tries to use the same number of CPUs as
researchers to use it to train networks. Vanilla data parallelism GPUs to be the Servers and thus alleviate the burden of each
is functional when training a small neural network. However, Server device.
it is very limited in training large-scale models because of the For decentralized architecture, the most representative work
memory redundancy in storing replicas of the model. are PyTorch DDP [60], Horovod [65] and DeepSpeed [19],
TABLE I
PARTITIONS OF DIFFERENT INTRA - OPERATOR PARALLELISM SCHEMES

Scheme Name Input (Tin ) Weight (W ) Output (Tout ) Weight Gradient (δW ) Optimizer States (Os )

Vanilla DP (b/p, win ) (win , wout ) (b/p, wout ) (win , wout ) (win , wout )

ZeRO-powered DP stage 1 [37] (b/p, win ) (win , wout ) (b/p, wout ) (win , wout ) (win /p, wout ) or (win , wout /p)

ZeRO-powered DP stage 2 [37] (b/p, win ) (win , wout ) (b/p, wout ) (win /p, wout ) or (win , wout /p) (win /p, wout ) or (win , wout /p)

ZeRO-powered DP stage 3 [37] (b/p, win ) (win /p, wout ) or (win , wout /p) (b/p, wout ) (win /p, wout ) or (win , wout /p) (win /p, wout ) or (win , wout /p)

Row-TP (b, win /p) (win /p, wout ) (b, wout ) (win /p, wout ) (win /p, wout )

Column-TP (b, win ) (win , wout /p) (b, wout /p) (win , wout /p) (win , wout /p)

2D-TP [16] ( √bp , w√inp ) ( w√inp , w√out


p ) ( √bp , w√out
p ) ( w√inp , w√out
p ) ( w√inp , w√out
p )

wout wout wout wout


2.5D-TP [41] ( √bpd , √win ) ( √win , √ ) ( √bpd , √ ) ( √win , √ ) ( √win , √ )
p/d p/d p/d p/d p/d p/d p/d p/d

b in in out b out in out in out


3D-TP [17] ( p2/3 , pw1/3 ) ( pw1/3 , wp2/3 ) ( p2/3 , wp1/3 ) ( pw1/3 , wp2/3 ) ( pw1/3 , wp2/3 )
6
TABLE II
E XAMPLES OF 2D- ARRAY index OF INTRA - OP PARALLELISM SCHEMES

Tin W Tout δW Os

1 4
Vanilla DP [[0, b/2], [−1]] [[−1], [−1]] [[0, b/2], [−1]] [[−1], [−1]] [[−1], [−1]]

1
ZeRO stage 1 [[0, b/2], [−1]] [[−1], [−1]] [[0, b/2], [−1]] [[−1], [−1]] [[0, win /2], [−1]]

1
ZeRO stage 2 [[0, b/2], [−1]] [[−1], [−1]] [[0, b/2], [−1]] [[0, win /2], [−1]] [[0, win /2], [−1]]

1
ZeRO stage 3 [[0, b/2], [−1]] [[0, win /2], [−1]] [[0, b/2], [−1]] [[0, win /2], [−1]] [[0, win /2], [−1]]

Row-TP1 [[−1], [0, win /2]] [[0, win /2], [−1]] [[−1], [−1]] [[0, win /2], [−1]] [[0, win /2], [−1]]

1
Column-TP [[−1], [−1]] [[−1], [0, wout /2]] [[−1], [0, wout /2]] [[−1], [0, wout /2]] [[−1], [0, wout /2]]

[[0, b/2], [[0, win /2], [[0, b/2], [[0, win /2], [[0, win /2],
2
2D-TP
[0, win /2]] [0, wout /2]] [0, wout /2]] [0, wout /2]] [0, wout /2]]

[[0, b/4, [[0, win /2], [[0, b/4, [[0, win /2], [[0, win /2],
3
3D-TP b/2, 3b/4], [0, wout /4, b/2, 3b/4], [0, wout /4, [0, wout /4,

[0, win /2]] wout /2, 3wout /4]] [0, wout /2]] wout /2, 3wout /4]] wout /2, 3wout /4]]
1 Partitioned on 2 devices, device group g is [0, 1] as an example.
2 Partitioned on 4 devices, device group g is [0, 1, 2, 3] as an example.
3 Partitioned on 8 devices, device group g is [0, 1, 2, 3, 4, 5, 6, 7] as an example.
4 -1 represents for a non-partition along this axis.
7
8

which form DP group with ring topology to do collective B. Model Parallelism


operations [66] effectively. As Fig. 4b shows, there is no spe-
cific server worker in decentralized architecture as each worker Model parallelism occurs because device memory is insuf-
serves as Server and Worker at the same time. Take Horovod ficient to hold the entire replica of a model in vanilla DP.
as an example; the Worker in Horovod only communicates Since ZeRO-DP can address this flaw well, the goal of model
with its neighbor Worker using the ring-allreduce algorithm. parallelism now changes to reduce the amount of data that
Ring-based allreduce algorithm is proven to be bandwidth- transfers between devices and the memory cost of temporary
optimal [67]. Decentralized architecture with ring-topology activation values. We divide MP into two categories: inter-
uses ring-allreduce to exchange parameters and gradients, and layer and intra-layer models. Intra-layer model parallelism is
it has been widely used in large-scale model training. A ring- also known as tensor parallelism (TP) since it partitions the
allreduce algorithm consists of two steps: a reduce-scatter weight tensor. To simplify our illustration, we note inter-layer
and all-gather collective operations. In each step, each worker model parallelism as MP and intra-layer model parallelism as
sends only 1/N of the data to its neighbor N − 1 times. TP in the following sections.
The total volume of sending data of each worker is, therefore, 1) Inter-Layer Wise Model Parallelism (MP) and Pipeline
(N − 1)K/N , where K represents the volume of the data. Parallelism: In inter-layer model parallelism, the model is par-
titioned into several stages that usually consist of continuous
The communication of ZeRO-DP is based on ring-allreduce. layers. A stage will be executed only when the computation
However, there is a subtle difference between vanilla ring- of the previous stage finishes. Devices only need to transfer
allreduce DP and ZeRO-DP. Vanilla DP reduce-scatters and intermediate activation values and gradients between devices.
all-gathers only the gradients, after which all workers up- So the amount of transfer data is smaller than data parallelism
date parameters simultaneously, which results in redundant when applied to a fully-connected (FC) network, which usually
computation in the updates of parameters in vanilla DP. For has a vast weight matrix but a small output tensor. Moreover,
stage 1 and stage 2, ZeRO-DP uses a reduce-scatter to get MP reduces the storage of temporary activation values on
accumulated gradients for each worker and then updates its each device due to the model’s partition. Thus the device can
owned parameters. Finally, it uses an all-gather operation to train the model with a larger batchsize. However, as shown in
synchronize updated parameters on all workers. This subtle Fig.7.a, each computing device (worker) holds only one stage
change in communication enables ZeRO-DP to update pa- in MP, resulting in data relevance among computing devices.
rameters with no redundant computations while maintaining Due to data relevance, only one device is running at any time
correctness. Based on ZeRO-DP stage 2, ZeRO-DP stage of training when using inter-layer model parallelism, which
3 needs an extra all-gather communication before executing leads to a low utilization rate of computing devices.
backward propagation of the corresponding weight matrix. 2) Intra-Layer Wise Model Parallelism: Intra-layer model
This all-gather operation results in 50% more communication parallelism starts from partitioning a matrix that does mul-
volume. tiplication computing. It splits the large weight matrix [18],
[69] to execute it efficiently. A weight matrix often consists
Tab. III gives the comparison of the minimum communica- of two axes: row and column. Therefore, we have row-TP and
tion volume (bandwidth cost) of vanilla allreduce algorithm in column-TP, where the weight matrix is divided along the row
parameter server and ring-allreduce algorithm in decentralized dimension and column dimension, respectively. Intra-Layer
architecture in an iteration, where P represent the number of model parallelism has an excellent performance in acceler-
parameters, N represents the number of Worker devices, and n ating matrix multiplication but brings some communication
represents the number of Server devices. When the number of overheads. We will further explore the overheads in section
Worker devices increases, using the ring-allreduce algorithm IV-B3.
has a less communication volume that converges to 2P , Intra-layer model parallelism has some varieties for dif-
indicating a high scalability potential. For parameter servers, ferent device topologies. Optimus [16], SUMMA2.5 [41]
by adding Server devices, the sending volume of Servers tends and 3D parallel transformer model [17] applies 2D, 2.5D
to converge to P , and hence all Server and Worker devices and 3D tensor parallelism respectively to help further re-
only need to communicate P data; this is why BytePS claims duce activation memory and communication while increasing
that theoretically, it can be up to 2x faster than ring-allreduce. throughput. They split the weight matrix along both the
However, this may incur extra costs in maintaining clusters and row axis and column axis and prove that this helps reduce
designing device topology. In addition, the parameter server each worker’s activation memory and communication volume.
currently did not solve the redundancy problem that ZeRO Suppose we have p devices, 2D-TP partitions both input
solved. So researchers prefer to use ring-allreduce based DP tensor and weight matrix into p sub-blocks along both row
to train large-scale models. and column axis (e.g., weight matrix is partitioned into sub-
√ √
matrices with shape [win / p, wout / p] among p devices).
The biggest flaw of DP is the redundancy of storing replicas. 2.5D-TP
√ partitions
p input tensor into sub-blocks with shape
Though ZeRO [37], [61] have tried to optimize the redundancy [b/ pd, win /p p/d], and p weight matrix into sub-blocks with
in DP, it introduces extra communication cost in the system. shape [win/ p/d, wout / p/d], where d here is the depth
An effective way to alleviate redundancy and communication of the processor group. 3D-TP partitions input tensor into
cost is to use model parallelism. sub-blocks with shape [b/p2/3 , win /p1/3 ] and weight matrix
9

TABLE III
C OMPARISON OF THE MINIMUM COMMUNICATION VOLUME OF DATA PARALLELISM

Method Volume on Servers Volume on Workers Total Volume


Parameter Server [63] (N − 1)P P 2(N − 1)P
BytePS [64] (N − 1)P/n P 2(N − 1)P
Ring-Allreduce [68] None 2(N − 1)P/N 2(N − 1)P
ZeRO stage 1&2 [37] None 2(N − 1)P/N 2(N − 1)P
ZeRO stage 3 [37] None 3(N − 1)P/N 3(N − 1)P

Details Ignored Details Ignored Details Ignored

x Update δW Ex x Update Ex x Update Ex

W δW W δW AllReduce
W AllReduce

MatMul MatMul MatMul MatMul MatMul MatMul MatMul MatMul MatMul

AllReduce

y Ey y Ey y Ey

Details Ignored Details Ignored Details Ignored

Loss Loss Loss

(a) DP (b) Row-TP (c) Column-TP

Details Ignored Details Ignored Details Ignored

Reduce-
W Update MatMul x W Update x W Update
Scatter

Broadcast Broadcast δW Ex AllGather AllGather δW MatMul


x AllGather δW AllGather

Reduce-
x W Reduce Reduce x W Scatter
AllGather
Reduce-
MatMul MatMul
Scatter
Broadcast MatMul MatMul MatMul AllGather MatMul MatMul AllGather

y Ey
Reduce-
Broadcast y Ey AllGather
Scatter
y Ey
Details Ignored
Details Ignored Details Ignored

Loss Loss Loss

(d) ZeRO-DP stage 3 (e) 2D-TP (f) 3D-TP

Fig. 5. Computation Graph of Intra-operator Parallelsim


10

into sub-blocks with shape [win /p1/3 , wout /p2/3 ]. 2D-TP and x Slice y
2.5D-TP uses broadcast and reduce communication operations
MatMul AllReduce MatMul AllReduce z
to help calculate the output results, while 3D uses all-gather
and reduce-scatter collective operations. It is emphasized that W1 y W1
the communication amounts of these three methods are smaller
than 1D-TP while having no redundancy in saving interme- (a) Row-TP to Row-TP w/o optimization

diate activation values. Currently 2D, 2.5D and 3D intra-


layer model-parallelism has been integrated into deep learning
framework Colossal-AI [70] from HPC-AI Tech. x
y
3) Communication of MP and TP: Inter-layer model paral- Reduce-
MatMul MatMul AllReduce z
Scatter
lelism requires communication of intermediate results between
two layers. We suppose that MP splits the model after this W1 W1

layer, then we will need to transfer Tout to the device which


holds the next layer to continue computation, resulting in (b) Row-TP to Row-TP with optimization
communication volume of bwout .
Now let us discuss intra-layer model parallelism. OWT Fig. 6. Optimization on Row-TP to Row-TP
[71] experimentally finds that using DP for CNN and TP for
FC has a higher throughput than using only one of them in
training AlexNet [12]. OWT inspires researchers to discover layer communication. Inter-layer communication is also
the reasons. It turns out that this is due to the communication known as tensor redistribution [48], [57] since it reorganizes
difference between DP and TP: CNN has a lower weight the distribution of tensors. The combinations of DP (including
matrix and huge output tensor, which results in less commu- ZeRO-DP), Row-TP, Column-TP, 2D-TP, and 3D-TP generate
nication when applied in DP, and FC is the opposite. The 25 kinds of tensor redistribution strategies. The communica-
communication of TP consists of intra-layer and inter-layer tion of inter-layer happens in forward and backward prop-
communication. Intra-layer communication happens when a agation, where we calculate intermediate results Tout and
tensor is partitioned and brings inconsistency during execution. the error of input tensor Ein , respectively. Tab. V shows
They are often synchronous operations, such as all-reduce, the communication volume of inter-layer communication. To
broadcast, and reduce or combination of reduce-scatter and simplify our illustration, we suppose that we have p devices,
all-gather. Inter-layer communication is responsible for redis- and layer l and layer l + 1 both run on them. Layer l generates
tributing a tensor partitioned under strategy A to new partitions Tout and forward propagates it to layer l + 1, and layer l + 1
under strategy B. We can use an all-gather or all-to-all commu- backward propagates error of Tout , Eout for layer l to calculate
nication operator to handle the redistribution. AccPar [23] is the Ein We use all-gather or all-to-all to communicate for
an strategy searching method that makes good use of analysis inter-layer TP communication and involve DP in the table
of communication of intra-layer model parallelism to find the since DP is also an intra-op partition strategy. All-gather
best heterogeneous parallelism strategies for clusters. In this happens in DP layer to Column-TP layer, Row-TP layer to DP,
survey, we extend their work in analyzing communication. Row-TP, 2D-TP and 3D-TP layer, Column-TP to Column-TP
We first illustrate intra-layer communication. For row-TP on layer, 2D-TP to Column-TP layer and 3D-TP to Column-TP
p devices, we divide the weight matrix into p sub-matrix whose layer. No communication happens in the DP layer to the DP
shape is (win /p, wout ). In the meanwhile, we partition input layer, Row-TP layer to Column-TP layer, Column-TP layer to
tensor into p part whose shape is (b, win /p). Since all workers Row-TP layer, 2D-TP layer to 2D-TP layer, and 3D layer to
can generate an output tensor with shape (b, wout ) but may 3D layer. All-to-all happens in the remaining circumstances.
have different values, we must synchronize the output tensor We find a possible optimization strategy under several cir-
using an all-reduce operation. Similarly, for column-TP on p cumstances when considering intra-layer communication and
devices, we need to synchronize the error (i.e., gradient) of Tin inter-layer communication together. We can replace the all-
whose shape is also (b, win ) which is generated in backward reduce operation in intra-layer communication of Row-TP with
propagation. Obviously, the data that need to be synchronized a reduce-scatter inter-layer communication operation when its
here is also the redundant (or temporarily redundant) memory next layer is a DP or Row-TP partition; replace the one of
cost of each method. For multi-dimensional TP, both 2D- Column-TP when its preceding layer is a DP or Column-
TP and 3D-TP need to transfer the intermediate activation TP partition. This replacement saves a communication cost of
values and model weights in every matrix multiplication. (p − 1)bwout /p. Fig. 6 shows an example of this optimization
However, their communication volumes are smaller than 1D- of a Row-TP layer to Row-TP layer.
TP in most cases. Tab. IV shows more details of the intra- MP and TP both offer ideas to reduce communication
layer communication in the training process, which contains pressure and activation memory. However, with an unavoidable
replicated tensors, tensors that need communication, commu- data-relevance problem, the efficiency of MP is not high
nication operators they use, and maximum communication cost enough. Therefore, MP usually works with other parallelism
on each worker. We ignore 2.5D-TP here because 2.5D-TP is strategies like DP or PP. Like DP, there is some redundancy
actually a 2D-TP that replicates d times. and communication in TP, so TP is more like a trade-off choice
Inter-layer communication is more complicated than intra- for us. It is evident that using TP for CNN, which has a
11

TABLE IV
I NTRA - OP PARALLELISM S CHEMES : I NTRA -L AYER C OMMUNICATION C OST OF M AT M UL Tout = Tin W IN A T RAINING S TEP

Method Replicated Tensor Communication Tensor Communication Pattern Communication Cost


2(p−1)
Vanilla DP W, δW, Os δW All-Reduce p
win wout
2(p−1)
ZeRO-DP 1 [37] W, δW W, δW Reduce-Scatter, All-Gather p
win wout
2(p−1)
ZeRO-DP 2 [37] W W, δW Reduce-Scatter, All-Gather p
win wout
3(p−1)
ZeRO-DP 3 [37] None W, δW Reduce-Scatter, All-Gather p
win wout
2(p−1)
Row-TP Tout , Eout Tout All-Reduce p
bwout
2(p−1)
Column-TP Tin , Ein Ein All-Reduce p
bwin
3 log p
2D-TP [16] None W, δW, Tin , Ein Broadcast, Reduce √ (bwin
2 p
+ win wout )
1/3
3(p −1)
3D-TP [17] None W, δW, Tin , Ein , Tout , Eout Reduce-Scatter, All-Gather p
(bwin + win wout + bwout )

TABLE V
T ENSOR R EDISTRIBUTION : I NTER -L AYER C OMMUNICATION VOLUME

Layer l + 1

DP Row-TP Column-TP 2D-TP 3D-TP


2(p−1)bwout −1/2 −1/3
DP 0 p2
(p − 1)bwout /p 2 1−pp bwout 2 1−pp bwout

Row-TP (p − 1)bwout /p (p − 1)bwout /p 0 (p − 1)bwout /p (p − 1)bwout /p


2(p−1)bwout −1/2 −1/3
Layer l Column-TP p2
0 (p − 1)bwout /p 2 1−pp bwout 2 1−pp bwout
−1/2 −1/2
2D-TP 2 1−pp bwout 2 1−pp bwout (p − 1)bwout /p 0 2bwout /p
−1/3 −1/3
3D-TP 2 1−pp bwout 2 1−pp bwout (p − 1)bwout /p 2bwout /p 0

small weight matrix but tremendous intermediate results, is not represents the matches of weight versions between forward and
acceptable. TP is more suitable for a model with big weight backward propagation, guaranteeing convergence. From these
matrixes in order to gain performance improvement [71]. two works, researchers have proposed lots of varieties. Fig.7
shows the scheduling details of the below-mentioned pipeline
varieties.
C. Pipeline Parallelism
MP is relatively more complicated than DP to design 1) Asynchronous Pipeline Parallelism: The most rep-
and reproduce because MP often requires a good balance resentative work of asynchronous pipeline parallelism is
of model scaling capacity, flexibility, and efficiency among PipeDream. PipeDream pursues higher throughput and uti-
devices. Most of the pipeline parallelism (PP) methods can lization of devices. It automatically partitions layers into
automatically partition models to balanced stages on top of load-balanced stages which have similar computation times.
MP [15], [25], [72]. The partition pattern of PP is the same as Each device on a different stage processes a different micro-
that of MP; this is why some emerging popular frameworks batch of data simultaneously, avoiding the data relevance
like MindSpore [57] and OneFlow [31], and researchers name problem. PipeDream eliminates bubbles in the pipeline by
inter-layer model parallelism as pipeline parallelism directly. storing multiple versions of the parameters. However, it brings
The only difference between PP and MP is that PP is the well- extra memory costs and a staleness problem due to its
scheduled pipelined MP, which can overlap the computation asynchronous updates, resulting in a possible convergence
of different batches. PP was proposed to solve the low-utility problem. PipeDream-2BW [73] optimizes the memory usage
problem of MP. With the development of TP, researchers tend in PipeDream, which needs only two buffers to store generated
to replace MP/PP with TP in most cases. However, PP is still weights of different versions.
effective in training large-scale models because it introduces Although the experiments in PipeDream and PipeDream-
less communication than DP and TP, and it is beneficial to 2BW show that using asynchronous pipeline schedules does
enlarge batch size. not hurt the convergence, we still need to be cautious about the
Typical works on pipeline parallelism (PP) include asyn- convergence problem that may occur due to delayed updates
chronous pipeline PipeDream [15] from Microsoft and syn- in such schedules [74]–[77]. Currently, the PipeDream-based
chronous pipeline GPipe [72] from Google. Synchronicity here pipeline schedule has been widely used in Megatron-LM [78]
12

worker 1 1 1 2 2 x Forward pass of micro-batch x


worker 2 1 1 2 2
worker 3 1 1 2 2 y Backward pass of micro-batch y
worker 4 1 1 2 2
Parameter updates
time
Idle
(a) MP

worker 1 1 2 3 4 1 2 3 4 worker 1 1 2 3 4 1 2 3 4
worker 2 1 2 3 4 1 2 3 4 worker 2 1 2 3 1 4 2 3 4
worker 3 1 2 3 4 1 2 3 4 worker 3 1 2 1 3 2 4 3 4
worker 4 1 2 3 4 1 2 3 4 worker 4 1 1 2 2 3 3 4 4

time time
(b) GPipe (c) DAPPLE, 1F1B

worker 1 1 2 3 3 4 4 1 2 worker 1 1 2 3 4 1 5 2 6 3 7 4 8 5
worker 2 1 3 2 4 3 1 4 2 worker 2 1 2 3 1 4 2 5 3 6 4 7 5 8 6
worker 3 3 1 4 2 1 3 2 4 worker 3 1 2 1 3 2 4 3 5 4 6 5 7 6 8
worker 4 3 4 1 1 2 2 3 4 worker 4 1 1 2 2 3 3 4 4 5 5 6 6 7 7

time time

(d) Chimera (e) PipeDream-2BW

Fig. 7. Inter-layer wise MP.

(i.e., PipeDream-1F1B), and Microsoft Fiddle project’s recent method naturally with a DP-degree of 2 and compare it with
works [55], [58]. DAPPLE with a DP-degree of 2, which also has a replica of
2) Synchronous Pipeline Parallelism: GPipe was first de- model weights, we would find it has a lower bubble ratio, and
signed for inter-layer MP, as shown in Fig. 7.b. GPipe first thus more effective.
partitions the model into N stages and puts the n-th stage There exist bubbles in synchronous PP methods due to the
on the n-th accelerator. Based on this partition, GPipe divides synchronization before updating parameters; this may lead to
the training batch B into M micro-batches. These M micro- insufficient utilization of devices. However, the convergence
batches are executed in the N accelerators in a pipelined form. of synchronous PP is guaranteed since it is mathematically
Unlike asynchronous PP methods, in GPipe, the gradients equivalent to the vanilla training process.
of each micro-batch are computed with the same model 3) Discussions and Comparisons of Pipeline Parallelism
parameters in the forward pass during the back-propagation Methods: In this subsection, we discuss the communication of
process. Finally, all of the gradients from all M micro-batches PP and compare the intermediate activation memory, parame-
are accumulated and are used to update the model parameters ter weight memory, and bubble ratio of different PP methods.
in this mini-batch across all devices. The schedule of GPipe Communication volume. All PP methods above except
requires devices to store all the intermediate activation values Chimera have the same communication volume when given
before the execution of backward propagation. This storage specific batchsize, micro-batch size, and stage number. The
could be a memory bottleneck in training big models. To communication form of PP is the same as MP. As mentioned
address this problem, DAPPLE [21] modifies the order of in section IV-B, the communication amount of MP is Bwout
backward propagation in GPipe to reduce the peak memory for each stage (except for the last stage), where wout here
usage, and it is also faster than GPipe. is the column size of the last weight matrix of the stage.
Absorbing the ideas from GPipe, DAPPLE, and GEMS [79], Suppose that batchsize of the data is b In PP, and a batch
Chimera [35] proposes a bidirectional pipeline that further is divided into M micro-batch. A micro-batch, therefore,
reduces the number of bubbles ratio in the pipeline. However, results in Bwout /M communication volume in a stage, and
the memory cost of storing the weight matrix is twice that the total volume of a batch in this stage sums up to Bwout .
of GPipe and DAPPLE. Furthermore, the redundancy results Suppose we now have S stages and the output dimension
in synchronization of the model weights, which increases the of stage i P is Wi , the total communication volume of an
S−2
communication volume. However, if we treat Chimera as a PP iteration is 0 BWi , which is the same as MP. In addition
13

to this communication volume, Chimera needs to all-reduces


gradients between workers, which results in a communication
volume equal to the size of model weights.
We have mentioned above that PP can reduce more com- GPU0 GPU0
munication volume than DP and TP since the communications
in DP and TP happen in every layer while communications in GPU0 GPU8
GPU2 GPU16 GPU2
GPU24
PP happen only between stages. Moreover, synchronous PP DP
methods have no redundancy tensor memory cost like DP and
GPU4 GPU12 GPU20 GPU28
TP do. TP
Comparison of PP methods. We use the table in [35]
to help illustrate the difference of PP methods, where Mw
PP
represents for the memory of parameter weights of the model,
S represents for the number of stages, N represents for the Fig. 8. The over view of 3D Parallelism, including Data Parallelism, Tensor
number of micro-batches. As Tab. VI shows, the asynchronous Parallelism and Pipeline Parallelism, each of which lies in a independent axis.
pipeline has the lowest bubble ratio that is approximate 0
but may cost redundancy in weights memory, and it may Details Ignored
harm the convergence. The synchronous pipeline has a higher
bubble ratio than the asynchronous pipeline but guarantees x Update

convergence. DAPPLE highly optimizes the memory peak in


GPipe, and this is very helpful in training large-scale models.
Chimera further reduces the bubble ratio and the lower bound x W δW Ex

of activation memory and includes implicit data parallelism in


the pipeline.
MatMul MatMul MatMul MatMul
D. Hybrid Parallelism
Hybrid parallelism uses a combination of data, model, or
y y Ey
pipeline parallelism to partition the model in a fine-grained
way. OWT [71] (one weird trick) is the most classic hybrid
Details Ignored
parallelism scheme. It is a heuristic convolution neural network
(CNN) training method that uses data parallelism for CNN Loss
and model parallelism for a fully connected network (FC).
Other representative work includes Megatron-LM [18] and 3D-
Fig. 9. Check-pointing Computation Graph
parallelism [19] from DeepSpeed.
In Megatron-LM, domain experts manually partition the
large-scale model like GPT-2 using Row-TP and Column-TP the batchsize in training a large-scale model. Check-pointing
and apply a 1F1B PipeDream pipeline to improve throughput. is a practical way to train neural networks, which has been
Although Megatron-LM has an excellent training speed, it is adopted by frameworks like MindSpore, OneFlow, PyTorch,
hard for non-experts to reproduce the code or manually apply PaddlePaddle, and TensorFlow.
its thought to other models since it is designed specifically. 2) Experts Parallelism: Different from the above paral-
Like GPipe, DeepSpeed automatically partitions models into lelism schemes, Expert Parallelism is specific to MoE-based
pipeline stages effortlessly and then uses ZeRO-DP on each models. There are several experts in the MoE layer, and each
stage to enlarge the throughput. However, this kind of partition expert in the MoE layer is just a Feed-Forward Network that
is inter-layer-wise, which does not consider communication does matrix multiplication, which can be allocated into the
the computing of each stage well; and it can only handle same computing device. In other words, Experts Parallelism
sequential models. Moreover, DeepSpeed proposes 3D paral- could also be reviewed as a variant of TP that partitions a
lelism, a new hybrid scheme of large-scale training. It involves weight matrix. Moreover, it could be utilized with DP and
DP, TP, and PP at the same time. The details of 3D parallelism MP simultaneously.
can be found in Fig. 8. Assuming there are three axes, x
3) Token-level parallelism: Token-level parallelism (Ter-
axis, y, and z axis, these three axes represent PP, TP, and
aPipe) [39] is a variety of pipeline parallelism. Instead of
DP, respectively.
feeding data in the unit of micro-batch to the pipeline, TeraPipe
splits data by tokens axis (i.e., length axis) unevenly and then
E. Other Methods feeds them in the pipeline. It makes good use of the properties
1) Check-pointing: Also known as recomputation, check- of Transformers [82] that longer sequences require a longer
pointing [80], [81] drops the activation values generated by time to compute. TeraPipe is orthogonal to MP and TP, which
forward propagation and recomputes them in backward prop- may be helpful in training large-scale language models.
agation, which reduces the memory of activation values to a 4) Sequence parallelism: Sequence parallelism [38] also
sub-linear degree. Using check-pointing enables us to enlarge uses the properties of Transformers. It is a ring-pipeline that
14

TABLE VI
C OMPARISON OF DIFFERENT PP METHODS

PP method Bubble Ratio Weights Memory Activations Memory Synchronous or not Convergence
PipeDream [15] ≈0 [Mw , SMw ] [Ma , SMa ]
No Unstable
PipeDream-2BW [73] ≈0 2Mw [Ma , SMa ]
GPipe [72] (S − 1)/(N + S − 1) Mw N Ma
DAPPLE [21] (S − 1)/(N + S − 1) Mw [Ma , SMa ] Yes Stable
Chimera [35] (S − 2)/(2N + S − 2) 2Mw [(S/2 + 1)Ma , SMa ]

each worker holds the same parameters and then computes sequence-to-sequence RNN with LSTM cells and a content-
different parts of the inputs. The inputs are chunked along the based attention mechanism. However, HDP and Spotlight
sequence-length axis. By transmitting the computing results both rely too much on LSTM controllers that are hard to
of each device, it finally outputs the final complete results. train(i.e., Spotlight takes 9 hours on five worker machines
Sequence parallelism can support a larger batchsize in training to find better placement than ColocRL). Moreover, LSTM
than TP and, in the meanwhile, has better throughput. performs poorly on capturing long-distance dependencies over
large computation graphs. To alleviate this problem, Placeto
V. S TRATEGY S EARCHING M ETHODS FOR [85] and GDP [83] use Graph Neural Network (GNN) [100]
AUTO -PARALLELISM to make embedding information for nodes in computation
graph G. Placeto models problems as MDP and relies on
As mentioned before, strategy searching is the key to auto-
hierarchical grouping, and only generates placement for one
parallelism and, in the meanwhile, is an NP-hard problem. Re-
operator at each time step. Instead, GDP pre-trains and fine-
searchers have proposed many methods [14], [15], [20]–[25],
tunes a Transformer-based [82] attentive network to generate
[25], [26], [30]–[32], [34]–[36], [43], [55]–[59], [73], [83]–
whole graph operator placements at once and is 16.7x faster
[88] for auto-parallelism to find a near-optimal strategy. We
than HDP when finding strategies for an 8-layer Transformer
divide existing strategy searching methods into two categories:
model. In addition, GDP can support partitions for large hold-
classic-algorithm-based methods and machine-learning-based
out graphs with over 50k nodes. REGAL (Reinforced Genetic
methods. Classic-algorithm based methods include recursive
Algorithm Learning) [88] uses GNN policy to predict node-
algorithm [89], dynamic programming algorithm [54], integer
specific non-uniform proposal distribution choices, which are
linear programming algorithm [90] as well as breath-first-
parameterized as beta distributions over [0, 1]. REGAL then
search (BFS) algorithm [91]. We summarize the following
uses a biased random key genetic algorithm (BRKGA) [101]
analysis in Tab. VII. Machine-learning based methods include
to run with those choices and outputs the best solution found
methods like Monte-Carlo Markov Chain (MCMC) [92],
by its iteration limit. REGAL can generalize to a broad set
Monte-Carlo Tree Search (MCTS) [93] which help search
of previously unseen computation graphs, which saves lots
strategies, and reinforcement learning [94] which help predict
of training time, and it can produce an MP strategy for
strategies for each operator and so on.
a graph with 1k nodes in only a few seconds. However,
REGAL only considers peak memory minimization while
A. Machine-Learning-Based Methods GDP focuses on model throughput and scalability. HeterPS
1) Reinforcement-Learning-Based Methods: We first start [28] applies parameter server architecture on CPU and ring-
with reinforcement-learning-based methods. ColocRL [24] is allreduce architecture on GPU/XPU to exploit heterogeneous
the first work that uses reinforcement learning to do auto- computing devices fully. It uses a reinforcement-learning-
parallelism. It uses an attentional sequence-to-sequence model based LSTM model to predict device type for each layer of
trained with Adam optimizer based on policy gradients com- Click Through Rate (CTR) models. Their experiment shows
puted via the REINFORCE equation [95] to predict the place- that HeterPS is exponentially faster (i.e., 10 seconds to find
ments of operators. However, it is a coarse-grained method the best strategy) than Brute Force Search when the number of
that only does model parallelism, and it is too expensive device types and the generated schedule plan on heterogeneous
for the recurrent neural network (RNN) [96] policy to learn computing resources has higher throughput than homogeneous
when the number of operations is enormous. It took 27 hours computing resources.
over a cluster of 160 workers to find a placement that out- Above reinforcement learning methods all focus on DP and
performs an existing heuristic. Moreover, the standard policy MP. We will discuss methods related to TP or PP below.
gradient method is known to be inefficient, as it performs TAPP (Task allocation in pipeline parallelism) focuses on
one gradient update for each data sample [97]. The author partitioning the model into several stages with reinforcement
of ColocRL then proposes a long short-term memory (LSTM) learning. It uses a feed-forward neural network (FFN), where
[98] reinforcement-learning-based hierarchical device place- the last layer is a Softmax operator that predicts the stage
ment strategy (HDP) [84], which can support neural networks number for each layer. It then uses a reinforcement learning
that have ten of thousands of operations. Spotlight [86] models attention-based sequence-to-sequence model to predict which
the problem as a multi-stage Markov decision process (MDP) device a stage should be in. Auto-MAP [32] from Alibaba
[99], and applies proximal policy optimization on a two-layer leverage Deep Q-Network (DQN) [102] with task-specific
15

pruning strategies to help efficiently explore the search space B. Classic-Algorithm Based Methods
of DP, TP, or PP over XLA Higher Level Operations (HLO)
with device and network interconnect topology specified. They OptCNN proposes the layer-wise parallelism strategy, which
choose HLO Intermediate Representation (IR) [103] that is is an auto-parallelism solution under the parameter server
produced by Accelerated Linear Algebra (XLA) [104] from architecture [63]. OptCNN can only handle TP partition.
TensorFlow [105] as the operational level of Auto-MAP. Taking the computer vision task as an example, OptCNN
Because exploring distributed plans on HLO IR can achieve considers the input dimensions of every layer in the model,
better performance benefits from its finer granularity than including batchsize, width, height, and the number of channels.
operators. Moreover, IR is a kind of expression of compu- All dimensions can be divided into various devices. They first
tation graph, which fits our problem definition. Auto-MAP set build a computation graph G of the model and a device graph
rewards, states, and actions for all three DP, TP, PP to instruct D of the cluster and then build a cost model to estimate
DQN to search strategies. Given a cluster of 4 servers with cost under any TP strategies. Using a dynamic programming
8 V100 GPUs, Auto-MAP can search TP strategies for 11- graph search algorithm, OptCNN can determine combinations
billion-parameter T5 [106] model within 1.5 hours, DP strate- of partition dimensions for every layer. However, OptCNN
gies within 17 minutes, and PP strategies within 280 seconds. can only solve computation graphs with linear structure. To
However, Auto-MAP can currently automatically give a single address this problem, Tofu [43] coarsens the computation
parallelism strategy, which may result in sub-optimal runtime graph G by grouping some nodes in V to make it linear,
performance in large-scale distributed training. The authors after which they use the dynamic programming algorithm in
are considering supporting a hybrid of these strategies in the OptCNN to produce strategies. What is more, ToFu considers
future. only communication cost to reduce the search space under the
2) Other Methods: FlexFlow proposed four possible paral- observation that different strategies of an operator like matrix
lelizable dimensions based on OptCNN [14]: SOAP, which multiplication have the same arithmetic complexity. Tofu uses
represented sample, operation, attribute, and parameter, re- the dynamic programming algorithm in OptCNN and applies
spectively. Among SOAP, sample-dimension division corre- some techniques to make it more practical. In addition to the
sponds to DP, operation-dimension division corresponds to coarsening technique, Tofu accelerates the dynamic program-
MP, attribute-dimension division corresponds to each attribute ming algorithm by applying recursively. Compared to dynamic
dimension of input Tensor (such as height and width), and programming with coarsening, which takes 8 hours to get the
parameter-dimension division corresponds to TP. The parti- best strategy set P for 8 workers, using recursion to search
tioning of attributes and parameters corresponds to model strategy for WResNet-152 [110] only takes 8.3 seconds.
parallelism. Unlike OptCNN, which only supports linear mod- Instead of coarsening, TensorOpt [26] extends the dynamic
els like AlexNet [12], FlexFlow can parallelize all kinds of programming algorithm in OptCNN and name it as FT-
computation graphs. They use a random MCMC algorithm Elimination (Frontier Tracking Elimination) to make it exe-
to find the optimal partition configuration and determine the cutable on a computation graph with a non-linear structure.
appropriate parallelism strategy for each operator in a neural However, the runtime is not efficient enough. So TensorOpt
network. However, the MCMC tries to enumerate strategies in also tries to group operators and applies an FT-LDP (Frontier
the search space randomly, which results in an unacceptable Tracking Linear Dynamic Programming) algorithm to help
time of solving the optimal solution for large-scale models. It reduce the time complexity. In addition, the FT-LDP algo-
requires 37 minutes to search strategies for NMT [107] model rithm can be parallelized by multi-threading to generate the
on 16 servers with 4 P100 GPUs. To support the partition of computation of different parallelization strategies efficiently.
GNN models, the author of FlexFlow, Zhihao Jia, implements For WReset, FT-LDP with multi-threading can find the best
ROC [34] on top of FlexFlow. They design a cost model, strategy in 22 minutes, while FT-Elimination needs 5.5 hours.
which could predict the execution time of GNN on an arbitrary Though the search time is longer than Tofu, the throughput of
graph, and then uses an online linear regression model to learn TensorOpt’s generated strategy is much higher than Tofu since
the cost model. The learned cost model enables the graph Tofu tries to search strategies that use less memory. However,
partitioner to discover balanced and effective partitioning for these memory-saved strategies may have smaller throughput.
GNN training and inference. In addition, ROC uses a dynamic PipeDream [15] provides auto-parallelism solutions that
programming algorithm to minimize communication costs support asynchronous pipeline training. In order to accurately
between devices. Automap [33]from DeepMind is performed obtain the execution time of each layer, PipeDream first
on MHLO, which is an MLIR [108] encoding of XLA HLO. It profiles the model that needs to be partitioned to obtain the
applies a Search and Learning method to annotate Megatron- execution time, activation size, and model parameter size of
like [18], [78] strategies for all operators. They implement each layer. Then, according to the obtained results, they create
MCTS to help search and propagate strategies when traversing a profiling-based cost model and design a dynamic program-
the program. To reduce the search space and improve the ming algorithm that divides pipeline stages and determines
quality of strategies propagation, Automap uses a learned in- the DP degree of each stage to meet the load balance need.
teraction network [109] to compute per-node relevance score, Based on PipeDream, PipeDream-2BW [73] optimizes the
and the top-k will be considered first in the search space. memory consumption by applying activation recomputation
Automap shows that using MCTS with a learned filter can and reducing the number of parameter weight buffers that store
find strategies similar to Megatron. different versions of computed gradients to 2. To accelerate the
16

partition, PipeDream-2BW exploits the repetitive structure of in unbalanced partitions. What is more, PaSE is not good
models (e.g., transformer layers in BERT) by grouping them at handling G that |E| is tremendous, as the M may be
and only considering configurations where all model stages significantly large. The runtime overhead of solving strategies
replicate an equal number of times. However, PipeDream only of models like DenseNET [112] is unacceptable since their
supports linear graphs. To address this problem, researchers computing graphs are uniformly dense. Both D-Rec and PaSE
from Fiddle propose dnn-partitioning [59], which extends the uses static analysis-based cost model to generate strategies.
dynamic programming algorithm in PipeDream to support However, an asymptotic analysis may not be accurate enough
partition for arbitrary DAG, and also proposes an integer as profiling does, which may result in some performance
programming solution to solve partition problem. However, deterioration. Because static analysis usually can not capture
like PipeDream and PipeDream-2BW, these methods do not some low-level details like cache effects and overlapping of
consider tensor parallelism. computation and communication, which may be necessary for
Also from project Fiddle, Piper [55] uses a two-level analyzing execution time.
dynamic programming approach to search DP, TP, and PP AccPar [23] analyzes the intra-layer and inter-layer com-
strategies. The outer dynamic programming algorithm would munication cost for all situations between DP and 1D-TP and
generate hundreds of NP-hard knapsack sub-problems, which does DP and 1D-TP partitions to the model. It simplifies the
calculates the throughput of a sub-graph under given hyper- DAG partition problem by deciding strategies layer by layer
parameters. Piper uses a bang-per-buck heuristic to accelerate using dynamic programming, whose arithmetic complexity is
the solving procedure of generated knapsack sub-problems, O(|V |). By introducing a partition ratio, AccPar can support
reducing the computation complexity. The computation com- heterogeneous clusters. Its experiments show that the perfor-
plexity of the Piper algorithm is O(|V |2 N |VD |2 ), where |V | is mance of AccPar outperforms OWT and HyPar [44], which
the number of vertices in computing graph, |VD | is the number only do DP and Row-TP on homogeneous clusters.
of devices, N ≤ |VD | is the maximum sum of DP degrees. DistIR [58] is an efficient IR for explicit representation
Piper can partition a 64 layer BERT [111] on 2048 devices of distributed DNN computation. It uses a linear regression
within only 2 hours, which is relatively a short time compared model to simulate the cost of operators (e.g., MatMul and
to its training time. However, the current implementation of AllReduce), and simulated throughput has a strong correlation
the algorithm is serial and inefficient. A potential advantage for both MLP training and GPT-2 inference for all model sizes.
of Piper is that some procedures in Piper can be executed in DistIR uses a simple grid-search to find the minimum cost
parallel, which can scale the runtime for this algorithm linearly strategy that consists of DP, Row-TP, and 1F1B-PP. Although
on a multi-core CPU server and further reduce searching time. DistIR is very efficient in finding the best strategy in their
Double Recursion Algorithm (D-Rec) [22] uses the obser- search space, the optimal strategy may be too coarse to use
vation that DP and TP have the same communication cost compared to others.
per worker, and thus only consider communication cost to do Alpa [36] uses a two-level hierarchical planning algorithm
DP, TP partitions, and the combination of them. D-Rec builds to search strategies and is the first auto-parallelism method
its cost model statically, which asymptotically and statically that supports DP, 1D-TP, 1F1B-PP as well as ZeRO-DP.
analyzes communication cost based on the shape of tensor Alpa works on arbitrary DAG. It formalizes the intra-operator
and type of operator using the formulations in Table V and parallelism problem as integer linear programming (ILP) and
IV. Based on the analysis, D-Rec automatically determines formalizes the inter-operator parallelism as dynamic program-
strategies for each operator within a linear complexity short ming. The dynamic programming algorithm is built on top
time (28 seconds to search strategies for 24-layer BERT of that in TeraPipe [39], but additionally consider device
on 8 devices). MindSpore implements D-Rec as a choice mesh slicing. During the dynamic programming calculation
of strategy searching algorithms due to its speed advantage. for finding the best inter-operator parallelism strategy, Alpa
However, ignoring the computation analysis limits D-Rec from uses ILP to find the best DP and TP strategy for each
supporting heterogeneous clusters and PP in the future. stage (i.e., sub-graph). However, the overall complexity of
PaSE [56] also uses a static analysis-based cost model this algorithm is O(|V |5 |VD |(|VD |/d + log d)2 ), where d is
to generate DP, TP strategies, and a combination of them. the number of device nodes in the cluster. To reduce the
It makes good use of the sparsity of computing graph to complexity, Alpa tries to use early pruning to reduce search
form a DP-based strategy searching algorithm, whose overall space and use another dynamic programming algorithm to
computational complexity of is O(|V |2 K M +1 ), where |V | is group operators, whose computation complexity is O(|V |2 L).
the total number of vi ∈ V , K is the maximum number Here L represents the number of layers after grouping. Alpa
configurations of an operator (vertex), and M is the size can find the best strategy within 40 minutes for GPT-39B on
of the largest dependent set (i.e., difference set of com- 64 GPU devices. However, considering the complexity cost of
puting sub-graphs from two iterations). According to their this method, searching strategies for GPT-39B on 2048 GPU
experiment, PaSE can generate strategies for a Transformer devices may require thousands of hours, which shows poor
NMT model on 16 devices and 64 devices in 2.2 minutes scalability. But Alpa is fair enough to search near-optimal
and 31.4 minutes, respectively. The throughput of generated strategies for small models.
strategies outperforms Mesh-Tensorflow [69]. PaSE can be Some works use the breadth-first search (BFS) algorithm to
applied on the heterogeneous cluster, but currently, it does propagate strategies. BFS algorithm-based algorithm requires
not include heterogeneity in the cost model, which may result users to annotate some parallelism strategies of tensors or
17

operators, after which the deep learning framework will auto- deciding strategies for several operators (e.g., MatMul and
matically propagate strategies based on set rules. GSPMD [29] succeeding ReLU).
is the first work to do this. It proposes sharding propagation, We recommend using grouping as a heuristic method to help
and the corresponding algorithm has been integrated into reducing the runtime of auto-parallelism methods. We could
TensorFlow’s XLA compiler [105]. It uses a priority-queue- control the size of groups and the method to generate groups
based heuristic method to arrange the parallelism strategies to explore the influence and effectiveness that grouping brings
of rest operators in the compute graph. More specifically, it us.
gives the element-wise operators top priority when propagating 2) Profiling-based Cost Model: As mentioned in section
strategies. III, although using the symbolic cost model is very fast
Inspired by GSPMD, frameworks like MindSpore [57], in evaluating strategies, it owns the inability of telling the
OneFlow [31] and PaddlePaddle [87] absorb sharding prop- difference between different devices, and it ignores many
agation and create their own semi-auto-parallelism method. optimization strategies like cache and the overlap between
Currently, they all use the BFS algorithm in propagating computation and communication. Furthermore, profiling is too
annotations. Among them, OneFlow shows that using split- time-costly to evaluate every strategy for a large-scale model.
broadcast-partial (SBP) parallelism and actor-based runtime We recommend using a profiling-based cost model, which
can further accelerate the training of large-scale models. holds the actual runtime of an operation on a specific device
and can be further fine-tuned to gain better performance (e.g.,
applying a linear regression model).
VI. C ONCLUSIONS AND D ISCUSSIONS
3) Using Heuristics: Heuristics help reduce the search
Large-scale models are becoming increasingly important in space while keeping a well enough output. For example, Alpa
industry and academia and also greatly improve the devel- uses early pruning to ignore strategies with costs over the
opment of scalable distributed training systems that involves threshold; Piper uses greedy heuristics to solve the knapsack
the auto-parallelism method. In this survey, we took a deeper problem.
look into distributed training from the perspective of auto-
parallelism. We investigate the main challenges to make auto- B. Optimizing Parallelism Strategies
parallelism methods more practical and have reviewed existing
Given a specific device topology, an auto-parallelism
methods that tackle those challenges. We give a detailed
method should optimize the parallelism strategies by organiz-
analysis on the foundations of auto-parallelism, including
ing computation among devices and designing good commu-
the problem definition and parallelism strategies. Finally, we
nication pace and pattern.
provide an overview and analyze the existing auto-parallelism
1) Topology-aware Computation: Only a few existing auto-
methods.
parallelism methods handle topology-aware computation, es-
Looking into the future, we suggest a few trends that may
pecially on heterogeneous clusters. AccPar distributes com-
be important in the following years, which are acceleration
putation tasks according to devices’ computation capacity;
of strategy searching, optimization of founded strategies, and
DeepSpeed and PaddlePaddle let the CPU participate in part
combinations of more parallelism schemes.
of the computation to alleviate the pressure of the GPU.
Although many DL training is deployed in a homogeneous
A. Accelerating Strategy Searching cluster, we suggest developing auto-parallelism methods that
support heterogeneous partitions.
1) Grouping: There are two ways in grouping to accelerate
2) Topology-aware Communication: Auto-parallelism
searching. The first way is to apply the same partitions on
strategy searching methods need to consider topology-aware
modules with the same architectures. The second way is to
communication strategies to reduce communication time
group some operators to form a layer and apply partitions to
further and increase throughput. BytePS proposes that using
it.
more CPU as parameter servers can reduce the communication
The first way of grouping is based on the fact that many
amount of synchronizing parameters. However, most of the
neural networks have regular structures. For example, ResNet
current auto-parallelism methods fail to be aware of this
consists of many residual blocks, and BERT consists of many
possible option. [47] and [48] propose ways to reduce intra-
transformer layers. Researchers have found that using the
node and inter-node communication. We suggest involving
regularity of models can help us accelerate the auto partition
their work in generating new strategies.
of computation graphs because modules with the same archi-
tecture often have the same parallelism strategies. Thus, we
only need to find a strategy for a layer and then broadcast it C. Supporting more Parallelism Schemes
to the others, which can reduce the time to a sub-linear degree. Emerging methods including multidimensional TP [16],
The second way is designed for models that do not have [17], [41], TeraPipe [39] and sequence-level parallelism [38]
similar architecture, but it can also be used together with the as well as ZeRO [37] can bring huge enhancement in training
first way. By grouping operators in the second way, we could large-scale models. However, almost no auto-parallelism meth-
transfer a non-linear computation graph into a linear one [43], ods consider the above strategies in their implementation. We
and thus extend the usability of some algorithms like OptCNN. expect new algorithms that make the most of these emerging
Moreover, we can accelerate searching by simultaneously parallelism methods.
TABLE VII
C OMPARISON OF D IFFERENT S TRATEGIES S EARCHING M ETHODS FOR AUTO - PARALLELISM

Name Supported Scheme Detail Evaluation Method Scheduling Time


ColocRL [24] Training RNN RL NMT: 27 hours on 4 K80 GPUs
HDP [84] MP Training LSTM RL NMT: 3 hours on 8 K40 GPUs
Profiling
Transformer RL.
GDP [83] NMT: 7.35x faster than HDP
PreTrain and Finetune
Spotlight [86] Training LSTM+Attention RL CNN: 9 hours on 40 K80 GPUs
DP+MP
Placeto [85] MDP & Graph Embedding NMT: 49 hours
REGAL [88] MP BRKGA & GNN & RL Graphs whose |V | < 1000: seconds
HeterPS [28] LSTM RL Profiling-based cost model CTR model: 20 Seconds on 8 V100
DP+PP
FlexFlow [20] MCMC NMT: 0.6 hour on 64 K80 GPUs
Auto-MAP [32] DP or TP or PP DQN with pruning Bert-48: 262 seconds on 32*V100
Automap [33] DP+TP MCTS & interaction Network Cost Model A few minutes
Pesto [113] MP ILP Profiling-based cost model NMT: 51 minutes on 2 V100 GPUs
vPipe [114] PP Dynamic Programming (KL) Profiling O(|V |2 log |V |)
3 2
PL
PipeDream [15] Dynamic Programming Profiling-based cost model k=1 O(|V | mk )
RaNNC [25] Dynamic Programming Profiling Not Given
3 2
PL
Chimera [35] DP+PP Grid-Search Profiling-based cost model k=1 O(|V | mk )
DAPPLE [21] Dynamic Programming Not Given
2 gpu
DNN-Partitioning [59] Dynamic Programming+ILP Profiling-based cost model O(I (|VD ||VDcpu | + |V | + |E|))
OptCNN [14] O(|V |K 3 )
Dynamic Programming
Tofu [43] O(|V |K 3 )
(Graph Elimination and Regeneration) 2 3
TensorOpt [26] O(|V | K log(K)(log(|V |) + log(K)))
DP+TP
D-Rec [22] Double Recursive Programming Symbolic cost model O(|V |)
AccPar [23] Dynamic Programming O(|V |)
Dynamic Programming
PaSE [56] O(|V |2 K M +1 )
with GenerateSeq
GSPMD [29] Sharding Propagation None O(|V |)
Neo [30] Greedy+Karmarker-karp algorithm Symbolic cost model Not Given
Alpa [36] DP+TP+PP ILP+Dynamic Programming O(|V |5 |VD |(|VD |/d + log d)2 )
DistIR [58] Grid-Search Profiling-based cost model Not Given
Piper [55] 2-level Dynamic Programming O(|V |2 N |VD |2 )
1 mk : the device number of k-th hierarchy in device topology.
2 I: number of already-partitioned region.
3 K: the number of configurable strategies.
4 M : the size of largest dependent set.
5 d: the number of device nodes (i.e, depth).
6 N : the maximum sum of DP degrees.
18
19

R EFERENCES Symposium on Principles and Practice of Parallel Programming,


Conference Proceedings, pp. 431–445.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, [22] H. Wang, “Freeing hybrid distributed ai training configuration,” in
no. 7553, pp. 436–444, 2015. Proceedings of the 29th ACM Joint Meeting on European Software
[2] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, Engineering Conference and Symposium on the Foundations of Soft-
M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi ware Engineering, 2021, pp. 1620–1624.
speech recognition toolkit,” in IEEE 2011 workshop on automatic [23] L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Accpar:
speech recognition and understanding, no. CONF. IEEE Signal Tensor partitioning for heterogeneous deep learning accelerators,” in
Processing Society, 2011. 2020 IEEE International Symposium on High Performance Computer
[3] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Tho- Architecture (HPCA). IEEE, Conference Proceedings, pp. 342–355.
rat, F. Viégas, M. Wattenberg, and G. Corrado, “Google’s multilingual [24] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou,
neural machine translation system: Enabling zero-shot translation,” N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device placement
Transactions of the Association for Computational Linguistics, vol. 5, optimization with reinforcement learning,” CoRR, vol. abs/1706.04972,
pp. 339–351, 2017. 2017. [Online]. Available: http://arxiv.org/abs/1706.04972
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [25] M. Tanaka, K. Taura, T. Hanawa, and K. Torisawa, “Automatic
with deep convolutional neural networks,” Advances in neural infor- graph partitioning for very large-scale deep learning,” arXiv preprint
mation processing systems, vol. 25, pp. 1097–1105, 2012. arXiv:2103.16063, 2021.
[5] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural [26] Z. Cai, X. Yan, K. Ma, Y. Wu, Y. Huang, J. Cheng, T. Su, and
collaborative filtering,” in Proceedings of the 26th International Confer- F. Yu, “Tensoropt: Exploring the tradeoffs in distributed dnn training
ence on World Wide Web. International World Wide Web Conferences with auto-parallelism,” IEEE Transactions on Parallel and Distributed
Steering Committee, 2017, Conference Proceedings, p. 173–182. Systems, 2021.
[6] BAAI, “Release of wudao2.0,” 2021, https://2021.baai.ac.cn/schedule. [27] J. H. Park, G. Yun, C. M. Yi, N. T. Nguyen, S. Lee,
[7] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling J. Choi, S. H. Noh, and Y. ri Choi, “HetPipe: Enabling large
to trillion parameter models with simple and efficient sparsity,” arXiv DNN training on (whimpy) heterogeneous GPU clusters through
preprint arXiv:2101.03961, 2021. integration of pipelined model parallelism and data parallelism,”
[8] J. Lin, A. Yang, J. Bai, C. Zhou, L. Jiang, X. Jia, A. Wang, J. Zhang, in 2020 USENIX Annual Technical Conference (USENIX ATC 20).
Y. Li, W. Lin, J. Zhou, and H. Yang, “M6-10t: A sharing-delinking USENIX Association, Jul. 2020, pp. 307–321. [Online]. Available:
paradigm for efficient multi-trillion parameter pretraining,” 2021. https://www.usenix.org/conference/atc20/presentation/park
[Online]. Available: https://arxiv.org/abs/2110.03888 [28] J. Liu, Z. Wu, D. Yu, Y. Ma, D. Feng, M. Zhang, X. Wu, X. Yao,
[9] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, and Z. Yang, “Wudao- and D. Dou, “Heterps: Distributed deep learning with reinforcement
corpora: A super large-scale chinese corpora for pre-training language learning based scheduling in heterogeneous environments,” CoRR,
models,” Preprint, 2021. vol. abs/2111.10635, 2021. [Online]. Available: https://arxiv.org/abs/
[10] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Fos- 2111.10635
ter, J. Phang, H. He, A. Thite, and N. Nabeshima, “The pile: An [29] Y. Xu, H. Lee, D. Chen, B. Hechtman, Y. Huang, R. Joshi, M. Krikun,
800gb dataset of diverse text for language modeling,” arXiv preprint D. Lepikhin, A. Ly, M. Maggioni et al., “Gspmd: General and
arXiv:2101.00027, 2020. scalable parallelization for ml computation graphs,” arXiv preprint
[11] R. Mayer and H.-A. Jacobsen, “Scalable deep learning on distributed arXiv:2105.04663, 2021.
infrastructures: Challenges, techniques, and tools,” ACM Computing [30] D. Mudigere, Y. Hao, J. Huang, A. Tulloch, and S. Sridharan,
Surveys (CSUR), vol. 53, no. 1, pp. 1–37, 2020. “High-performance, distributed training of large-scale deep learning
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification recommendation models,” CoRR, vol. abs/2104.05158, 2021. [Online].
with deep convolutional neural networks,” Advances in neural infor- Available: https://arxiv.org/abs/2104.05158
mation processing systems, vol. 25, pp. 1097–1105, 2012. [31] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao, F. Yang, X. Yi,
[13] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. C. Wu, H. Zhang, and J. Zhao, “Oneflow: Redesign the distributed
Mao, M. Ranzato, A. Senior, and P. Tucker, “Large scale distributed deep learning framework from scratch,” CoRR, vol. abs/2110.15032,
deep networks,” 2012. 2021. [Online]. Available: https://arxiv.org/abs/2110.15032
[14] Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Exploring hidden dimensions [32] S. Wang, Y. Rong, S. Fan, Z. Zheng, L. Diao, G. Long, J. Yang, X. Liu,
in accelerating convolutional neural networks,” in International Con- and W. Lin, “Auto-map: A DQN framework for exploring distributed
ference on Machine Learning. PMLR, Conference Proceedings, pp. execution plans for DNN workloads,” CoRR, vol. abs/2007.04069,
2274–2283. 2020. [Online]. Available: https://arxiv.org/abs/2007.04069
[15] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, [33] M. Schaarschmidt, D. Grewe, D. Vytiniotis, A. Paszke, G. S.
G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized Schmid, T. Norman, J. Molloy, J. Godwin, N. A. Rink, V. Nair, and
pipeline parallelism for dnn training,” in Proceedings of the 27th ACM D. Belov, “Automap: Towards ergonomic automated parallelism for
Symposium on Operating Systems Principles, Conference Proceedings, ML models,” CoRR, vol. abs/2112.02958, 2021. [Online]. Available:
pp. 1–15. https://arxiv.org/abs/2112.02958
[16] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient 2d method for training [34] Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken, “Improving the
super-large deep learning models,” CoRR, vol. abs/2104.05343, 2021. accuracy, scalability, and performance of graph neural networks with
[Online]. Available: https://arxiv.org/abs/2104.05343 roc,” Proceedings of Machine Learning and Systems, vol. 2, pp. 187–
[17] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism 198, 2020.
in distributed training for huge neural networks,” CoRR, vol. [35] S. Li and T. Hoefler, “Chimera: Efficiently training large-scale
abs/2105.14450, 2021. [Online]. Available: https://arxiv.org/abs/2105. neural networks with bidirectional pipelines,” in Proceedings of
14450 the International Conference for High Performance Computing,
[18] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- Networking, Storage and Analysis, ser. SC ’21. New York, NY, USA:
zaro, “Megatron-lm: Training multi-billion parameter language models Association for Computing Machinery, 2021. [Online]. Available:
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. https://doi.org/10.1145/3458817.3476145
[19] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: Sys- [36] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang,
tem optimizations enable training deep learning models with over Y. Xu, D. Zhuo, J. E. Gonzalez, and I. Stoica, “Alpa: Automating inter-
100 billion parameters,” in Proceedings of the 26th ACM SIGKDD and intra-operator parallelism for distributed deep learning,” 2022.
International Conference on Knowledge Discovery & Data Mining, [37] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory
Conference Proceedings, pp. 3505–3506. optimizations toward training trillion parameter models,” in SC20: In-
[20] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible ternational Conference for High Performance Computing, Networking,
dataflow accelerator architecture for convolutional neural networks,” in Storage and Analysis. IEEE, Conference Proceedings, pp. 1–16.
2017 IEEE International Symposium on High Performance Computer [38] S. Li, F. Xue, Y. Li, and Y. You, “Sequence parallelism: Making 4d
Architecture (HPCA). IEEE, Conference Proceedings, pp. 553–564. parallelism possible,” arXiv preprint arXiv:2105.13120, 2021.
[21] S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, [39] Z. Li, S. Zhuang, S. Guo, D. Zhuo, H. Zhang, D. Song, and I. Stoica,
J. Yang, and L. Xia, “Dapple: A pipelined data parallel approach for “Terapipe: Token-level pipeline parallelism for training large-scale
training large models,” in Proceedings of the 26th ACM SIGPLAN language models,” arXiv preprint arXiv:2102.07988, 2021.
20

[40] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, [63] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josi-
N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with fovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed
conditional computation and automatic sharding,” arXiv preprint machine learning with the parameter server,” in 11th USENIX Sym-
arXiv:2006.16668, 2020. posium on Operating Systems Design and Implementation (OSDI 14),
[41] B. Wang, Q. Xu, Z. Bian, and Y. You, “2.5-dimensional distributed Conference Proceedings, pp. 583–598.
model training,” CoRR, vol. abs/2105.14500, 2021. [Online]. Available: [64] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified
https://arxiv.org/abs/2105.14500 architecture for accelerating distributed DNN training in heterogeneous
[42] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and gpu/cpu clusters,” in 14th USENIX Symposium on Operating Systems
J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Design and Implementation (OSDI 20), Conference Proceedings, pp.
Computing Surveys (CSUR), vol. 53, no. 2, pp. 1–33, 2020. 463–479.
[43] M. Wang, C.-c. Huang, and J. Li, “Supporting very large models [65] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep
using automatic dataflow graph partitioning,” in Proceedings of the learning in tensorflow,” arXiv preprint arXiv:1802.05799, 2018.
Fourteenth EuroSys Conference 2019, 2019, pp. 1–17. [66] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective
[44] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Hypar: communication operations in mpich,” The International Journal of
Towards hybrid parallelism for deep learning accelerator array,” in High Performance Computing Applications, vol. 19, no. 1, pp. 49–66,
2019 IEEE International Symposium on High Performance Computer 2005.
Architecture (HPCA). IEEE, 2019, pp. 56–68. [67] P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms
[45] Y. Ueno and R. Yokota, “Exhaustive study of hierarchical allreduce for clusters of workstations,” Journal of Parallel and Distributed
patterns for large messages between gpus,” in 2019 19th IEEE/ACM Computing, vol. 69, no. 2, pp. 117–124, 2009.
International Symposium on Cluster, Cloud and Grid Computing (CC- [68] A. Gibiansky, “Bringing hpc techniques to deep learning,” Baidu
GRID). IEEE, 2019, pp. 430–439. Research, Tech. Rep., 2017.
[46] M. Cho, U. Finkler, D. Kung, and H. Hunter, “Blueconnect: Decompos- [69] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanan-
ing all-reduce for deep learning on heterogeneous network hierarchy,” takool, P. Hawkins, H. Lee, M. Hong, and C. Young, “Mesh-tensorflow:
Proceedings of Machine Learning and Systems, vol. 1, pp. 241–251, Deep learning for supercomputers,” arXiv preprint arXiv:1811.02084,
2019. 2018.
[47] N. Xie, T. Norman, D. Grewe, and D. Vytiniotis, “Synthesizing [70] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang, F. Cui,
optimal parallelism placement and reduction strategies on hierarchical and Y. You, “Colossal-ai: A unified deep learning system for large-
systems for deep learning,” CoRR, vol. abs/2110.10548, 2021. scale parallel training,” CoRR, vol. abs/2110.14883, 2021. [Online].
[Online]. Available: https://arxiv.org/abs/2110.10548 Available: https://arxiv.org/abs/2110.14883
[48] N. A. Rink, A. Paszke, D. Vytiniotis, and G. S. Schmid, “Memory- [71] A. Krizhevsky, “One weird trick for parallelizing convolutional neural
efficient array redistribution through portable collective communica- networks,” arXiv preprint arXiv:1404.5997, 2014.
tion,” 2021. [72] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen,
[49] K. Kennedy and U. Kremer, “Automatic data layout for distributed- H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training
memory machines,” ACM Transactions on Programming Languages of giant neural networks using pipeline parallelism,” arXiv preprint
and Systems (TOPLAS), vol. 20, no. 4, pp. 869–916, 1998. arXiv:1811.06965, 2018.
[73] D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Za-
[50] J. L. M. Chen and J. Li, “Index domain alignment: Minimizing cost
haria, “Memory-efficient pipeline-parallel dnn training,” arXiv preprint
of cross-referencing between distributed arrays,” 1989.
arXiv:2006.09503, 2020.
[51] U. Kremer, “Np-completeness of dynamic remapping,” in Proceedings
[74] M. Assran, N. Loizou, N. Ballas, and M. G. Rabbat, “Stochastic gra-
of the Fourth Workshop on Compilers for Parallel Computers, Delft,
dient push for distributed deep learning,” CoRR, vol. abs/1811.10792,
The Netherlands, 1993.
2018. [Online]. Available: http://arxiv.org/abs/1811.10792
[52] J. Li and M. Chen, “The data alignment phase in compiling programs
[75] X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized
for distributed-memory machines,” Journal of parallel and distributed
parallel stochastic gradient descent,” 2018.
computing, vol. 13, no. 2, pp. 213–221, 1991.
[76] G. Nadiradze, A. Sabour, D. Alistarh, A. Sharma, I. Markov, and V. Ak-
[53] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine senov, “Swarmsgd: Scalable decentralized sgd with local updates.”
learning. MIT press, 2018. arXiv: Learning, 2020.
[54] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. [77] Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li, “Communication-efficient
34–37, 1966. distributed deep learning: A comprehensive survey,” 2020.
[55] J. Tarnawski, D. Narayanan, and A. Phanishayee, “Piper: Multidimen- [78] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary,
sional planner for dnn parallelization,” in NeurIPS 2021, December V. A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, and
2021. [Online]. Available: https://www.microsoft.com/en-us/research/ B. Catanzaro, “Efficient large-scale language model training on gpu
publication/piper-multidimensional-planner-for-dnn-parallelization/ clusters,” arXiv preprint arXiv:2104.04473, 2021.
[56] V. Elango, “Pase: Parallelization strategies for efficient dnn training,” [79] A. Jain, A. A. Awan, A. M. Aljuhani, J. M. Hashmi, Q. G. Anthony,
in 2021 IEEE International Parallel and Distributed Processing Sym- H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, “Gems:
posium (IPDPS), 2021, pp. 1025–1034. Gpu-enabled memory-aware model-parallelism system for distributed
[57] Huawei, “Mindspore,” https://www.mindspore.cn/en, 2020. dnn training,” in SC20: International Conference for High Performance
[58] K. Santhanam, S. Krishna, R. Tomioka, T. Harris, and M. Zaharia, Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–15.
“Distir: An intermediate representation and simulator for efficient [80] A. Griewank and A. Walther, “Algorithm 799: Revolve: An
neural network distribution,” CoRR, vol. abs/2111.05426, 2021. implementation of checkpointing for the reverse or adjoint mode
[Online]. Available: https://arxiv.org/abs/2111.05426 of computational differentiation,” ACM Trans. Math. Softw., vol. 26,
[59] J. Tarnawski, A. Phanishayee, N. R. Devanur, D. Mahajan, and F. N. no. 1, p. 19–45, mar 2000. [Online]. Available: https://doi.org/10.
Paravecino, “Efficient algorithms for device placement of DNN graph 1145/347837.347846
operators,” CoRR, vol. abs/2006.16423, 2020. [Online]. Available: [81] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with
https://arxiv.org/abs/2006.16423 sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016.
[60] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, [82] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, Gomez, L. Kaiser, and I. Polosukhin.
“Pytorch distributed: Experiences on accelerating data parallel [83] Y. Zhou, S. Roy, A. Abdolrashidi, D. L. Wong, P. C. Ma,
training,” CoRR, vol. abs/2006.15704, 2020. [Online]. Available: Q. Xu, M. Zhong, H. Liu, A. Goldie, A. Mirhoseini, and
https://arxiv.org/abs/2006.15704 J. Laudon, “GDP: generalized device placement for dataflow
[61] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero- graphs,” CoRR, vol. abs/1910.01578, 2019. [Online]. Available:
infinity: Breaking the gpu memory wall for extreme scale deep learn- http://arxiv.org/abs/1910.01578
ing,” arXiv preprint arXiv:2104.07857, 2021. [84] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean,
[62] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, and J. H. and, “A hierarchical model for device placement,” in International Confer-
“Scaling language models: Methods, analysis & insights from training ence on Learning Representations, 2018.
gopher,” CoRR, vol. abs/2112.11446, 2021. [Online]. Available: [85] R. Addanki, S. B. Venkatakrishnan, S. Gupta, H. Mao, and M. Al-
https://arxiv.org/abs/2112.11446 izadeh, “Placeto: Learning generalizable device placement algorithms
21

for distributed machine learning,” CoRR, vol. abs/1906.08879, 2019. [110] S. Zagoruyko and N. Komodakis, “Wide residual networks,” CoRR,
[Online]. Available: http://arxiv.org/abs/1906.08879 vol. abs/1605.07146, 2016. [Online]. Available: http://arxiv.org/abs/
[86] Y. Gao, L. Chen, and B. Li, “Spotlight: Optimizing device placement 1605.07146
for training deep neural networks,” in International Conference on [111] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
Machine Learning. PMLR, 2018, pp. 1676–1684. training of deep bidirectional transformers for language understanding,”
[87] Y. Ao, Z. Wu, D. Yu, W. Gong, Z. Kui, M. Zhang, Z. Ye, L. Shen, in Proceedings of the 2019 Conference of the North American Chapter
Y. Ma, T. Wu et al., “End-to-end adaptive distributed training on of the Association for Computational Linguistics: Human Language
paddlepaddle,” arXiv preprint arXiv:2112.02752, 2021. Technologies, Volume 1 (Long and Short Papers). Minneapolis,
[88] A. Paliwal, F. Gimeno, V. Nair, Y. Li, M. Lubin, P. Kohli, and Minnesota: Association for Computational Linguistics, Jun. 2019, pp.
O. Vinyals, “Reinforced genetic algorithm learning for optimizing 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/
computation graphs,” 2019. N19-1423
[89] E. W. Dijkstra, “Recursive programming,” Numerische Mathematik, [112] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected
vol. 2, no. 1, pp. 312–318, 1960. convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online].
[90] A. Schrijver, Theory of linear and integer programming. John Wiley Available: http://arxiv.org/abs/1608.06993
& Sons, 1998. [113] U. U. Hafeez, X. Sun, A. Gandhi, and Z. Liu, “Towards optimal
placement and scheduling of dnn operations with pesto,” in Proceedings
[91] L. Luo, M. Wong, and W. Hwu, “An effective gpu implementation of
of the 22nd International Middleware Conference, ser. Middleware ’21.
breadth-first search,” in Design Automation Conference, 2010.
New York, NY, USA: Association for Computing Machinery, 2021, p.
[92] W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain Monte 39–51. [Online]. Available: https://doi.org/10.1145/3464298.3476132
Carlo in practice. CRC press, 1995. [114] S. Zhao, F. Li, X. Chen, X. Guan, J. Jiang, D. Huang, Y. Qing, S. Wang,
[93] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Wang, G. Zhang et al., “v pipe: A virtualized acceleration system for
P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, achieving efficient and scalable pipeline parallel dnn training,” IEEE
“A survey of monte carlo tree search methods,” IEEE Transactions on Transactions on Parallel and Distributed Systems, vol. 33, no. 3, pp.
Computational Intelligence and AI in games, vol. 4, no. 1, pp. 1–43, 489–506, 2021.
2012.
[94] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
learning: A survey,” Journal of artificial intelligence research, vol. 4,
pp. 237–285, 1996.
[95] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine learning, vol. 8, no. 3,
pp. 229–256, 1992.
[96] B. and Hammer, “Learning with recurrent neural networks,” Assembly
Automation, 1980.
[97] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347,
2017. [Online]. Available: http://arxiv.org/abs/1707.06347
[98] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 11 1997. [Online].
Available: https://doi.org/10.1162/neco.1997.9.8.1735
[99] F. P. Miller, A. F. Vandome, and J. Mcbrewster, “Markov decision
process,” Springer London, 1985.
[100] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” CoRR, vol. abs/1706.02216, 2017. [Online].
Available: http://arxiv.org/abs/1706.02216
[101] J. Gonçalves and M. Resende, “Biased random-key genetic algorithms
for combinatorial optimization,” Journal of Heuristics, vol. 17, no. 5,
pp. 487–525, 2011.
[102] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. A. Riedmiller, “Playing atari with deep
reinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online].
Available: http://arxiv.org/abs/1312.5602
[103] M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan,
G. Yang, and D. Qian, “The deep learning compiler: A comprehensive
survey,” 2020.
[104] A. Sabne, “Xla : Compiling machine learning for peak performance,”
2020.
[105] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system
for large-scale machine learning,” in 12th {USENIX} symposium on
operating systems design and implementation ({OSDI} 16), 2016, pp.
265–283.
[106] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” CoRR, vol. abs/1910.10683,
2019. [Online]. Available: http://arxiv.org/abs/1910.10683
[107] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
machine translation system: Bridging the gap between human and
machine translation,” arXiv preprint arXiv:1609.08144, 2016.
[108] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. A.
Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “Mlir:
Scaling compiler infrastructure for domain specific computation,” in
CGO 2021, 2021.
[109] P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu,
“Interaction networks for learning about objects, relations and physics,”
2016.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy