Auto Parallel
Auto Parallel
Abstract—Deep learning (DL) has gained great success in efficiently train large-scale models gains more and more
recent years, leading to state-of-the-art performance in research attention from research community and industrial fields.
community and industrial fields like computer vision and natural Distributed training [11] jointly make use of multiple de-
language processing. One of the reasons for this success is
the huge amount parameters adopted in DL models. However, vices to train the model and achieve reasonable training
it is impractical to train a moderately large model with a speedup. In 2012, [12] trained AlexNet with 2 GPUs in
large number of parameters on a typical single device. It is parallel, which is a initiate and successful attempt at training
necessary to train DL models in clusters with novel parallel model with multiple devices. Then, Jeffrey Dean etc. pro-
and distributed training algorithms. However, traditional training poses the first generation distributed deep learning system
algorithms owns the inability to train large-scale neural networks
in heterogeneous computing clusters. Nowadays, auto-parallelism DistBelief [13], introducing the concept of distributed cal-
is promising to handle the above issue. Auto-parallelism makes culation in deep neural network model training. Moreover,
large-scale DL model training efficient and practical in vari- they systematically design the policies of parallelism and way
ous computing clusters. In this survey, we perform a broad of synchronization so that the training process can apply to
and thorough investigation on challenges, basis, and strategy large-scale clusters. At present, the research on accelerating
searching methods of auto-parallelism in DL training. Firstly, we
abstract basic parallelism schemes with their communication cost distributed training mainly focuses on the design of parallelism
and memory consumption in DL training. Further, we analyze strategies and how to select them.
and compare a series of current auto-parallelism works and In this work, we draw conclusion on parallelism works
investigate strategies and searching methods which are commonly available in publications and make a comprehensively anal-
used in practice. At last, we discuss several trends in auto- ysis on these algorithms. Parallelism strategy can be divided
parallelism which are promising in further research.
into two categories: intra-operator parallelism [14] and inter-
Index Terms—auto-parallelism, large-scale neural networks, operator parallelism [15]. Intra-operator parallelism includes
training technique, parallel and distributed training data parallelism (DP) and tensor parallelism (TP). TP is
also known as intra-layer model parallelism and has some
I. I NTRODUCTION varieties, such as Row-TP, Column-TP, 2D-TP [16], and 3D-
TP [17]. Inter-operator parallelism includes inter-layer model
D EEP learning [1] has drawn a lot of attention for its
superior performance in tasks like speech recognition [2],
machine translation [3], object detection [4], recommenda-
parallelism and pipeline parallelism (PP). All these strategies
are helpful to accelerate the training of models, but it is
unable to achieve the best performance by using only few
tion [5], and so on. The success of deep learning is highly
of them. To gain better performance, researchers propose
related to the availability of large, labeled databases, and the
hybrid parallelism. Hybrid parallelism is the work that uses
ability to train these huge volumes networks. So far, the largest
the combination of data, model, and pipeline parallelism to
models have trillions of parameters [6]–[8]. Furthermore,
partition the model in a fine-grained way to increase through-
the volume of training corpus have achieved tera-byte. For
put. Representative work includes Megatron-LM [18] and 3D-
instance, Wudao 2.0 [6] is trained with 4.9TB high-quality
parallelism from DeepSpeed [19].
Chinese and English corpus from WuDaoCorpora [9] and Pile
However, manually applying the intra-operator or inter-
[10] datasets.
operator strategy of parallelism methods to a model is difficult
While the number of model parameters increases expo-
since manual partitions require the engineer to be an expert
nentially, the storage capacity of a computing device only
at communication and computing. They are required to be
increases from a few GBs to 80 GBs (e.g., NVIDIA A100,
aware of all the execution time and memory state in every sub-
H100) in the last decade, which results in a memory wall
procedure of the training process. Moreover, strategies vary
bottleneck. Thus, a moderate computing device can no longer
according to the device topology and the model structure. Once
hold the entire model. In order to train a large-scale model, we
the devices or the model changes, experts may need to redesign
may need thousands of computing devices to work corporately,
a modified strategy from scratch. An excellent parallelism
and to deliberately manage these devices to effectively and
strategy can perform better than only using data or model
Peng Liang, Yu Tang, Zhiquan Lai, Linbo Qiao and Dongsheng Li are parallelism, repeatedly manually designing is empirical, and
with the College of Computer, National University of Defense Technology, often sub-optimal, sometimes even impractical in real-world
Changsha, Hunan, P.R.China, 410073. Xiaoda Zhang, Youhui Bai and Teng applications. Thus, it is essential to search for the optimal
Su are with the Huawei Technologies Co. Ltd. E-mail: dsli@nudt.edu.cn.
† : corresponding authors. hybrid parallel strategy automatically to save time, energy, and
Manuscript received Mar. 31, 2022; revised xx xx, xxxx. money.
2
automatically. We will further discuss these strategy searching E. Trade-off of Runtime and Strategy Performance in Finding
methods in Sec.V. Strategy
Strategy searching algorithms are time-consuming for two
reasons. The first one is that partitioning a DAG for optimal
C. Load-Balance in Heterogeneous Topology performance is an NP-hard problem [49]–[52]. The second one
is to evaluate every strategy that the algorithm finds.
Heterogeneous topology here represents a device graph with
To solve the NP-hard problem, researchers have tried to use
different types of computing devices in it (e.g., CPU and GPU
machine-learning algorithms [1], [53] and classic algorithms
of multiple types). Using heterogeneous clusters to train a
like dynamic programming [54]. Some of the works [43],
model more environmentally is a good choice. Buying new
[55] additionally uses heuristic assumptions that help shorten
devices does not mean the old ones cannot work anymore
searching runtime but may sacrifice the performance of the
because they could also participate in some parts of the work.
strategy. We will discuss more details of this in Sec.V.
Different types of devices usually have different computing
performances. The goal is to arrange the computation properly To evaluate the performance of searched strategies, we can
to achieve load-balance on each device. However, only a use a cost model that calculates the cost of a strategy or
few works involve the load-balance analysis in the strategy profile the execution time of strategy by deploying a strategy
searching task. DeepSpeed [19] heuristically uses CPU to to the model and running it. Some auto-parallelism works
execute the update of parameters because the updates are less [22], [56], [57] use a symbolic cost model to analyze the
complicated compared to forward and backward propagation performance of strategies. However, most accelerators (e.g.,
and the computation of updates on CPU can overlap the GPU, NPU, FPGA) do the computation in parallel, while a
computation on GPU in certain situations. Paddle-HeterPS symbolic cost model only reflects on the serial amount of
[28] uses reinforcement learning to decide computing devices computation. Meanwhile, different types of devices may have
for every layer, but it only supports DP and PP. AccPar [23] different performances and implementation on some specific
introduces a method that solves partition ratios of each kind tasks like convolution, which is hard for the system to be
of device and then partitions model layer by layer, after which aware of and thus needs more artificial annotation work on
the computation time on each device is similar. However, it tuning the cost model. Moreover, it is hard for symbolic cost
only supports DP and 1D-TP without check-pointing. Auto- models to be aware of the overlaps between computation and
parallelism on heterogeneous topology is still explorable, and communication. These all make the symbolic cost model hard
it will save much money from buying more devices. to accurately reflect the actual performance of found strategies,
though it has shorter runtime than profiling. On the other hand,
profiling every schedule that the algorithm generates is too
time costly [15], [25], [58], although it can accurately tell us
D. Topology-aware Communication Optimization the difference between any two strategies. Using a profiling-
Auto-parallelism algorithms need to consider topology- based cost model [55], [59] seems to be a more reasonable
aware communication strategies to reduce communication time decision, whose costs are the actual time obtained by running
further and increase throughput. Due to the limited size of the each operator in the computation graph. And then, we can find
motherboard of a node, a large number of computing devices a near-optimal parallelism strategy by minimizing cost.
are distributed to different nodes, which results in bandwidth
differences between intra-node and inter-node communication. IV. T HE A NALYSIS ON D IFFERENT PARALLELISM
While intra-node bandwidth is usually faster than inter-node’s, S CHEMES
making full use of intra-node bandwidth can optimize com-
munication and thus reduce overall execution time. [45], [46] Detailed analysis on communication, computation, and
divide all-reduce operations among all devices in a cluster memory cost of every parallelism scheme is the basis of auto-
into several all-reduce operations among subgroups of devices parallelism since different partition strategies bring different
to achieve better performance. Inspired by this, P 2 [47] can amounts of cost. Auto-parallelism methods try every combina-
generate DP and 1D-TP partition strategies and utilizes the tion of parallelism schemes they can handle and select the one
system hierarchy to synthesize the best reduction strategies with minimum cost as the final decision. This section discusses
that consist of sequences of common collective communication the partition and communication in every parallelism scheme.
operations, which is proven to be faster than a single All- To simplify the illustration in this section, we assume that
Reduce operation among all devices in many cases. These we are using homogeneous clusters. Because in homogeneous
works on all-reduce optimize intra-layer communication we clusters, devices have the same computation capacity, which
mentioned above. Another communication optimization work means that partitions can be executed averagely, and thus
on tensor distribution reduces the inter-layer communication each device has the same computation cost when given the
cost. [48] reduces the communication amount of tensor re- partitioned task. This assumption helps us focus on evaluating
distribution by replacing original All-to-All operations with communication costs in different strategies. It should be noted
sequences of portable collective communication operations that communication amount is the most crucial factor that we
including All-Gather, Dynamic-Slice, All-Permute, and All- need to consider in generating strategies, and computation is
to-All. another critical factor for balancing works on every device.
5
Scheme Name Input (Tin ) Weight (W ) Output (Tout ) Weight Gradient (δW ) Optimizer States (Os )
Vanilla DP (b/p, win ) (win , wout ) (b/p, wout ) (win , wout ) (win , wout )
ZeRO-powered DP stage 1 [37] (b/p, win ) (win , wout ) (b/p, wout ) (win , wout ) (win /p, wout ) or (win , wout /p)
ZeRO-powered DP stage 2 [37] (b/p, win ) (win , wout ) (b/p, wout ) (win /p, wout ) or (win , wout /p) (win /p, wout ) or (win , wout /p)
ZeRO-powered DP stage 3 [37] (b/p, win ) (win /p, wout ) or (win , wout /p) (b/p, wout ) (win /p, wout ) or (win , wout /p) (win /p, wout ) or (win , wout /p)
Row-TP (b, win /p) (win /p, wout ) (b, wout ) (win /p, wout ) (win /p, wout )
Column-TP (b, win ) (win , wout /p) (b, wout /p) (win , wout /p) (win , wout /p)
Tin W Tout δW Os
1 4
Vanilla DP [[0, b/2], [−1]] [[−1], [−1]] [[0, b/2], [−1]] [[−1], [−1]] [[−1], [−1]]
1
ZeRO stage 1 [[0, b/2], [−1]] [[−1], [−1]] [[0, b/2], [−1]] [[−1], [−1]] [[0, win /2], [−1]]
1
ZeRO stage 2 [[0, b/2], [−1]] [[−1], [−1]] [[0, b/2], [−1]] [[0, win /2], [−1]] [[0, win /2], [−1]]
1
ZeRO stage 3 [[0, b/2], [−1]] [[0, win /2], [−1]] [[0, b/2], [−1]] [[0, win /2], [−1]] [[0, win /2], [−1]]
Row-TP1 [[−1], [0, win /2]] [[0, win /2], [−1]] [[−1], [−1]] [[0, win /2], [−1]] [[0, win /2], [−1]]
1
Column-TP [[−1], [−1]] [[−1], [0, wout /2]] [[−1], [0, wout /2]] [[−1], [0, wout /2]] [[−1], [0, wout /2]]
[[0, b/2], [[0, win /2], [[0, b/2], [[0, win /2], [[0, win /2],
2
2D-TP
[0, win /2]] [0, wout /2]] [0, wout /2]] [0, wout /2]] [0, wout /2]]
[[0, b/4, [[0, win /2], [[0, b/4, [[0, win /2], [[0, win /2],
3
3D-TP b/2, 3b/4], [0, wout /4, b/2, 3b/4], [0, wout /4, [0, wout /4,
[0, win /2]] wout /2, 3wout /4]] [0, wout /2]] wout /2, 3wout /4]] wout /2, 3wout /4]]
1 Partitioned on 2 devices, device group g is [0, 1] as an example.
2 Partitioned on 4 devices, device group g is [0, 1, 2, 3] as an example.
3 Partitioned on 8 devices, device group g is [0, 1, 2, 3, 4, 5, 6, 7] as an example.
4 -1 represents for a non-partition along this axis.
7
8
TABLE III
C OMPARISON OF THE MINIMUM COMMUNICATION VOLUME OF DATA PARALLELISM
W δW W δW AllReduce
W AllReduce
AllReduce
y Ey y Ey y Ey
Reduce-
W Update MatMul x W Update x W Update
Scatter
Reduce-
x W Reduce Reduce x W Scatter
AllGather
Reduce-
MatMul MatMul
Scatter
Broadcast MatMul MatMul MatMul AllGather MatMul MatMul AllGather
y Ey
Reduce-
Broadcast y Ey AllGather
Scatter
y Ey
Details Ignored
Details Ignored Details Ignored
into sub-blocks with shape [win /p1/3 , wout /p2/3 ]. 2D-TP and x Slice y
2.5D-TP uses broadcast and reduce communication operations
MatMul AllReduce MatMul AllReduce z
to help calculate the output results, while 3D uses all-gather
and reduce-scatter collective operations. It is emphasized that W1 y W1
the communication amounts of these three methods are smaller
than 1D-TP while having no redundancy in saving interme- (a) Row-TP to Row-TP w/o optimization
TABLE IV
I NTRA - OP PARALLELISM S CHEMES : I NTRA -L AYER C OMMUNICATION C OST OF M AT M UL Tout = Tin W IN A T RAINING S TEP
TABLE V
T ENSOR R EDISTRIBUTION : I NTER -L AYER C OMMUNICATION VOLUME
Layer l + 1
small weight matrix but tremendous intermediate results, is not represents the matches of weight versions between forward and
acceptable. TP is more suitable for a model with big weight backward propagation, guaranteeing convergence. From these
matrixes in order to gain performance improvement [71]. two works, researchers have proposed lots of varieties. Fig.7
shows the scheduling details of the below-mentioned pipeline
varieties.
C. Pipeline Parallelism
MP is relatively more complicated than DP to design 1) Asynchronous Pipeline Parallelism: The most rep-
and reproduce because MP often requires a good balance resentative work of asynchronous pipeline parallelism is
of model scaling capacity, flexibility, and efficiency among PipeDream. PipeDream pursues higher throughput and uti-
devices. Most of the pipeline parallelism (PP) methods can lization of devices. It automatically partitions layers into
automatically partition models to balanced stages on top of load-balanced stages which have similar computation times.
MP [15], [25], [72]. The partition pattern of PP is the same as Each device on a different stage processes a different micro-
that of MP; this is why some emerging popular frameworks batch of data simultaneously, avoiding the data relevance
like MindSpore [57] and OneFlow [31], and researchers name problem. PipeDream eliminates bubbles in the pipeline by
inter-layer model parallelism as pipeline parallelism directly. storing multiple versions of the parameters. However, it brings
The only difference between PP and MP is that PP is the well- extra memory costs and a staleness problem due to its
scheduled pipelined MP, which can overlap the computation asynchronous updates, resulting in a possible convergence
of different batches. PP was proposed to solve the low-utility problem. PipeDream-2BW [73] optimizes the memory usage
problem of MP. With the development of TP, researchers tend in PipeDream, which needs only two buffers to store generated
to replace MP/PP with TP in most cases. However, PP is still weights of different versions.
effective in training large-scale models because it introduces Although the experiments in PipeDream and PipeDream-
less communication than DP and TP, and it is beneficial to 2BW show that using asynchronous pipeline schedules does
enlarge batch size. not hurt the convergence, we still need to be cautious about the
Typical works on pipeline parallelism (PP) include asyn- convergence problem that may occur due to delayed updates
chronous pipeline PipeDream [15] from Microsoft and syn- in such schedules [74]–[77]. Currently, the PipeDream-based
chronous pipeline GPipe [72] from Google. Synchronicity here pipeline schedule has been widely used in Megatron-LM [78]
12
worker 1 1 2 3 4 1 2 3 4 worker 1 1 2 3 4 1 2 3 4
worker 2 1 2 3 4 1 2 3 4 worker 2 1 2 3 1 4 2 3 4
worker 3 1 2 3 4 1 2 3 4 worker 3 1 2 1 3 2 4 3 4
worker 4 1 2 3 4 1 2 3 4 worker 4 1 1 2 2 3 3 4 4
time time
(b) GPipe (c) DAPPLE, 1F1B
worker 1 1 2 3 3 4 4 1 2 worker 1 1 2 3 4 1 5 2 6 3 7 4 8 5
worker 2 1 3 2 4 3 1 4 2 worker 2 1 2 3 1 4 2 5 3 6 4 7 5 8 6
worker 3 3 1 4 2 1 3 2 4 worker 3 1 2 1 3 2 4 3 5 4 6 5 7 6 8
worker 4 3 4 1 1 2 2 3 4 worker 4 1 1 2 2 3 3 4 4 5 5 6 6 7 7
time time
(i.e., PipeDream-1F1B), and Microsoft Fiddle project’s recent method naturally with a DP-degree of 2 and compare it with
works [55], [58]. DAPPLE with a DP-degree of 2, which also has a replica of
2) Synchronous Pipeline Parallelism: GPipe was first de- model weights, we would find it has a lower bubble ratio, and
signed for inter-layer MP, as shown in Fig. 7.b. GPipe first thus more effective.
partitions the model into N stages and puts the n-th stage There exist bubbles in synchronous PP methods due to the
on the n-th accelerator. Based on this partition, GPipe divides synchronization before updating parameters; this may lead to
the training batch B into M micro-batches. These M micro- insufficient utilization of devices. However, the convergence
batches are executed in the N accelerators in a pipelined form. of synchronous PP is guaranteed since it is mathematically
Unlike asynchronous PP methods, in GPipe, the gradients equivalent to the vanilla training process.
of each micro-batch are computed with the same model 3) Discussions and Comparisons of Pipeline Parallelism
parameters in the forward pass during the back-propagation Methods: In this subsection, we discuss the communication of
process. Finally, all of the gradients from all M micro-batches PP and compare the intermediate activation memory, parame-
are accumulated and are used to update the model parameters ter weight memory, and bubble ratio of different PP methods.
in this mini-batch across all devices. The schedule of GPipe Communication volume. All PP methods above except
requires devices to store all the intermediate activation values Chimera have the same communication volume when given
before the execution of backward propagation. This storage specific batchsize, micro-batch size, and stage number. The
could be a memory bottleneck in training big models. To communication form of PP is the same as MP. As mentioned
address this problem, DAPPLE [21] modifies the order of in section IV-B, the communication amount of MP is Bwout
backward propagation in GPipe to reduce the peak memory for each stage (except for the last stage), where wout here
usage, and it is also faster than GPipe. is the column size of the last weight matrix of the stage.
Absorbing the ideas from GPipe, DAPPLE, and GEMS [79], Suppose that batchsize of the data is b In PP, and a batch
Chimera [35] proposes a bidirectional pipeline that further is divided into M micro-batch. A micro-batch, therefore,
reduces the number of bubbles ratio in the pipeline. However, results in Bwout /M communication volume in a stage, and
the memory cost of storing the weight matrix is twice that the total volume of a batch in this stage sums up to Bwout .
of GPipe and DAPPLE. Furthermore, the redundancy results Suppose we now have S stages and the output dimension
in synchronization of the model weights, which increases the of stage i P is Wi , the total communication volume of an
S−2
communication volume. However, if we treat Chimera as a PP iteration is 0 BWi , which is the same as MP. In addition
13
TABLE VI
C OMPARISON OF DIFFERENT PP METHODS
PP method Bubble Ratio Weights Memory Activations Memory Synchronous or not Convergence
PipeDream [15] ≈0 [Mw , SMw ] [Ma , SMa ]
No Unstable
PipeDream-2BW [73] ≈0 2Mw [Ma , SMa ]
GPipe [72] (S − 1)/(N + S − 1) Mw N Ma
DAPPLE [21] (S − 1)/(N + S − 1) Mw [Ma , SMa ] Yes Stable
Chimera [35] (S − 2)/(2N + S − 2) 2Mw [(S/2 + 1)Ma , SMa ]
each worker holds the same parameters and then computes sequence-to-sequence RNN with LSTM cells and a content-
different parts of the inputs. The inputs are chunked along the based attention mechanism. However, HDP and Spotlight
sequence-length axis. By transmitting the computing results both rely too much on LSTM controllers that are hard to
of each device, it finally outputs the final complete results. train(i.e., Spotlight takes 9 hours on five worker machines
Sequence parallelism can support a larger batchsize in training to find better placement than ColocRL). Moreover, LSTM
than TP and, in the meanwhile, has better throughput. performs poorly on capturing long-distance dependencies over
large computation graphs. To alleviate this problem, Placeto
V. S TRATEGY S EARCHING M ETHODS FOR [85] and GDP [83] use Graph Neural Network (GNN) [100]
AUTO -PARALLELISM to make embedding information for nodes in computation
graph G. Placeto models problems as MDP and relies on
As mentioned before, strategy searching is the key to auto-
hierarchical grouping, and only generates placement for one
parallelism and, in the meanwhile, is an NP-hard problem. Re-
operator at each time step. Instead, GDP pre-trains and fine-
searchers have proposed many methods [14], [15], [20]–[25],
tunes a Transformer-based [82] attentive network to generate
[25], [26], [30]–[32], [34]–[36], [43], [55]–[59], [73], [83]–
whole graph operator placements at once and is 16.7x faster
[88] for auto-parallelism to find a near-optimal strategy. We
than HDP when finding strategies for an 8-layer Transformer
divide existing strategy searching methods into two categories:
model. In addition, GDP can support partitions for large hold-
classic-algorithm-based methods and machine-learning-based
out graphs with over 50k nodes. REGAL (Reinforced Genetic
methods. Classic-algorithm based methods include recursive
Algorithm Learning) [88] uses GNN policy to predict node-
algorithm [89], dynamic programming algorithm [54], integer
specific non-uniform proposal distribution choices, which are
linear programming algorithm [90] as well as breath-first-
parameterized as beta distributions over [0, 1]. REGAL then
search (BFS) algorithm [91]. We summarize the following
uses a biased random key genetic algorithm (BRKGA) [101]
analysis in Tab. VII. Machine-learning based methods include
to run with those choices and outputs the best solution found
methods like Monte-Carlo Markov Chain (MCMC) [92],
by its iteration limit. REGAL can generalize to a broad set
Monte-Carlo Tree Search (MCTS) [93] which help search
of previously unseen computation graphs, which saves lots
strategies, and reinforcement learning [94] which help predict
of training time, and it can produce an MP strategy for
strategies for each operator and so on.
a graph with 1k nodes in only a few seconds. However,
REGAL only considers peak memory minimization while
A. Machine-Learning-Based Methods GDP focuses on model throughput and scalability. HeterPS
1) Reinforcement-Learning-Based Methods: We first start [28] applies parameter server architecture on CPU and ring-
with reinforcement-learning-based methods. ColocRL [24] is allreduce architecture on GPU/XPU to exploit heterogeneous
the first work that uses reinforcement learning to do auto- computing devices fully. It uses a reinforcement-learning-
parallelism. It uses an attentional sequence-to-sequence model based LSTM model to predict device type for each layer of
trained with Adam optimizer based on policy gradients com- Click Through Rate (CTR) models. Their experiment shows
puted via the REINFORCE equation [95] to predict the place- that HeterPS is exponentially faster (i.e., 10 seconds to find
ments of operators. However, it is a coarse-grained method the best strategy) than Brute Force Search when the number of
that only does model parallelism, and it is too expensive device types and the generated schedule plan on heterogeneous
for the recurrent neural network (RNN) [96] policy to learn computing resources has higher throughput than homogeneous
when the number of operations is enormous. It took 27 hours computing resources.
over a cluster of 160 workers to find a placement that out- Above reinforcement learning methods all focus on DP and
performs an existing heuristic. Moreover, the standard policy MP. We will discuss methods related to TP or PP below.
gradient method is known to be inefficient, as it performs TAPP (Task allocation in pipeline parallelism) focuses on
one gradient update for each data sample [97]. The author partitioning the model into several stages with reinforcement
of ColocRL then proposes a long short-term memory (LSTM) learning. It uses a feed-forward neural network (FFN), where
[98] reinforcement-learning-based hierarchical device place- the last layer is a Softmax operator that predicts the stage
ment strategy (HDP) [84], which can support neural networks number for each layer. It then uses a reinforcement learning
that have ten of thousands of operations. Spotlight [86] models attention-based sequence-to-sequence model to predict which
the problem as a multi-stage Markov decision process (MDP) device a stage should be in. Auto-MAP [32] from Alibaba
[99], and applies proximal policy optimization on a two-layer leverage Deep Q-Network (DQN) [102] with task-specific
15
pruning strategies to help efficiently explore the search space B. Classic-Algorithm Based Methods
of DP, TP, or PP over XLA Higher Level Operations (HLO)
with device and network interconnect topology specified. They OptCNN proposes the layer-wise parallelism strategy, which
choose HLO Intermediate Representation (IR) [103] that is is an auto-parallelism solution under the parameter server
produced by Accelerated Linear Algebra (XLA) [104] from architecture [63]. OptCNN can only handle TP partition.
TensorFlow [105] as the operational level of Auto-MAP. Taking the computer vision task as an example, OptCNN
Because exploring distributed plans on HLO IR can achieve considers the input dimensions of every layer in the model,
better performance benefits from its finer granularity than including batchsize, width, height, and the number of channels.
operators. Moreover, IR is a kind of expression of compu- All dimensions can be divided into various devices. They first
tation graph, which fits our problem definition. Auto-MAP set build a computation graph G of the model and a device graph
rewards, states, and actions for all three DP, TP, PP to instruct D of the cluster and then build a cost model to estimate
DQN to search strategies. Given a cluster of 4 servers with cost under any TP strategies. Using a dynamic programming
8 V100 GPUs, Auto-MAP can search TP strategies for 11- graph search algorithm, OptCNN can determine combinations
billion-parameter T5 [106] model within 1.5 hours, DP strate- of partition dimensions for every layer. However, OptCNN
gies within 17 minutes, and PP strategies within 280 seconds. can only solve computation graphs with linear structure. To
However, Auto-MAP can currently automatically give a single address this problem, Tofu [43] coarsens the computation
parallelism strategy, which may result in sub-optimal runtime graph G by grouping some nodes in V to make it linear,
performance in large-scale distributed training. The authors after which they use the dynamic programming algorithm in
are considering supporting a hybrid of these strategies in the OptCNN to produce strategies. What is more, ToFu considers
future. only communication cost to reduce the search space under the
2) Other Methods: FlexFlow proposed four possible paral- observation that different strategies of an operator like matrix
lelizable dimensions based on OptCNN [14]: SOAP, which multiplication have the same arithmetic complexity. Tofu uses
represented sample, operation, attribute, and parameter, re- the dynamic programming algorithm in OptCNN and applies
spectively. Among SOAP, sample-dimension division corre- some techniques to make it more practical. In addition to the
sponds to DP, operation-dimension division corresponds to coarsening technique, Tofu accelerates the dynamic program-
MP, attribute-dimension division corresponds to each attribute ming algorithm by applying recursively. Compared to dynamic
dimension of input Tensor (such as height and width), and programming with coarsening, which takes 8 hours to get the
parameter-dimension division corresponds to TP. The parti- best strategy set P for 8 workers, using recursion to search
tioning of attributes and parameters corresponds to model strategy for WResNet-152 [110] only takes 8.3 seconds.
parallelism. Unlike OptCNN, which only supports linear mod- Instead of coarsening, TensorOpt [26] extends the dynamic
els like AlexNet [12], FlexFlow can parallelize all kinds of programming algorithm in OptCNN and name it as FT-
computation graphs. They use a random MCMC algorithm Elimination (Frontier Tracking Elimination) to make it exe-
to find the optimal partition configuration and determine the cutable on a computation graph with a non-linear structure.
appropriate parallelism strategy for each operator in a neural However, the runtime is not efficient enough. So TensorOpt
network. However, the MCMC tries to enumerate strategies in also tries to group operators and applies an FT-LDP (Frontier
the search space randomly, which results in an unacceptable Tracking Linear Dynamic Programming) algorithm to help
time of solving the optimal solution for large-scale models. It reduce the time complexity. In addition, the FT-LDP algo-
requires 37 minutes to search strategies for NMT [107] model rithm can be parallelized by multi-threading to generate the
on 16 servers with 4 P100 GPUs. To support the partition of computation of different parallelization strategies efficiently.
GNN models, the author of FlexFlow, Zhihao Jia, implements For WReset, FT-LDP with multi-threading can find the best
ROC [34] on top of FlexFlow. They design a cost model, strategy in 22 minutes, while FT-Elimination needs 5.5 hours.
which could predict the execution time of GNN on an arbitrary Though the search time is longer than Tofu, the throughput of
graph, and then uses an online linear regression model to learn TensorOpt’s generated strategy is much higher than Tofu since
the cost model. The learned cost model enables the graph Tofu tries to search strategies that use less memory. However,
partitioner to discover balanced and effective partitioning for these memory-saved strategies may have smaller throughput.
GNN training and inference. In addition, ROC uses a dynamic PipeDream [15] provides auto-parallelism solutions that
programming algorithm to minimize communication costs support asynchronous pipeline training. In order to accurately
between devices. Automap [33]from DeepMind is performed obtain the execution time of each layer, PipeDream first
on MHLO, which is an MLIR [108] encoding of XLA HLO. It profiles the model that needs to be partitioned to obtain the
applies a Search and Learning method to annotate Megatron- execution time, activation size, and model parameter size of
like [18], [78] strategies for all operators. They implement each layer. Then, according to the obtained results, they create
MCTS to help search and propagate strategies when traversing a profiling-based cost model and design a dynamic program-
the program. To reduce the search space and improve the ming algorithm that divides pipeline stages and determines
quality of strategies propagation, Automap uses a learned in- the DP degree of each stage to meet the load balance need.
teraction network [109] to compute per-node relevance score, Based on PipeDream, PipeDream-2BW [73] optimizes the
and the top-k will be considered first in the search space. memory consumption by applying activation recomputation
Automap shows that using MCTS with a learned filter can and reducing the number of parameter weight buffers that store
find strategies similar to Megatron. different versions of computed gradients to 2. To accelerate the
16
partition, PipeDream-2BW exploits the repetitive structure of in unbalanced partitions. What is more, PaSE is not good
models (e.g., transformer layers in BERT) by grouping them at handling G that |E| is tremendous, as the M may be
and only considering configurations where all model stages significantly large. The runtime overhead of solving strategies
replicate an equal number of times. However, PipeDream only of models like DenseNET [112] is unacceptable since their
supports linear graphs. To address this problem, researchers computing graphs are uniformly dense. Both D-Rec and PaSE
from Fiddle propose dnn-partitioning [59], which extends the uses static analysis-based cost model to generate strategies.
dynamic programming algorithm in PipeDream to support However, an asymptotic analysis may not be accurate enough
partition for arbitrary DAG, and also proposes an integer as profiling does, which may result in some performance
programming solution to solve partition problem. However, deterioration. Because static analysis usually can not capture
like PipeDream and PipeDream-2BW, these methods do not some low-level details like cache effects and overlapping of
consider tensor parallelism. computation and communication, which may be necessary for
Also from project Fiddle, Piper [55] uses a two-level analyzing execution time.
dynamic programming approach to search DP, TP, and PP AccPar [23] analyzes the intra-layer and inter-layer com-
strategies. The outer dynamic programming algorithm would munication cost for all situations between DP and 1D-TP and
generate hundreds of NP-hard knapsack sub-problems, which does DP and 1D-TP partitions to the model. It simplifies the
calculates the throughput of a sub-graph under given hyper- DAG partition problem by deciding strategies layer by layer
parameters. Piper uses a bang-per-buck heuristic to accelerate using dynamic programming, whose arithmetic complexity is
the solving procedure of generated knapsack sub-problems, O(|V |). By introducing a partition ratio, AccPar can support
reducing the computation complexity. The computation com- heterogeneous clusters. Its experiments show that the perfor-
plexity of the Piper algorithm is O(|V |2 N |VD |2 ), where |V | is mance of AccPar outperforms OWT and HyPar [44], which
the number of vertices in computing graph, |VD | is the number only do DP and Row-TP on homogeneous clusters.
of devices, N ≤ |VD | is the maximum sum of DP degrees. DistIR [58] is an efficient IR for explicit representation
Piper can partition a 64 layer BERT [111] on 2048 devices of distributed DNN computation. It uses a linear regression
within only 2 hours, which is relatively a short time compared model to simulate the cost of operators (e.g., MatMul and
to its training time. However, the current implementation of AllReduce), and simulated throughput has a strong correlation
the algorithm is serial and inefficient. A potential advantage for both MLP training and GPT-2 inference for all model sizes.
of Piper is that some procedures in Piper can be executed in DistIR uses a simple grid-search to find the minimum cost
parallel, which can scale the runtime for this algorithm linearly strategy that consists of DP, Row-TP, and 1F1B-PP. Although
on a multi-core CPU server and further reduce searching time. DistIR is very efficient in finding the best strategy in their
Double Recursion Algorithm (D-Rec) [22] uses the obser- search space, the optimal strategy may be too coarse to use
vation that DP and TP have the same communication cost compared to others.
per worker, and thus only consider communication cost to do Alpa [36] uses a two-level hierarchical planning algorithm
DP, TP partitions, and the combination of them. D-Rec builds to search strategies and is the first auto-parallelism method
its cost model statically, which asymptotically and statically that supports DP, 1D-TP, 1F1B-PP as well as ZeRO-DP.
analyzes communication cost based on the shape of tensor Alpa works on arbitrary DAG. It formalizes the intra-operator
and type of operator using the formulations in Table V and parallelism problem as integer linear programming (ILP) and
IV. Based on the analysis, D-Rec automatically determines formalizes the inter-operator parallelism as dynamic program-
strategies for each operator within a linear complexity short ming. The dynamic programming algorithm is built on top
time (28 seconds to search strategies for 24-layer BERT of that in TeraPipe [39], but additionally consider device
on 8 devices). MindSpore implements D-Rec as a choice mesh slicing. During the dynamic programming calculation
of strategy searching algorithms due to its speed advantage. for finding the best inter-operator parallelism strategy, Alpa
However, ignoring the computation analysis limits D-Rec from uses ILP to find the best DP and TP strategy for each
supporting heterogeneous clusters and PP in the future. stage (i.e., sub-graph). However, the overall complexity of
PaSE [56] also uses a static analysis-based cost model this algorithm is O(|V |5 |VD |(|VD |/d + log d)2 ), where d is
to generate DP, TP strategies, and a combination of them. the number of device nodes in the cluster. To reduce the
It makes good use of the sparsity of computing graph to complexity, Alpa tries to use early pruning to reduce search
form a DP-based strategy searching algorithm, whose overall space and use another dynamic programming algorithm to
computational complexity of is O(|V |2 K M +1 ), where |V | is group operators, whose computation complexity is O(|V |2 L).
the total number of vi ∈ V , K is the maximum number Here L represents the number of layers after grouping. Alpa
configurations of an operator (vertex), and M is the size can find the best strategy within 40 minutes for GPT-39B on
of the largest dependent set (i.e., difference set of com- 64 GPU devices. However, considering the complexity cost of
puting sub-graphs from two iterations). According to their this method, searching strategies for GPT-39B on 2048 GPU
experiment, PaSE can generate strategies for a Transformer devices may require thousands of hours, which shows poor
NMT model on 16 devices and 64 devices in 2.2 minutes scalability. But Alpa is fair enough to search near-optimal
and 31.4 minutes, respectively. The throughput of generated strategies for small models.
strategies outperforms Mesh-Tensorflow [69]. PaSE can be Some works use the breadth-first search (BFS) algorithm to
applied on the heterogeneous cluster, but currently, it does propagate strategies. BFS algorithm-based algorithm requires
not include heterogeneity in the cost model, which may result users to annotate some parallelism strategies of tensors or
17
operators, after which the deep learning framework will auto- deciding strategies for several operators (e.g., MatMul and
matically propagate strategies based on set rules. GSPMD [29] succeeding ReLU).
is the first work to do this. It proposes sharding propagation, We recommend using grouping as a heuristic method to help
and the corresponding algorithm has been integrated into reducing the runtime of auto-parallelism methods. We could
TensorFlow’s XLA compiler [105]. It uses a priority-queue- control the size of groups and the method to generate groups
based heuristic method to arrange the parallelism strategies to explore the influence and effectiveness that grouping brings
of rest operators in the compute graph. More specifically, it us.
gives the element-wise operators top priority when propagating 2) Profiling-based Cost Model: As mentioned in section
strategies. III, although using the symbolic cost model is very fast
Inspired by GSPMD, frameworks like MindSpore [57], in evaluating strategies, it owns the inability of telling the
OneFlow [31] and PaddlePaddle [87] absorb sharding prop- difference between different devices, and it ignores many
agation and create their own semi-auto-parallelism method. optimization strategies like cache and the overlap between
Currently, they all use the BFS algorithm in propagating computation and communication. Furthermore, profiling is too
annotations. Among them, OneFlow shows that using split- time-costly to evaluate every strategy for a large-scale model.
broadcast-partial (SBP) parallelism and actor-based runtime We recommend using a profiling-based cost model, which
can further accelerate the training of large-scale models. holds the actual runtime of an operation on a specific device
and can be further fine-tuned to gain better performance (e.g.,
applying a linear regression model).
VI. C ONCLUSIONS AND D ISCUSSIONS
3) Using Heuristics: Heuristics help reduce the search
Large-scale models are becoming increasingly important in space while keeping a well enough output. For example, Alpa
industry and academia and also greatly improve the devel- uses early pruning to ignore strategies with costs over the
opment of scalable distributed training systems that involves threshold; Piper uses greedy heuristics to solve the knapsack
the auto-parallelism method. In this survey, we took a deeper problem.
look into distributed training from the perspective of auto-
parallelism. We investigate the main challenges to make auto- B. Optimizing Parallelism Strategies
parallelism methods more practical and have reviewed existing
Given a specific device topology, an auto-parallelism
methods that tackle those challenges. We give a detailed
method should optimize the parallelism strategies by organiz-
analysis on the foundations of auto-parallelism, including
ing computation among devices and designing good commu-
the problem definition and parallelism strategies. Finally, we
nication pace and pattern.
provide an overview and analyze the existing auto-parallelism
1) Topology-aware Computation: Only a few existing auto-
methods.
parallelism methods handle topology-aware computation, es-
Looking into the future, we suggest a few trends that may
pecially on heterogeneous clusters. AccPar distributes com-
be important in the following years, which are acceleration
putation tasks according to devices’ computation capacity;
of strategy searching, optimization of founded strategies, and
DeepSpeed and PaddlePaddle let the CPU participate in part
combinations of more parallelism schemes.
of the computation to alleviate the pressure of the GPU.
Although many DL training is deployed in a homogeneous
A. Accelerating Strategy Searching cluster, we suggest developing auto-parallelism methods that
support heterogeneous partitions.
1) Grouping: There are two ways in grouping to accelerate
2) Topology-aware Communication: Auto-parallelism
searching. The first way is to apply the same partitions on
strategy searching methods need to consider topology-aware
modules with the same architectures. The second way is to
communication strategies to reduce communication time
group some operators to form a layer and apply partitions to
further and increase throughput. BytePS proposes that using
it.
more CPU as parameter servers can reduce the communication
The first way of grouping is based on the fact that many
amount of synchronizing parameters. However, most of the
neural networks have regular structures. For example, ResNet
current auto-parallelism methods fail to be aware of this
consists of many residual blocks, and BERT consists of many
possible option. [47] and [48] propose ways to reduce intra-
transformer layers. Researchers have found that using the
node and inter-node communication. We suggest involving
regularity of models can help us accelerate the auto partition
their work in generating new strategies.
of computation graphs because modules with the same archi-
tecture often have the same parallelism strategies. Thus, we
only need to find a strategy for a layer and then broadcast it C. Supporting more Parallelism Schemes
to the others, which can reduce the time to a sub-linear degree. Emerging methods including multidimensional TP [16],
The second way is designed for models that do not have [17], [41], TeraPipe [39] and sequence-level parallelism [38]
similar architecture, but it can also be used together with the as well as ZeRO [37] can bring huge enhancement in training
first way. By grouping operators in the second way, we could large-scale models. However, almost no auto-parallelism meth-
transfer a non-linear computation graph into a linear one [43], ods consider the above strategies in their implementation. We
and thus extend the usability of some algorithms like OptCNN. expect new algorithms that make the most of these emerging
Moreover, we can accelerate searching by simultaneously parallelism methods.
TABLE VII
C OMPARISON OF D IFFERENT S TRATEGIES S EARCHING M ETHODS FOR AUTO - PARALLELISM
[40] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, [63] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josi-
N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with fovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed
conditional computation and automatic sharding,” arXiv preprint machine learning with the parameter server,” in 11th USENIX Sym-
arXiv:2006.16668, 2020. posium on Operating Systems Design and Implementation (OSDI 14),
[41] B. Wang, Q. Xu, Z. Bian, and Y. You, “2.5-dimensional distributed Conference Proceedings, pp. 583–598.
model training,” CoRR, vol. abs/2105.14500, 2021. [Online]. Available: [64] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified
https://arxiv.org/abs/2105.14500 architecture for accelerating distributed DNN training in heterogeneous
[42] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and gpu/cpu clusters,” in 14th USENIX Symposium on Operating Systems
J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Design and Implementation (OSDI 20), Conference Proceedings, pp.
Computing Surveys (CSUR), vol. 53, no. 2, pp. 1–33, 2020. 463–479.
[43] M. Wang, C.-c. Huang, and J. Li, “Supporting very large models [65] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep
using automatic dataflow graph partitioning,” in Proceedings of the learning in tensorflow,” arXiv preprint arXiv:1802.05799, 2018.
Fourteenth EuroSys Conference 2019, 2019, pp. 1–17. [66] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective
[44] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Hypar: communication operations in mpich,” The International Journal of
Towards hybrid parallelism for deep learning accelerator array,” in High Performance Computing Applications, vol. 19, no. 1, pp. 49–66,
2019 IEEE International Symposium on High Performance Computer 2005.
Architecture (HPCA). IEEE, 2019, pp. 56–68. [67] P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms
[45] Y. Ueno and R. Yokota, “Exhaustive study of hierarchical allreduce for clusters of workstations,” Journal of Parallel and Distributed
patterns for large messages between gpus,” in 2019 19th IEEE/ACM Computing, vol. 69, no. 2, pp. 117–124, 2009.
International Symposium on Cluster, Cloud and Grid Computing (CC- [68] A. Gibiansky, “Bringing hpc techniques to deep learning,” Baidu
GRID). IEEE, 2019, pp. 430–439. Research, Tech. Rep., 2017.
[46] M. Cho, U. Finkler, D. Kung, and H. Hunter, “Blueconnect: Decompos- [69] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanan-
ing all-reduce for deep learning on heterogeneous network hierarchy,” takool, P. Hawkins, H. Lee, M. Hong, and C. Young, “Mesh-tensorflow:
Proceedings of Machine Learning and Systems, vol. 1, pp. 241–251, Deep learning for supercomputers,” arXiv preprint arXiv:1811.02084,
2019. 2018.
[47] N. Xie, T. Norman, D. Grewe, and D. Vytiniotis, “Synthesizing [70] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang, F. Cui,
optimal parallelism placement and reduction strategies on hierarchical and Y. You, “Colossal-ai: A unified deep learning system for large-
systems for deep learning,” CoRR, vol. abs/2110.10548, 2021. scale parallel training,” CoRR, vol. abs/2110.14883, 2021. [Online].
[Online]. Available: https://arxiv.org/abs/2110.10548 Available: https://arxiv.org/abs/2110.14883
[48] N. A. Rink, A. Paszke, D. Vytiniotis, and G. S. Schmid, “Memory- [71] A. Krizhevsky, “One weird trick for parallelizing convolutional neural
efficient array redistribution through portable collective communica- networks,” arXiv preprint arXiv:1404.5997, 2014.
tion,” 2021. [72] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen,
[49] K. Kennedy and U. Kremer, “Automatic data layout for distributed- H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training
memory machines,” ACM Transactions on Programming Languages of giant neural networks using pipeline parallelism,” arXiv preprint
and Systems (TOPLAS), vol. 20, no. 4, pp. 869–916, 1998. arXiv:1811.06965, 2018.
[73] D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Za-
[50] J. L. M. Chen and J. Li, “Index domain alignment: Minimizing cost
haria, “Memory-efficient pipeline-parallel dnn training,” arXiv preprint
of cross-referencing between distributed arrays,” 1989.
arXiv:2006.09503, 2020.
[51] U. Kremer, “Np-completeness of dynamic remapping,” in Proceedings
[74] M. Assran, N. Loizou, N. Ballas, and M. G. Rabbat, “Stochastic gra-
of the Fourth Workshop on Compilers for Parallel Computers, Delft,
dient push for distributed deep learning,” CoRR, vol. abs/1811.10792,
The Netherlands, 1993.
2018. [Online]. Available: http://arxiv.org/abs/1811.10792
[52] J. Li and M. Chen, “The data alignment phase in compiling programs
[75] X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized
for distributed-memory machines,” Journal of parallel and distributed
parallel stochastic gradient descent,” 2018.
computing, vol. 13, no. 2, pp. 213–221, 1991.
[76] G. Nadiradze, A. Sabour, D. Alistarh, A. Sharma, I. Markov, and V. Ak-
[53] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine senov, “Swarmsgd: Scalable decentralized sgd with local updates.”
learning. MIT press, 2018. arXiv: Learning, 2020.
[54] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. [77] Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li, “Communication-efficient
34–37, 1966. distributed deep learning: A comprehensive survey,” 2020.
[55] J. Tarnawski, D. Narayanan, and A. Phanishayee, “Piper: Multidimen- [78] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary,
sional planner for dnn parallelization,” in NeurIPS 2021, December V. A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, and
2021. [Online]. Available: https://www.microsoft.com/en-us/research/ B. Catanzaro, “Efficient large-scale language model training on gpu
publication/piper-multidimensional-planner-for-dnn-parallelization/ clusters,” arXiv preprint arXiv:2104.04473, 2021.
[56] V. Elango, “Pase: Parallelization strategies for efficient dnn training,” [79] A. Jain, A. A. Awan, A. M. Aljuhani, J. M. Hashmi, Q. G. Anthony,
in 2021 IEEE International Parallel and Distributed Processing Sym- H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, “Gems:
posium (IPDPS), 2021, pp. 1025–1034. Gpu-enabled memory-aware model-parallelism system for distributed
[57] Huawei, “Mindspore,” https://www.mindspore.cn/en, 2020. dnn training,” in SC20: International Conference for High Performance
[58] K. Santhanam, S. Krishna, R. Tomioka, T. Harris, and M. Zaharia, Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–15.
“Distir: An intermediate representation and simulator for efficient [80] A. Griewank and A. Walther, “Algorithm 799: Revolve: An
neural network distribution,” CoRR, vol. abs/2111.05426, 2021. implementation of checkpointing for the reverse or adjoint mode
[Online]. Available: https://arxiv.org/abs/2111.05426 of computational differentiation,” ACM Trans. Math. Softw., vol. 26,
[59] J. Tarnawski, A. Phanishayee, N. R. Devanur, D. Mahajan, and F. N. no. 1, p. 19–45, mar 2000. [Online]. Available: https://doi.org/10.
Paravecino, “Efficient algorithms for device placement of DNN graph 1145/347837.347846
operators,” CoRR, vol. abs/2006.16423, 2020. [Online]. Available: [81] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with
https://arxiv.org/abs/2006.16423 sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016.
[60] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, [82] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, Gomez, L. Kaiser, and I. Polosukhin.
“Pytorch distributed: Experiences on accelerating data parallel [83] Y. Zhou, S. Roy, A. Abdolrashidi, D. L. Wong, P. C. Ma,
training,” CoRR, vol. abs/2006.15704, 2020. [Online]. Available: Q. Xu, M. Zhong, H. Liu, A. Goldie, A. Mirhoseini, and
https://arxiv.org/abs/2006.15704 J. Laudon, “GDP: generalized device placement for dataflow
[61] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero- graphs,” CoRR, vol. abs/1910.01578, 2019. [Online]. Available:
infinity: Breaking the gpu memory wall for extreme scale deep learn- http://arxiv.org/abs/1910.01578
ing,” arXiv preprint arXiv:2104.07857, 2021. [84] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean,
[62] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, and J. H. and, “A hierarchical model for device placement,” in International Confer-
“Scaling language models: Methods, analysis & insights from training ence on Learning Representations, 2018.
gopher,” CoRR, vol. abs/2112.11446, 2021. [Online]. Available: [85] R. Addanki, S. B. Venkatakrishnan, S. Gupta, H. Mao, and M. Al-
https://arxiv.org/abs/2112.11446 izadeh, “Placeto: Learning generalizable device placement algorithms
21
for distributed machine learning,” CoRR, vol. abs/1906.08879, 2019. [110] S. Zagoruyko and N. Komodakis, “Wide residual networks,” CoRR,
[Online]. Available: http://arxiv.org/abs/1906.08879 vol. abs/1605.07146, 2016. [Online]. Available: http://arxiv.org/abs/
[86] Y. Gao, L. Chen, and B. Li, “Spotlight: Optimizing device placement 1605.07146
for training deep neural networks,” in International Conference on [111] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
Machine Learning. PMLR, 2018, pp. 1676–1684. training of deep bidirectional transformers for language understanding,”
[87] Y. Ao, Z. Wu, D. Yu, W. Gong, Z. Kui, M. Zhang, Z. Ye, L. Shen, in Proceedings of the 2019 Conference of the North American Chapter
Y. Ma, T. Wu et al., “End-to-end adaptive distributed training on of the Association for Computational Linguistics: Human Language
paddlepaddle,” arXiv preprint arXiv:2112.02752, 2021. Technologies, Volume 1 (Long and Short Papers). Minneapolis,
[88] A. Paliwal, F. Gimeno, V. Nair, Y. Li, M. Lubin, P. Kohli, and Minnesota: Association for Computational Linguistics, Jun. 2019, pp.
O. Vinyals, “Reinforced genetic algorithm learning for optimizing 4171–4186. [Online]. Available: https://www.aclweb.org/anthology/
computation graphs,” 2019. N19-1423
[89] E. W. Dijkstra, “Recursive programming,” Numerische Mathematik, [112] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected
vol. 2, no. 1, pp. 312–318, 1960. convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online].
[90] A. Schrijver, Theory of linear and integer programming. John Wiley Available: http://arxiv.org/abs/1608.06993
& Sons, 1998. [113] U. U. Hafeez, X. Sun, A. Gandhi, and Z. Liu, “Towards optimal
placement and scheduling of dnn operations with pesto,” in Proceedings
[91] L. Luo, M. Wong, and W. Hwu, “An effective gpu implementation of
of the 22nd International Middleware Conference, ser. Middleware ’21.
breadth-first search,” in Design Automation Conference, 2010.
New York, NY, USA: Association for Computing Machinery, 2021, p.
[92] W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain Monte 39–51. [Online]. Available: https://doi.org/10.1145/3464298.3476132
Carlo in practice. CRC press, 1995. [114] S. Zhao, F. Li, X. Chen, X. Guan, J. Jiang, D. Huang, Y. Qing, S. Wang,
[93] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Wang, G. Zhang et al., “v pipe: A virtualized acceleration system for
P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, achieving efficient and scalable pipeline parallel dnn training,” IEEE
“A survey of monte carlo tree search methods,” IEEE Transactions on Transactions on Parallel and Distributed Systems, vol. 33, no. 3, pp.
Computational Intelligence and AI in games, vol. 4, no. 1, pp. 1–43, 489–506, 2021.
2012.
[94] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
learning: A survey,” Journal of artificial intelligence research, vol. 4,
pp. 237–285, 1996.
[95] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine learning, vol. 8, no. 3,
pp. 229–256, 1992.
[96] B. and Hammer, “Learning with recurrent neural networks,” Assembly
Automation, 1980.
[97] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347,
2017. [Online]. Available: http://arxiv.org/abs/1707.06347
[98] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 11 1997. [Online].
Available: https://doi.org/10.1162/neco.1997.9.8.1735
[99] F. P. Miller, A. F. Vandome, and J. Mcbrewster, “Markov decision
process,” Springer London, 1985.
[100] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” CoRR, vol. abs/1706.02216, 2017. [Online].
Available: http://arxiv.org/abs/1706.02216
[101] J. Gonçalves and M. Resende, “Biased random-key genetic algorithms
for combinatorial optimization,” Journal of Heuristics, vol. 17, no. 5,
pp. 487–525, 2011.
[102] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. A. Riedmiller, “Playing atari with deep
reinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online].
Available: http://arxiv.org/abs/1312.5602
[103] M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan,
G. Yang, and D. Qian, “The deep learning compiler: A comprehensive
survey,” 2020.
[104] A. Sabne, “Xla : Compiling machine learning for peak performance,”
2020.
[105] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system
for large-scale machine learning,” in 12th {USENIX} symposium on
operating systems design and implementation ({OSDI} 16), 2016, pp.
265–283.
[106] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” CoRR, vol. abs/1910.10683,
2019. [Online]. Available: http://arxiv.org/abs/1910.10683
[107] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
machine translation system: Bridging the gap between human and
machine translation,” arXiv preprint arXiv:1609.08144, 2016.
[108] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. A.
Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “Mlir:
Scaling compiler infrastructure for domain specific computation,” in
CGO 2021, 2021.
[109] P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu,
“Interaction networks for learning about objects, relations and physics,”
2016.