0% found this document useful (0 votes)
71 views13 pages

Torchbench: Benchmarking Pytorch With High Api Surface Coverage

TorchBench is a comprehensive benchmark suite designed to evaluate the performance of the PyTorch software stack, covering a wide range of models and APIs. It addresses the challenges of identifying performance inefficiencies and bugs in PyTorch, providing insights for optimization and integration into continuous development processes. The suite includes 84 models across various domains, enabling thorough performance characterization and facilitating improvements in the PyTorch ecosystem.

Uploaded by

ykchen913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views13 pages

Torchbench: Benchmarking Pytorch With High Api Surface Coverage

TorchBench is a comprehensive benchmark suite designed to evaluate the performance of the PyTorch software stack, covering a wide range of models and APIs. It addresses the challenges of identifying performance inefficiencies and bugs in PyTorch, providing insights for optimization and integration into continuous development processes. The suite includes 84 models across various domains, enabling thorough performance characterization and facilitating improvements in the PyTorch ecosystem.

Uploaded by

ykchen913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

TorchBench: Benchmarking PyTorch with High API Surface

Coverage
Yueming Hao Xu Zhao Bin Bao
yhao24@ncsu.edu xzhao9@meta.com binbao@meta.com
North Carolina State University Meta Platforms, Inc. Meta Platforms, Inc.
Raleigh, North Carolina, USA Menlo Park, California, USA Menlo Park, California, USA

David Berard Will Constable Adnan Aziz


dberard@meta.com whc@meta.com adnanaziz@meta.com
Meta Platforms, Inc. Meta Platforms, Inc. Meta Platforms, Inc.
arXiv:2304.14226v3 [cs.LG] 24 Jun 2023

Menlo Park, California, USA Menlo Park, California, USA Menlo Park, California, USA

Xu Liu
xliu88@ncsu.edu
North Carolina State University
Raleigh, North Carolina, USA

ABSTRACT PyTorch is one of the most popular deep learning frameworks.


Deep learning (DL) has been a revolutionary technique in vari- As an open-source project, PyTorch has more than 2000 contrib-
ous domains. To facilitate the model development and deployment, utors and has received more than 61,000 stars by the end of 2022.
many deep learning frameworks are proposed, among which Py- PyTorch is still actively evolving; PyTorch has over 110k commits in
Torch is one of the most popular solutions. The performance of eco- 2022 [26]. The machine learning research papers based on PyTorch
system around PyTorch is critically important, which saves the costs have increased from 29% to 65% in the past four years [27].
of training models and reduces the response time of model infer- One major challenge of deep learning tasks is the extensive
ences. In this paper, we propose TorchBench, a novel benchmark amount of computation for both training and inference, which
suite to study the performance of PyTorch software stack. Unlike translates to huge time and money costs. For example, the best
existing benchmark suites, TorchBench encloses many represen- version of AlphaGo [28] needs weeks to train with 280 GPUs, and
tative models, covering a large PyTorch API surface. TorchBench its estimated cost is 35 million dollars [29]. In addition, the real time
is able to comprehensively characterize the performance of the Py- inference of deep learning models on edge devices requires fast
Torch software stack, guiding the performance optimization across response and low latency. Thus, it is critically important for PyTorch
models, PyTorch framework, and GPU libraries. We show two prac- to achieve bare-metal performance for boosting productivity and
tical use cases of TorchBench. (1) We profile TorchBench to iden- reducing costs.
tify GPU performance inefficiencies in PyTorch. We are able to op- However, understanding and optimizing PyTorch performance is
timize many performance bugs and upstream patches to the official challenging because of PyTorch’s complex and rapidly evolved code
PyTorch repository. (2) We integrate TorchBench into PyTorch bases. On the one hand, the PyTorch software stack consists of three
continuous integration system. We are able to identify performance major components: the acceleration libraries (e.g., CuDNN [30]),
regression in multiple daily code checkins to prevent PyTorch repos- the PyTorch framework [24] (including optimizers and compilers),
itory from introducing performance bugs. TorchBench is open and deep learning models; performance inefficiencies can exist
source and keeps evolving. in or across any components. Without a systematic evaluation,
inefficiencies can be difficult to identify. On the other hand, as an
active open-source project, PyTorch receives many patches for new
functionality or performance improvement. Without a thorough
1 INTRODUCTION evaluation, it is difficult to understand how these patches affect
Deep learning (DL) has been the most transformative technology the entire PyTorch software stack for various models in different
over the past decade, given its wide applicability in global climate domains.
projections [1, 2], image processing [3–5], speech recognition [6, 7], Benchmarking is a well-known technique to understand the per-
content recommendation [8–10], and pattern classification [11, 12], formance of different workloads, programming languages, compil-
and great potential in emerging domains such as autonomous driv- ers, and architectures. For example, SPEC provides many standard
ing [13, 14], augmented/virtual reality [15, 16], and sciences [17, 18]. benchmark suites for various purposes; Renassance [31] is a Java
To facilitate the development and deployment of DL models, DL benchmark suite; DeathStar [32] is a benchmark suite for microser-
framework practitioners have proposed a number of DL frame- vices; NPB [33] is a benchmark suite for different parallel models in
works, such as Caffe [19], TVM [20], ONNX [21], MXNet [22], scientific computing; and Rodinia [34] is a GPU benchmark suite.
JAX [23], PyTorch [24], and TensorFlow [25]. These frameworks These benchmarks provide indispensable performance insights for
provide rich APIs and support performance tuning.
1
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu

both software and hardware evolution. MLPerf [35] is the state- • TorchBench is the first PyTorch benchmark suite that consists
of-the-art benchmark suite for deep learning workloads. However, of rich models in different domains. It covers 2.3× more PyTorch
existing deep learning benchmark suites, including MLPerf, aim API surface, compared to the state-of-the-art MLPerf benchmark
to compare the performance of deep learning models running on suite.
different hardware and frameworks. They usually include a small • TorchBench integrate a set of built-in tools, which config-
number of deep learning models (e.g., MLPerf has eight models ure execution environments and collect performance statistics.
only) and cover a small PyTorch API surface, which fails to identify TorchBench is able to report multiple metrics to thoroughly
many PyTorch performance bugs or fairly evaluate the performance characterize PyTorch performance.
impact of patches. • TorchBench demonstrates two use cases in practice. First,
TorchBench exposes the performance bugs in PyTorch software
1.1 Motivating Examples stacks and provides insights for bug fixing. Second, TorchBench
We show two examples to motivate the necessity of a comprehen- is configured as the continuous integration of PyTorch repository
sive benchmark suite for PyTorch. to understand the performance regression of daily committed
patches.
Misleading performance characterization. Understanding the per- The rest of the paper is organized as follows. Section 2 describes
formance difference across various architectures is one of the ma- the models, benchmark adaptation, and configurations in our bench-
jor tasks for benchmarking. However, without a comprehensive mark suite. Section 3 characterizes TorchBench from different as-
benchmark suite, such characterization can result in misleading pects, including execution time breakdown, different PyTorch com-
conclusions. For example, some studies [36] show that AMD GPUs pilers, and different GPU platforms. Section 4 shows the practical
outperform NVIDIA GPUs on PyTorch, and some studies [37, 38] usages of TorchBench in detecting performance inefficiencies in
show an opposite conclusion. Thus, a comprehensive benchmark PyTorch software stack. Section 5 reviews related works and distin-
suite is necessary for performance characterization. Unlike exist- guishes our approach. Section 6 presents some conclusions.
ing studies, we show a different conclusion and unique insights in
Section 3.3. 2 TORCHBENCH SUITE
Missing performance bugs. Performance bugs can be buried deep There are a huge amount of deep learning models developed by
in complex PyTorch code bases, especially in the cold execution the PyTorch community. For example, The Hugging Face plat-
paths. We identify a recent performance bug in PyTorch, which is form has more than 10,000 models [39]. Thus, it is challenging
inappropriate error handling. PyTorch implements an error han- for TorchBench to enclose representative models to cover perfor-
dling mechanism named c10_Exception to deal with runtime errors. mance behaviors of typical PyTorch use cases. We work closely
It prints out backtraces for all errors and uses std::string to gen- with machine learning engineers and set the following criteria to
erate error messages. Since error handling is usually in the cold select models for TorchBench.
path, it does not incur any performance degradation in models of • Classic models. TorchBench includes models that were devel-
existing benchmark suites. However, we find that this error code oped many years ago but have been proven both useful and
handling slows down quantized models by 10×. Our further in- impactful, such as ResNet [4], VGG16 [40], and MobileNet [5, 41].
vestigation shows that quantized models heavily call torch.ops These models serve as the foundation of many state-of-the-art
API, which frequently throws a benign error, “Not Implemented models.
Functions”, resulting in a large overhead in error handling. This is- • Popular models. TorchBench includes popular models released
sue has been confirmed by PyTorch team, and c10_Exception has in recent years. These models attract extensive attention in the
been reverted to the previous implementation. Thus, a benchmark community, enabling many academic research papers and appli-
suite that covers a large PyTorch API surface is necessary to expose cations. These models include pig2 [42], T5 [43], Yolo [44], and
performance bugs. docTR [45].
• Important models in the industry. Industrial companies such
1.2 Paper Contribution as Meta and Google release important models which are used
In this paper, we develop TorchBench, a novel benchmark suite in their products. These models include Detectron2 [46] and
for PyTorch to address the aforementioned challenges. Unlike exist- Bert [47].
ing approaches, TorchBench embraces a large number of models, • Diverse models. TorchBench includes models from different do-
which cover a large PyTorch API surface. Given the voluminous mains to ensure a fair comparison. Moreover, models of different
deep learning models, we carefully include the representative mod- weight layers and different implementations are included.
els in TorchBench for both fairness and generality. Additionally, In the rest of this section, we describe the benchmarks, set the study
we develop a set of tools associated with TorchBench to enable scope, and distinguish TorchBench from existing approaches.
TorchBench to (1) run benchmarks with different configurations,
(2) collect various performance statistics, and (3) be ready for any 2.1 Benchmark Description
continuous integration systems. TorchBench helps fix many per- Table 1 overviews all the benchmarks in TorchBench, which
formance bugs in PyTorch eco-system, and many of them are ac- consists of 84 deep learning models and covers six domains.
cepted by the official PyTorch repository. In summary, TorchBench For a given model name, it may consist of a prefix and/or a
makes the following contributions. suffix. The prefix, such as d2 (Detectron2), hf (Hugging Face),
2
TorchBench: Benchmarking PyTorch with High API Surface Coverage

Table 1: TorchBench consists of 84 models from various domains to benchmark PyTorch.

Domain Task Model


alexnet phlippe_resnet squeezenet1_1
densenet121 resnet152 timm_efficientnet
mnasnet1_0 resnet18 timm_nfnet
Image
mobilenet_v2 resnet50 timm_regnet
Classification
mobilenet_v2_quantized_qat resnet50_quantized_qat timm_resnest
mobilenet_v3_large resnext50_32x4d vgg16
phlippe_densenet shufflenet_v2_x1_0
detectron2_fasterrcnn_r_101_c4 detectron2_fasterrcnn_r_50_dc5 doctr_reco_predictor
Computer Object detectron2_fasterrcnn_r_101_dc5 detectron2_fasterrcnn_r_50_fpn timm_efficientdet
Vision Detection detectron2_fasterrcnn_r_101_fpn detectron2_maskrcnn timm_vovnet
detectron2_fasterrcnn_r_50_c4 doctr_det_predictor vision_maskrcnn
Image timm_vision_transformer_large pytorch_CycleGAN_and_pix2pix dcgan
Generation timm_vision_transformer public_image_generator1 public_image_generator2
detectron2_maskrcnn_r_101_c4 detectron2_maskrcnn_r_50_fpn yolov3
Image
detectron2_maskrcnn_r_101_fpn detectron2_fcos_r_50_fpn
Segmentation
detectron2_maskrcnn_r_50_c4 pytorch_unet
Pattern
Background_Matting
Recognition
Video
Super_SloMo
Interpolation
BERT_pytorch hf_Bert_large hf_Longformer
fambench_xlmr hf_BigBird hf_Reformer
Language
hf_Albert hf_DistilBert hf_T5
Modeling
NLP hf_Bart hf_public_text_generator1 hf_T5_base
hf_Bert hf_public_text_generator1_large hf_T5_large
Translation attention_is_all_you_need_pytorch
Other fastNLP_Bert
Recommendation - nvidia_deeprecommender dlrm
Reinforcement
- drq soft_actor_critic LearningToPaint
Learning
Recognition speech_transformer
Speech Audio Source
demucs
Separation
Synthesis tacotron2 tts_angular
pytorch_struct pyhpc_turbulent_kinetic_energy pyhpc_equation_of_state
functorch_dp_cifar10 opacus_cifar10 pyhpc_isoneutral_mixing
Other -
functorch_maml_omniglot moco maml
lennard_jones maml_omniglot

and timm, means the collection or platform the model is from. from a description in natural language, and multiple GAN [49]
The suffix means different configurations. Specifically, c4 means based models such as pig1 [50] and CycleGAN [51, 52].
using a conv4 backbone; FPN means using Feature Pyramid • Image segmentation, which partitions an image into multiple
Network backbone; dc5 means using a conv5 backbone with segments to locate objects and their boundaries. TorchBench
dilations in conv5; large means that the model takes more includes YoLOv3 [44] and a series of MaskRCNN models [53]
parameters. Due to potential privacy concerns, we rename sev- implemented atop Detectron2.
eral models, including hf_public_text_generator1(hf_ptg1), • Pattern Recognition and Video Interpolation. TorchBench in-
hf_public_text_generator1_large(hf_ptg1_large), pub- cludes background matting [54] in pattern recognition, which
lic_image_generator1(pig1), and public_image_generator2(pig2). separates the foreground elements from the background of
an image or video and composites it into a new background.
Computer Vision. Computer vision is one of the most important TorchBench also includes Super SloMo [55], which reconstructs
domains that embrace deep learning. We further categorize models high-resolution slow-motion videos by alignment and appear-
in different subareas. ance estimation.

• Image classification, which categorizes and labels images. Natural Language Process (NLP). NLP is a series of algorithms
TorchBench includes 20 models in this domain, such as ResNet and techniques enabling computers to understand human language.
and its variant, MobileNet_v2, and various models from the TorchBench includes NLP models for the task of language model-
model collection timm [48]. ing and translation.
• Object detection, which detects all instances of predefined classes • Language Modeling, which predicts the probability distribution of
and provides axis-aligned boxes to locate the detected objects. words in a language and models the relationships and dependen-
TorchBench includes 12 models in this domain, including Faster- cies among the words. TorchBench includes the open-sourced
RCNN and MaskRCNN atop Detectron2 and many other models. hf_ptg1 [56] from Hugging Face platform and hf_ptg1_large
• Image generation, which takes texts or images as inputs and with more parameters. Other models such as Bart [57], Bert[47],
generates new images. TorchBench includes pig2 [42], a state- T5 [43], BigBird [58] are included to cover NLP domains such as
of-the-art diffusion model that can create realistic images and art natural language generation, translation, and comprehension.
3
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu

1 def main () : Computation configuration. To focus on computation, Torch-


2 init_distributed ()
3 model = create_model (" resnet50 ")
Bench does not run the original end-to-end model code. Instead, it
4 train_dataloader = create_dataloader ( TRAIN_DATASET_DIR ) slices the original model code to run on a single GPU and only keeps
5 val_dataloader = create_dataloader ( VAL_DATASET_DIR )
6 optimizer = create_optimizer ( model ) the computation intensive part, where most of the GPU computation
7 lr_scheduler = create_scheduler ( optimizer ) happens. Listing 1 shows the original model code of a ResNet50
8 loss_fn = torch . nn . Loss ()
9 # start training loop model performing an image classification training task and high-
10 for epoch in range ( start_epoch , num_epochs ): lights the computation-intensive part sliced by TorchBench. The
11 for batch in train_dataloader :
12 inputs , targets = batch rest parts, such as distributed setup, data preprocessing, model
13 inputs , targets = inputs . cuda () , targets . cuda ()
14 outputs = model ( inputs )}
checkpointing, and data loading, are out of scope. Specifically, at
15 loss = loss_fn ( outputs , targets ) the beginning of every model’s benchmark test, we always assume
16 optimizer . zero_grad ()
17 loss . backward () that the input data has already been preprocessed and prefetched to
18 optimizer . step () the GPU device. To further simplify our analysis, we limit the Torch-
19 validate ( model , val_dataloader )
20 lr_scheduler . step () Bench tests to run only 1 iteration repeatedly in the model code.
21 save_checkpoint () # save model checkpoint after each epoch We run each model ten times and report performance statistics of
Listing 1: An example of ResNet50 deep learning model the run with the medium execution time.
training task. In our benchmark, we only measure the
TFLOPS of the highlighted program segment and set both
Batch size configuration. Besides limiting the number of itera-
num_epochs and len(train_dataloader) to 1.
tions, we also configure the model inputs by carefully specifying
batch_size, an important parameter in many deep learning tasks
that defines the size of input per iteration. After data preprocessing,
• Translation, which converts text written in one language into the input data is organized as a list of batches, where each batch cor-
text in another language while preserving its meaning as much responds to batch_size number of the original data samples. The
as possible. TorchBench includes the transformer model atten- lower bound for batch_size is 1, and the upper bound is limited
tion_is_all_you_need_pytorch, which is from a classic paper [59]. by the capacity of GPU memory. For training, we use the default
Recommendation. Recommendation systems are used to give batch_size value in the original model code, because batch_size
suggestions on personalized videos, advertisements, and text con- in training could affect model convergence. For inference, the origi-
tent. TorchBench includes DLRM [60] open-sourced by Facebook nal model code usually does not specify an optimal batch_size of
Research which combines dense and sparse features to represent inputs. Thus, we run a set of tests with exhaustively enumerating
user-item interactions and generates recommendations based on batch_size values (i.e., starting with one and doubling the size
these representations. TorchBench also includes a PyTorch imple- in each test) to determine the optimal value that yields the high-
mentation of NVIDIA’s DeepRecommender [61], which is based est GPU utilization. We believe our configuration enables optimal
on a deep autoencoder with six layers and is trained end-to-end performance for each model in TorchBench.
without any layer-wise pre-training.
Precision configuration. We use the 32-bit floating point opera-
Reinforcement Learning. Reinforcement learning focuses on the
tions (FP32 or TF32) for all the benchmarks unless models have
capability of making decisions from unstructured input data with-
specific requirements. Although TorchBench supports other data
out manual engineering of the state space. TorchBench includes
precisions such as FP16 (half-precision), BF16 (Brain Floating Point
three representative models, drq [62], soft actor critic [63], and
Format), and AMP (automatic mixed precision) [71], FP32 and TF32
learning to paint [64].
are the most representative and recommended precisions. On the
Speech. The speech domain focuses on text and audio trans- one hand, GPU vendors such as NVIDIA publish concrete roofline
formation and augmentation, such as speech recognition, syn- performance numbers for FP32 operations in TFLOPS (Tera Floating
thesis, and audio source separation. TorchBench includes four Point Operations per Second) as part of the hardware specification,
models, speech_transformer [65], tactron2 [66] from NVIDIA, which facilitates our characterization efforts. For example, the the-
tts_angular [67] from Mozilla, and demucs[68]. oretical peaks of an NVIDIA Tesla A100 GPU and an AMD Instinct
MI210 GPU are 19.5 TFLOPS [72] and 22.6 TFLOPS [73], respec-
Others. TorchBench includes ten models from miscellaneous tively. On the other hand, PyTorch uses TF32 for cuDNN by default,
domains, such as pyhpc_isoneutral_mixing [69] from high- as TF32 is newly developed and typically yields better performance
performance computing and pytorch_struct[70] from core struc- than FP32.
tured prediction algorithms for deep learning applications.

2.2 Benchmark Adaptation and Configuration 2.3 TorchBench vs. MLPerf


We adapt the off-the-shelf models to fulfill the requirement of bench- The goals of designing TorchBench and MLPerf are different.
marks. We configure the computation of each model to exclude the TorchBench aims to give a comprehensive and deep analysis of
data loading and postprocessing phases. We also select the optimal PyTorch software stack, while MLPerf aims to compare models run-
batch size and configure the precision to obtain a fair performance ning atop different frameworks. Thus, TorchBench differs from
analysis. MLPerf in three aspects.
4
TorchBench: Benchmarking PyTorch with High API Surface Coverage

100

75

50
Ratio

25

functorch_maml_omniglot
shufflenet_v2_x1_0
pytorch_unet

pytorch_struct
fambench_xlmr

d2_fasterrcnn_r_50_fpn

d2_fasterrcnn_r_101_fpn
timm_vovnet

resnet50_quantized_qat

lennard_jones
functorch_dp_cifar10
vision_maskrcnn

tts_angular
timm_nfnet

d2_maskrcnn_r_50_fpn
hf_Longformer
BERT_pytorch

d2_maskrcnn_r_101_fpn
hf_Reformer

timm_efficientnet

resnext50_32x4d
timm_vision_tf_large

mobilenet_v3_large
alexnet

d2_fasterrcnn_r_50_c4

maml_omniglot
d2_fasterrcnn_r_50_dc5

squeezenet1_1

d2_fasterrcnn_r_101_c4
hf_T5_large

d2_fasterrcnn_r_101_dc5

phlippe_densenet
yolov3

timm_vision_tf
mobilenet_v2

mobilenet_v2_q_qat

mnasnet1_0

d2_maskrcnn
nvidia_deeprecommender

hf_Albert

hf_Bert_large

timm_resnest

phlippe_resnet
hf_T5

timm_regnet

timm_efficientdet
hf_ptg1_large

d2_maskrcnn_r_50_c4

tacotron2
d2_maskrcnn_r_101_c4
hf_Bert

hf_Bart
densenet121
hf_ptg1

speech_tf
resnet50

resnet18
hf_DistilBert

aiaynp

resnet152
dlrm
fastNLP_Bert

soft_actor_critic
Super_SloMo
vgg16

dcgan

opacus_cifar10
moco

drq
hf_BigBird

pcap
pig1
GPU Idleness Data Movement GPU Active

Figure 1: The execution time breakdown for training models in TorchBench.

10
0

75
Ratio

50

25

0
pyhpc_equation_of_state

Background_Matting
pyhpc_isoneutral_mixing

shufflenet_v2_x1_0
mobilenet_v3_large
d2_fasterrcnn_r_101_fpn
nvidia_deeprecommender

phlippe_densenet
d2_fasterrcnn_r_50_fpn
resnext50_32x4d

LearningToPaint
d2_fasterrcnn_r_101_dc5

d2_maskrcnn_r_101_fpn
Super_SloMo

hf_ptg1_large
d2_maskrcnn_r_50_fpn

d2_fasterrcnn_r_101_c4

d2_fasterrcnn_r_50_dc5
timm_vision_tf_large

phlippe_resnet
mobilenet_v2

d2_fasterrcnn_r_50_c4

lennard_jones
d2_fcos_r_50_fpn

functorch_dp_cifar10
hf_Albert

hf_Bert_large

hf_T5_large
d2_maskrcnn_r_101_c4
pytorch_unet

densenet121

functorch_maml_omniglot
squeezenet1_1

d2_maskrcnn_r_50_c4
hf_Longformer
alexnet

yolov3
timm_efficientnet

timm_efficientdet
aiaynp

tts_angular

hf_T5_base
fastNLP_Bert

hf_BigBird
hf_ptg1

timm_vovnet

vgg16
timm_vision_tf

hf_DistilBert

resnet152

vision_maskrcnn

opacus_cifar10

maml_omniglot
mnasnet1_0

fambench_xlmr

tacotron2
BERT_pytorch
resnet18

hf_Reformer

hf_T5

resnet50

pig1
timm_regnet

pig2
hf_Bert

hf_Bart

ptke
timm_nfnet

d2_maskrcnn

speech_tf

soft_actor_critic
dcgan

drq
pcap

dlrm
timm_resnest

moco

maml
demucs

GPU Idleness Data Movement GPU Active

Figure 2: The execution time breakdown for inference models in TorchBench.

• TorchBench benchmarks the computation phase of DL mod- on one NVIDIA A100 GPU with 40 GB memory. Experiments in
els, while MLPerf benchmarks the end-to-end execution of the Section 3.3 also include data obtained from an AMD MI210 GPU
models. with 64 GB memory.
• TorchBench benchmarks PyTorch only, while MLPerf bench-
marks different deep learning frameworks. 3.1 Characterizing PyTorch Computation on
• TorchBench includes 84 DL models in six domains, while GPU
MLPerf has only five models in five domains with PyTorch.
We choose execution time as our main metric for GPU utilization be-
TorchBench covers 2.3× more PyTorch APIs than MLPerf.
cause it is the most straightforward and common metric to measure
Recently, TorchBench has evolved to embrace end-to-end models the model performance.
and support beyond PyTorch (e.g., Jax). However, this evolution is Figures 1 and 2 show the characterization results for the training
in the preliminary stage and out of the scope of this paper. and inference tasks of TorchBench models, respectively. Each bar
in the figures is composed of three segments: blue for the time
3 TORCHBENCH CHRACTERIZATION that GPU is active for computation, red for the time used in data
TorchBench enables comprehensive characterization of PyTorch. movement between CPU and GPU, and grey for the time that GPU
Given the page limit, we show the insights obtained from three is idle. We normalize them as portions of the total execution time of
characterization efforts: (1) characterizing PyTorch performance the models. From the figures, we observe that PyTorch keeps GPU
on NVIDIA GPUs (Section 3.1), (2) characterizing PyTorch perfor- busy for only 56.8% and 55.4% of total execution time for training
mance for different compiler backends (Section 3.2), and (3) com- and inference, respectively. GPU idleness and CPU-GPU data move-
paring PyTorch performance between NVIDIA and AMD GPUs ment account for a substantial time portion, preventing PyTorch
(Section 3.3). from achieving full GPU usage. Table 2 further quantifies the time
We benchmark TorchBench with PyTorch 2.0-20230102 nightly decomposition averaged across models in different domains. We
release linked with CUDA 11.7 [24]. Our experiments are done obtain the following insights for further investigation.
5
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu

Table 2: The breakdown ratios of model execution time for different deep learning tasks.

Train Inference
Task
GPU activeness Data movement GPU idleness GPU activeness Data movement GPU idleness
Computer Vision 53.1 2.1 44.8 62.8 1.4 35.7
NLP 84.9 1.3 13.8 64.7 0.8 34.5
Recommendation 75.4 0.4 24.2 51.4 0.1 48.5
Reinforcement Learning 10.2 5.0 84.8 19.3 8.4 72.3
Speech 28.8 0.3 70.9 50.3 0.3 49.4

3
Execution Time CPU Memory GPU Memory
TInductor CMInductor GMInductor
TOrigin CMOrigin GMOrigin
Comparison

0
timm_vision_tf

timm_vovnet
timm_nfnet

fastNLP_Bert
alexnet

hf_T5

hf_ptg1

resnet50
timm_resnest

Super_SloMo

timm_regnet
timm_efficientnet

hf_DistilBert
resnext50_32x4d

mobilenet_v2

vgg16

hf_Bart

hf_Bert
mnasnet1_0

hf_Albert

hf_Reformer
resnet18

shufflenet_v2_x1_0

resnet152
mobilenet_v3_large

hf_Bert_large
phlippe_resnet
phlippe_densenet

Figure 3: The comparisons of execution time (T), CPU memory usage (CM), and GPU memory usage (GM) for training between
original PyTorch and PyTorch compiled by TorchInductor. < 1 means TorchInductor performs better, while > 1 means original
PyTorch compiler performs better.

4
Execution Time CPU Memory GPU Memory
TInductor CMInductor GMInductor
3
TOrigin CMOrigin GMOrigin
Comparison

0
maml_omniglot
densenet121
hf_ptg1
timm_resnest

fambench_xlmr

timm_vision_tf
timm_vovnet
dlrm

timm_nfnet

fastNLP_Bert
alexnet

demucs
phlippe_densenet

timm_regnet

Super_SloMo
timm_efficientnet
timm_efficientdet
ptke

yolov3
hf_DistilBert

resnext50_32x4d
pcap

timm_vision_tf_large
functorch_maml_omniglot

mobilenet_v2
drq

aiaynp
vgg16

dcgan
hf_Bert

hf_Bart
pyhpc_isoneutral_mixing

mnasnet1_0
hf_Albert

BERT_pytorch
hf_T5

hf_Reformer
squeezenet1_1
resnet50

resnet18
shufflenet_v2_x1_0

resnet152

tts_angular
mobilenet_v3_large
hf_Bert_large

soft_actor_critic
hf_T5_base

opacus_cifar10

pytorch_unet

pig1

nvidia_deeprecommender
phlippe_resnet
hf_T5_large

LearningToPaint
functorch_dp_cifar10
pyhpc_equation_of_state

hf_ptg1_large

Figure 4: The comparisons of execution time (T), CPU memory usage (CM), and GPU memory usage (GM) for inference between
original PyTorch and PyTorch compiled by TorchInductor. < 1 means TorchInductor performs better, while > 1 means original
PyTorch compiler performs better.

Insights for execution time decomposition For both training and Insights for performance difference between training and inference
inference, models in computer vision, NLP, and recommendation Some models perform better on training, while others perform
yield over 50% GPU active time, among which NLP models achieve better on inference. There are three reasons. First, training may use
>80% in training. Models in these domains usually have large input different input sizes from inference, so PyTorch invokes different
sizes and intensive computation to be offloaded to GPU. In contrast, GPU kernels. Second, some functions in training require higher
Reinforcement Learning (RL) models achieve the smallest GPU precision than inference. Third, PyTorch may invoke different GPU
active time for both training and inference. RL models need to inter- kernels for training and inference even when they have the same
act with environment, a component not based on PyTorch, which input. For example, the GPU active time of fambench_xlmr on
limits the parallelism of RL models. Compared with NLP models, training is 98.0% but only 44.7% for inference with the same input.
RL models have smaller inputs and less intensive computations in With further investigation, we found that this model uses FP32 for
each batch, so they usually incur more GPU idleness. training but FP16 for inference by default. For the same forward

6
TorchBench: Benchmarking PyTorch with High API Surface Coverage

phase, FP16 GPU kernels run faster, so GPU finishes its computation Table 3: The peak theoretical TFLOPS for various floating
earlier, resulting in a larger portion of idleness. point number formats on NVIDIA A100 and AMD MI210
Discussion for individual models pig2 is one of the outliers, which GPUs.
spends 52% of execution time on data movement. With further
TFLOPS of Floating Point Number Formats
investigation, we find that in order to save GPU memory usage, it GPU
FP32- FP64- FP64-
FP32 TF32 FP64
always keeps one neural network structure on GPUs and offloads Matrix Matrix Tensor Core
NVIDIA A100 19.5 156 - 9.7 - 19.5
all other structures to CPUs. After the computations, it copies all AMD MI210 22.6 - 45.3 22.6 45.3 -
structures back to GPUs. This kind of ping-pong data movement
wastes a lot of time.
These insights motivate the necessity of optimizing PyTorch to make the tradeoff between performance and resource usage. Third,
remefy the performance losses caused by GPU idleness and data it determines which buffers can be reused and when the node should
movement. We elaborate our optimization efforts with the help be executed according to the data dependencies. To apply these
of TorchBench in Section 4.1. It is worth noting that high GPU techniques, TorchInductor has to use its memory cache allocators
active time does not mean no room for performance improvement. and record graph details, which results in larger GPU memory
For example, the GPU active ratio for vgg16 is 98.3% while its footprints. Since many intermediate operations have been removed,
achieved TFLOPS is about 10.07, which still has a gap from the peak the CPU memory usage is reduced significantly. We have reported
performance. This is caused by GPU kernel inefficiencies, such the large GPU memory usage to TorchInductor developers, who
as device memory access delays, instruction dependency delays, confirm this perform issue and promise to fix it in the next release.
shared memory bank conflicts, and many pipeline stall reasons. Outlier discussion. Inferencing yolov3 and hf_Reformer shows
Further characterization of TFLOPS is out of the scope due to the significant slowdown with TorchInductor. It is because of the high
page limit. just-in-time (JIT) compilation overhead introduced by TorchInduc-
tor. For most models, TorchInductor only needs to JIT compile
once and use the jitted model in the following iterations. However,
3.2 Characterizing PyTorch Compilers
models such as hf_Reformer incur many guard checks in TorchIn-
Besides the default model interpreter in the eager mode, PyTorch ductor, which guarantee the correct execution but cause a high
provides multiple model compilers (also known as model backends) overhead. For example, hf_Reformer incurs 2699 guard checks,
in graph modes, such as TorchScript [74], Torch-TensorRT [75], and 30% are heavy guard checks such as dictionary keys check. We
TorchDynamo [76], and TorchInductor [77]. TorchScript is a classic confirm this performance issue with TorchInductor developers.
Just-In-Time (JIT) model compiler that traces and optimizes model
code. Torch-TensorRT is an Ahead-of-Time (AOT) compiler, which Insights. TorchInductor typically accelerates both training and
utilizes NVIDIA TensorRT [78] deep learning optimizer and runtime inference models with consuming more GPU memory compared
to compile models before deployment. TorchDynamo compiles to the default compiler. However, TorchInductor is not suitable
arbitrary Python code into graphs, which can be further compiled. for models running on GPUs with limited memory unless apply-
TorchInductor compiles the graphs generated by TorchDynamo into ing further configurations, such as changing batch sizes or using
optimized C++/Triton kernels. The combination of TorchDynamo quantization.
and TorchInductor is the latest and recommended JIT compiler for
PyTorch 2.0 [79]. We use TorchInductor to denote this combination 3.3 PyTorch on NVIDIA vs. AMD GPUs
in this paper. PyTorch supports various types of GPUs. In this section, we com-
Figures 3 and 4 compare the performance between TorchInduc- pare the performance of NVIDIA A100 and AMD MI210, which are
tor and the default PyTorch compiler. We measure three metrics: the competing products in the market. We compare their perfor-
execution time, CPU memory consumption, and GPU memory con- mance on TorchBench with their software stacks: ROCm 5.4.2 [80]
sumption. It is worth noting that we do not show every benchmark from AMD and CUDA 11.8 [81] from NVIDIA. We run the default
in TorchBench because TorchInductor is still in its early stage and 32-bit configuration of TorchBench on both GPUs for a fair com-
does not fully support PyTorch APIs. From the figures, we observe parison. We test PyTorch stable 2.0.1 on both platforms. Table 3
that TorchInductor typically improves the execution time over the compares the peak theoretical TFLOPS for different floating point
default compiler, with 1.30× and 1.46× speedups for training and number formats on these two GPUs. Theoretically, MI210 has higher
inference on average (geomean). Moreover, TorchInductor signifi- performance than A100 in FP32 and FP64 computation. However,
cantly reduces the CPU memory consumption by 71.2% and 73.7% both A100 and MI210 have unique features to give an uncertainty
for training and inference but generally increases the demand for on TorchBench performance in comparison. For example, FP32-
GPU memory by 31.2% and 51.1% for training and inference. Specif- Matrix and FP64-Matrix are the unique optimized matrix operations
ically, many models suffer from such GPU memory bloat as high as for FP32 and FP64 on AMD GPUs, yielding high TFLOPS. TF32 is
> 5× compared to the default PyTorch compiler. the unique 32-bit floating point format on A100, which yields high
TorchInductor obtains speedups mainly from three techniques. TFLOPS but with some accuracy losses; FP64-Tensor Core is the
First, it fuses GPU kernels and utilizes Triton to generate faster FP64 operations uniquely accelerated by NVIDIA Tensor Cores.
kernels. For example, fusing two subsequent functions can eliminate Figure 5 shows the comparison of execution time obtained from
intermediate computations, memory load and store operations. AMD MI210 (𝑇𝐴𝑀𝐷 ) and NVIDIA A100 (𝑇𝑁𝑉 𝐼 𝐷𝐼𝐴 ). The ratio each
Second, it reorders the resource-intensive nodes in the graph to bar represents in the figure is 𝑇𝑁 𝑉 𝐼 𝐷𝐼 𝐴/𝑇𝐴𝑀𝐷 ; < 1 means NVIDIA
7
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu

1.5
Inference Training
1.0
Comparison

0.5

0.0

functorch_maml_omniglot

shufflenet_v2_x1_0

pytorch_unet
d2_fasterrcnn_r_50_fpn

d2_fasterrcnn_r_101_fpn
timm_vovnet
lennard_jones

tts_angular

vision_maskrcnn
timm_nfnet

d2_maskrcnn_r_50_fpn
hf_Longformer
BERT_pytorch

d2_maskrcnn_r_101_fpn
hf_Reformer

timm_efficientnet
resnext50_32x4d

mobilenet_v3_large
alexnet

d2_fasterrcnn_r_50_c4
maml_omniglot

squeezenet1_1

d2_fasterrcnn_r_50_dc5

d2_fasterrcnn_r_101_c4
hf_T5_large

LearningToPaint
phlippe_densenet

d2_fasterrcnn_r_101_dc5
yolov3
timm_vision_tf

mobilenet_v2
mnasnet1_0

d2_maskrcnn
hf_Albert
hf_Bert_large

nvidia_deeprecommender

timm_resnest
phlippe_resnet
hf_T5

timm_regnet

timm_efficientdet
hf_ptg1_large

d2_maskrcnn_r_50_c4

d2_maskrcnn_r_101_c4
hf_Bert
hf_Bart

hf_ptg1

speech_tf

resnet18

resnet50
hf_DistilBert
aiaynp

resnet152
dlrm
fastNLP_Bert

soft_actor_critic

Super_SloMo
dcgan

vgg16
drq

moco
hf_BigBird

pcap

pig1

pig2
Figure 5: Comparing the execution time for training and inference obtained from NVIDIA A100 and AMD MI210 GPUs. Each
bar represents the ratio as 𝑇𝑁 𝑉 𝐼 𝐷𝐼 𝐴/𝑇𝐴𝑀𝐷 . Note that <1 means A100 performs better, while >1 means MI210 performs better.

A100 performs better, and > 1 means AMD MI210 performs better. 1 def zero_grad () :
2 ...
Overall, we can find no GPU best for all TorchBench models. For 3 // pddg : per_device_and_dtype_grads
model inference, AMD MI210 can achieve 1.46× performance than 4 + pddg = defaultdict ( lambda : defaultdict ( list ))
5 for group in self . param_groups :
NVIDIA A100 on dlrm. On the opposite, NVIDIA A100 can yield 6 for p in group [ ' params ' ]:
12.57× speedup over AMD MI210 on hf_ptg1. A similar situation 7 ...
8 - if ( not foreach or p. grad . is_sparse ):
appears in the training phase as well. 9 - p. grad . zero_ ()
10 - else :
11 - pddg [p. grad . device ][ p. grad . dtype ]. append (p. grad )
12 + if not p. grad . is_sparse and p. grad . is_cuda :
13 + pddg [p. grad . device ][ p. grad . dtype ]. append (p)
Insights. Our further investigation shows that models typically 14 + else :
15 + p. grad . zero_ ()
benefit from A100 if most of their GPU kernels can use the TF32 16 -if foreach :
17 +if foreach or pddg :
format for computation because TF32 can achieve much higher 18 for _ , per_dtype_grads in pddg . items () :
TFLOPS than FP32 and FP32-Matrix according to Table 3. However, 19 for grads in per_dtype_grads . values () :
20 torch . _foreach_zero_ ( grads )
not all models can use TF32, as TF32 incurs accuracy losses. For
example, training most NLP models invokes aten::matmul oper- Listing 2: zero_grad sets all gradients to zeros serially.
ator, which requires the use of FP32 since PyTorch 1.12. Similar
operators include elementwise_add and elementwise_div with
FP32 precision. In this case, AMD MI210 performs better because it 4.1 PyTorch Optimization with TorchBench
has a higher TFLOPS than NVIDIA A100 on FP32.
We optimize the entire PyTorch software stack to improve GPU uti-
lization, characterized in Section 3.1. We use PyTorch Profiler [82]
to understand GPU idleness and data movement.
4 APPLYING TORCHBENCH IN PRACTICE 4.1.1 Minimizing GPU Idleness. A GPU is said to be idle when there
We have applied TorchBench to guide performance optimization is no work scheduled on it. This is the grossest of inefficiencies
in the entire PyTorch software stack, including DL models, PyTorch because it means the precious GPU computation resource is wasted.
framework, and GPU acceleration libraries. We describe two ways As shown in Figures 1 and 2, TorchBench exposes significant
to use TorchBench. First, we analyze each model in TorchBench GPU idleness in a majority of models. With the help of PyTorch
to understand and optimize the GPU idleness and data movement Profiler, we are able to pinpoint the GPU idleness in both model
on the source code level. We have identified five performance issues and framework layers of the PyTorch software stack.
and three optimization patches have been upstreamed to PyTorch Listing 2 shows an example in zero_grad method in PyTorch
or model repositories, and two optimization patches are confirmed optimizer, which sets all gradients to zeros in each training iter-
and under discussion for upstreaming. Second, we have the con- ation. In this method, p.grad.zero_ is invoked in a loop nest to
tinuous integration (CI) service of PyTorch repository to include set zeros for each gradient, which incurs a series of tiny GPU ker-
TorchBench. We perform a daily sanity check on the performance nels. A significant amount of GPU idleness occurs in between these
regression for every nightly release. We have identified seven com- kernels. This inefficiency has been confirmed by PyTorch devel-
mits that incur unexpected slowdown to multiple TorchBench opers. We propose a fix as shown in Listing 2. We create a tempo-
benchmarks. Among these problematic commits, five are reverted rary list to maintain the references to all the gradients and utilize
and two are merged with optimization. In the remaining section, torch._foreach_zero_ to set them to zeros with one GPU kernel.
we elaborate on the use cases of both usages. This optimization avoids GPU idleness due to waiting for kernel
8
TorchBench: Benchmarking PyTorch with High API Surface Coverage

1 def _len_and_dim_norm ( self , vectors ): 1.6


2 vectors = self . _len_norm ( vectors )
3 vectors = vectors * torch . rsqrt (

Speedup
1.4
4 torch . tensor ( self . attention_head_size ,
5 device = vectors . device , dtype = vectors . dtype )
6 ) 1.2
7 return vectors
1.0
Listing 3: Model hf_reformer calls the function

timm_vision_tf

timm_efficientnet
hf_BigBird

yolov3
resnext50_32x4d

pcap

aiaynp

vision_maskrcnn
dcgan

mnasnet1_0
DALLE2_pytorch

speech_tf
resnet18

resnet50
shufflenet_v2_x1_0
resnet152
mobilenet_v3_large

soft_actor_critic
lennard_jones

opacus_cifar10

densenet121
phlippe_resnet
phlippe_densenet

d2_maskrcnn_r_50_fpn

d2_maskrcnn_r_101_fp

functorch_dp_cifar10

d2_fasterrcnn_r_50_fpn
torch.rsqrt() to calculate the reciprocal of the square-root
of the variable self.attention_head_size.

launches. Since the optimization is for the PyTorch framework,


multiple models benefit from this optimization.
Moreover, we investigate TorchBench models that incur sig- Figure 6: We only show models having speedups over 5%
nificant GPU idleness. For example, hf_BigBird [58], a recently obtained by applying different optimizations in the train-
proposed transformer-based model, has more than 50% GPU idle- ing phase. Other models do not show obvious performance
ness. We find that substantial computation is done on CPU, rather changes due to our optimizations.
than GPU, resulting in the high GPU idleness. Our optimization is
either to reduce the CPU execution time to reduce the GPU waiting
time or offload some CPU work to GPU to keep GPU busy.
help improve the performance significantly, which proves the value
4.1.2 Reducing Data Movement. Excessive data movement be- of TorchBench.
tween CPUs and GPUs is a well-known bottleneck in GPU ap-
plications. Some TorchBench models show nontrivial data move- 4.2 PyTorch CI with TorchBench
ment. For example, training hf_reformer incurs 2.9% of execution Continuous integration (CI) of a project repository enables automat-
time for data movement. Listing 3 shows the problematic method, ing the integration of code commits from multiple contributors.
_len_and_dim_norm. This method calls torch.rsqrt(), which PyTorch leverages GitHub CI service for pull request sanity checks,
takes a tensor as the input to compute the reciprocal of square workflow checks, programmatic and stylistic error checks, OS and
root of scalar self.attention_head_size. As torch.rsqrt() compiler checks, and many others. However, PyTorch lacks con-
launches a GPU kernel for the computation, it needs to copy this tinuous performance checks. We integrate TorchBench into the
scalar from CPU to GPU, which incurs a large overhead surpassing PyTorch CI service for performance regression testing. This is the
the benefit from GPU acceleration. first such effort for the PyTorch repository. In this section, we first
Thus, we use numpy.sqrt() instead to perform the computation describe how we set up PyTorch CI with TorchBench and then
on the CPU. PyTorch then generates one single kernel for a division show the optimization for performance regression due to various
between the tensor and the scalar. This optimization yields a 27× reasons.
speedup for function _len_and_dim_norm. This optimization patch
has been upstreamed to the PyTorch Transformer repository [83]. 4.2.1 PyTorch CI Setup with TorchBench. Based on GitHub ac-
Moreover, we found pig2 spends 52.7% of execution time on tions, we create a series of GitHub workflows to continuously test
CPU-GPU memory copies in inference. As described in Section 3.1, the performance regression with TorchBench. We measure the
to save GPU memory usage, it keeps neural network structure on following two metrics.
GPUs onlyand offloads all other neural network structures to CPUs • Execution time. TorchBench can be configured to run on CPU
and copy them back to GPUs upon needs. However, for GPUs with only or CPU+GPU for model training or inference. CI measures
large memory capacity like NVIDIA A100, the offloads waste a lot the execution time of each TorchBench benchmark in all four
of time. After a discussion with the pig2 developer, they add an configurations: training on CPU, training on CPU+GPU, infer-
option to disable such memory offloading, which yields a 10.1× ence on CPU, and inference on CPU+GPU.
speedup. • Memory usage. For each TorchBench benchmark, CI measures
4.1.3 Optimization Speedups. We run each TorchBench model 20 the peak CPU and GPU memory consumption and checks the
times and report the arithmetic mean of the obtained speedups with memory leaks. Similar to the time measurement, the memory
our optimization. In summary, 41 models out of 84 models yield measurement is also for all four configurations.
nontrivial speedups in the training phase, which is 1.34× on average Naively, CI collects these metrics for every commit and checks
and up to 10.1×. Our optimizations yield speedups over 1.03× for whether a commit (also known as a pull request or PR) significantly
15 out of 84 modes for inferences, with up to 10.3× speedup for bloat execution time or memory usage. From our experiences, we
pig2. The rest models do not show obvious improvement because define the thresholds as a 7% increment in execution time and
our optimizations do not impact their execution paths. memory usage. If at least one TorchBench benchmark exceeds the
It is worth noting that the PyTorch eco-system has already been thresholds, PyTorch CI automatically submits a GitHub issue with
thoroughly optimized for high performance by machine learning the detailed performance report and the problematic commit for
engineers and PyTorch system engineers. TorchBench can still further investigation.
9
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu

Table 4: Seven issues found in PyTorch development by 1 class Distribution ( object ):


TorchBench. 2 def __init__ (...) :
3 ...
4 + value = getattr ( self , param )
PR# Issues Performance Issues Fixed 5 + valid = constraint . check ( value )
6 + if not valid . all () :
85447 Break-chain API change Memory bloat 7 + raise ValueError (...)
61056 Duplicate error check Runtime inflation 8 if not constraint . check ( getattr ( self , param )). all () :
9 raise ValueError (" The parameter {} has invalid values ".
Optimization’s device
65594 Runtime inflation format ( param ))
compatibility
Suboptimal library Listing 5: PR #61056: Lines 4-7 are added to check the values’
72148 Runtime inflation
configuration validities, which are redundant.
71904 Redundant bound checks Runtime inflation
65839 Template Mismatch Runtime inflation Reverted
87855 Misused error handling Runtime inflation Reverted

1 - template < typename scalar_t >


2 + template < typename scalar_t , typename opmath_t = at :: opmath_type < to opmath_t in the gemm function. The opmath_t is supposed to be
scalar_t >>
3 void gemm (
compiled to faster oneDNN-accelerated functions. However, such
4 TransposeType transa , TransposeType transb , template matching results in slower type conversions introduced by
5 int64_t m , int64_t n , int64_t k ,
6 - scalar_t alpha , compilers. This performance issue has been confirmed by PyTorch
7 + opmath_t alpha , developers, and the PR is reverted.
8 const scalar_t *a , int64_t lda ,
9 const scalar_t *b , int64_t ldb , Listing 5 shows part of PR #61056, which adds a validity check
10 - scalar_t beta ,
11 + opmath_t beta ,
on value in the torch.distributions API. This API generates
12 scalar_t *c , int64_t ldc ) parameterizable probability distributions. This PR incurs 11% slow-
Listing 4: PR #65839 causes inefficient template matching. down to model soft_actor_critic. With further investigation, we
This commit replaces the original scalar_t with opmath_t find that line 7 uses valid.all to check all values in the array,
but causes inefficient template match in argument deduction. which incurs a high overhead. However, this validity check is un-
necessary as line 8 does the same check in another way. We report
our findings to PyTorch developers and this performance issue is
Table 5: Six models has 15.64× slowdown on average for CPU
fixed by removing the redundant validity checks.
testing. The slowdown is up to 51.37×.
PR #65594 introduces Conv-Bias-Relu fusion to fuse possible
function combinations. However, it results in 21% slowdown on
Mode Model Slowdown
pytorch_stargan 20.45 × average for nine models on NVIDIA M60 GPU with cuDNN 7.6.5.
Train
vision_maskrcnn 3.68 × The fix is patched to bypass the optimizations on devices such as
maml_omniglot 1.96 ×
timm_regnet 1.19 ×
M60 GPU. PR #72148 introduces optimized bias fusions but causes
pytorch_stargan 51.37 × 7.8% slowdown on model nvidia_deeprecommender for some input
Inference
demucs 43.22 × sizes. The fix is to set the correct workspace size related to bias
vision_maskrcnn 2.11 ×
mnasnet1_0 1.16 × fusions. PR #87855 has been introduced in Section 1.1. PR #71904
introduces extra bound checks resulting in a 14.9% slowdown for
model dlrm. The fix is to remove the bound checks.
However, there is a large overhead for CI to perform these checks
on each commit because more than 70 commits can be submitted Memory bloat. PR #85447 raises a red flag in the CI when mea-
every day. To reduce the overhead, PyTorch CI performs the checks suring the memory consumption metric. It introduces a potential
only on PyTorch’s nightly version, which is automatically built memory leak that part of the memory is not reclaimed after the
from the latest commit at the end of the day. If any performance re- computation. We investigate the PR and find that PyTorch uses a
gressions are found, CI uses the binary search to check the commits different memory management scheme to interact with cuBLAS.
submitted on the same day ordered by their submission timestamps. Instead of having cuBLAS manages its own memory, this PR allows
This optimization can significantly reduce the CI overhead. PyTorch to preallocate memory for cuBLAS workspace automati-
4.2.2 Performance Regression Case Studies. With the help of cally. However, this PR does not free the workspace memory. We
TorchBench, PyTorch CI is able to identify many submitted com- report this finding to the PR authors, who add the memory recla-
mits with performance issues. Table 4 overviews some problematic mation torch._C._cuda_clearCublasWorkspaces() to avoid the
commits and their issues. All these issues are reported to the Py- memory leak.
Torch team and fixed immediately with either a patch update or
patch reversion. We elaborate on several typical case studies. 5 RELATED WORK
Runtime inflation. Most problematic PRs fall into this category. There are many existing approaches [84–92] to characterizing, ana-
Listing 4 shows PR #65839, which results in 6.82× slowdown on lyzing, and optimizing general GPU applications. In this section, we
average for training and 24.47× slowdown for inference. These review the most related solutions that analyze and optimize the per-
slowdowns are observed across six models as shown in Table 5. formance (i.e., execution efficiency, not model accuracy/robustness)
This PR updates the C++ template match from the type scalar_t of deep learning frameworks and models.
10
TorchBench: Benchmarking PyTorch with High API Surface Coverage

5.1 Benchmarking Deep Learning Worloads computations and memory usage as much as possible. TASO [117]
MLPerf [35] is one of the most popular deep learning bench- reduces the strength of the computation by transforming it to an
mark suites with contributors across major industrial stakeholders. equivalence of higher efficiencies.
MLPerf embraces eight models to benchmark end-to-end execution TorchBench complements these optimization techniques by
of training and inferences. MLPerf is used to evaluate the model per- providing a platform to evaluate all these existing optimization tech-
formance with different deep learning frameworks running on dif- niques and identify new optimization opportunities. TorchBench
ferent hardware. DAWNBench [93] and AIBench [94] follow a sim- provides unique insights to enable optimization for performant
ilar design and goal with their end-to-end benchmarking. Addition- PyTorch code bases.
ally, microbenchmark suites such as Fathom [95], DeepBench [96],
DNNMark [97], and AIBench, include multiple computation ker- 6 CONCLUSIONS
nels (aka operators) widely used in deep learning workloads. These This paper describes TorchBench, which is the first comprehen-
microbenchmarks configure the operators with different inputs to sive benchmark suite for PyTorch. We show the unique insights
understand how these operators behave on different hardware. obtained from TorchBench benchmarking the entire PyTorch soft-
TorchBench has a completely different design goal from ex- ware stack running on mainstream NVIDIA and AMD GPUs. More-
isting benchmark suites. TorchBench aims to expose the perfor- over, we show the real use cases of applying TorchBench to guide
mance issues in PyTorch repository along with the project evolution. code optimization and support regression testing. With the help
It obtains deep insights in PyTorch code bases, but not focuses on of TorchBench, we are able to devise many optimization patches
the characterization across different deep learning frameworks. to PyTorch, and most of them are upstreamed to PyTorch official
repository.
5.2 Profiling Deep Learning Worloads
REFERENCES
There are many profilers [86–88] that are able to understand per-
[1] M. Haggag, A. S. Siam, W. El-Dakhakhni, P. Coulibaly, and E. Hassini, “A
formance inefficiencies in CPU-GPU applications, including deep deep learning model for predicting climate-induced disasters,” Natural Hazards,
learning frameworks. Since they are not specialized for deep learn- vol. 107, pp. 1009–1034, 2021.
[2] B. Y. EL-HABIL and S. S. ABU-NASER, “Global climate prediction using deep
ing applications, they require significant manual efforts and do- learning,” Journal of Theoretical and Applied Information Technology, vol. 100,
main knowledge to understand inefficiencies and devise actionable no. 24, 2022.
optimization. Domain-specific profilers that target deep learning [3] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transforma-
tions for deep neural networks,” 2016.
frameworks include PyTorch Profiler [82], Tensorflow Profiler [98], [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
DLPerf [99], MXNet Profiler [100], to name a few. These profilers nition,” in Proceedings of the IEEE conference on computer vision and pattern
pinpoint hotspots in both CPU and GPU computation kernels and recognition, pp. 770–778, 2016.
[5] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu,
associate them with deep learning operators. These profilers can be R. Pang, V. Vasudevan, et al., “Searching for mobilenetv3,” in Proceedings of the
used together with TorchBench for better performance insights. IEEE/CVF international conference on computer vision, pp. 1314–1324, 2019.
[6] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep
neural networks for large-vocabulary speech recognition,” IEEE Transactions on
5.3 Optimizing Deep Learning Workloads audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
[7] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
To obtain bare-metal performance for deep learning models on V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for
GPUs, a number of optimization methods have been proposed. acoustic modeling in speech recognition: The shared views of four research
groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
Those optimization methods focus on two directions. One is domain- [8] X. Wang and Y. Wang, “Improving content-based and hybrid music recom-
or task-specific highly tuned implementations. For example, CUDA mendation using deep learning,” in Proceedings of the 22nd ACM international
Deep Neural Network library (cuDNN) [30] released by NVIDIA conference on Multimedia, pp. 627–636, 2014.
[9] J. Wei, J. He, K. Chen, Y. Zhou, and Z. Tang, “Collaborative filtering and deep
is a GPU-accelerated library of primitives for deep neural net- learning based recommendation system for cold start items,” Expert Systems
works. It has been widely applied in different deep learning frame- with Applications, vol. 69, pp. 29–39, 2017.
works. Intel OneAPI [101], Google’s XNNPACK [102], and An- [10] A. M. Elkahky, Y. Song, and X. He, “A multi-view deep learning approach for
cross domain user modeling in recommendation systems,” in Proceedings of the
droid’s NNAPI [103] among others are of similar purposes. Torch- 24th international conference on world wide web, pp. 278–288, 2015.
Script [74] and Grappler [104] are two common Python frameworks [11] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification
of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations
to optimize deep learning graph passes by utilizing compiler opti- and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
mization techniques such as constant folding. There are also some [12] M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, and S. Mougiakakou,
other optimization techniques such as pruning [105–107], quanti- “Lung pattern classification for interstitial lung diseases using a deep convo-
lutional neural network,” IEEE transactions on medical imaging, vol. 35, no. 5,
zation [108–110] and operator fusion [111–114]. pp. 1207–1216, 2016.
The other direction is optimizing compilers. XLA (Accelerated [13] Q. Rao and J. Frtunikj, “Deep learning for self-driving cars: Chances and chal-
Linear Algebra) [115] and TorchDynamo [76] are two linear alge- lenges,” in Proceedings of the 1st international workshop on software engineering
for AI in autonomous systems, pp. 35–38, 2018.
bra code compilers for Tensorflow and PyTorch separately. They [14] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani, “Visual slam for
can accelerate deep learning models with potentially no source automated driving: Exploring the applications of deep learning,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
code changes. JAX [23] is recently released by Google to provide pp. 247–257, 2018.
composable transformations of deep learning models. TVM [20] [15] T. Zhou, Q. Zhu, and J. Du, “Intuitive robot teleoperation for civil engineer-
is a deep learning compiler that can autotune the deep learning ing operations with virtual reality and deep learning scene reconstruction,”
Advanced Engineering Informatics, vol. 46, p. 101170, 2020.
models for given hardware. AITemplate [116] fuses different opera- [16] T. Karácsony, J. P. Hansen, H. K. Iversen, and S. Puthusserypady, “Brain com-
tions in various hierarchies to remove unnecessary intermediate puter interface for neuro-rehabilitation with deep learning classification and
11
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu

virtual reality feedback,” in Proceedings of the 10th Augmented Human Interna- [38] N. Otterness and J. H. Anderson, “Amd gpus as an alternative to nvidia for
tional Conference 2019, pp. 1–8, 2019. supporting real-time workloads,” in 32nd Euromicro conference on real-time
[17] K. M. Ruff and R. V. Pappu, “Alphafold and implications for intrinsically disor- systems (ECRTS 2020), Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
dered proteins,” Journal of Molecular Biology, vol. 433, no. 20, p. 167208, 2021. [39] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,
[18] B. Ramsundar, P. Eastman, P. Walters, and V. Pande, Deep learning for the life R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite,
sciences: applying deep learning to genomics, microscopy, drug discovery, and J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush,
more. O’Reilly Media, 2019. “Transformers: State-of-the-art natural language processing,” in Proceedings
[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, of the 2020 Conference on Empirical Methods in Natural Language Processing:
and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in System Demonstrations, (Online), pp. 38–45, Association for Computational
Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678, Linguistics, Oct. 2020.
2014. [40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
[20] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
Y. Hu, L. Ceze, et al., “ { TVM } : An automated { End-to-End } optimizing compiler [41] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2:
for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference
Implementation (OSDI 18), pp. 578–594, 2018. on computer vision and pattern recognition, pp. 4510–4520, 2018.
[21] “Onnx.” [Accessed April 11, 2023]. [42] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-
[22] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, conditional image generation with clip latents,” 2022.
and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for [43] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li,
heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015. and P. J. Liu, “Exploring the limits of transfer learning with a unified text-
[23] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1,
G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: pp. 5485–5551, 2020.
composable transformations of Python+NumPy programs,” 2018. [44] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint
[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, arXiv:1804.02767, 2018.
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Rai- [45] Mindee, “doctr: Document text recognition.” https://github.com/mindee/doctr,
son, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, 2021.
“Pytorch: An imperative style, high-performance deep learning library,” in Ad- [46] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.” https:
vances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, //github.com/facebookresearch/detectron2, 2019.
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, [47] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
Curran Associates, Inc., 2019. deep bidirectional transformers for language understanding,” arXiv preprint
[25] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, arXiv:1810.04805, 2018.
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is- [48] R. Wightman, “Pytorch image models.” https://github.com/rwightman/pytorch-
ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, image-models, 2019.
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, [49] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War- A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
machine learning on heterogeneous systems,” 2015. Software available from [50] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified
tensorflow.org. generative adversarial networks for multi-domain image-to-image translation,”
[26] PyTorch, “Pytorch commit statistics.” [Accessed April 11, 2023]. in Proceedings of the IEEE conference on computer vision and pattern recognition,
[27] M. A. Research, “Paperswithcode.” [Accessed April 11, 2023]. pp. 8789–8797, 2018.
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, [51] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering using cycle-consistent adversarial networks,” in Computer Vision (ICCV), 2017
the game of go with deep neural networks and tree search,” nature, vol. 529, IEEE International Conference on, 2017.
no. 7587, pp. 484–489, 2016. [52] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
[29] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ai,” Commun. ACM, conditional adversarial networks,” in Computer Vision and Pattern Recognition
vol. 63, p. 54–63, nov 2020. (CVPR), 2017 IEEE Conference on, 2017.
[30] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and [53] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of
E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint the IEEE international conference on computer vision, pp. 2961–2969, 2017.
arXiv:1410.0759, 2014. [54] S. Sengupta, V. Jayaram, B. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman,
[31] A. Prokopec, A. Rosà, D. Leopoldseder, G. Duboscq, P. Tma, M. Studener, L. Bulej, “Background matting: The world is your green screen,” in Proceedings of the
Y. Zheng, A. Villazón, D. Simon, et al., “Renaissance: benchmarking suite for IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2291–2300,
parallel applications on the jvm,” in Proceedings of the 40th ACM SIGPLAN 2020.
Conference on Programming Language Design and Implementation, pp. 31–47, [55] A. Paliwal and N. K. Kalantari, “Deep slow motion video reconstruction with
2019. hybrid imaging system,” IEEE transactions on pattern analysis and machine
[32] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, intelligence, vol. 42, no. 7, pp. 1557–1569, 2020.
B. Ritchken, B. Jackson, et al., “An open-source benchmark suite for microser- [56] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language
vices and their hardware-software implications for cloud & edge systems,” in models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9,
Proceedings of the Twenty-Fourth International Conference on Architectural Sup- 2019.
port for Programming Languages and Operating Systems, pp. 3–18, 2019. [57] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov,
[33] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for
R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al., “The nas natural language generation, translation, and comprehension,” arXiv preprint
parallel benchmarks,” The International Journal of Supercomputing Applications, arXiv:1910.13461, 2019.
vol. 5, no. 3, pp. 63–73, 1991. [58] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon,
[34] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, P. Pham, A. Ravula, Q. Wang, L. Yang, et al., “Big bird: Transformers for longer
“Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE in- sequences,” Advances in neural information processing systems, vol. 33, pp. 17283–
ternational symposium on workload characterization (IISWC), pp. 44–54, Ieee, 17297, 2020.
2009. [59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
[35] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, and I. Polosukhin, “Attention is all you need,” Advances in neural information
H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf, et al., “Mlperf training benchmark,” processing systems, vol. 30, 2017.
Proceedings of Machine Learning and Systems, vol. 2, pp. 336–349, 2020. [60] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang,
[36] J. Yin, A. Tsaris, S. Dash, R. Miller, F. Wang, and M. A. Shankar, “Comparative U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherni-
evaluation of deep learning workloads for leadership-class systems,” Bench- avskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen,
Council Transactions on Benchmarks, Standards and Evaluations, vol. 1, no. 1, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recom-
p. 100005, 2021. mendation model for personalization and recommendation systems,” CoRR,
[37] Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu, “Benchmarking vol. abs/1906.00091, 2019.
the performance and energy efficiency of ai accelerators for ai training,” in [61] O. Kuchaiev and B. Ginsburg, “Training deep autoencoders for collaborative
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet filtering,” arXiv preprint arXiv:1708.01715, 2017.
Computing (CCGRID), pp. 744–751, IEEE, 2020. [62] D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Reg-
ularizing deep reinforcement learning from pixels,” in International Conference
12
TorchBench: Benchmarking PyTorch with High API Surface Coverage

on Learning Representations, 2021. [100] The Apache MXNet team, “Mxnet profiler,” 2022. [Accessed April 11, 2023].
[63] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, [101] Intel, “Intel oneapi,” 2022. [Accessed April 11, 2023].
A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv [102] Google, “Google xnnpack,” 2022. [Accessed April 11, 2023].
preprint arXiv:1812.05905, 2018. [103] Google, “Google neural networks api,” 2022. [Accessed April 11, 2023].
[64] Z. Huang, W. Heng, and S. Zhou, “Learning to paint with model-based deep [104] R. M. Larsen and T. Shpeisman, “Tensorflow graph optimizations,” 2019.
reinforcement learning,” in Proceedings of the IEEE/CVF International Conference [105] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections
on Computer Vision, pp. 8709–8718, 2019. for efficient neural network,” Advances in neural information processing systems,
[65] J. Li, X. Wang, Y. Li, et al., “The speechtransformer for large-scale mandarin vol. 28, 2015.
chinese speech recognition,” in ICASSP 2019-2019 IEEE International Conference [106] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for
on Acoustics, Speech and Signal Processing (ICASSP), pp. 7095–7099, IEEE, 2019. efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[66] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, [107] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local
Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet reparameterization trick,” Advances in neural information processing systems,
on mel spectrogram predictions,” in 2018 IEEE international conference on acous- vol. 28, 2015.
tics, speech and signal processing (ICASSP), pp. 4779–4783, IEEE, 2018. [108] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural networks
[67] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, with low precision multiplications,” arXiv preprint arXiv:1412.7024, 2014.
Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet [109] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient in-
on mel spectrogram predictions,” in 2018 IEEE international conference on acous- ference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
tics, speech and signal processing (ICASSP), pp. 4779–4783, IEEE, 2018. [110] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv
[68] S. Rouard, F. Massa, and A. Défossez, “Hybrid transformers for music source preprint arXiv:1612.01064, 2016.
separation,” in ICASSP 23, 2023. [111] W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, “Dnnfusion: accelerating
[69] D. Häfner, “Isoneutral mixing.” [Accessed April 11, 2023]. deep neural networks execution with advanced operator fusion,” in Proceedings
[70] A. M. Rush, “Torch-struct: Deep structured prediction library,” 2020. of the 42nd ACM SIGPLAN International Conference on Programming Language
[71] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, Design and Implementation, pp. 883–898, 2021.
M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” [112] A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside,
CoRR, vol. abs/1710.03740, 2017. and P. Sadayappan, “On optimizing machine learning workloads via kernel
[72] NVIDIA Corporation, “Nvidia tesla a100 gpu,” 2023. [Accessed April 11, 2023]. fusion,” ACM SIGPLAN Notices, vol. 50, no. 8, pp. 173–182, 2015.
[73] AMD, Inc, “Amd instinct mi210 gpu,” 2023. [Accessed April 11, 2023]. [113] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. Reiss, “Re-
[74] PyTorch, “Torchscript,” 2023. [Accessed April 11, 2023]. source elasticity for large-scale machine learning,” in Proceedings of the 2015
[75] PyTorch, “Torch-tensorrt,” 2023. [Accessed April 11, 2023]. ACM SIGMOD International Conference on Management of Data, pp. 137–152,
[76] PyTorch, “Torchdynamo,” 2023. [Accessed April 11, 2023]. 2015.
[77] PyTorch, “Torchinductor,” 2023. [Accessed April 11, 2023]. [114] A. Sujeeth, H. Lee, K. Brown, T. Rompf, H. Chafi, M. Wu, A. Atreya, M. Odersky,
[78] NVIDIA Corporation, “Nvidia tensorrt,” 2023. [Accessed April 11, 2023]. and K. Olukotun, “Optiml: an implicitly parallel domain-specific language for
[79] PyTorch, “Pytorch 2.0,” 2023. [Accessed April 11, 2023]. machine learning,” in Proceedings of the 28th International Conference on Machine
[80] AMD, “Amd rocm,” 2023. [Accessed April 11, 2023]. Learning (ICML-11), pp. 609–616, Citeseer, 2011.
[81] NVIDIA, “Nvidia cuda toolkit,” 2023. [Accessed April 11, 2023]. [115] A. Sabne, “Xla: Compiling machine learning for peak performance,” 2020.
[82] PyTorch, “Pytorch profiler,” 2023. [Accessed April 11, 2023]. [116] Facebook Research, “Aitemplate,” 2022. [Accessed April 11, 2023].
[83] HuggingFace, “Huggingface transformer,” 2023. [Accessed April 11, 2023]. [117] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken, “Taso:
[84] NVIDIA Corporation, “NVIDIA Nsight Compute,” 2022. [Accessed April 11, optimizing deep learning computation with automatic generation of graph
2023]. substitutions,” in Proceedings of the 27th ACM Symposium on Operating Systems
[85] NVIDIA Corporation, “Nvidia nsight systems,” 2022. [Accessed April 11, 2023]. Principles, pp. 47–62, 2019.
[86] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and
N. R. Tallent, “Hpctoolkit: Tools for performance analysis of optimized parallel
programs,” Concurrency and Computation: Practice and Experience, vol. 22, no. 6,
pp. 685–701, 2010.
[87] K. Zhou, Y. Hao, J. Mellor-Crummey, X. Meng, and X. Liu, “Gvprof: A value
profiler for gpu-based clusters,” in SC20: International Conference for High Per-
formance Computing, Networking, Storage and Analysis, pp. 1–16, IEEE, 2020.
[88] K. Zhou, Y. Hao, J. Mellor-Crummey, X. Meng, and X. Liu, “Valueexpert: ex-
ploring value patterns in gpu-accelerated applications,” in Proceedings of the
27th ACM International Conference on Architectural Support for Programming
Languages and Operating Systems, pp. 171–185, 2022.
[89] AMD, “Radeon gpu profiler,” 2022. [Accessed April 11, 2023].
[90] S. S. Shende and A. D. Malony, “The tau parallel performance system,” The
International Journal of High Performance Computing Applications, vol. 20, no. 2,
pp. 287–311, 2006.
[91] A. Knüpfer, C. Rössel, D. Mey, S. Biersdorff, K. Diethelm, D. Eschweiler,
M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. Nagel, Y. Oleynik, P. Philippen,
P. Saviankou, D. Schmidl, S. Shende, R. Tschüter, M. Wagner, B. Wesarg, and
F. Wolf, “Score-P: A Joint Performance Measurement Run-Time Infrastructure
for Periscope, Scalasca, TAU, and Vampir,” in Competence in High Performance
Computing 2011, pp. 79–91, Springer Berlin Heidelberg, 01 2012.
[92] Y. Hao, N. Jain, R. Van der Wijngaart, N. Saxena, Y. Fan, and X. Liu, “Drgpu: A
top-down profiler for gpu applications,” in Proceedings of the 2023 ACM/SPEC
International Conference on Performance Engineering, pp. 43–53, 2023.
[93] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis,
K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-to-end deep learning
benchmark and competition,” Training, vol. 100, no. 101, p. 102, 2017.
[94] W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng, J. Dai,
Z. Cao, et al., “Aibench: an industry standard internet service ai benchmark
suite,” arXiv preprint arXiv:1908.08998, 2019.
[95] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: Reference work-
loads for modern deep learning methods,” in 2016 IEEE International Symposium
on Workload Characterization (IISWC), pp. 1–10, IEEE, 2016.
[96] Baidu Research, “Deepbench,” 2023. [Accessed April 11, 2023].
[97] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for
gpus,” in Proceedings of the General Purpose GPUs, pp. 63–72, 2017.
[98] The Tensorflow team, “Tensorflow profiler,” 2022. [Accessed April 11, 2023].
[99] Oneflow Inc., “Dlperf,” 2022. [Accessed April 11, 2023].
13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy