Torchbench: Benchmarking Pytorch With High Api Surface Coverage
Torchbench: Benchmarking Pytorch With High Api Surface Coverage
Coverage
Yueming Hao Xu Zhao Bin Bao
yhao24@ncsu.edu xzhao9@meta.com binbao@meta.com
North Carolina State University Meta Platforms, Inc. Meta Platforms, Inc.
Raleigh, North Carolina, USA Menlo Park, California, USA Menlo Park, California, USA
Menlo Park, California, USA Menlo Park, California, USA Menlo Park, California, USA
Xu Liu
xliu88@ncsu.edu
North Carolina State University
Raleigh, North Carolina, USA
both software and hardware evolution. MLPerf [35] is the state- • TorchBench is the first PyTorch benchmark suite that consists
of-the-art benchmark suite for deep learning workloads. However, of rich models in different domains. It covers 2.3× more PyTorch
existing deep learning benchmark suites, including MLPerf, aim API surface, compared to the state-of-the-art MLPerf benchmark
to compare the performance of deep learning models running on suite.
different hardware and frameworks. They usually include a small • TorchBench integrate a set of built-in tools, which config-
number of deep learning models (e.g., MLPerf has eight models ure execution environments and collect performance statistics.
only) and cover a small PyTorch API surface, which fails to identify TorchBench is able to report multiple metrics to thoroughly
many PyTorch performance bugs or fairly evaluate the performance characterize PyTorch performance.
impact of patches. • TorchBench demonstrates two use cases in practice. First,
TorchBench exposes the performance bugs in PyTorch software
1.1 Motivating Examples stacks and provides insights for bug fixing. Second, TorchBench
We show two examples to motivate the necessity of a comprehen- is configured as the continuous integration of PyTorch repository
sive benchmark suite for PyTorch. to understand the performance regression of daily committed
patches.
Misleading performance characterization. Understanding the per- The rest of the paper is organized as follows. Section 2 describes
formance difference across various architectures is one of the ma- the models, benchmark adaptation, and configurations in our bench-
jor tasks for benchmarking. However, without a comprehensive mark suite. Section 3 characterizes TorchBench from different as-
benchmark suite, such characterization can result in misleading pects, including execution time breakdown, different PyTorch com-
conclusions. For example, some studies [36] show that AMD GPUs pilers, and different GPU platforms. Section 4 shows the practical
outperform NVIDIA GPUs on PyTorch, and some studies [37, 38] usages of TorchBench in detecting performance inefficiencies in
show an opposite conclusion. Thus, a comprehensive benchmark PyTorch software stack. Section 5 reviews related works and distin-
suite is necessary for performance characterization. Unlike exist- guishes our approach. Section 6 presents some conclusions.
ing studies, we show a different conclusion and unique insights in
Section 3.3. 2 TORCHBENCH SUITE
Missing performance bugs. Performance bugs can be buried deep There are a huge amount of deep learning models developed by
in complex PyTorch code bases, especially in the cold execution the PyTorch community. For example, The Hugging Face plat-
paths. We identify a recent performance bug in PyTorch, which is form has more than 10,000 models [39]. Thus, it is challenging
inappropriate error handling. PyTorch implements an error han- for TorchBench to enclose representative models to cover perfor-
dling mechanism named c10_Exception to deal with runtime errors. mance behaviors of typical PyTorch use cases. We work closely
It prints out backtraces for all errors and uses std::string to gen- with machine learning engineers and set the following criteria to
erate error messages. Since error handling is usually in the cold select models for TorchBench.
path, it does not incur any performance degradation in models of • Classic models. TorchBench includes models that were devel-
existing benchmark suites. However, we find that this error code oped many years ago but have been proven both useful and
handling slows down quantized models by 10×. Our further in- impactful, such as ResNet [4], VGG16 [40], and MobileNet [5, 41].
vestigation shows that quantized models heavily call torch.ops These models serve as the foundation of many state-of-the-art
API, which frequently throws a benign error, “Not Implemented models.
Functions”, resulting in a large overhead in error handling. This is- • Popular models. TorchBench includes popular models released
sue has been confirmed by PyTorch team, and c10_Exception has in recent years. These models attract extensive attention in the
been reverted to the previous implementation. Thus, a benchmark community, enabling many academic research papers and appli-
suite that covers a large PyTorch API surface is necessary to expose cations. These models include pig2 [42], T5 [43], Yolo [44], and
performance bugs. docTR [45].
• Important models in the industry. Industrial companies such
1.2 Paper Contribution as Meta and Google release important models which are used
In this paper, we develop TorchBench, a novel benchmark suite in their products. These models include Detectron2 [46] and
for PyTorch to address the aforementioned challenges. Unlike exist- Bert [47].
ing approaches, TorchBench embraces a large number of models, • Diverse models. TorchBench includes models from different do-
which cover a large PyTorch API surface. Given the voluminous mains to ensure a fair comparison. Moreover, models of different
deep learning models, we carefully include the representative mod- weight layers and different implementations are included.
els in TorchBench for both fairness and generality. Additionally, In the rest of this section, we describe the benchmarks, set the study
we develop a set of tools associated with TorchBench to enable scope, and distinguish TorchBench from existing approaches.
TorchBench to (1) run benchmarks with different configurations,
(2) collect various performance statistics, and (3) be ready for any 2.1 Benchmark Description
continuous integration systems. TorchBench helps fix many per- Table 1 overviews all the benchmarks in TorchBench, which
formance bugs in PyTorch eco-system, and many of them are ac- consists of 84 deep learning models and covers six domains.
cepted by the official PyTorch repository. In summary, TorchBench For a given model name, it may consist of a prefix and/or a
makes the following contributions. suffix. The prefix, such as d2 (Detectron2), hf (Hugging Face),
2
TorchBench: Benchmarking PyTorch with High API Surface Coverage
and timm, means the collection or platform the model is from. from a description in natural language, and multiple GAN [49]
The suffix means different configurations. Specifically, c4 means based models such as pig1 [50] and CycleGAN [51, 52].
using a conv4 backbone; FPN means using Feature Pyramid • Image segmentation, which partitions an image into multiple
Network backbone; dc5 means using a conv5 backbone with segments to locate objects and their boundaries. TorchBench
dilations in conv5; large means that the model takes more includes YoLOv3 [44] and a series of MaskRCNN models [53]
parameters. Due to potential privacy concerns, we rename sev- implemented atop Detectron2.
eral models, including hf_public_text_generator1(hf_ptg1), • Pattern Recognition and Video Interpolation. TorchBench in-
hf_public_text_generator1_large(hf_ptg1_large), pub- cludes background matting [54] in pattern recognition, which
lic_image_generator1(pig1), and public_image_generator2(pig2). separates the foreground elements from the background of
an image or video and composites it into a new background.
Computer Vision. Computer vision is one of the most important TorchBench also includes Super SloMo [55], which reconstructs
domains that embrace deep learning. We further categorize models high-resolution slow-motion videos by alignment and appear-
in different subareas. ance estimation.
• Image classification, which categorizes and labels images. Natural Language Process (NLP). NLP is a series of algorithms
TorchBench includes 20 models in this domain, such as ResNet and techniques enabling computers to understand human language.
and its variant, MobileNet_v2, and various models from the TorchBench includes NLP models for the task of language model-
model collection timm [48]. ing and translation.
• Object detection, which detects all instances of predefined classes • Language Modeling, which predicts the probability distribution of
and provides axis-aligned boxes to locate the detected objects. words in a language and models the relationships and dependen-
TorchBench includes 12 models in this domain, including Faster- cies among the words. TorchBench includes the open-sourced
RCNN and MaskRCNN atop Detectron2 and many other models. hf_ptg1 [56] from Hugging Face platform and hf_ptg1_large
• Image generation, which takes texts or images as inputs and with more parameters. Other models such as Bart [57], Bert[47],
generates new images. TorchBench includes pig2 [42], a state- T5 [43], BigBird [58] are included to cover NLP domains such as
of-the-art diffusion model that can create realistic images and art natural language generation, translation, and comprehension.
3
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu
100
75
50
Ratio
25
functorch_maml_omniglot
shufflenet_v2_x1_0
pytorch_unet
pytorch_struct
fambench_xlmr
d2_fasterrcnn_r_50_fpn
d2_fasterrcnn_r_101_fpn
timm_vovnet
resnet50_quantized_qat
lennard_jones
functorch_dp_cifar10
vision_maskrcnn
tts_angular
timm_nfnet
d2_maskrcnn_r_50_fpn
hf_Longformer
BERT_pytorch
d2_maskrcnn_r_101_fpn
hf_Reformer
timm_efficientnet
resnext50_32x4d
timm_vision_tf_large
mobilenet_v3_large
alexnet
d2_fasterrcnn_r_50_c4
maml_omniglot
d2_fasterrcnn_r_50_dc5
squeezenet1_1
d2_fasterrcnn_r_101_c4
hf_T5_large
d2_fasterrcnn_r_101_dc5
phlippe_densenet
yolov3
timm_vision_tf
mobilenet_v2
mobilenet_v2_q_qat
mnasnet1_0
d2_maskrcnn
nvidia_deeprecommender
hf_Albert
hf_Bert_large
timm_resnest
phlippe_resnet
hf_T5
timm_regnet
timm_efficientdet
hf_ptg1_large
d2_maskrcnn_r_50_c4
tacotron2
d2_maskrcnn_r_101_c4
hf_Bert
hf_Bart
densenet121
hf_ptg1
speech_tf
resnet50
resnet18
hf_DistilBert
aiaynp
resnet152
dlrm
fastNLP_Bert
soft_actor_critic
Super_SloMo
vgg16
dcgan
opacus_cifar10
moco
drq
hf_BigBird
pcap
pig1
GPU Idleness Data Movement GPU Active
10
0
75
Ratio
50
25
0
pyhpc_equation_of_state
Background_Matting
pyhpc_isoneutral_mixing
shufflenet_v2_x1_0
mobilenet_v3_large
d2_fasterrcnn_r_101_fpn
nvidia_deeprecommender
phlippe_densenet
d2_fasterrcnn_r_50_fpn
resnext50_32x4d
LearningToPaint
d2_fasterrcnn_r_101_dc5
d2_maskrcnn_r_101_fpn
Super_SloMo
hf_ptg1_large
d2_maskrcnn_r_50_fpn
d2_fasterrcnn_r_101_c4
d2_fasterrcnn_r_50_dc5
timm_vision_tf_large
phlippe_resnet
mobilenet_v2
d2_fasterrcnn_r_50_c4
lennard_jones
d2_fcos_r_50_fpn
functorch_dp_cifar10
hf_Albert
hf_Bert_large
hf_T5_large
d2_maskrcnn_r_101_c4
pytorch_unet
densenet121
functorch_maml_omniglot
squeezenet1_1
d2_maskrcnn_r_50_c4
hf_Longformer
alexnet
yolov3
timm_efficientnet
timm_efficientdet
aiaynp
tts_angular
hf_T5_base
fastNLP_Bert
hf_BigBird
hf_ptg1
timm_vovnet
vgg16
timm_vision_tf
hf_DistilBert
resnet152
vision_maskrcnn
opacus_cifar10
maml_omniglot
mnasnet1_0
fambench_xlmr
tacotron2
BERT_pytorch
resnet18
hf_Reformer
hf_T5
resnet50
pig1
timm_regnet
pig2
hf_Bert
hf_Bart
ptke
timm_nfnet
d2_maskrcnn
speech_tf
soft_actor_critic
dcgan
drq
pcap
dlrm
timm_resnest
moco
maml
demucs
• TorchBench benchmarks the computation phase of DL mod- on one NVIDIA A100 GPU with 40 GB memory. Experiments in
els, while MLPerf benchmarks the end-to-end execution of the Section 3.3 also include data obtained from an AMD MI210 GPU
models. with 64 GB memory.
• TorchBench benchmarks PyTorch only, while MLPerf bench-
marks different deep learning frameworks. 3.1 Characterizing PyTorch Computation on
• TorchBench includes 84 DL models in six domains, while GPU
MLPerf has only five models in five domains with PyTorch.
We choose execution time as our main metric for GPU utilization be-
TorchBench covers 2.3× more PyTorch APIs than MLPerf.
cause it is the most straightforward and common metric to measure
Recently, TorchBench has evolved to embrace end-to-end models the model performance.
and support beyond PyTorch (e.g., Jax). However, this evolution is Figures 1 and 2 show the characterization results for the training
in the preliminary stage and out of the scope of this paper. and inference tasks of TorchBench models, respectively. Each bar
in the figures is composed of three segments: blue for the time
3 TORCHBENCH CHRACTERIZATION that GPU is active for computation, red for the time used in data
TorchBench enables comprehensive characterization of PyTorch. movement between CPU and GPU, and grey for the time that GPU
Given the page limit, we show the insights obtained from three is idle. We normalize them as portions of the total execution time of
characterization efforts: (1) characterizing PyTorch performance the models. From the figures, we observe that PyTorch keeps GPU
on NVIDIA GPUs (Section 3.1), (2) characterizing PyTorch perfor- busy for only 56.8% and 55.4% of total execution time for training
mance for different compiler backends (Section 3.2), and (3) com- and inference, respectively. GPU idleness and CPU-GPU data move-
paring PyTorch performance between NVIDIA and AMD GPUs ment account for a substantial time portion, preventing PyTorch
(Section 3.3). from achieving full GPU usage. Table 2 further quantifies the time
We benchmark TorchBench with PyTorch 2.0-20230102 nightly decomposition averaged across models in different domains. We
release linked with CUDA 11.7 [24]. Our experiments are done obtain the following insights for further investigation.
5
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu
Table 2: The breakdown ratios of model execution time for different deep learning tasks.
Train Inference
Task
GPU activeness Data movement GPU idleness GPU activeness Data movement GPU idleness
Computer Vision 53.1 2.1 44.8 62.8 1.4 35.7
NLP 84.9 1.3 13.8 64.7 0.8 34.5
Recommendation 75.4 0.4 24.2 51.4 0.1 48.5
Reinforcement Learning 10.2 5.0 84.8 19.3 8.4 72.3
Speech 28.8 0.3 70.9 50.3 0.3 49.4
3
Execution Time CPU Memory GPU Memory
TInductor CMInductor GMInductor
TOrigin CMOrigin GMOrigin
Comparison
0
timm_vision_tf
timm_vovnet
timm_nfnet
fastNLP_Bert
alexnet
hf_T5
hf_ptg1
resnet50
timm_resnest
Super_SloMo
timm_regnet
timm_efficientnet
hf_DistilBert
resnext50_32x4d
mobilenet_v2
vgg16
hf_Bart
hf_Bert
mnasnet1_0
hf_Albert
hf_Reformer
resnet18
shufflenet_v2_x1_0
resnet152
mobilenet_v3_large
hf_Bert_large
phlippe_resnet
phlippe_densenet
Figure 3: The comparisons of execution time (T), CPU memory usage (CM), and GPU memory usage (GM) for training between
original PyTorch and PyTorch compiled by TorchInductor. < 1 means TorchInductor performs better, while > 1 means original
PyTorch compiler performs better.
4
Execution Time CPU Memory GPU Memory
TInductor CMInductor GMInductor
3
TOrigin CMOrigin GMOrigin
Comparison
0
maml_omniglot
densenet121
hf_ptg1
timm_resnest
fambench_xlmr
timm_vision_tf
timm_vovnet
dlrm
timm_nfnet
fastNLP_Bert
alexnet
demucs
phlippe_densenet
timm_regnet
Super_SloMo
timm_efficientnet
timm_efficientdet
ptke
yolov3
hf_DistilBert
resnext50_32x4d
pcap
timm_vision_tf_large
functorch_maml_omniglot
mobilenet_v2
drq
aiaynp
vgg16
dcgan
hf_Bert
hf_Bart
pyhpc_isoneutral_mixing
mnasnet1_0
hf_Albert
BERT_pytorch
hf_T5
hf_Reformer
squeezenet1_1
resnet50
resnet18
shufflenet_v2_x1_0
resnet152
tts_angular
mobilenet_v3_large
hf_Bert_large
soft_actor_critic
hf_T5_base
opacus_cifar10
pytorch_unet
pig1
nvidia_deeprecommender
phlippe_resnet
hf_T5_large
LearningToPaint
functorch_dp_cifar10
pyhpc_equation_of_state
hf_ptg1_large
Figure 4: The comparisons of execution time (T), CPU memory usage (CM), and GPU memory usage (GM) for inference between
original PyTorch and PyTorch compiled by TorchInductor. < 1 means TorchInductor performs better, while > 1 means original
PyTorch compiler performs better.
Insights for execution time decomposition For both training and Insights for performance difference between training and inference
inference, models in computer vision, NLP, and recommendation Some models perform better on training, while others perform
yield over 50% GPU active time, among which NLP models achieve better on inference. There are three reasons. First, training may use
>80% in training. Models in these domains usually have large input different input sizes from inference, so PyTorch invokes different
sizes and intensive computation to be offloaded to GPU. In contrast, GPU kernels. Second, some functions in training require higher
Reinforcement Learning (RL) models achieve the smallest GPU precision than inference. Third, PyTorch may invoke different GPU
active time for both training and inference. RL models need to inter- kernels for training and inference even when they have the same
act with environment, a component not based on PyTorch, which input. For example, the GPU active time of fambench_xlmr on
limits the parallelism of RL models. Compared with NLP models, training is 98.0% but only 44.7% for inference with the same input.
RL models have smaller inputs and less intensive computations in With further investigation, we found that this model uses FP32 for
each batch, so they usually incur more GPU idleness. training but FP16 for inference by default. For the same forward
6
TorchBench: Benchmarking PyTorch with High API Surface Coverage
phase, FP16 GPU kernels run faster, so GPU finishes its computation Table 3: The peak theoretical TFLOPS for various floating
earlier, resulting in a larger portion of idleness. point number formats on NVIDIA A100 and AMD MI210
Discussion for individual models pig2 is one of the outliers, which GPUs.
spends 52% of execution time on data movement. With further
TFLOPS of Floating Point Number Formats
investigation, we find that in order to save GPU memory usage, it GPU
FP32- FP64- FP64-
FP32 TF32 FP64
always keeps one neural network structure on GPUs and offloads Matrix Matrix Tensor Core
NVIDIA A100 19.5 156 - 9.7 - 19.5
all other structures to CPUs. After the computations, it copies all AMD MI210 22.6 - 45.3 22.6 45.3 -
structures back to GPUs. This kind of ping-pong data movement
wastes a lot of time.
These insights motivate the necessity of optimizing PyTorch to make the tradeoff between performance and resource usage. Third,
remefy the performance losses caused by GPU idleness and data it determines which buffers can be reused and when the node should
movement. We elaborate our optimization efforts with the help be executed according to the data dependencies. To apply these
of TorchBench in Section 4.1. It is worth noting that high GPU techniques, TorchInductor has to use its memory cache allocators
active time does not mean no room for performance improvement. and record graph details, which results in larger GPU memory
For example, the GPU active ratio for vgg16 is 98.3% while its footprints. Since many intermediate operations have been removed,
achieved TFLOPS is about 10.07, which still has a gap from the peak the CPU memory usage is reduced significantly. We have reported
performance. This is caused by GPU kernel inefficiencies, such the large GPU memory usage to TorchInductor developers, who
as device memory access delays, instruction dependency delays, confirm this perform issue and promise to fix it in the next release.
shared memory bank conflicts, and many pipeline stall reasons. Outlier discussion. Inferencing yolov3 and hf_Reformer shows
Further characterization of TFLOPS is out of the scope due to the significant slowdown with TorchInductor. It is because of the high
page limit. just-in-time (JIT) compilation overhead introduced by TorchInduc-
tor. For most models, TorchInductor only needs to JIT compile
once and use the jitted model in the following iterations. However,
3.2 Characterizing PyTorch Compilers
models such as hf_Reformer incur many guard checks in TorchIn-
Besides the default model interpreter in the eager mode, PyTorch ductor, which guarantee the correct execution but cause a high
provides multiple model compilers (also known as model backends) overhead. For example, hf_Reformer incurs 2699 guard checks,
in graph modes, such as TorchScript [74], Torch-TensorRT [75], and 30% are heavy guard checks such as dictionary keys check. We
TorchDynamo [76], and TorchInductor [77]. TorchScript is a classic confirm this performance issue with TorchInductor developers.
Just-In-Time (JIT) model compiler that traces and optimizes model
code. Torch-TensorRT is an Ahead-of-Time (AOT) compiler, which Insights. TorchInductor typically accelerates both training and
utilizes NVIDIA TensorRT [78] deep learning optimizer and runtime inference models with consuming more GPU memory compared
to compile models before deployment. TorchDynamo compiles to the default compiler. However, TorchInductor is not suitable
arbitrary Python code into graphs, which can be further compiled. for models running on GPUs with limited memory unless apply-
TorchInductor compiles the graphs generated by TorchDynamo into ing further configurations, such as changing batch sizes or using
optimized C++/Triton kernels. The combination of TorchDynamo quantization.
and TorchInductor is the latest and recommended JIT compiler for
PyTorch 2.0 [79]. We use TorchInductor to denote this combination 3.3 PyTorch on NVIDIA vs. AMD GPUs
in this paper. PyTorch supports various types of GPUs. In this section, we com-
Figures 3 and 4 compare the performance between TorchInduc- pare the performance of NVIDIA A100 and AMD MI210, which are
tor and the default PyTorch compiler. We measure three metrics: the competing products in the market. We compare their perfor-
execution time, CPU memory consumption, and GPU memory con- mance on TorchBench with their software stacks: ROCm 5.4.2 [80]
sumption. It is worth noting that we do not show every benchmark from AMD and CUDA 11.8 [81] from NVIDIA. We run the default
in TorchBench because TorchInductor is still in its early stage and 32-bit configuration of TorchBench on both GPUs for a fair com-
does not fully support PyTorch APIs. From the figures, we observe parison. We test PyTorch stable 2.0.1 on both platforms. Table 3
that TorchInductor typically improves the execution time over the compares the peak theoretical TFLOPS for different floating point
default compiler, with 1.30× and 1.46× speedups for training and number formats on these two GPUs. Theoretically, MI210 has higher
inference on average (geomean). Moreover, TorchInductor signifi- performance than A100 in FP32 and FP64 computation. However,
cantly reduces the CPU memory consumption by 71.2% and 73.7% both A100 and MI210 have unique features to give an uncertainty
for training and inference but generally increases the demand for on TorchBench performance in comparison. For example, FP32-
GPU memory by 31.2% and 51.1% for training and inference. Specif- Matrix and FP64-Matrix are the unique optimized matrix operations
ically, many models suffer from such GPU memory bloat as high as for FP32 and FP64 on AMD GPUs, yielding high TFLOPS. TF32 is
> 5× compared to the default PyTorch compiler. the unique 32-bit floating point format on A100, which yields high
TorchInductor obtains speedups mainly from three techniques. TFLOPS but with some accuracy losses; FP64-Tensor Core is the
First, it fuses GPU kernels and utilizes Triton to generate faster FP64 operations uniquely accelerated by NVIDIA Tensor Cores.
kernels. For example, fusing two subsequent functions can eliminate Figure 5 shows the comparison of execution time obtained from
intermediate computations, memory load and store operations. AMD MI210 (𝑇𝐴𝑀𝐷 ) and NVIDIA A100 (𝑇𝑁𝑉 𝐼 𝐷𝐼𝐴 ). The ratio each
Second, it reorders the resource-intensive nodes in the graph to bar represents in the figure is 𝑇𝑁 𝑉 𝐼 𝐷𝐼 𝐴/𝑇𝐴𝑀𝐷 ; < 1 means NVIDIA
7
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu
1.5
Inference Training
1.0
Comparison
0.5
0.0
functorch_maml_omniglot
shufflenet_v2_x1_0
pytorch_unet
d2_fasterrcnn_r_50_fpn
d2_fasterrcnn_r_101_fpn
timm_vovnet
lennard_jones
tts_angular
vision_maskrcnn
timm_nfnet
d2_maskrcnn_r_50_fpn
hf_Longformer
BERT_pytorch
d2_maskrcnn_r_101_fpn
hf_Reformer
timm_efficientnet
resnext50_32x4d
mobilenet_v3_large
alexnet
d2_fasterrcnn_r_50_c4
maml_omniglot
squeezenet1_1
d2_fasterrcnn_r_50_dc5
d2_fasterrcnn_r_101_c4
hf_T5_large
LearningToPaint
phlippe_densenet
d2_fasterrcnn_r_101_dc5
yolov3
timm_vision_tf
mobilenet_v2
mnasnet1_0
d2_maskrcnn
hf_Albert
hf_Bert_large
nvidia_deeprecommender
timm_resnest
phlippe_resnet
hf_T5
timm_regnet
timm_efficientdet
hf_ptg1_large
d2_maskrcnn_r_50_c4
d2_maskrcnn_r_101_c4
hf_Bert
hf_Bart
hf_ptg1
speech_tf
resnet18
resnet50
hf_DistilBert
aiaynp
resnet152
dlrm
fastNLP_Bert
soft_actor_critic
Super_SloMo
dcgan
vgg16
drq
moco
hf_BigBird
pcap
pig1
pig2
Figure 5: Comparing the execution time for training and inference obtained from NVIDIA A100 and AMD MI210 GPUs. Each
bar represents the ratio as 𝑇𝑁 𝑉 𝐼 𝐷𝐼 𝐴/𝑇𝐴𝑀𝐷 . Note that <1 means A100 performs better, while >1 means MI210 performs better.
A100 performs better, and > 1 means AMD MI210 performs better. 1 def zero_grad () :
2 ...
Overall, we can find no GPU best for all TorchBench models. For 3 // pddg : per_device_and_dtype_grads
model inference, AMD MI210 can achieve 1.46× performance than 4 + pddg = defaultdict ( lambda : defaultdict ( list ))
5 for group in self . param_groups :
NVIDIA A100 on dlrm. On the opposite, NVIDIA A100 can yield 6 for p in group [ ' params ' ]:
12.57× speedup over AMD MI210 on hf_ptg1. A similar situation 7 ...
8 - if ( not foreach or p. grad . is_sparse ):
appears in the training phase as well. 9 - p. grad . zero_ ()
10 - else :
11 - pddg [p. grad . device ][ p. grad . dtype ]. append (p. grad )
12 + if not p. grad . is_sparse and p. grad . is_cuda :
13 + pddg [p. grad . device ][ p. grad . dtype ]. append (p)
Insights. Our further investigation shows that models typically 14 + else :
15 + p. grad . zero_ ()
benefit from A100 if most of their GPU kernels can use the TF32 16 -if foreach :
17 +if foreach or pddg :
format for computation because TF32 can achieve much higher 18 for _ , per_dtype_grads in pddg . items () :
TFLOPS than FP32 and FP32-Matrix according to Table 3. However, 19 for grads in per_dtype_grads . values () :
20 torch . _foreach_zero_ ( grads )
not all models can use TF32, as TF32 incurs accuracy losses. For
example, training most NLP models invokes aten::matmul oper- Listing 2: zero_grad sets all gradients to zeros serially.
ator, which requires the use of FP32 since PyTorch 1.12. Similar
operators include elementwise_add and elementwise_div with
FP32 precision. In this case, AMD MI210 performs better because it 4.1 PyTorch Optimization with TorchBench
has a higher TFLOPS than NVIDIA A100 on FP32.
We optimize the entire PyTorch software stack to improve GPU uti-
lization, characterized in Section 3.1. We use PyTorch Profiler [82]
to understand GPU idleness and data movement.
4 APPLYING TORCHBENCH IN PRACTICE 4.1.1 Minimizing GPU Idleness. A GPU is said to be idle when there
We have applied TorchBench to guide performance optimization is no work scheduled on it. This is the grossest of inefficiencies
in the entire PyTorch software stack, including DL models, PyTorch because it means the precious GPU computation resource is wasted.
framework, and GPU acceleration libraries. We describe two ways As shown in Figures 1 and 2, TorchBench exposes significant
to use TorchBench. First, we analyze each model in TorchBench GPU idleness in a majority of models. With the help of PyTorch
to understand and optimize the GPU idleness and data movement Profiler, we are able to pinpoint the GPU idleness in both model
on the source code level. We have identified five performance issues and framework layers of the PyTorch software stack.
and three optimization patches have been upstreamed to PyTorch Listing 2 shows an example in zero_grad method in PyTorch
or model repositories, and two optimization patches are confirmed optimizer, which sets all gradients to zeros in each training iter-
and under discussion for upstreaming. Second, we have the con- ation. In this method, p.grad.zero_ is invoked in a loop nest to
tinuous integration (CI) service of PyTorch repository to include set zeros for each gradient, which incurs a series of tiny GPU ker-
TorchBench. We perform a daily sanity check on the performance nels. A significant amount of GPU idleness occurs in between these
regression for every nightly release. We have identified seven com- kernels. This inefficiency has been confirmed by PyTorch devel-
mits that incur unexpected slowdown to multiple TorchBench opers. We propose a fix as shown in Listing 2. We create a tempo-
benchmarks. Among these problematic commits, five are reverted rary list to maintain the references to all the gradients and utilize
and two are merged with optimization. In the remaining section, torch._foreach_zero_ to set them to zeros with one GPU kernel.
we elaborate on the use cases of both usages. This optimization avoids GPU idleness due to waiting for kernel
8
TorchBench: Benchmarking PyTorch with High API Surface Coverage
Speedup
1.4
4 torch . tensor ( self . attention_head_size ,
5 device = vectors . device , dtype = vectors . dtype )
6 ) 1.2
7 return vectors
1.0
Listing 3: Model hf_reformer calls the function
timm_vision_tf
timm_efficientnet
hf_BigBird
yolov3
resnext50_32x4d
pcap
aiaynp
vision_maskrcnn
dcgan
mnasnet1_0
DALLE2_pytorch
speech_tf
resnet18
resnet50
shufflenet_v2_x1_0
resnet152
mobilenet_v3_large
soft_actor_critic
lennard_jones
opacus_cifar10
densenet121
phlippe_resnet
phlippe_densenet
d2_maskrcnn_r_50_fpn
d2_maskrcnn_r_101_fp
functorch_dp_cifar10
d2_fasterrcnn_r_50_fpn
torch.rsqrt() to calculate the reciprocal of the square-root
of the variable self.attention_head_size.
5.1 Benchmarking Deep Learning Worloads computations and memory usage as much as possible. TASO [117]
MLPerf [35] is one of the most popular deep learning bench- reduces the strength of the computation by transforming it to an
mark suites with contributors across major industrial stakeholders. equivalence of higher efficiencies.
MLPerf embraces eight models to benchmark end-to-end execution TorchBench complements these optimization techniques by
of training and inferences. MLPerf is used to evaluate the model per- providing a platform to evaluate all these existing optimization tech-
formance with different deep learning frameworks running on dif- niques and identify new optimization opportunities. TorchBench
ferent hardware. DAWNBench [93] and AIBench [94] follow a sim- provides unique insights to enable optimization for performant
ilar design and goal with their end-to-end benchmarking. Addition- PyTorch code bases.
ally, microbenchmark suites such as Fathom [95], DeepBench [96],
DNNMark [97], and AIBench, include multiple computation ker- 6 CONCLUSIONS
nels (aka operators) widely used in deep learning workloads. These This paper describes TorchBench, which is the first comprehen-
microbenchmarks configure the operators with different inputs to sive benchmark suite for PyTorch. We show the unique insights
understand how these operators behave on different hardware. obtained from TorchBench benchmarking the entire PyTorch soft-
TorchBench has a completely different design goal from ex- ware stack running on mainstream NVIDIA and AMD GPUs. More-
isting benchmark suites. TorchBench aims to expose the perfor- over, we show the real use cases of applying TorchBench to guide
mance issues in PyTorch repository along with the project evolution. code optimization and support regression testing. With the help
It obtains deep insights in PyTorch code bases, but not focuses on of TorchBench, we are able to devise many optimization patches
the characterization across different deep learning frameworks. to PyTorch, and most of them are upstreamed to PyTorch official
repository.
5.2 Profiling Deep Learning Worloads
REFERENCES
There are many profilers [86–88] that are able to understand per-
[1] M. Haggag, A. S. Siam, W. El-Dakhakhni, P. Coulibaly, and E. Hassini, “A
formance inefficiencies in CPU-GPU applications, including deep deep learning model for predicting climate-induced disasters,” Natural Hazards,
learning frameworks. Since they are not specialized for deep learn- vol. 107, pp. 1009–1034, 2021.
[2] B. Y. EL-HABIL and S. S. ABU-NASER, “Global climate prediction using deep
ing applications, they require significant manual efforts and do- learning,” Journal of Theoretical and Applied Information Technology, vol. 100,
main knowledge to understand inefficiencies and devise actionable no. 24, 2022.
optimization. Domain-specific profilers that target deep learning [3] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transforma-
tions for deep neural networks,” 2016.
frameworks include PyTorch Profiler [82], Tensorflow Profiler [98], [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
DLPerf [99], MXNet Profiler [100], to name a few. These profilers nition,” in Proceedings of the IEEE conference on computer vision and pattern
pinpoint hotspots in both CPU and GPU computation kernels and recognition, pp. 770–778, 2016.
[5] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu,
associate them with deep learning operators. These profilers can be R. Pang, V. Vasudevan, et al., “Searching for mobilenetv3,” in Proceedings of the
used together with TorchBench for better performance insights. IEEE/CVF international conference on computer vision, pp. 1314–1324, 2019.
[6] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep
neural networks for large-vocabulary speech recognition,” IEEE Transactions on
5.3 Optimizing Deep Learning Workloads audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
[7] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
To obtain bare-metal performance for deep learning models on V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for
GPUs, a number of optimization methods have been proposed. acoustic modeling in speech recognition: The shared views of four research
groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
Those optimization methods focus on two directions. One is domain- [8] X. Wang and Y. Wang, “Improving content-based and hybrid music recom-
or task-specific highly tuned implementations. For example, CUDA mendation using deep learning,” in Proceedings of the 22nd ACM international
Deep Neural Network library (cuDNN) [30] released by NVIDIA conference on Multimedia, pp. 627–636, 2014.
[9] J. Wei, J. He, K. Chen, Y. Zhou, and Z. Tang, “Collaborative filtering and deep
is a GPU-accelerated library of primitives for deep neural net- learning based recommendation system for cold start items,” Expert Systems
works. It has been widely applied in different deep learning frame- with Applications, vol. 69, pp. 29–39, 2017.
works. Intel OneAPI [101], Google’s XNNPACK [102], and An- [10] A. M. Elkahky, Y. Song, and X. He, “A multi-view deep learning approach for
cross domain user modeling in recommendation systems,” in Proceedings of the
droid’s NNAPI [103] among others are of similar purposes. Torch- 24th international conference on world wide web, pp. 278–288, 2015.
Script [74] and Grappler [104] are two common Python frameworks [11] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification
of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations
to optimize deep learning graph passes by utilizing compiler opti- and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
mization techniques such as constant folding. There are also some [12] M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, and S. Mougiakakou,
other optimization techniques such as pruning [105–107], quanti- “Lung pattern classification for interstitial lung diseases using a deep convo-
lutional neural network,” IEEE transactions on medical imaging, vol. 35, no. 5,
zation [108–110] and operator fusion [111–114]. pp. 1207–1216, 2016.
The other direction is optimizing compilers. XLA (Accelerated [13] Q. Rao and J. Frtunikj, “Deep learning for self-driving cars: Chances and chal-
Linear Algebra) [115] and TorchDynamo [76] are two linear alge- lenges,” in Proceedings of the 1st international workshop on software engineering
for AI in autonomous systems, pp. 35–38, 2018.
bra code compilers for Tensorflow and PyTorch separately. They [14] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani, “Visual slam for
can accelerate deep learning models with potentially no source automated driving: Exploring the applications of deep learning,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
code changes. JAX [23] is recently released by Google to provide pp. 247–257, 2018.
composable transformations of deep learning models. TVM [20] [15] T. Zhou, Q. Zhu, and J. Du, “Intuitive robot teleoperation for civil engineer-
is a deep learning compiler that can autotune the deep learning ing operations with virtual reality and deep learning scene reconstruction,”
Advanced Engineering Informatics, vol. 46, p. 101170, 2020.
models for given hardware. AITemplate [116] fuses different opera- [16] T. Karácsony, J. P. Hansen, H. K. Iversen, and S. Puthusserypady, “Brain com-
tions in various hierarchies to remove unnecessary intermediate puter interface for neuro-rehabilitation with deep learning classification and
11
Yueming Hao, Xu Zhao, Bin Bao, David Berard, Will Constable, Adnan Aziz, and Xu Liu
virtual reality feedback,” in Proceedings of the 10th Augmented Human Interna- [38] N. Otterness and J. H. Anderson, “Amd gpus as an alternative to nvidia for
tional Conference 2019, pp. 1–8, 2019. supporting real-time workloads,” in 32nd Euromicro conference on real-time
[17] K. M. Ruff and R. V. Pappu, “Alphafold and implications for intrinsically disor- systems (ECRTS 2020), Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
dered proteins,” Journal of Molecular Biology, vol. 433, no. 20, p. 167208, 2021. [39] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,
[18] B. Ramsundar, P. Eastman, P. Walters, and V. Pande, Deep learning for the life R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite,
sciences: applying deep learning to genomics, microscopy, drug discovery, and J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush,
more. O’Reilly Media, 2019. “Transformers: State-of-the-art natural language processing,” in Proceedings
[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, of the 2020 Conference on Empirical Methods in Natural Language Processing:
and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in System Demonstrations, (Online), pp. 38–45, Association for Computational
Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678, Linguistics, Oct. 2020.
2014. [40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
[20] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
Y. Hu, L. Ceze, et al., “ { TVM } : An automated { End-to-End } optimizing compiler [41] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2:
for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference
Implementation (OSDI 18), pp. 578–594, 2018. on computer vision and pattern recognition, pp. 4510–4520, 2018.
[21] “Onnx.” [Accessed April 11, 2023]. [42] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-
[22] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, conditional image generation with clip latents,” 2022.
and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for [43] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li,
heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015. and P. J. Liu, “Exploring the limits of transfer learning with a unified text-
[23] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1,
G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: pp. 5485–5551, 2020.
composable transformations of Python+NumPy programs,” 2018. [44] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint
[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, arXiv:1804.02767, 2018.
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Rai- [45] Mindee, “doctr: Document text recognition.” https://github.com/mindee/doctr,
son, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, 2021.
“Pytorch: An imperative style, high-performance deep learning library,” in Ad- [46] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.” https:
vances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, //github.com/facebookresearch/detectron2, 2019.
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, [47] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
Curran Associates, Inc., 2019. deep bidirectional transformers for language understanding,” arXiv preprint
[25] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, arXiv:1810.04805, 2018.
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is- [48] R. Wightman, “Pytorch image models.” https://github.com/rwightman/pytorch-
ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, image-models, 2019.
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, [49] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War- A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
machine learning on heterogeneous systems,” 2015. Software available from [50] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified
tensorflow.org. generative adversarial networks for multi-domain image-to-image translation,”
[26] PyTorch, “Pytorch commit statistics.” [Accessed April 11, 2023]. in Proceedings of the IEEE conference on computer vision and pattern recognition,
[27] M. A. Research, “Paperswithcode.” [Accessed April 11, 2023]. pp. 8789–8797, 2018.
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, [51] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering using cycle-consistent adversarial networks,” in Computer Vision (ICCV), 2017
the game of go with deep neural networks and tree search,” nature, vol. 529, IEEE International Conference on, 2017.
no. 7587, pp. 484–489, 2016. [52] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
[29] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ai,” Commun. ACM, conditional adversarial networks,” in Computer Vision and Pattern Recognition
vol. 63, p. 54–63, nov 2020. (CVPR), 2017 IEEE Conference on, 2017.
[30] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and [53] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of
E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint the IEEE international conference on computer vision, pp. 2961–2969, 2017.
arXiv:1410.0759, 2014. [54] S. Sengupta, V. Jayaram, B. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman,
[31] A. Prokopec, A. Rosà, D. Leopoldseder, G. Duboscq, P. Tma, M. Studener, L. Bulej, “Background matting: The world is your green screen,” in Proceedings of the
Y. Zheng, A. Villazón, D. Simon, et al., “Renaissance: benchmarking suite for IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2291–2300,
parallel applications on the jvm,” in Proceedings of the 40th ACM SIGPLAN 2020.
Conference on Programming Language Design and Implementation, pp. 31–47, [55] A. Paliwal and N. K. Kalantari, “Deep slow motion video reconstruction with
2019. hybrid imaging system,” IEEE transactions on pattern analysis and machine
[32] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, intelligence, vol. 42, no. 7, pp. 1557–1569, 2020.
B. Ritchken, B. Jackson, et al., “An open-source benchmark suite for microser- [56] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language
vices and their hardware-software implications for cloud & edge systems,” in models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9,
Proceedings of the Twenty-Fourth International Conference on Architectural Sup- 2019.
port for Programming Languages and Operating Systems, pp. 3–18, 2019. [57] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov,
[33] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for
R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al., “The nas natural language generation, translation, and comprehension,” arXiv preprint
parallel benchmarks,” The International Journal of Supercomputing Applications, arXiv:1910.13461, 2019.
vol. 5, no. 3, pp. 63–73, 1991. [58] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon,
[34] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, P. Pham, A. Ravula, Q. Wang, L. Yang, et al., “Big bird: Transformers for longer
“Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE in- sequences,” Advances in neural information processing systems, vol. 33, pp. 17283–
ternational symposium on workload characterization (IISWC), pp. 44–54, Ieee, 17297, 2020.
2009. [59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
[35] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, and I. Polosukhin, “Attention is all you need,” Advances in neural information
H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf, et al., “Mlperf training benchmark,” processing systems, vol. 30, 2017.
Proceedings of Machine Learning and Systems, vol. 2, pp. 336–349, 2020. [60] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang,
[36] J. Yin, A. Tsaris, S. Dash, R. Miller, F. Wang, and M. A. Shankar, “Comparative U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherni-
evaluation of deep learning workloads for leadership-class systems,” Bench- avskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen,
Council Transactions on Benchmarks, Standards and Evaluations, vol. 1, no. 1, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recom-
p. 100005, 2021. mendation model for personalization and recommendation systems,” CoRR,
[37] Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu, “Benchmarking vol. abs/1906.00091, 2019.
the performance and energy efficiency of ai accelerators for ai training,” in [61] O. Kuchaiev and B. Ginsburg, “Training deep autoencoders for collaborative
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet filtering,” arXiv preprint arXiv:1708.01715, 2017.
Computing (CCGRID), pp. 744–751, IEEE, 2020. [62] D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Reg-
ularizing deep reinforcement learning from pixels,” in International Conference
12
TorchBench: Benchmarking PyTorch with High API Surface Coverage
on Learning Representations, 2021. [100] The Apache MXNet team, “Mxnet profiler,” 2022. [Accessed April 11, 2023].
[63] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, [101] Intel, “Intel oneapi,” 2022. [Accessed April 11, 2023].
A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv [102] Google, “Google xnnpack,” 2022. [Accessed April 11, 2023].
preprint arXiv:1812.05905, 2018. [103] Google, “Google neural networks api,” 2022. [Accessed April 11, 2023].
[64] Z. Huang, W. Heng, and S. Zhou, “Learning to paint with model-based deep [104] R. M. Larsen and T. Shpeisman, “Tensorflow graph optimizations,” 2019.
reinforcement learning,” in Proceedings of the IEEE/CVF International Conference [105] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections
on Computer Vision, pp. 8709–8718, 2019. for efficient neural network,” Advances in neural information processing systems,
[65] J. Li, X. Wang, Y. Li, et al., “The speechtransformer for large-scale mandarin vol. 28, 2015.
chinese speech recognition,” in ICASSP 2019-2019 IEEE International Conference [106] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for
on Acoustics, Speech and Signal Processing (ICASSP), pp. 7095–7099, IEEE, 2019. efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[66] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, [107] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local
Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet reparameterization trick,” Advances in neural information processing systems,
on mel spectrogram predictions,” in 2018 IEEE international conference on acous- vol. 28, 2015.
tics, speech and signal processing (ICASSP), pp. 4779–4783, IEEE, 2018. [108] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural networks
[67] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, with low precision multiplications,” arXiv preprint arXiv:1412.7024, 2014.
Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet [109] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient in-
on mel spectrogram predictions,” in 2018 IEEE international conference on acous- ference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
tics, speech and signal processing (ICASSP), pp. 4779–4783, IEEE, 2018. [110] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv
[68] S. Rouard, F. Massa, and A. Défossez, “Hybrid transformers for music source preprint arXiv:1612.01064, 2016.
separation,” in ICASSP 23, 2023. [111] W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, “Dnnfusion: accelerating
[69] D. Häfner, “Isoneutral mixing.” [Accessed April 11, 2023]. deep neural networks execution with advanced operator fusion,” in Proceedings
[70] A. M. Rush, “Torch-struct: Deep structured prediction library,” 2020. of the 42nd ACM SIGPLAN International Conference on Programming Language
[71] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, Design and Implementation, pp. 883–898, 2021.
M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” [112] A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside,
CoRR, vol. abs/1710.03740, 2017. and P. Sadayappan, “On optimizing machine learning workloads via kernel
[72] NVIDIA Corporation, “Nvidia tesla a100 gpu,” 2023. [Accessed April 11, 2023]. fusion,” ACM SIGPLAN Notices, vol. 50, no. 8, pp. 173–182, 2015.
[73] AMD, Inc, “Amd instinct mi210 gpu,” 2023. [Accessed April 11, 2023]. [113] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. Reiss, “Re-
[74] PyTorch, “Torchscript,” 2023. [Accessed April 11, 2023]. source elasticity for large-scale machine learning,” in Proceedings of the 2015
[75] PyTorch, “Torch-tensorrt,” 2023. [Accessed April 11, 2023]. ACM SIGMOD International Conference on Management of Data, pp. 137–152,
[76] PyTorch, “Torchdynamo,” 2023. [Accessed April 11, 2023]. 2015.
[77] PyTorch, “Torchinductor,” 2023. [Accessed April 11, 2023]. [114] A. Sujeeth, H. Lee, K. Brown, T. Rompf, H. Chafi, M. Wu, A. Atreya, M. Odersky,
[78] NVIDIA Corporation, “Nvidia tensorrt,” 2023. [Accessed April 11, 2023]. and K. Olukotun, “Optiml: an implicitly parallel domain-specific language for
[79] PyTorch, “Pytorch 2.0,” 2023. [Accessed April 11, 2023]. machine learning,” in Proceedings of the 28th International Conference on Machine
[80] AMD, “Amd rocm,” 2023. [Accessed April 11, 2023]. Learning (ICML-11), pp. 609–616, Citeseer, 2011.
[81] NVIDIA, “Nvidia cuda toolkit,” 2023. [Accessed April 11, 2023]. [115] A. Sabne, “Xla: Compiling machine learning for peak performance,” 2020.
[82] PyTorch, “Pytorch profiler,” 2023. [Accessed April 11, 2023]. [116] Facebook Research, “Aitemplate,” 2022. [Accessed April 11, 2023].
[83] HuggingFace, “Huggingface transformer,” 2023. [Accessed April 11, 2023]. [117] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken, “Taso:
[84] NVIDIA Corporation, “NVIDIA Nsight Compute,” 2022. [Accessed April 11, optimizing deep learning computation with automatic generation of graph
2023]. substitutions,” in Proceedings of the 27th ACM Symposium on Operating Systems
[85] NVIDIA Corporation, “Nvidia nsight systems,” 2022. [Accessed April 11, 2023]. Principles, pp. 47–62, 2019.
[86] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and
N. R. Tallent, “Hpctoolkit: Tools for performance analysis of optimized parallel
programs,” Concurrency and Computation: Practice and Experience, vol. 22, no. 6,
pp. 685–701, 2010.
[87] K. Zhou, Y. Hao, J. Mellor-Crummey, X. Meng, and X. Liu, “Gvprof: A value
profiler for gpu-based clusters,” in SC20: International Conference for High Per-
formance Computing, Networking, Storage and Analysis, pp. 1–16, IEEE, 2020.
[88] K. Zhou, Y. Hao, J. Mellor-Crummey, X. Meng, and X. Liu, “Valueexpert: ex-
ploring value patterns in gpu-accelerated applications,” in Proceedings of the
27th ACM International Conference on Architectural Support for Programming
Languages and Operating Systems, pp. 171–185, 2022.
[89] AMD, “Radeon gpu profiler,” 2022. [Accessed April 11, 2023].
[90] S. S. Shende and A. D. Malony, “The tau parallel performance system,” The
International Journal of High Performance Computing Applications, vol. 20, no. 2,
pp. 287–311, 2006.
[91] A. Knüpfer, C. Rössel, D. Mey, S. Biersdorff, K. Diethelm, D. Eschweiler,
M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. Nagel, Y. Oleynik, P. Philippen,
P. Saviankou, D. Schmidl, S. Shende, R. Tschüter, M. Wagner, B. Wesarg, and
F. Wolf, “Score-P: A Joint Performance Measurement Run-Time Infrastructure
for Periscope, Scalasca, TAU, and Vampir,” in Competence in High Performance
Computing 2011, pp. 79–91, Springer Berlin Heidelberg, 01 2012.
[92] Y. Hao, N. Jain, R. Van der Wijngaart, N. Saxena, Y. Fan, and X. Liu, “Drgpu: A
top-down profiler for gpu applications,” in Proceedings of the 2023 ACM/SPEC
International Conference on Performance Engineering, pp. 43–53, 2023.
[93] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis,
K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-to-end deep learning
benchmark and competition,” Training, vol. 100, no. 101, p. 102, 2017.
[94] W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng, J. Dai,
Z. Cao, et al., “Aibench: an industry standard internet service ai benchmark
suite,” arXiv preprint arXiv:1908.08998, 2019.
[95] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: Reference work-
loads for modern deep learning methods,” in 2016 IEEE International Symposium
on Workload Characterization (IISWC), pp. 1–10, IEEE, 2016.
[96] Baidu Research, “Deepbench,” 2023. [Accessed April 11, 2023].
[97] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for
gpus,” in Proceedings of the General Purpose GPUs, pp. 63–72, 2017.
[98] The Tensorflow team, “Tensorflow profiler,” 2022. [Accessed April 11, 2023].
[99] Oneflow Inc., “Dlperf,” 2022. [Accessed April 11, 2023].
13