0% found this document useful (0 votes)

32 views8 pages

Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms

Uploaded by

pratiknavale1131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views8 pages

Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms

Uploaded by

pratiknavale1131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Detailed Performance Analysis of Distributed

Tensorflow on a GPU Cluster using Deep Learning

Algorithms
Abid Malik {corresponding author}, Micheal Lu, Nathenial Wang, Yeiwei Lin, and Shinjae Yoo
Computer Science Initiative
Brookhaven National Laboratory
amalik@bnl.gov

Abstract—Long training times for building a high accuracy

deep neural networks (DNNs) is impeding research for new
DNN architectures. For example, time for training GoogleNet
with the ImageNet dataset on a single Nvidia K20 GPU almost
takes 25 days. Therefore, there is a great need in the AI
community to speed up the training phase, especially when
using a large dataset. For this, we need Distributed Deep Neural
Networks (DDNNs) that can scale well with more computation
resources. However, this involves two challenges. First, the deep
learning framework or training library must support inter-node
communication. Second, the user must modify the code to take
advantage of the inter-node communication. The changes to the
code can be minimal to significant depending upon the user
expertize in the distributed systems. Current DNN frameworks
support distributed learning using MPI. However, these frame-
works come with poorly understood overheads associated with
communication and data management. Tensorflow provides APIs
for distributed learning using MPI programming model and
gRPC. These APIs are not easy to use for a domain expert Fig. 1. Accuracy performance of DNNs and ordinary NNs using large data.
for designing an efficient distributed learning model. Recently,
Uber Inc. provides the Horovod Framework which gives a fast
and easy way to support distributed learning using Tensorflow, A. What is slowing down the development of new architec-
Pytorach, and Keras. In this paper we provide a detailed tures?
performance analysis of distributed Tensorflow using Horovod.
We implemented distributed learning for AlexNet, GoogleNet, The development of new DNN architectures needs creativity
and ResNet50 using Horovod. We used Nvidia K40, K80 , and and novel ideas. This includes changing the number of layers,
P 100 GPUs for our experimentation. We used synthetic image filter dimension, new approach to initiate the weights of each
data with different runtime variables (batch size and number connection, and so forth. These changes need to be tested for
of GPUs). Our results shows that the Horovod framework gives
almost linear throughput (images/sec) scalability up to 256 GPUs. accuracy and rate of convergence using a given training data.
Index Terms—High Performance Computing, Tensorflow, Deep This is a time consuming phase. A high accuracy deep neural
Learning, Distributed Learning, Performance Analysis network model such as GoogleNet can take weeks to train
on a modern GPU [9]. This is true even when using highly
optimized deep neural network libraries like cuDNN [3],
I. I NTRODUCTION AND M OTIVATION
maxDNN [11], or fbfft [12] all of which operate near the
AlexNet [5] came to lime light in 2012 during the ImageNet theoretical peak computation per second achievable on GPUs.
Large Scale Visual Recognition Challenge [4], when it beat Thus, training time is a key challenge at the root of the
the runner up by a huge margin in accuracy (by more than development of new DNN architectures. We would like to
15%). Since then, many DNNs have been developed, e.g., reproduce the words by Jeffrey Dean of Google [13]:
GoogleNet [9] and ResNet50/101 [8]. Every year, we see 1) DNN researchers want results of experiments quickly.
new DNN architectures with better accuracy and convergence 2) There is a patience threshold: No one wants to wait more
performance on a given training data. As the training data is than a few days or a week for a result.
fixed, therefore, it is up to a DNN developer to improve the 3) This significantly affects scale of problems that can be
accuracy performance. In other words, the race to improve the tackled.
accuracy has become a race in the development of new DNN 4) Researchers sometimes optimize for experiment
architectures. turnaround time, rather than absolute minimal system
1RWLFH7KLVPDQXVFULSWKDVEHHQDXWKRUHGE\HPSOR\HHVRI%URRNKDYHQ6FLHQFH$VVRFLDWHV//&XQGHU&RQWUDFW1R'(
6&ZLWKWKH86'HSDUWPHQWRI(QHUJ\7KHSXEOLVKHUE\DFFHSWLQJWKHPDQXVFULSWIRUSXEOLFDWLRQDFNQRZOHGJHVWKDWWKH
8QLWHG6WDWHV*RYHUQPHQWUHWDLQVDQRQH[FOXVLYHSDLGXSLUUHYRFDEOHZRUOGZLGHOLFHQVHWRSXEOLVKRUUHSURGXFHWKHSXEOLVKHG
IRUPRIWKLVPDQXVFULSWRUDOORZRWKHUVWRGRVRIRU8QLWHG6WDWHV*RYHUQPHQWSXUSRVHV

U.S. Government work not protected by U.S. copyright

resources for performing the experiment. scalability for throughput performance (images/sec). However,
These points clearly show the importance of High Per- the framework does not show linear scalability for latency
formance Computing (HPC) in the field of distributed deep performance (epoch/sec) beyond 128 GPUs.
learning. The rest of the paper is organized as: Section II talks
about some recent work in DDNNs. Section III gives some
B. Impact of BigData: necessary background to the reader in the field of distributed
Now it is a known fact that if we have a BigData, we need learning. Section VI gives detail about our experimentation
deep learning to extract or process information. Traditional work. Section VII gives conclusion and some future work.
neural networks do not scale well with large data, however,
II. R ELATED W ORK
deep neural networks scale well for accuracy with large data
set as shown in Figure 1 1 . Training DNNs is time consuming In this section, we review the previous work about scaling
because of high computation requirement. Long training times deep neural networks on parallel or distributed systems.
are limiting the pace of DNN research and production. It is
a known fact that several Internet companies have internal
databases containing billions of images with hundreds of
thousands of different category labels. Due to long training
times, these companies are facing serious delays in bringing
DNN-based solutions to market. Accelerated DDNN training
solutions would help these companies to bring new DNN based
solution to market quickly.
C. Impact of training time on real time processing:
There are a number of situations where it is crucial to
incorporate new data into a DNN model in real time. For
example, reinforcement learning (RL) enables robots to learn
things themselves with minimal supervision. A recent study by
Levine [14] applied state-of-the-art DNN-based RL techniques Fig. 2. Parameter Server. Figure taken from the work [25].
to enable a robot to teach itself how to build Lego structures
and screw on bottle caps. This technique is effective, and the Figure 2 shows a typical parameter server model or Asyn-
robot does indeed learn to screw on bottle caps. However, it chronous Stochastic Gradient Descent model [13]. In this
takes 3-4 hours for the robot to learn to screw on bottle caps, framework, each worker machine has a copy of weight W .
and the majority of this time is spent on DNN training. Faster The dataset is partitioned to all the worker machines. At each
DNN training would enable reinforcement learning be applied step, ith worker computes a sub-gradient ( Wi ) from its own
in real-time. data and weight. Then the ith worker sends Wi to the master
Therefore, there is a great need in the AI community to (i 2 1, 2, ..., P ). The master receives Wi , conducts the
speed up the training phase, especially when using a large weight update, and sends weight back to ith worker machine.
dataset. For this, we need DDNNs that can scale linearly with All the workers finish this step asynchronously, using first
more computation resources. DDNNs using a GPU cluster can come first serve (FCFS). Most of the DDNNs use this approach
give much faster training. However, this entails two challenges: as it scales well with number of computing nodes. However,
• Portability– Most implementations require a data analyst it has low convergence rate because of high global variance
to modify their code significantly. in weights among the models running on different nodes.
• Scalability– Several distributed Deep Learning imple- The Hogwild method [21] can be presented as a variant of
mentations are geared towards cloud computing systems Async SGD. The master machine is a shared memory system.
– which is inadequate for execution on massively parallel For Async SGD, if the sub-gradient from jth worker arrives
system such as supercomputers. during the period that the master is interacting with ith worker,
Recently, Uber presents the Horovod Framework. Horovod is a then W W ⌘ Wj can not be started before W W
fast and efficient framework to port Deep Learning algorithms ⌘ Wi is finished (i, j 2 1, 2, ..., P ). This means that there is
written in Tensorflow, Keras, and Pytorch on GPU/CPU clus- a lock to avoid weight update conflicts on the shared memory
ters using the MPI programming model. system (master machine). The lock makes sure the master only
In this paper, we did a detailed performance analysis of processes one sub-gradient at one time. The Hogwild method,
the Horovod framework for scalability under various run- however, removes the lock and allows the master to process
time parameters. We used well known convolutional neural multiple sub-gradients at the same time.
networks with a synthetic data. Our results show that the Elastic Averaging SGD (EASGD) method [23] can also
framework scales well with 256 GPUs showing almost linear be presented as a variant of Async SGD. Async SGD
uses a FCFS strategy for processing the sub-gradients asyn-
1 The figure is taken form the lectures on “Deep Learning” by Andrew Ng. chronously. EASGD uses a round-robin strategy for ordered
update, i.e., W W ⌘ Wj can not be started before previously. By adding the input of the block to its output,
W W ⌘ Wj 1 is finished (i 2 2, 3, ..., n). Also, EASGD the residual block learns the residual function, and forwards
requires the workers to conduct the update locally. Before all the activations to deeper layers than earlier. One advantage
the workers conduct the local updating, the master updates the of ResNet is that it can improve accuracy of the model while
center (or global) weight. avoiding parameter explosion. That is, the ResNet blocks
Work by Li [20] is focused on single-node memory opti- increase the depth (and inner layers) of the network instead
mization. The idea is included in the work from [25]. There of its width.
is some work [26], [27] on scaling up deep neural networks
by model parallelism method. Low-precision representation of B. Issues with Parallel Stochastic Gradient Descent (SGD)
neural networks is another direction of research. The idea is We now explain the main issues with parallel SGDs.
to use low-precision floating point to reduce the computation
and communication for going the acceptable accuracy ( [15], 1) Network Contention: One of the major issues faced
[17], [18], [22]). with Synchronous SGD is the network congestion due to
III. BACKGROUND high communication cost between parameter server and
worker nodes. One solution to this problem is to use multiple
In this section, we describe the background for distributed parameter servers. However, the number of workers still poses
neural networks. a challenge. Recently, a hierarchical tree approach is used to
A. Deep Neural Networks aggregate the gradients from the workers [25]. However, these
approaches do not utilize all the underlying communication
We use the following three deep neural networks for our
links and are heavily dependent on the network topology of
work.
the machine. Instead, MPI solves the problem by deploying
state-of-the-art parallel algorithms which can adapt to any
1) AlexNet: AlexNet is the name of a Convolutional underlying topology. Thus, grouping workers into logical MPI
Neural Network (CNN), originally written with CUDA to run groups can significantly resolves contention on parameter
with GPU support, which competed in the ImageNet Large server.
Scale Visual Recognition Challenge in 2012. The network
achieved a top-5 error of 15.3%, more than 10.8 percentage
2) Parameter Staleness: As the number of workers in-
points ahead of the runner up. AlexNet was designed by the
creases, asynchronous forms of SGD face the issue of stal-
SuperVision group, consisting of Alex Krizhevsky, Geoffrey
eness, which inhibit a fast rate of convergence. On way
Hinton, and Ilya Sutskever. AlexNet has five convolution
to improve this is to use hierarchical clustering. Clustering
layers, three pooling layers, and two fully-connected layers.
workers into MPI clients potentially offers two immediate
This CNN architecture requires about 1.4 M activations/image
advantages:
and has 60 M parameters.
• It reduces the variance of the gradient updates by effec-
2) GoogLeNet: GoogleNet is more complex model than tively increasing the mini batch size. In, the number of
AlexNet. GoogleNet has two convolution layers, two pooling iterations to converge is halved as the mini batch size is
layers, and nine inception layers. Each inception layer consists doubled.
of six convolution layers and one pooling layer. The concept • It reduces total number of workers. Depending on the
of inception layer is to cover bigger area of images while algorithm and the distribution of data, one or both the
maintaining fine resolution for small information on these factors improve the rate.
images. The inception module of GoogLeNet concatenates For example, the MPI elastic averaging algorithms studied
filters of different sizes into a single new filter. This avoids in the work [25] benefits from both. Such models offer
parameter explosion with the use of inception layers. potential to scale to a full scale machine comprising of
GoogLeNet performs significantly better than AlexNet for the thousands of GPUs.
ImageNet and the recent ILSVRC challenge datasets. This
CNN architecture has about 5.5 M parameters. GoogLeNet 3) Memory Pressure and Batch Size: One of the main
in relation to AlexNet has (i) more layers; (ii) fewer features issues in the implementation of Deep Learning (DL) systems
per layer, and; (iii) more activations. GoogleNet has 10.8 M is memory pressure, which keeps growing as the number of
activations per image. levels of the network increase. This restricts the choice of
batch size of a DL worker to smaller values as the total
3) ResNet/x: Deep Residual Learning Network memory per worker is dictated by the hardware. There are
(ResNet) [24] introduced the concept of a residual block. inefficiencies using smaller batch sizes. Grouping workers to
Each block consists of two convolution layers along with larger batches should improve performance as long as the
a connection adding the output of the second block to the batch sizes falls within algorithmic stipulated limits. Moreover,
input of the first. Residual blocks are designed to allow the the new framework also allows the flexibility to decouple
training of substantially deeper models than had been trained dependency between the model mini batch size and memory
limits per worker, allowing for a possibility of porting models 1) TensorFlow Graph: The fundamental model of
across different hardware architectures. computation within TensorFlow is a computational graph. A
graph contains vertices, representing operations, and edges,
C. Data Parallelism and Model Parallelism representing tensors (arbitrary dimensional arrays). Each
operation can take multiple inputs and generate multiple
outputs, with tensors created and passed from one operation
to another. Edges also act as control flow objects in the
computational graph, which ensures dependencies, that
naturally arise in DL implementations.

2) Tensors: There are several special types of tensors

in TensorFlow. An important tensor is a variable. Variables
are persistent tensors that can be accessed outside the
computational graph. In DL implementations, the weights
and biases of a model are stored as variables and updated by
operations, when a computational graph is executed. Another
Fig. 3. Parallelism strategies for Deep Learning type of a tensor is placeholder. Placeholders are input points
. into a computational graph. Outside of placeholders, the
computational graph is self-contained.
There are two major parallelism strategies for Deep Learn-
ing: Data Parallelism and Model Parallelism (See Figure 3). 3) Session: In TensorFlow, a session controls the graph.
All the later parallel methods are the variants of these two It stores the values of variables and is used to run the
methods. computations described by the graph. After the creation
1) Data Parallelism: The dataset is partitioned into P of a session, an initializer must be run to give values to
parts and each machine only gets one part. Each machine the variables to be used within the session. Subsequent
has a copy of the neural network, hence the weights (W ). computations, such as the computation of gradients, must be
The communication includes sum of all the gradients Wi managed through the session to ensure that the correct values
and broadcast of W . The worst part of communication of variables are used. The session makes use of a scheduler,
is conducted between backward propagation and weights which maintains a record of which operations have been
update. The master updates W with Wi after it gets all completed and enqueues those whose dependencies are all
the sub-gradients Wi from the workers. Then the master satisfied to be executed.
machine broadcasts W to all the worker machines, which is
the second part of communication. Figure 3 is an example of 4) Device Scheduling: In addition to its use by the session
data parallelism on four machines. to keep track of which operations are ready to execute, the
TensorFlow scheduler also handles device scheduling when
2) Model Parallelism: Data parallelism replicates the neu- multiple devices are available. Before executing a graph as
ral network itself on each machine while model parallelism desired by the user, the schedule runs a simulation of the graph
partitions the neural network into P pieces. Partitioning the to determine execution time and the order of the operations.
neural network means parallelizing the matrix operations on It then uses this information to create the dependency lists
the partitioned network. Thus, model parallelism can get the that the session requires and to assign each operation to a
same solution as the single-machine case. Figure 3 shows device. These assignments first depend on whether there is
model parallelism on three machines. These three machines an implementation of the operation for a given device for
partition the matrix operation of each layer. However, because instance, sometimes GPU implementations may be unavailable
both the batch size and the picture size typically are relatively and then upon expected execution speed taking into account
small, the matrix operations are not large. State-of-the-art inter-device communication times for the relevant tensors.
methods often use data-parallelism.
V. H OROVOD F RAMEWORK
IV. T ENSORFLOW Uber adopted Baidus implementation [2] of the TensorFlow
Google released TensorFlow in November 2015 as a ring-allreduce algorithm for the Horovod Framework. Here is
platform for building and developing DL implementations. a brief introduction to the framework:
TensorFlow is capable of utilizing multiple threads, such that • Uber’s AI team converted the code , from Baidu, into
multi-core systems can be utilized effectively. It also provides a stand-alone Python package called Horovod, named
implementations to leverage GPUs (using NVIDIA CUDA after a traditional Russian folk dance in which per-
based DNN (cuDNN)), such that one (or more) GPUs on a formers dance with linked arms in a circle, much like
single node may be utilized effectively. how distributed TensorFlow processes use Horovod to
communicate with each other.
• The framework also provides ring-allreduce implementa- of N nodes communicates with two of its peers 2 ⇥ (N 1)
tion with NCCL. NCCL is NVIDIAs library for collective times. During this communication, a node sends and receives
communication that provides a highly optimized version chunks of the data buffer. In the first N 1 iterations, received
of ring-allreduce. NCCL 2 introduced the ability to run values are added to the values in the nodes buffer. In the
ring-allreduce across multiple machines, enabling the end second N 1 iterations, received values replace the values held
user to take advantage of its many performance boosting in the nodes buffer. Patarasuk and Yuan in [28] suggest that
optimizations. this algorithm is bandwidth-optimal, meaning that if the buffer
• The framework provides support for models that fit inside is large enough, it will optimally utilize the available network.
a single server, potentially on multiple GPUs. In addition to being network-optimal, the allreduce approach is
• The framework provides several API improvements in- much easier to understand and adopt. Users utilize a Message
spired by feedback received from a number of initial Passing Interface (MPI) implementation such as OpenMPI
users. In particular, the framework implements a broad- to launch all copies of the TensorFlow program. MPI then
cast operation that enforces consistent initialization of transparently sets up the distributed infrastructure necessary
the model on all workers. The new API allows the end for workers to communicate with each other. All the user needs
users to cut down the number of operations a user had to to do is modify their program to average gradients using an
introduce to their single GPU program to four. allreduce() operation.
VI. E XPERIMENTATION
In this section we talk about our experimentation and results.
The main information about the experimentation environment
is shown in Table I. We use Nvidia K20, K40, K80 and P100
for our work. However, we only present our results with K80
GPUs in this section.
Nodes 256 (4 K80 per node)
GPUs Nvidia K20, K40, K80, P100
Data Synthetic data
Memory 250 GB
Compiler GCC
CUDA 9.1
Tensorflow 1.80
Python 2.7
TABLE I
Fig. 4. Horovod simple APIs to move single GPU code to multi GPU code. E XPERIMENTAL ENVIRONMENT

Whereas the parameter server paradigm for distributed

TensorFlow training often requires careful implementation of Figure 6 shows the scalability performance of Horovod
significant boilerplate code, Horovod needs just a few new using AlexNet. The running time for each epoch decreases as
lines. Figure 4 gives an example of a TensorFlow program we increase the number of GPU nodes. The rate of decrease
distributed using Horovod. is high up-to 16 GPU nodes and then it dies down because
of high communication cost. Figure 7 shows the scalability
performance of gRPC using AlexNet. Comparing to Horovod,
it shows poor performance. The gRPC implementation uses
parameter sever model which has high communication cost as
compare to the ring all reduce implementation in Horovod.
Figure 8 shows throughput (images/sec) performance with
Horovod using AlexNet. The performance is almost linear
with all three batch sizes. Figure 9 shows the throughput
performance for gRPC implemenatation. The scalability per-
formance dies down after 64 GPUs nodes.
Figure 10 and Figure 11 show throughput performance
for Horovod and gRPC respectively using GoogleNet. Again,
Horovod is showing better scalability performance than gRPC
under different workloads.
Fig. 5. MPI Ring All Reduce. Figure 12 and Figure 13 shows throughput performance
. for Horovod and gRPC respectively using ResNet50. Again,
Horovod is showing better scalability than gRPC under differ-
In the ring-allreduce algorithm, shown on Figure 5, each ent workloads.
Fig. 6. Scalability analysis with AlexNet using Horovod. Fig. 9. Throughput performance of gRPC using AlexNet
. .

Fig. 7. Scalability analysis with Alexnet using gRPC. Fig. 10. Throughput performance of Horovod using GoogleNet.
. .

Fig. 8. Throughput performance of Horovod using AlexNet. Fig. 11. Throughput performance of gRPC using GoogleNet.
. .

VII. C ONCLUSION AND F UTURE WORK ResNeT, have become readily available for use with little mod-
ification. Distributed Deep Learning implementations capable
Deep Neural Networks or Deep Learning algorithms have of execution on large scale systems are becoming important
become a popular choice for data analysis. Several Deep to address the computational needs of large data produced by
Learning implementation, such as AlexNet, GoogleNet, and scientific simulations and experiments, and social media such
For the future work, we are planning to do a detailed anal-
ysis of the framework with a real data set such as ImageNet
using a performance analysis tool. This fine level performance
analysis will help us pinpoint the main performance bottle-
necks for the Horovod framework.

VIII. ACKNOWLEDGEMENT
The authors would like to thank Alexander Sergeev from
Uber Technologies, Inc. for a thoughtful discussion.

R EFERENCES
[1] An In-depth Performance Characterization of CPU- and GPU-based
DNN Training on Modern Architectures A. Awan , H. Subramoni , and
D. K. Panda. 3rd Workshop on Machine Learning in High Performance
Fig. 12. Throughput performance of Horovod using ResNet50.
Computing Environments, held in conjunction with SC17, Nov 2017.
. [2] Horovod: fast and easy distributed deep learning in TensorFlow. Alexan-
der Sergeev, Mike Del Balso. 2018.{https://arxiv.org/abs/1802.05799}
[3] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-
zaro, and E. Shelhamer. cuDNN: efficient primitives for deep learning.
arXiv:1410.0759, 2014.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:
A large-scale hierarchical image database. In CVPR, 2009.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification
with Deep Convolutional Neural Networks. In NIPS, 2012.
[6] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400,
2013.
[7] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv:1409.1556, 2014.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions.
arXiv:1409.4842, 2014.
[9] . Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
NIPS, 2013.
[10] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao,
X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From
Fig. 13. Throughput performance of gRPC using ResNet50. captions to visual concepts and back. In CVPR, 2015.
. [11] A. Lavin. maxDNN: an efficient convolution kernel for deep learning
with maxwell gpus. arXiv:1501.06633, 2015.
[12] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y.
LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation.
as Facebook, Youtube etc. However, the adoption of distributed arXiv:1412.7580, 2014.
[13] J. Dean. Keynote: Large scale deep learning. In GPU Technology
Deep Learning faces many significant challenges. The top Conference, 2015.
two challenges are: 1) Portability– Most implementations [14] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of
require a data analyst to modify their code significantly, deep visuomotor policies. arXiv:1504.00702, 2015.
[15] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2014.
and 2) Scalability– Several distributed Deep Learning im- Training deep neural networks with low precision multiplications. arXiv
plementations are geared towards cloud computing systems – preprint arXiv:1412.7024 (2014).
which is inadequate for execution on massively parallel system [16] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Mathieu Devin,
Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and
such as supercomputers. Recently, Uber presents the Horovod others. 2012. Large scale distributed deep networks. In Advances in
Framework which is a fast and efficient framework to port neural information processing systems. 12231231.
Deep Learning algorithms written in Tensorflow, Keras, and [17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. 2015. Deep learning with limited numerical precision. In
Pytorch on GPU/CPU clusters using the MPI programming Proceedings of the 32nd International Conference on Machine Learning
model. (ICML-15). 17371746.
In this paper, we did a detailed performance analysis of [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. 2016. Quantized neural networks: Training neural
the Horovod framework for scalability under various run- networks with low precision weights and activations. arXiv preprint
time parameters. We used well known convolutional neural arXiv:1609.07061 (2016).
networks with a synthetic data for our experimentation. Our [19] Quoc V Le. 2013. Building high-level features using large scale unsuper-
vised learning. In Acoustics, Speech and Signal Processing (ICASSP),
results show that the framework scales well with 256 GPUs 2013 IEEE International Conference on. IEEE, 85958598.
showing almost linear scalability for throughput performance [20] Chao Li, Yi Yang, Min Feng, Srimat Chakradhar, and Huiyang Zhou.
(images/sec). However, the framework does not show linear 2016. Optimizing memory efficiency for deep convolutional neural
networks on GPUs. In Proceedings of the International Conference for
scalability for latency performance (epoch/sec) beyond 128 High Performance Computing, Networking, Storage and Analysis. IEEE
GPUs. Press, 54.
[21] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011.
Hogwild: A lock-free approach to parallelizing stochastic gradient de-
scent. In Advances in Neural Information Processing Systems. 693701.
[22] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit
stochastic gradient descent and its application to data-parallel distributed
training of speech DNNs.. In Interspeech. 10581062.
[23] Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep
learning with elastic averaging SGD. In Advances in Neural Information
Processing Systems. 685693.
[24] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
recognition. In: Proc. of the IEEE conference on Computer Vision and
Pattern Recognition (CVPR). (2016) 770778.
[25] Yang You, Aydn Bulu, and James Demmel. 2017. Scaling deep learning
on GPU and knights landing clusters. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis (SC ’17). ACM, New York, NY, USA, Article 9, 12 pages.
DOI: https://doi.org/10.1145/3126908.3126912
[26] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro,
and Andrew Ng. 2013. Deep learning with COTS HPC systems. In
International Conference on Machine Learning. 13371345
[27] Quoc V Le. 2013. Building high-level features using large scale unsuper-
vised learning. In Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on. IEEE, 85958598
[28] P. Patarasuk and X. Yuan, ”Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations,” Journal of Parallel and Distributed Com-
puting, 69(2):117-124, February 2009.

Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
No ratings yet
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
10 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
Demystifying Parallel and Distributed Deep Learning
No ratings yet
Demystifying Parallel and Distributed Deep Learning
43 pages
DLBench A Comprehensive Experimental Evaluation of
No ratings yet
DLBench A Comprehensive Experimental Evaluation of
23 pages
Ai 04 00047
No ratings yet
Ai 04 00047
23 pages
EasyChair Preprint 15723
No ratings yet
EasyChair Preprint 15723
10 pages
Energy-Efficient Deep Learning Inference On Edge Devices
No ratings yet
Energy-Efficient Deep Learning Inference On Edge Devices
55 pages
Autoencoders: Parallel Programming Parallel Processing
No ratings yet
Autoencoders: Parallel Programming Parallel Processing
5 pages
SparkNet Training Deep Networks in Spark PDF
No ratings yet
SparkNet Training Deep Networks in Spark PDF
13 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
A Hybrid Parallelization Approach For Distributed and Scalable Deep Learning
No ratings yet
A Hybrid Parallelization Approach For Distributed and Scalable Deep Learning
12 pages
Torchrec
No ratings yet
Torchrec
20 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
No ratings yet
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
10 pages
Les Neuronsputha FINAL 12 5 24 With Cert
No ratings yet
Les Neuronsputha FINAL 12 5 24 With Cert
36 pages
15 ML
No ratings yet
15 ML
60 pages
DL Intro
No ratings yet
DL Intro
64 pages
Distributed Graph Neural Network Training: A Survey
No ratings yet
Distributed Graph Neural Network Training: A Survey
46 pages
Neural Networks & Deep Learning Makaut & & 7th SemNotes
No ratings yet
Neural Networks & Deep Learning Makaut & & 7th SemNotes
36 pages
1-S2.0-S0925231223004502-Main 24
No ratings yet
1-S2.0-S0925231223004502-Main 24
24 pages
Singa Tomm
No ratings yet
Singa Tomm
23 pages
Bigdata Neural Networks
No ratings yet
Bigdata Neural Networks
144 pages
PL-NPU An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
No ratings yet
PL-NPU An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
14 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Review of Deep Learning Algorithms and Architectur
No ratings yet
Review of Deep Learning Algorithms and Architectur
29 pages
A Comprehensive Survey On Model Compression and Acceleration
No ratings yet
A Comprehensive Survey On Model Compression and Acceleration
43 pages
HW 5
No ratings yet
HW 5
10 pages
Nsdi21 SwitchML
No ratings yet
Nsdi21 SwitchML
25 pages
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
No ratings yet
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
18 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
A GPU Implementation of The Sparse Deep Neural Network Graph Challenge
No ratings yet
A GPU Implementation of The Sparse Deep Neural Network Graph Challenge
8 pages
Large-Scale Deep Reinforcement Learning
No ratings yet
Large-Scale Deep Reinforcement Learning
6 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Isvlsi2019 SS
No ratings yet
Isvlsi2019 SS
7 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
TF Estimators KDD Paper
No ratings yet
TF Estimators KDD Paper
9 pages
Lyu 2019
No ratings yet
Lyu 2019
8 pages
Distributed Deep Learning For Parallel Training
No ratings yet
Distributed Deep Learning For Parallel Training
7 pages
Distributed Hybrid CPU and GPU Training For Graph Neural
No ratings yet
Distributed Hybrid CPU and GPU Training For Graph Neural
11 pages
Deep Learning
No ratings yet
Deep Learning
28 pages
2017 MSSC Verhelst eDNNP-1
No ratings yet
2017 MSSC Verhelst eDNNP-1
11 pages
SoftMemoryBox II A Scalable Shared Memory Buffer F
No ratings yet
SoftMemoryBox II A Scalable Shared Memory Buffer F
15 pages
ML Unit-5
No ratings yet
ML Unit-5
19 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
Mondal Umn 0130E 25561
No ratings yet
Mondal Umn 0130E 25561
111 pages
Deep Learning - Phil Compass - Draft 3
No ratings yet
Deep Learning - Phil Compass - Draft 3
19 pages
MythicWhitepaper 2019oct31
No ratings yet
MythicWhitepaper 2019oct31
9 pages
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
No ratings yet
Pytorch Distributed: Experiences On Accelerating Data Parallel Training
14 pages
Distributed Graph Neural Network Training: A Survey
No ratings yet
Distributed Graph Neural Network Training: A Survey
37 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
No ratings yet
Thesis Proposal: Scaling Distributed Machine Learning With System and Algorithm Co-Design
12 pages
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
No ratings yet
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
18 pages
Efficient Deep Learning in Network Compression and
No ratings yet
Efficient Deep Learning in Network Compression and
21 pages
Advanced Systemdesign 2023
No ratings yet
Advanced Systemdesign 2023
65 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
From Everand
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
William Smith
No ratings yet
Caterpillar 248b - Loader - Operation Manual - Maintenance PDF
No ratings yet
Caterpillar 248b - Loader - Operation Manual - Maintenance PDF
33 pages
Chapter 6 Exponential Functions (指數函數) Tutorial Class (常規課堂)
No ratings yet
Chapter 6 Exponential Functions (指數函數) Tutorial Class (常規課堂)
11 pages
Norfolk Documentation
No ratings yet
Norfolk Documentation
5 pages
Trace
No ratings yet
Trace
2 pages
Pascal 11 15 Operating Manual EN
No ratings yet
Pascal 11 15 Operating Manual EN
29 pages
CA Assignment 2
50% (2)
CA Assignment 2
2 pages
The Relevant Résumé Template 2 PDF
No ratings yet
The Relevant Résumé Template 2 PDF
1 page
Stokes 2018 TOC - Emarketing 6ed
No ratings yet
Stokes 2018 TOC - Emarketing 6ed
4 pages
Amazon's Dynamo - All Things Distributed
No ratings yet
Amazon's Dynamo - All Things Distributed
21 pages
Bhuvaneswar Reddy Pidatala
No ratings yet
Bhuvaneswar Reddy Pidatala
1 page
Main Project Notice
No ratings yet
Main Project Notice
1 page
2.4. E-Commerce Payment Systems and Security Issues
No ratings yet
2.4. E-Commerce Payment Systems and Security Issues
68 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Certified Fiber Optics Tech
No ratings yet
Certified Fiber Optics Tech
4 pages
Instruction Manual EOI PJB
No ratings yet
Instruction Manual EOI PJB
6 pages
Owner'S Manual: Smartpro 2U Rack-Mount
No ratings yet
Owner'S Manual: Smartpro 2U Rack-Mount
64 pages
Quectel BC660K-GL TCPIP Application Note V1.1
No ratings yet
Quectel BC660K-GL TCPIP Application Note V1.1
37 pages
Do Not Power Off The Unit During The Flashing Process!
No ratings yet
Do Not Power Off The Unit During The Flashing Process!
8 pages
Automatic Drilling Machine Using PLC I Ji Set
No ratings yet
Automatic Drilling Machine Using PLC I Ji Set
7 pages
Autocad 2D: Dimensioning - Part 2
No ratings yet
Autocad 2D: Dimensioning - Part 2
18 pages
Home (Vintage International) Ebook All Chapters PDF
100% (7)
Home (Vintage International) Ebook All Chapters PDF
23 pages
Notice IReport 5.5.1
No ratings yet
Notice IReport 5.5.1
3 pages
Brochure Inpage
No ratings yet
Brochure Inpage
2 pages
Ethernet Standards
No ratings yet
Ethernet Standards
3 pages
Construction Schedule: Ks Saastha Enterprise
No ratings yet
Construction Schedule: Ks Saastha Enterprise
4 pages
Final Review SolutionsWritten
No ratings yet
Final Review SolutionsWritten
13 pages
431-342-02 Using Mitutoyo DP-1 VR
No ratings yet
431-342-02 Using Mitutoyo DP-1 VR
2 pages
Resume - VIVEK KUMAR - PANDEY
No ratings yet
Resume - VIVEK KUMAR - PANDEY
4 pages
LMR 16020
No ratings yet
LMR 16020
36 pages
1.1.1 Binary Systems
No ratings yet
1.1.1 Binary Systems
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms

Uploaded by

Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms

Uploaded by

Detailed Performance Analysis of Distributed

Tensorflow on a GPU Cluster using Deep Learning

Abstract—Long training times for building a high accuracy

U.S. Government work not protected by U.S. copyright

2) Tensors: There are several special types of tensors

Whereas the parameter server paradigm for distributed

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.