0% found this document useful (0 votes)
32 views8 pages

Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms

Uploaded by

pratiknavale1131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms

Uploaded by

pratiknavale1131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Detailed Performance Analysis of Distributed

Tensorflow on a GPU Cluster using Deep Learning


Algorithms
Abid Malik {corresponding author}, Micheal Lu, Nathenial Wang, Yeiwei Lin, and Shinjae Yoo
Computer Science Initiative
Brookhaven National Laboratory
amalik@bnl.gov

Abstract—Long training times for building a high accuracy


deep neural networks (DNNs) is impeding research for new
DNN architectures. For example, time for training GoogleNet
with the ImageNet dataset on a single Nvidia K20 GPU almost
takes 25 days. Therefore, there is a great need in the AI
community to speed up the training phase, especially when
using a large dataset. For this, we need Distributed Deep Neural
Networks (DDNNs) that can scale well with more computation
resources. However, this involves two challenges. First, the deep
learning framework or training library must support inter-node
communication. Second, the user must modify the code to take
advantage of the inter-node communication. The changes to the
code can be minimal to significant depending upon the user
expertize in the distributed systems. Current DNN frameworks
support distributed learning using MPI. However, these frame-
works come with poorly understood overheads associated with
communication and data management. Tensorflow provides APIs
for distributed learning using MPI programming model and
gRPC. These APIs are not easy to use for a domain expert Fig. 1. Accuracy performance of DNNs and ordinary NNs using large data.
for designing an efficient distributed learning model. Recently,
Uber Inc. provides the Horovod Framework which gives a fast
and easy way to support distributed learning using Tensorflow, A. What is slowing down the development of new architec-
Pytorach, and Keras. In this paper we provide a detailed tures?
performance analysis of distributed Tensorflow using Horovod.
We implemented distributed learning for AlexNet, GoogleNet, The development of new DNN architectures needs creativity
and ResNet50 using Horovod. We used Nvidia K40, K80 , and and novel ideas. This includes changing the number of layers,
P 100 GPUs for our experimentation. We used synthetic image filter dimension, new approach to initiate the weights of each
data with different runtime variables (batch size and number connection, and so forth. These changes need to be tested for
of GPUs). Our results shows that the Horovod framework gives
almost linear throughput (images/sec) scalability up to 256 GPUs. accuracy and rate of convergence using a given training data.
Index Terms—High Performance Computing, Tensorflow, Deep This is a time consuming phase. A high accuracy deep neural
Learning, Distributed Learning, Performance Analysis network model such as GoogleNet can take weeks to train
on a modern GPU [9]. This is true even when using highly
optimized deep neural network libraries like cuDNN [3],
I. I NTRODUCTION AND M OTIVATION
maxDNN [11], or fbfft [12] all of which operate near the
AlexNet [5] came to lime light in 2012 during the ImageNet theoretical peak computation per second achievable on GPUs.
Large Scale Visual Recognition Challenge [4], when it beat Thus, training time is a key challenge at the root of the
the runner up by a huge margin in accuracy (by more than development of new DNN architectures. We would like to
15%). Since then, many DNNs have been developed, e.g., reproduce the words by Jeffrey Dean of Google [13]:
GoogleNet [9] and ResNet50/101 [8]. Every year, we see 1) DNN researchers want results of experiments quickly.
new DNN architectures with better accuracy and convergence 2) There is a patience threshold: No one wants to wait more
performance on a given training data. As the training data is than a few days or a week for a result.
fixed, therefore, it is up to a DNN developer to improve the 3) This significantly affects scale of problems that can be
accuracy performance. In other words, the race to improve the tackled.
accuracy has become a race in the development of new DNN 4) Researchers sometimes optimize for experiment
architectures. turnaround time, rather than absolute minimal system
1RWLFH7KLVPDQXVFULSWKDVEHHQDXWKRUHGE\HPSOR\HHVRI%URRNKDYHQ6FLHQFH$VVRFLDWHV//&XQGHU&RQWUDFW1R'(
6&ZLWKWKH86'HSDUWPHQWRI(QHUJ\7KHSXEOLVKHUE\DFFHSWLQJWKHPDQXVFULSWIRUSXEOLFDWLRQDFNQRZOHGJHVWKDWWKH
8QLWHG6WDWHV*RYHUQPHQWUHWDLQVDQRQH[FOXVLYHSDLGXSLUUHYRFDEOHZRUOGZLGHOLFHQVHWRSXEOLVKRUUHSURGXFHWKHSXEOLVKHG
IRUPRIWKLVPDQXVFULSWRUDOORZRWKHUVWRGRVRIRU8QLWHG6WDWHV*RYHUQPHQWSXUSRVHV

U.S. Government work not protected by U.S. copyright


resources for performing the experiment. scalability for throughput performance (images/sec). However,
These points clearly show the importance of High Per- the framework does not show linear scalability for latency
formance Computing (HPC) in the field of distributed deep performance (epoch/sec) beyond 128 GPUs.
learning. The rest of the paper is organized as: Section II talks
about some recent work in DDNNs. Section III gives some
B. Impact of BigData: necessary background to the reader in the field of distributed
Now it is a known fact that if we have a BigData, we need learning. Section VI gives detail about our experimentation
deep learning to extract or process information. Traditional work. Section VII gives conclusion and some future work.
neural networks do not scale well with large data, however,
II. R ELATED W ORK
deep neural networks scale well for accuracy with large data
set as shown in Figure 1 1 . Training DNNs is time consuming In this section, we review the previous work about scaling
because of high computation requirement. Long training times deep neural networks on parallel or distributed systems.
are limiting the pace of DNN research and production. It is
a known fact that several Internet companies have internal
databases containing billions of images with hundreds of
thousands of different category labels. Due to long training
times, these companies are facing serious delays in bringing
DNN-based solutions to market. Accelerated DDNN training
solutions would help these companies to bring new DNN based
solution to market quickly.
C. Impact of training time on real time processing:
There are a number of situations where it is crucial to
incorporate new data into a DNN model in real time. For
example, reinforcement learning (RL) enables robots to learn
things themselves with minimal supervision. A recent study by
Levine [14] applied state-of-the-art DNN-based RL techniques Fig. 2. Parameter Server. Figure taken from the work [25].
to enable a robot to teach itself how to build Lego structures
and screw on bottle caps. This technique is effective, and the Figure 2 shows a typical parameter server model or Asyn-
robot does indeed learn to screw on bottle caps. However, it chronous Stochastic Gradient Descent model [13]. In this
takes 3-4 hours for the robot to learn to screw on bottle caps, framework, each worker machine has a copy of weight W .
and the majority of this time is spent on DNN training. Faster The dataset is partitioned to all the worker machines. At each
DNN training would enable reinforcement learning be applied step, ith worker computes a sub-gradient ( Wi ) from its own
in real-time. data and weight. Then the ith worker sends Wi to the master
Therefore, there is a great need in the AI community to (i 2 1, 2, ..., P ). The master receives Wi , conducts the
speed up the training phase, especially when using a large weight update, and sends weight back to ith worker machine.
dataset. For this, we need DDNNs that can scale linearly with All the workers finish this step asynchronously, using first
more computation resources. DDNNs using a GPU cluster can come first serve (FCFS). Most of the DDNNs use this approach
give much faster training. However, this entails two challenges: as it scales well with number of computing nodes. However,
• Portability– Most implementations require a data analyst it has low convergence rate because of high global variance
to modify their code significantly. in weights among the models running on different nodes.
• Scalability– Several distributed Deep Learning imple- The Hogwild method [21] can be presented as a variant of
mentations are geared towards cloud computing systems Async SGD. The master machine is a shared memory system.
– which is inadequate for execution on massively parallel For Async SGD, if the sub-gradient from jth worker arrives
system such as supercomputers. during the period that the master is interacting with ith worker,
Recently, Uber presents the Horovod Framework. Horovod is a then W W ⌘ Wj can not be started before W W
fast and efficient framework to port Deep Learning algorithms ⌘ Wi is finished (i, j 2 1, 2, ..., P ). This means that there is
written in Tensorflow, Keras, and Pytorch on GPU/CPU clus- a lock to avoid weight update conflicts on the shared memory
ters using the MPI programming model. system (master machine). The lock makes sure the master only
In this paper, we did a detailed performance analysis of processes one sub-gradient at one time. The Hogwild method,
the Horovod framework for scalability under various run- however, removes the lock and allows the master to process
time parameters. We used well known convolutional neural multiple sub-gradients at the same time.
networks with a synthetic data. Our results show that the Elastic Averaging SGD (EASGD) method [23] can also
framework scales well with 256 GPUs showing almost linear be presented as a variant of Async SGD. Async SGD
uses a FCFS strategy for processing the sub-gradients asyn-
1 The figure is taken form the lectures on “Deep Learning” by Andrew Ng. chronously. EASGD uses a round-robin strategy for ordered
update, i.e., W W ⌘ Wj can not be started before previously. By adding the input of the block to its output,
W W ⌘ Wj 1 is finished (i 2 2, 3, ..., n). Also, EASGD the residual block learns the residual function, and forwards
requires the workers to conduct the update locally. Before all the activations to deeper layers than earlier. One advantage
the workers conduct the local updating, the master updates the of ResNet is that it can improve accuracy of the model while
center (or global) weight. avoiding parameter explosion. That is, the ResNet blocks
Work by Li [20] is focused on single-node memory opti- increase the depth (and inner layers) of the network instead
mization. The idea is included in the work from [25]. There of its width.
is some work [26], [27] on scaling up deep neural networks
by model parallelism method. Low-precision representation of B. Issues with Parallel Stochastic Gradient Descent (SGD)
neural networks is another direction of research. The idea is We now explain the main issues with parallel SGDs.
to use low-precision floating point to reduce the computation
and communication for going the acceptable accuracy ( [15], 1) Network Contention: One of the major issues faced
[17], [18], [22]). with Synchronous SGD is the network congestion due to
III. BACKGROUND high communication cost between parameter server and
worker nodes. One solution to this problem is to use multiple
In this section, we describe the background for distributed parameter servers. However, the number of workers still poses
neural networks. a challenge. Recently, a hierarchical tree approach is used to
A. Deep Neural Networks aggregate the gradients from the workers [25]. However, these
approaches do not utilize all the underlying communication
We use the following three deep neural networks for our
links and are heavily dependent on the network topology of
work.
the machine. Instead, MPI solves the problem by deploying
state-of-the-art parallel algorithms which can adapt to any
1) AlexNet: AlexNet is the name of a Convolutional underlying topology. Thus, grouping workers into logical MPI
Neural Network (CNN), originally written with CUDA to run groups can significantly resolves contention on parameter
with GPU support, which competed in the ImageNet Large server.
Scale Visual Recognition Challenge in 2012. The network
achieved a top-5 error of 15.3%, more than 10.8 percentage
2) Parameter Staleness: As the number of workers in-
points ahead of the runner up. AlexNet was designed by the
creases, asynchronous forms of SGD face the issue of stal-
SuperVision group, consisting of Alex Krizhevsky, Geoffrey
eness, which inhibit a fast rate of convergence. On way
Hinton, and Ilya Sutskever. AlexNet has five convolution
to improve this is to use hierarchical clustering. Clustering
layers, three pooling layers, and two fully-connected layers.
workers into MPI clients potentially offers two immediate
This CNN architecture requires about 1.4 M activations/image
advantages:
and has 60 M parameters.
• It reduces the variance of the gradient updates by effec-
2) GoogLeNet: GoogleNet is more complex model than tively increasing the mini batch size. In, the number of
AlexNet. GoogleNet has two convolution layers, two pooling iterations to converge is halved as the mini batch size is
layers, and nine inception layers. Each inception layer consists doubled.
of six convolution layers and one pooling layer. The concept • It reduces total number of workers. Depending on the
of inception layer is to cover bigger area of images while algorithm and the distribution of data, one or both the
maintaining fine resolution for small information on these factors improve the rate.
images. The inception module of GoogLeNet concatenates For example, the MPI elastic averaging algorithms studied
filters of different sizes into a single new filter. This avoids in the work [25] benefits from both. Such models offer
parameter explosion with the use of inception layers. potential to scale to a full scale machine comprising of
GoogLeNet performs significantly better than AlexNet for the thousands of GPUs.
ImageNet and the recent ILSVRC challenge datasets. This
CNN architecture has about 5.5 M parameters. GoogLeNet 3) Memory Pressure and Batch Size: One of the main
in relation to AlexNet has (i) more layers; (ii) fewer features issues in the implementation of Deep Learning (DL) systems
per layer, and; (iii) more activations. GoogleNet has 10.8 M is memory pressure, which keeps growing as the number of
activations per image. levels of the network increase. This restricts the choice of
batch size of a DL worker to smaller values as the total
3) ResNet/x: Deep Residual Learning Network memory per worker is dictated by the hardware. There are
(ResNet) [24] introduced the concept of a residual block. inefficiencies using smaller batch sizes. Grouping workers to
Each block consists of two convolution layers along with larger batches should improve performance as long as the
a connection adding the output of the second block to the batch sizes falls within algorithmic stipulated limits. Moreover,
input of the first. Residual blocks are designed to allow the the new framework also allows the flexibility to decouple
training of substantially deeper models than had been trained dependency between the model mini batch size and memory
limits per worker, allowing for a possibility of porting models 1) TensorFlow Graph: The fundamental model of
across different hardware architectures. computation within TensorFlow is a computational graph. A
graph contains vertices, representing operations, and edges,
C. Data Parallelism and Model Parallelism representing tensors (arbitrary dimensional arrays). Each
operation can take multiple inputs and generate multiple
outputs, with tensors created and passed from one operation
to another. Edges also act as control flow objects in the
computational graph, which ensures dependencies, that
naturally arise in DL implementations.

2) Tensors: There are several special types of tensors


in TensorFlow. An important tensor is a variable. Variables
are persistent tensors that can be accessed outside the
computational graph. In DL implementations, the weights
and biases of a model are stored as variables and updated by
operations, when a computational graph is executed. Another
Fig. 3. Parallelism strategies for Deep Learning type of a tensor is placeholder. Placeholders are input points
. into a computational graph. Outside of placeholders, the
computational graph is self-contained.
There are two major parallelism strategies for Deep Learn-
ing: Data Parallelism and Model Parallelism (See Figure 3). 3) Session: In TensorFlow, a session controls the graph.
All the later parallel methods are the variants of these two It stores the values of variables and is used to run the
methods. computations described by the graph. After the creation
1) Data Parallelism: The dataset is partitioned into P of a session, an initializer must be run to give values to
parts and each machine only gets one part. Each machine the variables to be used within the session. Subsequent
has a copy of the neural network, hence the weights (W ). computations, such as the computation of gradients, must be
The communication includes sum of all the gradients Wi managed through the session to ensure that the correct values
and broadcast of W . The worst part of communication of variables are used. The session makes use of a scheduler,
is conducted between backward propagation and weights which maintains a record of which operations have been
update. The master updates W with Wi after it gets all completed and enqueues those whose dependencies are all
the sub-gradients Wi from the workers. Then the master satisfied to be executed.
machine broadcasts W to all the worker machines, which is
the second part of communication. Figure 3 is an example of 4) Device Scheduling: In addition to its use by the session
data parallelism on four machines. to keep track of which operations are ready to execute, the
TensorFlow scheduler also handles device scheduling when
2) Model Parallelism: Data parallelism replicates the neu- multiple devices are available. Before executing a graph as
ral network itself on each machine while model parallelism desired by the user, the schedule runs a simulation of the graph
partitions the neural network into P pieces. Partitioning the to determine execution time and the order of the operations.
neural network means parallelizing the matrix operations on It then uses this information to create the dependency lists
the partitioned network. Thus, model parallelism can get the that the session requires and to assign each operation to a
same solution as the single-machine case. Figure 3 shows device. These assignments first depend on whether there is
model parallelism on three machines. These three machines an implementation of the operation for a given device for
partition the matrix operation of each layer. However, because instance, sometimes GPU implementations may be unavailable
both the batch size and the picture size typically are relatively and then upon expected execution speed taking into account
small, the matrix operations are not large. State-of-the-art inter-device communication times for the relevant tensors.
methods often use data-parallelism.
V. H OROVOD F RAMEWORK
IV. T ENSORFLOW Uber adopted Baidus implementation [2] of the TensorFlow
Google released TensorFlow in November 2015 as a ring-allreduce algorithm for the Horovod Framework. Here is
platform for building and developing DL implementations. a brief introduction to the framework:
TensorFlow is capable of utilizing multiple threads, such that • Uber’s AI team converted the code , from Baidu, into
multi-core systems can be utilized effectively. It also provides a stand-alone Python package called Horovod, named
implementations to leverage GPUs (using NVIDIA CUDA after a traditional Russian folk dance in which per-
based DNN (cuDNN)), such that one (or more) GPUs on a formers dance with linked arms in a circle, much like
single node may be utilized effectively. how distributed TensorFlow processes use Horovod to
communicate with each other.
• The framework also provides ring-allreduce implementa- of N nodes communicates with two of its peers 2 ⇥ (N 1)
tion with NCCL. NCCL is NVIDIAs library for collective times. During this communication, a node sends and receives
communication that provides a highly optimized version chunks of the data buffer. In the first N 1 iterations, received
of ring-allreduce. NCCL 2 introduced the ability to run values are added to the values in the nodes buffer. In the
ring-allreduce across multiple machines, enabling the end second N 1 iterations, received values replace the values held
user to take advantage of its many performance boosting in the nodes buffer. Patarasuk and Yuan in [28] suggest that
optimizations. this algorithm is bandwidth-optimal, meaning that if the buffer
• The framework provides support for models that fit inside is large enough, it will optimally utilize the available network.
a single server, potentially on multiple GPUs. In addition to being network-optimal, the allreduce approach is
• The framework provides several API improvements in- much easier to understand and adopt. Users utilize a Message
spired by feedback received from a number of initial Passing Interface (MPI) implementation such as OpenMPI
users. In particular, the framework implements a broad- to launch all copies of the TensorFlow program. MPI then
cast operation that enforces consistent initialization of transparently sets up the distributed infrastructure necessary
the model on all workers. The new API allows the end for workers to communicate with each other. All the user needs
users to cut down the number of operations a user had to to do is modify their program to average gradients using an
introduce to their single GPU program to four. allreduce() operation.
VI. E XPERIMENTATION
In this section we talk about our experimentation and results.
The main information about the experimentation environment
is shown in Table I. We use Nvidia K20, K40, K80 and P100
for our work. However, we only present our results with K80
GPUs in this section.
Nodes 256 (4 K80 per node)
GPUs Nvidia K20, K40, K80, P100
Data Synthetic data
Memory 250 GB
Compiler GCC
CUDA 9.1
Tensorflow 1.80
Python 2.7
TABLE I
Fig. 4. Horovod simple APIs to move single GPU code to multi GPU code. E XPERIMENTAL ENVIRONMENT

Whereas the parameter server paradigm for distributed


TensorFlow training often requires careful implementation of Figure 6 shows the scalability performance of Horovod
significant boilerplate code, Horovod needs just a few new using AlexNet. The running time for each epoch decreases as
lines. Figure 4 gives an example of a TensorFlow program we increase the number of GPU nodes. The rate of decrease
distributed using Horovod. is high up-to 16 GPU nodes and then it dies down because
of high communication cost. Figure 7 shows the scalability
performance of gRPC using AlexNet. Comparing to Horovod,
it shows poor performance. The gRPC implementation uses
parameter sever model which has high communication cost as
compare to the ring all reduce implementation in Horovod.
Figure 8 shows throughput (images/sec) performance with
Horovod using AlexNet. The performance is almost linear
with all three batch sizes. Figure 9 shows the throughput
performance for gRPC implemenatation. The scalability per-
formance dies down after 64 GPUs nodes.
Figure 10 and Figure 11 show throughput performance
for Horovod and gRPC respectively using GoogleNet. Again,
Horovod is showing better scalability performance than gRPC
under different workloads.
Fig. 5. MPI Ring All Reduce. Figure 12 and Figure 13 shows throughput performance
. for Horovod and gRPC respectively using ResNet50. Again,
Horovod is showing better scalability than gRPC under differ-
In the ring-allreduce algorithm, shown on Figure 5, each ent workloads.
Fig. 6. Scalability analysis with AlexNet using Horovod. Fig. 9. Throughput performance of gRPC using AlexNet
. .

Fig. 7. Scalability analysis with Alexnet using gRPC. Fig. 10. Throughput performance of Horovod using GoogleNet.
. .

Fig. 8. Throughput performance of Horovod using AlexNet. Fig. 11. Throughput performance of gRPC using GoogleNet.
. .

VII. C ONCLUSION AND F UTURE WORK ResNeT, have become readily available for use with little mod-
ification. Distributed Deep Learning implementations capable
Deep Neural Networks or Deep Learning algorithms have of execution on large scale systems are becoming important
become a popular choice for data analysis. Several Deep to address the computational needs of large data produced by
Learning implementation, such as AlexNet, GoogleNet, and scientific simulations and experiments, and social media such
For the future work, we are planning to do a detailed anal-
ysis of the framework with a real data set such as ImageNet
using a performance analysis tool. This fine level performance
analysis will help us pinpoint the main performance bottle-
necks for the Horovod framework.

VIII. ACKNOWLEDGEMENT
The authors would like to thank Alexander Sergeev from
Uber Technologies, Inc. for a thoughtful discussion.

R EFERENCES
[1] An In-depth Performance Characterization of CPU- and GPU-based
DNN Training on Modern Architectures A. Awan , H. Subramoni , and
D. K. Panda. 3rd Workshop on Machine Learning in High Performance
Fig. 12. Throughput performance of Horovod using ResNet50.
Computing Environments, held in conjunction with SC17, Nov 2017.
. [2] Horovod: fast and easy distributed deep learning in TensorFlow. Alexan-
der Sergeev, Mike Del Balso. 2018.{https://arxiv.org/abs/1802.05799}
[3] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-
zaro, and E. Shelhamer. cuDNN: efficient primitives for deep learning.
arXiv:1410.0759, 2014.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:
A large-scale hierarchical image database. In CVPR, 2009.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification
with Deep Convolutional Neural Networks. In NIPS, 2012.
[6] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400,
2013.
[7] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv:1409.1556, 2014.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions.
arXiv:1409.4842, 2014.
[9] . Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
NIPS, 2013.
[10] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao,
X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From
Fig. 13. Throughput performance of gRPC using ResNet50. captions to visual concepts and back. In CVPR, 2015.
. [11] A. Lavin. maxDNN: an efficient convolution kernel for deep learning
with maxwell gpus. arXiv:1501.06633, 2015.
[12] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y.
LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation.
as Facebook, Youtube etc. However, the adoption of distributed arXiv:1412.7580, 2014.
[13] J. Dean. Keynote: Large scale deep learning. In GPU Technology
Deep Learning faces many significant challenges. The top Conference, 2015.
two challenges are: 1) Portability– Most implementations [14] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of
require a data analyst to modify their code significantly, deep visuomotor policies. arXiv:1504.00702, 2015.
[15] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2014.
and 2) Scalability– Several distributed Deep Learning im- Training deep neural networks with low precision multiplications. arXiv
plementations are geared towards cloud computing systems – preprint arXiv:1412.7024 (2014).
which is inadequate for execution on massively parallel system [16] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Mathieu Devin,
Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and
such as supercomputers. Recently, Uber presents the Horovod others. 2012. Large scale distributed deep networks. In Advances in
Framework which is a fast and efficient framework to port neural information processing systems. 12231231.
Deep Learning algorithms written in Tensorflow, Keras, and [17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. 2015. Deep learning with limited numerical precision. In
Pytorch on GPU/CPU clusters using the MPI programming Proceedings of the 32nd International Conference on Machine Learning
model. (ICML-15). 17371746.
In this paper, we did a detailed performance analysis of [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. 2016. Quantized neural networks: Training neural
the Horovod framework for scalability under various run- networks with low precision weights and activations. arXiv preprint
time parameters. We used well known convolutional neural arXiv:1609.07061 (2016).
networks with a synthetic data for our experimentation. Our [19] Quoc V Le. 2013. Building high-level features using large scale unsuper-
vised learning. In Acoustics, Speech and Signal Processing (ICASSP),
results show that the framework scales well with 256 GPUs 2013 IEEE International Conference on. IEEE, 85958598.
showing almost linear scalability for throughput performance [20] Chao Li, Yi Yang, Min Feng, Srimat Chakradhar, and Huiyang Zhou.
(images/sec). However, the framework does not show linear 2016. Optimizing memory efficiency for deep convolutional neural
networks on GPUs. In Proceedings of the International Conference for
scalability for latency performance (epoch/sec) beyond 128 High Performance Computing, Networking, Storage and Analysis. IEEE
GPUs. Press, 54.
[21] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011.
Hogwild: A lock-free approach to parallelizing stochastic gradient de-
scent. In Advances in Neural Information Processing Systems. 693701.
[22] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit
stochastic gradient descent and its application to data-parallel distributed
training of speech DNNs.. In Interspeech. 10581062.
[23] Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep
learning with elastic averaging SGD. In Advances in Neural Information
Processing Systems. 685693.
[24] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
recognition. In: Proc. of the IEEE conference on Computer Vision and
Pattern Recognition (CVPR). (2016) 770778.
[25] Yang You, Aydn Bulu, and James Demmel. 2017. Scaling deep learning
on GPU and knights landing clusters. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis (SC ’17). ACM, New York, NY, USA, Article 9, 12 pages.
DOI: https://doi.org/10.1145/3126908.3126912
[26] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro,
and Andrew Ng. 2013. Deep learning with COTS HPC systems. In
International Conference on Machine Learning. 13371345
[27] Quoc V Le. 2013. Building high-level features using large scale unsuper-
vised learning. In Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on. IEEE, 85958598
[28] P. Patarasuk and X. Yuan, ”Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations,” Journal of Parallel and Distributed Com-
puting, 69(2):117-124, February 2009.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy