Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
Fig. 7. Scalability analysis with Alexnet using gRPC. Fig. 10. Throughput performance of Horovod using GoogleNet.
. .
Fig. 8. Throughput performance of Horovod using AlexNet. Fig. 11. Throughput performance of gRPC using GoogleNet.
. .
VII. C ONCLUSION AND F UTURE WORK ResNeT, have become readily available for use with little mod-
ification. Distributed Deep Learning implementations capable
Deep Neural Networks or Deep Learning algorithms have of execution on large scale systems are becoming important
become a popular choice for data analysis. Several Deep to address the computational needs of large data produced by
Learning implementation, such as AlexNet, GoogleNet, and scientific simulations and experiments, and social media such
For the future work, we are planning to do a detailed anal-
ysis of the framework with a real data set such as ImageNet
using a performance analysis tool. This fine level performance
analysis will help us pinpoint the main performance bottle-
necks for the Horovod framework.
VIII. ACKNOWLEDGEMENT
The authors would like to thank Alexander Sergeev from
Uber Technologies, Inc. for a thoughtful discussion.
R EFERENCES
[1] An In-depth Performance Characterization of CPU- and GPU-based
DNN Training on Modern Architectures A. Awan , H. Subramoni , and
D. K. Panda. 3rd Workshop on Machine Learning in High Performance
Fig. 12. Throughput performance of Horovod using ResNet50.
Computing Environments, held in conjunction with SC17, Nov 2017.
. [2] Horovod: fast and easy distributed deep learning in TensorFlow. Alexan-
der Sergeev, Mike Del Balso. 2018.{https://arxiv.org/abs/1802.05799}
[3] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-
zaro, and E. Shelhamer. cuDNN: efficient primitives for deep learning.
arXiv:1410.0759, 2014.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:
A large-scale hierarchical image database. In CVPR, 2009.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification
with Deep Convolutional Neural Networks. In NIPS, 2012.
[6] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400,
2013.
[7] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv:1409.1556, 2014.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions.
arXiv:1409.4842, 2014.
[9] . Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
NIPS, 2013.
[10] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao,
X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From
Fig. 13. Throughput performance of gRPC using ResNet50. captions to visual concepts and back. In CVPR, 2015.
. [11] A. Lavin. maxDNN: an efficient convolution kernel for deep learning
with maxwell gpus. arXiv:1501.06633, 2015.
[12] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y.
LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation.
as Facebook, Youtube etc. However, the adoption of distributed arXiv:1412.7580, 2014.
[13] J. Dean. Keynote: Large scale deep learning. In GPU Technology
Deep Learning faces many significant challenges. The top Conference, 2015.
two challenges are: 1) Portability– Most implementations [14] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of
require a data analyst to modify their code significantly, deep visuomotor policies. arXiv:1504.00702, 2015.
[15] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2014.
and 2) Scalability– Several distributed Deep Learning im- Training deep neural networks with low precision multiplications. arXiv
plementations are geared towards cloud computing systems – preprint arXiv:1412.7024 (2014).
which is inadequate for execution on massively parallel system [16] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Mathieu Devin,
Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and
such as supercomputers. Recently, Uber presents the Horovod others. 2012. Large scale distributed deep networks. In Advances in
Framework which is a fast and efficient framework to port neural information processing systems. 12231231.
Deep Learning algorithms written in Tensorflow, Keras, and [17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. 2015. Deep learning with limited numerical precision. In
Pytorch on GPU/CPU clusters using the MPI programming Proceedings of the 32nd International Conference on Machine Learning
model. (ICML-15). 17371746.
In this paper, we did a detailed performance analysis of [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. 2016. Quantized neural networks: Training neural
the Horovod framework for scalability under various run- networks with low precision weights and activations. arXiv preprint
time parameters. We used well known convolutional neural arXiv:1609.07061 (2016).
networks with a synthetic data for our experimentation. Our [19] Quoc V Le. 2013. Building high-level features using large scale unsuper-
vised learning. In Acoustics, Speech and Signal Processing (ICASSP),
results show that the framework scales well with 256 GPUs 2013 IEEE International Conference on. IEEE, 85958598.
showing almost linear scalability for throughput performance [20] Chao Li, Yi Yang, Min Feng, Srimat Chakradhar, and Huiyang Zhou.
(images/sec). However, the framework does not show linear 2016. Optimizing memory efficiency for deep convolutional neural
networks on GPUs. In Proceedings of the International Conference for
scalability for latency performance (epoch/sec) beyond 128 High Performance Computing, Networking, Storage and Analysis. IEEE
GPUs. Press, 54.
[21] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011.
Hogwild: A lock-free approach to parallelizing stochastic gradient de-
scent. In Advances in Neural Information Processing Systems. 693701.
[22] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit
stochastic gradient descent and its application to data-parallel distributed
training of speech DNNs.. In Interspeech. 10581062.
[23] Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep
learning with elastic averaging SGD. In Advances in Neural Information
Processing Systems. 685693.
[24] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
recognition. In: Proc. of the IEEE conference on Computer Vision and
Pattern Recognition (CVPR). (2016) 770778.
[25] Yang You, Aydn Bulu, and James Demmel. 2017. Scaling deep learning
on GPU and knights landing clusters. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and
Analysis (SC ’17). ACM, New York, NY, USA, Article 9, 12 pages.
DOI: https://doi.org/10.1145/3126908.3126912
[26] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro,
and Andrew Ng. 2013. Deep learning with COTS HPC systems. In
International Conference on Machine Learning. 13371345
[27] Quoc V Le. 2013. Building high-level features using large scale unsuper-
vised learning. In Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on. IEEE, 85958598
[28] P. Patarasuk and X. Yuan, ”Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations,” Journal of Parallel and Distributed Com-
puting, 69(2):117-124, February 2009.