Abstract
Convolution is one of the most time-consuming operations in training deep neural networks. Existing convolutional algorithms, such as FFT, GEMM, Winograd and their varieties, have different performances in time and space. However, there is no best algorithm for all convolution configurations (parameters of convolutional operations). This paper addresses the problem of convolutional algorithm selection for given configurations and proposes a fast and exact selector ConvDarts. We propose an informed cache that is preset with common convolution configurations and their optimal algorithm indices. A lightweight machine learning model is also used to predict the optimal convolutional algorithm for cache missing configurations. Compared with the heuristics and profiling approaches exploited in cuDNN, it not only reduces the training time of classical deep learning networks but also reduces the required memory space. The selector ConvDarts proposed in this paper provides more possibilities for the training of network models in resource-constrained environments.









Similar content being viewed by others
Data availability
The data that support the findings of this study are available on request from the corresponding author upon reasonable request.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., Krishnamurthy, A.: Tvm: An automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. OSDI’18, pp. 579–594. USENIX Association, USA (2018)
Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: Searching transformers for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12270–12280 (2021)
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Collobert, R., Bengio, S., Mariéthoz, J.: Torch: a modular machine learning software library. Technical report, Idiap (2002)
Dukhan, M.: The indirect convolution algorithm. arXiv preprint arXiv:1907.02129 (2019)
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., Smola, A.: Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020)
Goldsborough, P.: A tour of tensorflow. arXiv preprint arXiv:1610.01178 (2016)
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic Neural Networks: A Survey. arXiv (2021)
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., Aiken, A.: Taso: optimizing deep learning computation with automatic generation of graph substitutions, pp. 47–62 (2019)
Jia, Y.: Learning semantic image representations at a large scale. PhD thesis. University of California, Berkeley (2014)
Jordà, M., Valero-Lara, P., Peña, A.J.: cuconv: A cuda implementation of convolution for cnn inference. arXiv preprint arXiv:2103.16234 (2021)
Jorda, M., Valero-Lara, P., Pena, A.J.: Performance evaluation of cudnn convolution algorithms on nvidia volta gpus. IEEE Access 7, 70461–70473 (2019)
Jordà, M., Valero-Lara, P., Peña, A.J.: Performance evaluation of cudnn convolution algorithms on nvidia volta gpus. IEEE Access 7, 70461–70473 (2019)
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)
Li, X., Zhang, G., Huang, H.H., Wang, Z., Zheng, W.: Performance analysis of gpu-based convolutional neural networks. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 67–76 (2016)
Ma, Y., Yu, D., Wu, T., Wang, H.: Paddlepaddle: An open-source deep learning platform from industrial practice. Front. Data Domput. 1(1), 105–115 (2019)
Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013)
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: A comprehensive review. ACM Computing Surveys (CSUR) 54(3), 1–40 (2021)
NVML API Reference Guide (2022). https://docs.nvidia.com/deploy/nvml-api/index.html
Oyama, Y., Ben-Nun, T., Hoefler, T., Matsuoka, S.: \(\mu\)-cudnn: Accelerating deep learning frameworks with micro-batching. arXiv preprint arXiv:1804.04806 (2018)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
Pourghassemi, B., Zhang, C., Lee, J.H., Chandramowlishwaran, A.: Brief announcement: On the limits of parallelizing convolutional neural networks on gpus. CoRR (2020)
PyTorch: What does torch.backends.cudnn.benchmark do? (2017). https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936. Accessed 22 Nov 2021
PyTorch: Cudnn.benchmark Slowing Execution Down (2018). https://discuss.pytorch.org/t/cudnn-benchmark-slowing-execution-down/31762
PyTorch: Set Torch.backends.cudnn.benchmark = True Consumes Huge Amount of Memory (2021). https://discuss.pytorch.org/t/set-torch-backends-cudnn-benchmark-true-consumes-huge-amount-of-memory/131010
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI ’13, pp. 519–530. Association for Computing Machinery, New York, NY, USA (2013)
Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., Jiang, P.: Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1441–1450 (2019)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Wang, H., Zhai, J., Gao, M., Ma, Z., Tang, S., Zheng, L., Li, Y., Rong, K., Chen, Y., Jia, Z.: Pet: Optimizing tensor programs with partially equivalent transformations and automated corrections. In: 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pp. 37–54 (2021)
Xu, R., Ma, S., Guo, Y.: Performance analysis of different convolution algorithms in gpu environment. In: 2018 IEEE International Conference on Networking, Architecture and Storage (NAS), pp. 1–10 (2018). IEEE
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest Statement
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bai, L., Ji, W., Li, Q. et al. ConvDarts: a fast and exact convolutional algorithm selector for deep learning frameworks. CCF Trans. HPC 6, 32–44 (2024). https://doi.org/10.1007/s42514-023-00167-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-023-00167-7