Abstract
This paper proposes an efficient algorithm mapping method for accelerating deep convolutional neural networks, which includes: (1) Proposing an efficient transformation method, which converts CNN’s convolutional layer and fully connected layer computations into efficient large-scale matrix multiplication computations, and converts pooling layer computations into efficient matrix row computations; (2) Designing a set of general and efficient vectorization method for convolutional layer, fully connected layer and pooling layer on the vector accelerator. The experimental results on the accelerator show that the average computing efficiency of convolution layer and full connected layer of AlexNet, VGG-19, GoogleNet and ResNet-50 are 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%.
This work is supported by the National Natural Science Foundation of China (No. 61572025).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aurora ESP Projects. https://www.alcf.anl.gov/science/projects/AuroraESP/all. Accessed 24 Aug 2020
Patton, R.M., et al.: Exascale deep learning to accelerate cancer research (2019)
Kurth, T., et al.: Exascale deep learning for climate analytics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 1–12, November 2018
Ma, S., et al.: Coordinated DMA: improving the DRAM access efficiency for matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 30(10), 2148–2164 (2019)
Liu, Z., Tian, X., Ma, S.: The implementation and optimization of parallel linpack on multi-core vector accelerator. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications, pp. 2261–2269. IEEE (2019)
Maji, P., Mullins, R.: 1D-FALCON: accelerating deep convolutional neural network inference by co-optimization of models and underlying arithmetic implementation. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10614, pp. 21–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68612-7_3
Lin, S., Ji, R., Li, Y., Wu, Y., Huang, F., Zhang, B.: Accelerating convolutional networks via global & dynamic filter pruning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 2425–2432 (2018)
Abtahi, T., Shea, C., Kulkarni, A., Mohsenin, T.: Accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(9), 1737–1749 (2018)
Scherer, D., Schulz, H., Behnke, S.: Accelerating large-scale convolutional neural networks with parallel graphics multiprocessors. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010. LNCS, vol. 6354, pp. 82–91. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15825-4_9
Lee, S., Jha, D., Agrawal, A., Choudhary, A., Liao, W.: Parallel deep convolutional neural network training by exploiting the overlapping of computation and communication. In: 2017 IEEE 24th International Conference on High Performance Computing (2017)
Jouppi, N.P., et al.: In-data center performance analysis of a tensor processing unit. In: Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), pp. 1–12 (2017)
Imani, M., Peroni, D., Kim, Y., Rahimi, A., Rosing, T.: Efficient neural network acceleration on GPGPU using content addressable memory. In: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), pp. 1026–1031 (2017)
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the ACM Symposium on FPGAs, pp. 161–170 (2015)
Rahman, A., Lee, J., Choi, K.: Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array. In: Proceedings of the IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), pp. 1393–1398 (2016)
Intel Neural Network Processor. https://www.intel.ai/intel-nervana-neural-network-processor-architecture-update. Accessed 24 Aug 2020
Wang, Y., Li, H., Li, X.: Re-architecting the on-chip memory sub-system of machine-learning accelerator for embedded devices. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), p. 13 (2016)
Hu, M., et al.: Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the ACM/IEEE Design Automation Conference (DAC), p. 19 (2016)
Liu, Z., Tian, X., Chen, X., Lei, Y., Liao, M.: Efficient large-scale 1D FFT vectorization on multi-core vector accelerator. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications, pp. 484–491. IEEE (2019)
Zhang, J.Y., Guo, Y., Hu, X.: Design and implementation of deep neural network for edge computing. IEICE Trans. InfSyst. 101, 1982–1996 (2018)
Yang, C., Chen, S., Wang, Y., Zhang, J.: The evaluation of DCNN on vector-SIMD DSP. IEEE Access 7, 22301–22309 (2019)
INFERENCE using the NVIDIA T4. https://www.dell.com/support/article/zh-cn/sln316556/inference-using-the-nvidia-t4?lang=en. Accessed 24 Aug 2020
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 IFIP International Federation for Information Processing
About this paper
Cite this paper
Liu, Z., Ma, S., Li, C., Chen, H. (2021). Accelerating Large-Scale Deep Convolutional Neural Networks on Multi-core Vector Accelerators. In: He, X., Shao, E., Tan, G. (eds) Network and Parallel Computing. NPC 2020. Lecture Notes in Computer Science(), vol 12639. Springer, Cham. https://doi.org/10.1007/978-3-030-79478-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-79478-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79477-4
Online ISBN: 978-3-030-79478-1
eBook Packages: Computer ScienceComputer Science (R0)