Abstract
Efficient utilization of processors in heterogeneous CPU–GPU systems is crucial for improving overall application performance by reducing workload completion time. This article introduces a framework designed to achieve maximum performance in scheduling the processing of sparse matrix-based applications within a heterogeneous CPU–GPU system. The framework suggests splitting the matrix into chunks, employing machine learning to find the optimal chunk size for scheduling efficiency, with the number of GPU streams regarded as a critical factor. The scheduling algorithm introduced is inspired by the concept of quartiles in statistics and is designed to operate in real-time, thereby striving to impose minimal overhead on the system. The evaluation of the proposed framework focused on the SpMV (Sparse Matrix–Vector Multiplication) kernel, essential for various applications such as matrix-based graph processing. This evaluation was conducted using a system equipped with an NVIDIA GTX 1070 GPU. Testing on real-world sparse matrices showed that the proposed scheduling algorithm significantly outperforms scenarios with no offloading, full offloading, and the Alternate Assignment method.






















Similar content being viewed by others
References
Hu L, Che X, Zheng S-Q (2016) A closer look at GPGPU. ACM Comput Surv 48(4):1–20
Kato S, Lakshmanan K, Kumar A, Kelkar M, Ishikawa Y, Rajkumar R (2011) RGEM: a responsive GPGPU execution model for runtime engines. In: IEEE 32nd real-time systems symposium
Guzmán MAD, Nozal R, Tejero RG, Villarroya-Gaudó M, Gracia DS, Bosque JL (2019) Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL. J Supercomput 75:1732–1746
Fang J, Huang C, Tang T, Wang Z (2020) Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans High Perform Comput 4:382–400
Liu X, Zhong Z, Xu K (2015) A hybrid solution method for CFD applications on GPU-accelerated hybrid HPC platforms. Future Gener Comput Syst 56:759–765
Nurvitadhi E, Mishra A, Marr D (2015) A sparse matrix vector multiply accelerator for support vector machine. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Amsterdam
Roui MB, Shekofteh SK, Noori H, Harati A (2020) Efficient scheduling of streams on GPGPUs. J Supercomput 76:9270–9302
The open standard for parallel programming of heterogeneous. Khronos Group (2009). https://www.khronos.org/opencl/
Busato F, Green O, Bombieri N, Bader DA (2018) Hornet: an efficient data structure for dynamic sparse graphs and matrices on GPUs. In: IEEE High Performance Extreme Computing Conference (HPEC)
Zardoshti P, Khunjush F, Sarbazi-Azad H (2015) Adaptive sparse matrix representation for efficient matrix–vector multiplication. J Supercomput 72:3366–3386
Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, New York
Sedaghati N, Mu T, Pouchet L-N, Parthasarathy S, Sadayappan P (2015) Automatic selection of sparse matrix representation on GPUs. In: Proceedings of the 29th ACM on International Conference on Supercomputing
Langr D, Tvrdík P (2016) Evaluation criteria for sparse matrix storage formats. IEEE Trans Parallel Distribut Syst 27(2):428–440
Filippone S, Cardellini V, Barbieri D, Fanfarillo A (2017) Sparse matrix-vector multiplication on GPGPUs. ACM Trans Math Softw (TOMS) 43(4):1–49
Joseph MD, Greathouse L (2014) Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans
Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) CuSha:vertexcentric graph processing on GPUs. In: Proceedings of the 23rd international symposium on highperformance parallel and distributed computing
Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim 9(4):1–20
Geng T, Amaris M, Zuckerman S, Goldman A, Gao GR, Gaudiot J-L (2022) A profile-based AI-assisted dynamic scheduling approach for heterogeneous architectures. Int J Parallel Program 50(4):115–151
Tse AHT, Thomas DB, Tsoi KH, Luk W (2010) Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters. In: International Conference on Field-Programmable Technology
Busato F, Bombieri N (2017) A dynamic approach for workload partitioning on GPU architectures. IEEE Trans Parallel Distribut Syst 28:1535–1549
Wan L, Zheng W, Yuan X (2021) Efficient inter-device task scheduling schemes for multi-device co-processing of data-parallel kernels on heterogeneous systems. IEEE Access 9:59968–59978
Wang Z, Zheng L, Chen Q, Guo M (2014) CPU + GPU scheduling with asymptotic profiling. Parallel Comput 40(2):107–115
Grewe D, O’Boyle MFP (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: Compiler Construction: 20th International Conference, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, Saarbrücken, Germany
Yasir Noman Khalid MA, Prodan R, Iqbal MA, Islam MA (2018) E-OSched: a load balancing scheduler for heterogeneous multicores. J Supercomput 74:5399–5431
Wrede F, Ernsting S (2018) Simultaneous CPU–GPU execution of data parallel algorithmic skeletons. Int J Parallel Program 40(1):42–61
Aba MA, Zaourar L, Munier A (2018) Approximation algorithm for scheduling applications on hybrid multi-core machines with communications delays. In: IEEE international parallel and distributed processing symposium workshops (IPDPSW)
Tang X, Fu Z (2020) CPU–GPU utilization aware energy-efficient scheduling algorithm on heterogeneous computing systems. IEEE Access 8:58948–58958
Zhang P, Fang J, Tang T (2018) Auto-tuning streamed applications on intel Xeon Phi. In: IEEE international parallel and distributed processing symposium (IPDPS), Vancouver
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to Algorithms, 2nd edn. The MIT Press, New York
Bisseling RH, Knigge TE (2020) An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning. Parallel Comput 96:102640
Davis TA, Hu Y (2011) The university of Florida sparse matrix collection. ACM Trans Math Softw 38:1–25
Yang W, Li K, Li K (2017) A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems. J Parallel Distrib Comput 104:49–60
Zhang F, Wu B, Zhai J, He B, Chen W (2017) FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures. In: IEEE/ACM international symposium on code generation and optimization (CGO), Austin
SuiteSparse Matrix Collection. http://sparse.tamu.edu/. Accessed 8 3, 2024
Zardoshti P, Khunjush F, Sarbazi-Azad H (2016) Adaptive sparse matrix representation for efficient matrix–vector multiplication. J Supercomput 72:3366–3386
Chen Y, Li K, Yang W, Xiao G, Xie X, Li T (2019) Performance-aware model for sparse matrix-matrix multiplication on the sunway TaihuLight supercomputer. IEEE Trans Parallel Distrib Syst 99:1–1
Bian H, Huang J, Liu L, Huang D, Wang X (2021) ALBUS: a method for efficiently processing SpMV using SIMD and Load. Future Gener Comput Syst 116:371–392
Bell N, Garland M (2008) Efficient sparse matrix-vector multiplication on CUDA. Nvidia Corporation, New York
Choi HJ, Son DO, Kang SG, Kim JM, Lee H-H, Kim CH (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. J Supercomput 65:886–902
Zhang F, Liu W, Feng N, Zhai J, Du X (2019) Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors. Trans High Perform Comput 1:131–143
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
ASB contributed to methodology, algorithms, software, validation, writing—original draft, and editing. AS contributed to methodology, review and editing, and supervision. MN contributed to review and editing
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
No datasets were generated or analyzed during the current study.
Code availability
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shokrani Baigi, A., Savadi, A. & Naghibzadeh, M. A high-performance dynamic scheduling for sparse matrix-based applications on heterogeneous CPU–GPU environment. J Supercomput 80, 25071–25098 (2024). https://doi.org/10.1007/s11227-024-06394-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06394-1