Skip to main content
Log in

A high-performance dynamic scheduling for sparse matrix-based applications on heterogeneous CPU–GPU environment

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Efficient utilization of processors in heterogeneous CPU–GPU systems is crucial for improving overall application performance by reducing workload completion time. This article introduces a framework designed to achieve maximum performance in scheduling the processing of sparse matrix-based applications within a heterogeneous CPU–GPU system. The framework suggests splitting the matrix into chunks, employing machine learning to find the optimal chunk size for scheduling efficiency, with the number of GPU streams regarded as a critical factor. The scheduling algorithm introduced is inspired by the concept of quartiles in statistics and is designed to operate in real-time, thereby striving to impose minimal overhead on the system. The evaluation of the proposed framework focused on the SpMV (Sparse Matrix–Vector Multiplication) kernel, essential for various applications such as matrix-based graph processing. This evaluation was conducted using a system equipped with an NVIDIA GTX 1070 GPU. Testing on real-world sparse matrices showed that the proposed scheduling algorithm significantly outperforms scenarios with no offloading, full offloading, and the Alternate Assignment method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Hu L, Che X, Zheng S-Q (2016) A closer look at GPGPU. ACM Comput Surv 48(4):1–20

    Article  Google Scholar 

  2. Kato S, Lakshmanan K, Kumar A, Kelkar M, Ishikawa Y, Rajkumar R (2011) RGEM: a responsive GPGPU execution model for runtime engines. In: IEEE 32nd real-time systems symposium

  3. Guzmán MAD, Nozal R, Tejero RG, Villarroya-Gaudó M, Gracia DS, Bosque JL (2019) Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL. J Supercomput 75:1732–1746

    Article  Google Scholar 

  4. Fang J, Huang C, Tang T, Wang Z (2020) Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans High Perform Comput 4:382–400

    Article  Google Scholar 

  5. Liu X, Zhong Z, Xu K (2015) A hybrid solution method for CFD applications on GPU-accelerated hybrid HPC platforms. Future Gener Comput Syst 56:759–765

    Article  Google Scholar 

  6. Nurvitadhi E, Mishra A, Marr D (2015) A sparse matrix vector multiply accelerator for support vector machine. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Amsterdam

  7. Roui MB, Shekofteh SK, Noori H, Harati A (2020) Efficient scheduling of streams on GPGPUs. J Supercomput 76:9270–9302

    Article  Google Scholar 

  8. The open standard for parallel programming of heterogeneous. Khronos Group (2009). https://www.khronos.org/opencl/

  9. Busato F, Green O, Bombieri N, Bader DA (2018) Hornet: an efficient data structure for dynamic sparse graphs and matrices on GPUs. In: IEEE High Performance Extreme Computing Conference (HPEC)

  10. Zardoshti P, Khunjush F, Sarbazi-Azad H (2015) Adaptive sparse matrix representation for efficient matrix–vector multiplication. J Supercomput 72:3366–3386

    Article  Google Scholar 

  11. Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, New York

  12. Sedaghati N, Mu T, Pouchet L-N, Parthasarathy S, Sadayappan P (2015) Automatic selection of sparse matrix representation on GPUs. In: Proceedings of the 29th ACM on International Conference on Supercomputing

  13. Langr D, Tvrdík P (2016) Evaluation criteria for sparse matrix storage formats. IEEE Trans Parallel Distribut Syst 27(2):428–440

    Article  Google Scholar 

  14. Filippone S, Cardellini V, Barbieri D, Fanfarillo A (2017) Sparse matrix-vector multiplication on GPGPUs. ACM Trans Math Softw (TOMS) 43(4):1–49

    Article  MathSciNet  Google Scholar 

  15. Joseph MD, Greathouse L (2014) Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans

  16. Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) CuSha:vertexcentric graph processing on GPUs. In: Proceedings of the 23rd international symposium on highperformance parallel and distributed computing

  17. Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim 9(4):1–20

    Article  Google Scholar 

  18. Geng T, Amaris M, Zuckerman S, Goldman A, Gao GR, Gaudiot J-L (2022) A profile-based AI-assisted dynamic scheduling approach for heterogeneous architectures. Int J Parallel Program 50(4):115–151

    Article  Google Scholar 

  19. Tse AHT, Thomas DB, Tsoi KH, Luk W (2010) Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters. In: International Conference on Field-Programmable Technology

  20. Busato F, Bombieri N (2017) A dynamic approach for workload partitioning on GPU architectures. IEEE Trans Parallel Distribut Syst 28:1535–1549

    Article  Google Scholar 

  21. Wan L, Zheng W, Yuan X (2021) Efficient inter-device task scheduling schemes for multi-device co-processing of data-parallel kernels on heterogeneous systems. IEEE Access 9:59968–59978

    Article  Google Scholar 

  22. Wang Z, Zheng L, Chen Q, Guo M (2014) CPU + GPU scheduling with asymptotic profiling. Parallel Comput 40(2):107–115

    Article  Google Scholar 

  23. Grewe D, O’Boyle MFP (2011) A static task partitioning approach for heterogeneous systems using OpenCL. In: Compiler Construction: 20th International Conference, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, Saarbrücken, Germany

  24. Yasir Noman Khalid MA, Prodan R, Iqbal MA, Islam MA (2018) E-OSched: a load balancing scheduler for heterogeneous multicores. J Supercomput 74:5399–5431

    Article  Google Scholar 

  25. Wrede F, Ernsting S (2018) Simultaneous CPU–GPU execution of data parallel algorithmic skeletons. Int J Parallel Program 40(1):42–61

    Article  Google Scholar 

  26. Aba MA, Zaourar L, Munier A (2018) Approximation algorithm for scheduling applications on hybrid multi-core machines with communications delays. In: IEEE international parallel and distributed processing symposium workshops (IPDPSW)

  27. Tang X, Fu Z (2020) CPU–GPU utilization aware energy-efficient scheduling algorithm on heterogeneous computing systems. IEEE Access 8:58948–58958

    Article  Google Scholar 

  28. Zhang P, Fang J, Tang T (2018) Auto-tuning streamed applications on intel Xeon Phi. In: IEEE international parallel and distributed processing symposium (IPDPS), Vancouver

  29. Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to Algorithms, 2nd edn. The MIT Press, New York

    Google Scholar 

  30. Bisseling RH, Knigge TE (2020) An improved exact algorithm and an NP-completeness proof for sparse matrix bipartitioning. Parallel Comput 96:102640

    Article  MathSciNet  Google Scholar 

  31. Davis TA, Hu Y (2011) The university of Florida sparse matrix collection. ACM Trans Math Softw 38:1–25

    MathSciNet  Google Scholar 

  32. Yang W, Li K, Li K (2017) A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems. J Parallel Distrib Comput 104:49–60

    Article  Google Scholar 

  33. Zhang F, Wu B, Zhai J, He B, Chen W (2017) FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures. In: IEEE/ACM international symposium on code generation and optimization (CGO), Austin

  34. SuiteSparse Matrix Collection. http://sparse.tamu.edu/. Accessed 8 3, 2024

  35. Zardoshti P, Khunjush F, Sarbazi-Azad H (2016) Adaptive sparse matrix representation for efficient matrix–vector multiplication. J Supercomput 72:3366–3386

    Article  Google Scholar 

  36. Chen Y, Li K, Yang W, Xiao G, Xie X, Li T (2019) Performance-aware model for sparse matrix-matrix multiplication on the sunway TaihuLight supercomputer. IEEE Trans Parallel Distrib Syst 99:1–1

    Google Scholar 

  37. Bian H, Huang J, Liu L, Huang D, Wang X (2021) ALBUS: a method for efficiently processing SpMV using SIMD and Load. Future Gener Comput Syst 116:371–392

    Article  Google Scholar 

  38. Bell N, Garland M (2008) Efficient sparse matrix-vector multiplication on CUDA. Nvidia Corporation, New York

    Google Scholar 

  39. Choi HJ, Son DO, Kang SG, Kim JM, Lee H-H, Kim CH (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. J Supercomput 65:886–902

    Article  Google Scholar 

  40. https://github.com/computablee/heterogeneous-spmv/tree/main

  41. Zhang F, Liu W, Feng N, Zhai J, Du X (2019) Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors. Trans High Perform Comput 1:131–143

    Article  Google Scholar 

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

ASB contributed to methodology, algorithms, software, validation, writing—original draft, and editing. AS contributed to methodology, review and editing, and supervision. MN contributed to review and editing

Corresponding author

Correspondence to Abdorreza Savadi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

No datasets were generated or analyzed during the current study.

Code availability

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shokrani Baigi, A., Savadi, A. & Naghibzadeh, M. A high-performance dynamic scheduling for sparse matrix-based applications on heterogeneous CPU–GPU environment. J Supercomput 80, 25071–25098 (2024). https://doi.org/10.1007/s11227-024-06394-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06394-1

Keywords

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy