Skip to main content
Log in

Aggressive GPU cache bypassing with monolithic 3D-based NoC

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

A Correction to this article was published on 08 November 2022

This article has been updated

Abstract

Cache bypassing is widely employed to alleviate cache contention and pollution in GPUs. However, cache bypassing often puts more pressure on the network-on-chip (NoC) since the bypassed requests need to traverse the NoC to reach the lower-level memories, thus worsening the NoC congestion. In this paper, we propose an aggressive GPU cache bypassing technique (called SC-Table) to alleviate cache contention and pollution. The SC-Table relies on 2-bit saturating counters (SCs) to store the bypass history of warps. Memory requests issued by a warp are allowed to bypass the L1D when the corresponding SC’s value reaches the bypass threshold. In addition, we adopt the monolithic 3D-based NoC (M3D NoC) to provide better NoC throughput and latency. The combination of the SC-Table and the M3D NoC improves GPU performance by 34.6%, on average, over the baseline where there is no cache bypassing and the traditional 2D NoC is adopted.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

All data generated or analyzed during this study are included in this published article.

Change history

Notes

  1. We use the NVIDIA terminology to describe the GPU architecture.

  2. In this work, the training period length is set 10 K clock cycles. After the experiments on various training period lengths, we found the performance result is not so sensitive to the training period length.

  3. Compared to the M3D NoC, the TSV3D NoC provides less performance speedup. Therefore, we don’t show the result of the combination of cache bypassing and the TSV3D NoC.

References

  1. Jia W, Shaw K and Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors, In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA), pp 272–283

  2. Li C, Song S, Dai H, Sidelnik A, Hari S and Zhou H (2015) Locality-driven dynamic GPU cache bypassing, In: Proceedings of the 29th ACM on international conference on supercomputing, pp 67–77

  3. Do CT, Kim J-M, Kim CH (2018) Application characteristics-aware sporadic cache bypassing for high performance GPGPUs. J Parallel Distrib Comput 122:238–250

    Article  Google Scholar 

  4. Do CT, Kim JM, Kim CH (2017) Early miss prediction based periodic cache bypassing for high performance GPUs. Elsevier Microprocess Microsyst 55:44–54

    Article  Google Scholar 

  5. Zhang J, He Y, Shen F, Li Q, Tan H (2019) Memory-aware TLP throttling and cache bypassing for GPUs. Clust Comput 22:871–883

    Article  Google Scholar 

  6. Liu C and Lim S (2012) A design tradeoff study with monolithic 3D integration. In: Proceedings of the international symposium on quality electronic design (ISQED), pp 529–536

  7. Liu C and Lim SK (2012) Ultra-High Density 3D SRAM Cell Designs for Monolithic 3D Integration. In: Proceedings of the IEEE international interconnect technology conference (IITC), pp 1–3

  8. Srinivasa S et al. (2018) A monolithic-3D SRAM design with enhanced robustness and in-memory computation support. In: Proceedings of the international symposium on low power electronics and design (ISLPED), pp 1–6

  9. Felfel A, Datta K, Dutt A, Veluri H, Zaky A, Thean A and Aly M (2020) Quantifying the benefits of monolithic 3D computing systems enabled by TFT and RRAM. In: Proceedings of the design, automation & test in Europe conference & exhibition (DATE), pp 43–48

  10. Srinivasa S, Li X, Chang MF, Sampson J, Gupta SK, Narayanan V (2018) Compact D-SRAM memory with concurrent row and column data access capability using sequential monolithic D integration. IEEE Trans Very Larg Scale Integr (VLSI) Syst 26(4):671–683

    Article  Google Scholar 

  11. Panth S, Samadi K, Du Y and Lim SK (2014) Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations. In: Proceedings of the design automation conference (DAC), pp 1–6

  12. International technology roadmap for semiconductors (2015) edition: more Moore, 2015.

  13. Pang P, Zhang Y, Li T, Lim SK, Chen Q, Liang X and Jiang L (2018) In-growth test for monolithic 3D integrated SRAM. In: Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 569–572

  14. Shi J, Nayak D, Banna S, Fox R, Samavedam S, Samal S and Lim SK (2016) A 14nm finfet transistor-level 3D partitioning design to enable high-performance and low-cost monolithic 3D IC. IEEE international electron devices meeting (IEDM)

  15. Ku BW. and Lim SK. (2020) Pin-in-the-middle: an efficient block pin assignment methodology for block-level monolithic 3D ICs. In: Proceedings of the international symposium on low power electronics and design (ISLPED), pp 83–90

  16. Kong J., Gong YH., and. Chung SW, (2017) Architecting large-scale SRAM arrays with monolithic 3D integration. In: proceedings of the international symposium on low power electronics and design (ISLPED), pp 1–6

  17. Gong YH, Kong J, Chung SW (2021) Quantifying the impact of monolithic (M3D) integration on L1 caches. IEEE Trans Emerg Top Comput (TETC) 9(2):854–865

    Article  Google Scholar 

  18. C. T. Do, Y-H. Gong, C. H. Kim, S. W. Seon, and S. W. Chung,(2019) Exploring the relation between monolithic 3D L1 GPU cache capacity and warp scheduling efficiency, In: proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6

  19. Gopireddy B, Torrellas J (2019) Designing vertical processors in monolithic 3D. In: Proceedings of the international symposium on computer architecture (ISCA). pp 643–656

  20. Samal et al. (2016) How to cope with slow transistors in the top-tier of monolithic 3D ICs: design studies and CAD solutions. In: Proceedings of the international symposium on low power electronics and design (ISLPED). pp 320–325

  21. Shulaker M, Wu TF, Pal A, Zhao L, Nishi Y, Saraswat KH, Philip Wong S, Mitra S (2014) Monolithic 3D integration of logic and memory: carbon nanotube FETs, resistive RAM, and silicon FETs. In: IEEE international electron devices meeting (IEDM)

  22. Bobba S, Chakraborty A, Thomas O, Batude P, Micheli G (2013) Cell transformations and physical design techniques for 3D monolithic integrated circuits. ACM J Emerg Technol Comput Syst (JETC) 9(3):1–19

    Article  Google Scholar 

  23. Pentapati S, Chang K, Gerousis V, Sengupta, R. and Lim SK (2020) Pin-3D: a physical synthesis and post-layout optimization flow for heterogeneous monolithic 3d Ics. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 1–9

  24. Kim J, Murali G, Vanna-iampikul P, Lee E, Kim D, Chaudhuri R, Banerjee S, Chakrabarty K, Mukhopadhyay S, Lim SK (2020) RTL-to-GDS design tools for monolithic 3d ics. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 1–8

  25. Brunet et al. (2016) First demonstration of a cmos over cmos 3d vlsi coolcubetm integration on 300mm wafers. In: Proceedings of the symposium on vlsi technology digest of technical papers, pp 1–2

  26. Tech Design Forum: https://www.techdesignforums.com/blog/2018/12/05/iedm-leti-monolithic-3d/

  27. Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 307–319

  28. Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the IEEE/ACM international symposium on microarchitecture (MICRO), pp 343–355

  29. Xie X, Liang Y, Wang Y, Sun G,Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of the IEEE International symposium on high performance computer architecture (HPCA), pp 76–88

  30. Li A, Braak G-J, Kumar A, Corporaal H (2015) Adaptive and transparent cache bypassing for GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis (SC), pp 1–12

  31. Xie X, Liang Y, Sun G,Chen D (2013) An efficient compiler framework for cache bypassing on Gpus. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 516–523

  32. Tian Y, Puthoor S, Greathouse J, Beckmann B,imenez DJ (2015) Adaptive GPU cache bypassing. In: Proceedings of the 8th workshop on general purpose processing using Gpus (GPGPU), pp 25–35

  33. Lee S-Y Wu C-J (2016) Ctrl-C: instruction-aware control loop based adaptive cache bypassing for GPUs. In: Proceedings of the IEEE international conference on computer design (ICCD), pp 133–140

  34. Thuries S, Billoint O, Choisnet S, Lemaire R, Vivet P, Batude P, Lattard D (2020) MM3D-ADTCO: monolithic 3D architecture, design and technology Co-optimization for high energy efficient 3D IC. In: Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 1740–1745

  35. Lu Y-C , Pentapati S, Zhu L, Samadi K, Lim SK (2020) TP-GNN: a graph neural network framework for tier partitioning in monolithic 3D ICs. In: Proceedings of the design automation conference (DAC), pp 1–6

  36. Chatterjee A, Musavvir S, Kim RG, Doppa JR, Pande P (2021) Power management of monolithic 3D manycore chips with inter-tier process variations. ACM J Emerg Technolo Comput Syst (JETC) 17(2):1–19

    Article  Google Scholar 

  37. Zhu L, Chaudhuri A, Banerjee S, Murali G, Iampikul P, Chakrabarty K, Lim SK (2022) Design automation and test solutions for monolithic 3D ICs. ACM J Emerg Technolo Comput Syst (JETC) 18(1):1–49

    Article  Google Scholar 

  38. Rajendran B, Shenoy RS, Witte DJ, Chokshi NS, DeLeon RL, Tompa GS, Pease RFW (2007) Low Thermal Budget Processing for Sequential 3-D IC Fabrication. IEEE Trans Electron Devices 54(4):707–714

    Article  Google Scholar 

  39. Koneru A Chakrabarty K, (2016) Analysis of electrostatic coupling in monolithic 3D integrated circuits and its impact on delay testing. In: Proceedings of the IEEE european test symposium, pp 1–6

  40. Lee JH, Lee YS, Choi JH, Amrouch H, Kong J, Gong Y-H, Chung SW (2021) Characterizing the thermal feasibility of monolithic 3D microprocessors. IEEE Access 9:120715–120729

    Article  Google Scholar 

  41. Peng B, V. Pavlidis F , Chen Y-C , Cheng Y (2022) Thermal modeling and design exploration for monolithic 3D ICs. In: Proceedings of the international symposium on quality electronic design (ISQED), pp 1–6

  42. Do CT, Choi JH, Lee YS, Kim CH, Chung SW (2021) Enhancing matrix multiplication with a monolithic 3d based scratchpad memory. IEEE Embed Syst Lett 13(2):57–60

    Article  Google Scholar 

  43. Kim J, Nicopoulos C, Park D, Das R, Xie Y, Narayanan V, Yousif MS, Das CR (2007) A novel dimensionally-decomposed router for on-chip communication in 3D architectures. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 138–149

  44. Marcon C, Fernandes R, Cataldo R,Grando F , Webber T, Benso A, Poehls LB (2014) Tiny NoC: A 3D mesh topology with router channel optimization for area and latency minimization. In: Proceedings of the IEEE international conference on vlsi design (VLSID), pp 228–233

  45. Manna K, Swami S, Chattopadhyay S, Sengupta I (2016) Integrated through silicon Via placement and application mapping for 3D mesh ba sed NoC Des ign. ACM Trans Embed Comput Syst 16(1):1–25

    Article  Google Scholar 

  46. Lee D, Das S, Doppa JR, Pande PP, Chakrabarty K (2018) Performance and thermal tradeoffs for energy-efficient monolithic 3D network-on-chip. ACM Trans Des Autom Electron Syst 23(5):1–25

    Article  Google Scholar 

  47. Musavvir S,Chatterjee A, Kim RG, Kim DH, Pande P (2020) Inter-tier process variation-aware monolithic 3d noc architectures. IEEE Transactions on very large scale integration (VLSI) systems, 28(3), 686–699

  48. Joardar B, Kim RG, Doppa JR, Pande P, Marculescu D, Marculescu R (2019) Learning-based application-agnostic 3D NoC design for heterogeneous manycore systems. IEEE Trans Comput 68(1):852–866

    Article  MathSciNet  MATH  Google Scholar 

  49. NVIDIA Whitepaper: NVIDIA Tesla V100 GPU Architecture.

  50. NVIDIA Whitepaper: NVIDIA Turing GPU Architecture

  51. Rogers T, Connor MO’, Aamodt T (2012) Cache-conscious wavefront scheduling.In: Proceedings of the IEEE/ACM international symposium on microarchitecture (MICRO), pp 72–83

  52. Gebhart M et al. (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 235–246

  53. Son DO, Do CT, Choi HJ, Nam J, Kim CH (2017) A dynamic CTA scheduling scheme for massive parallel computing. Clust Comput 20(1):781–787

    Article  Google Scholar 

  54. ElTantawy A, Aamodt TM ( 2018) Warp scheduling for fine-grained synchronization. In: Proceedings of IEEE international symposium on high performance computer architecture (HPCA), pp 375–388

  55. Do CT, Choi HJ, Chung SW, Kim CH (2020) A novel warp scheduling scheme considering long-latency operations for high-performance GPUs. J Supercomput 76(4):3043–3062

    Article  Google Scholar 

  56. Tripathy D, Abdolrashidi A, Bhuyan LN, Zhou L, Wong D (2021) PAVER: locality graph-based thread block scheduling for GPUs. ACM Trans Archit Code Optim (TACO), 18(3):1–26

    Article  Google Scholar 

  57. Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum A (2012) Improving cache management Policies using dynamic reuse distances. In: Proceedings of the international symposium on microarchitecture (MICRO), pp 389–400

  58. Gaur J,Chaudhuri M, Subramoney S (2011) Bypass and insertion algorithms for exclusive Last-level caches. In: Proceedings of the international symposium on computer architecture (ISCA), pp 81–92

  59. Lee DU et al. (2014) A 1.2V 8Gb 8-Channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In: proceedings of the ieee international solid-state circuits conference digest of technical papers, pp 432–433

  60. JEDEC (2013) JEDEC Standard.High Bandwidth Memory (HBM) DRAM Specification

  61. Dong X, Zhao J, Xie Y (2010) Fabrication cost analysis and costaware design space exploration for 3-D ICs. IEEE Trans Comput Aided Des Integr Circuits Syst 29(12):1959–1972

    Article  Google Scholar 

  62. Samal S , Panth S, Samadi K, Saedi M, Du Y, Lim SK (2014) Fast and accurate thermal modeling and optimization for monolithic 3D ICs.”In: Proceedings of the design automation conference (DAC), pp 1–6

  63. Batude P, Ernst T, Arcamone J, Arndt G, Coudrain P, Gaillardon P-E (2012) 3-d sequential integration: a key enabling technology for heterogeneous co-integration of new function with cmos. IEEE J Emerg Sel Top Circuits Syst 2(4):714–722

    Article  Google Scholar 

  64. Lakshminarayana N, Kim H (2010) Effect of instruction fetch and memory scheduling on GPU performance. Workshop on Language, Compiler, and Architecture Support for GPGPU (LCA-GPGPU), pp 1–10

  65. Singh I, Shriraman A, Fung W, Connor MO’, Admodt T, (2013) Cache coherence for GPU architectures. In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA), pp 578–590

  66. Bakhoda A, Yuan GL , Fung WWL ,Wong H, Aamodt TM, (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: proceedings of the ieee international symposium on performance analysis of systems and software (ISPASS), pp 163–174

  67. Access Noxim: http://access.ee.ntu.edu.tw/noxim/index.html

  68. Chen K ,Li S, Muralimanohar N, Ahn JH , Brockman JB , Jouppi NP, (2012) CACTI-3DD: architecture-level modeling for 3D die-stacked dram main memory. In:Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 33–38

  69. CUDA SDK: https://developer.nvidia.com/cuda-toolkit

  70. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE international symposium on workload characterization (IISWC), pp 44–54

  71. Zhang R, Stan M, Skadron K, HotSpot 6.0: Validation, Acceleration and Extension, Univ. Virginia, Tech. Report CS-2015–04.

  72. NVIDIA Geforce GTX Specifications: https://www.nvidia.com/en-us/geforce/900-series/

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C2003500, No. 2020R1A6A3A13064398), Samsung Electronics, College of Information, Korea University, and School of Information and Communications Technology, Hanoi University of Science and Technology. We would also like to thank Dr. Young Seo Lee and Mr. Ji Heon Lee for their help with thermal simulation and anonymous reviewers for their helpful feedback.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Cheol Hong Kim or Sung Woo Chung.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised:In this article, Cong Thuan Do was incorrectly denoted as the corresponding author but it should have been Cheol Hong Kim. Cheol Hong Kim and Sung Woo Chung are the co-corresponding authors in this article.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Do, C.T., Kim, C.H. & Chung, S.W. Aggressive GPU cache bypassing with monolithic 3D-based NoC. J Supercomput 79, 5421–5442 (2023). https://doi.org/10.1007/s11227-022-04878-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04878-6

Keywords

Navigation

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy