Abstract
Cache bypassing is widely employed to alleviate cache contention and pollution in GPUs. However, cache bypassing often puts more pressure on the network-on-chip (NoC) since the bypassed requests need to traverse the NoC to reach the lower-level memories, thus worsening the NoC congestion. In this paper, we propose an aggressive GPU cache bypassing technique (called SC-Table) to alleviate cache contention and pollution. The SC-Table relies on 2-bit saturating counters (SCs) to store the bypass history of warps. Memory requests issued by a warp are allowed to bypass the L1D when the corresponding SC’s value reaches the bypass threshold. In addition, we adopt the monolithic 3D-based NoC (M3D NoC) to provide better NoC throughput and latency. The combination of the SC-Table and the M3D NoC improves GPU performance by 34.6%, on average, over the baseline where there is no cache bypassing and the traditional 2D NoC is adopted.










Similar content being viewed by others
Data availability
All data generated or analyzed during this study are included in this published article.
Change history
08 November 2022
A Correction to this paper has been published: https://doi.org/10.1007/s11227-022-04917-2
Notes
We use the NVIDIA terminology to describe the GPU architecture.
In this work, the training period length is set 10 K clock cycles. After the experiments on various training period lengths, we found the performance result is not so sensitive to the training period length.
Compared to the M3D NoC, the TSV3D NoC provides less performance speedup. Therefore, we don’t show the result of the combination of cache bypassing and the TSV3D NoC.
References
Jia W, Shaw K and Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors, In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA), pp 272–283
Li C, Song S, Dai H, Sidelnik A, Hari S and Zhou H (2015) Locality-driven dynamic GPU cache bypassing, In: Proceedings of the 29th ACM on international conference on supercomputing, pp 67–77
Do CT, Kim J-M, Kim CH (2018) Application characteristics-aware sporadic cache bypassing for high performance GPGPUs. J Parallel Distrib Comput 122:238–250
Do CT, Kim JM, Kim CH (2017) Early miss prediction based periodic cache bypassing for high performance GPUs. Elsevier Microprocess Microsyst 55:44–54
Zhang J, He Y, Shen F, Li Q, Tan H (2019) Memory-aware TLP throttling and cache bypassing for GPUs. Clust Comput 22:871–883
Liu C and Lim S (2012) A design tradeoff study with monolithic 3D integration. In: Proceedings of the international symposium on quality electronic design (ISQED), pp 529–536
Liu C and Lim SK (2012) Ultra-High Density 3D SRAM Cell Designs for Monolithic 3D Integration. In: Proceedings of the IEEE international interconnect technology conference (IITC), pp 1–3
Srinivasa S et al. (2018) A monolithic-3D SRAM design with enhanced robustness and in-memory computation support. In: Proceedings of the international symposium on low power electronics and design (ISLPED), pp 1–6
Felfel A, Datta K, Dutt A, Veluri H, Zaky A, Thean A and Aly M (2020) Quantifying the benefits of monolithic 3D computing systems enabled by TFT and RRAM. In: Proceedings of the design, automation & test in Europe conference & exhibition (DATE), pp 43–48
Srinivasa S, Li X, Chang MF, Sampson J, Gupta SK, Narayanan V (2018) Compact D-SRAM memory with concurrent row and column data access capability using sequential monolithic D integration. IEEE Trans Very Larg Scale Integr (VLSI) Syst 26(4):671–683
Panth S, Samadi K, Du Y and Lim SK (2014) Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations. In: Proceedings of the design automation conference (DAC), pp 1–6
International technology roadmap for semiconductors (2015) edition: more Moore, 2015.
Pang P, Zhang Y, Li T, Lim SK, Chen Q, Liang X and Jiang L (2018) In-growth test for monolithic 3D integrated SRAM. In: Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 569–572
Shi J, Nayak D, Banna S, Fox R, Samavedam S, Samal S and Lim SK (2016) A 14nm finfet transistor-level 3D partitioning design to enable high-performance and low-cost monolithic 3D IC. IEEE international electron devices meeting (IEDM)
Ku BW. and Lim SK. (2020) Pin-in-the-middle: an efficient block pin assignment methodology for block-level monolithic 3D ICs. In: Proceedings of the international symposium on low power electronics and design (ISLPED), pp 83–90
Kong J., Gong YH., and. Chung SW, (2017) Architecting large-scale SRAM arrays with monolithic 3D integration. In: proceedings of the international symposium on low power electronics and design (ISLPED), pp 1–6
Gong YH, Kong J, Chung SW (2021) Quantifying the impact of monolithic (M3D) integration on L1 caches. IEEE Trans Emerg Top Comput (TETC) 9(2):854–865
C. T. Do, Y-H. Gong, C. H. Kim, S. W. Seon, and S. W. Chung,(2019) Exploring the relation between monolithic 3D L1 GPU cache capacity and warp scheduling efficiency, In: proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6
Gopireddy B, Torrellas J (2019) Designing vertical processors in monolithic 3D. In: Proceedings of the international symposium on computer architecture (ISCA). pp 643–656
Samal et al. (2016) How to cope with slow transistors in the top-tier of monolithic 3D ICs: design studies and CAD solutions. In: Proceedings of the international symposium on low power electronics and design (ISLPED). pp 320–325
Shulaker M, Wu TF, Pal A, Zhao L, Nishi Y, Saraswat KH, Philip Wong S, Mitra S (2014) Monolithic 3D integration of logic and memory: carbon nanotube FETs, resistive RAM, and silicon FETs. In: IEEE international electron devices meeting (IEDM)
Bobba S, Chakraborty A, Thomas O, Batude P, Micheli G (2013) Cell transformations and physical design techniques for 3D monolithic integrated circuits. ACM J Emerg Technol Comput Syst (JETC) 9(3):1–19
Pentapati S, Chang K, Gerousis V, Sengupta, R. and Lim SK (2020) Pin-3D: a physical synthesis and post-layout optimization flow for heterogeneous monolithic 3d Ics. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 1–9
Kim J, Murali G, Vanna-iampikul P, Lee E, Kim D, Chaudhuri R, Banerjee S, Chakrabarty K, Mukhopadhyay S, Lim SK (2020) RTL-to-GDS design tools for monolithic 3d ics. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 1–8
Brunet et al. (2016) First demonstration of a cmos over cmos 3d vlsi coolcubetm integration on 300mm wafers. In: Proceedings of the symposium on vlsi technology digest of technical papers, pp 1–2
Tech Design Forum: https://www.techdesignforums.com/blog/2018/12/05/iedm-leti-monolithic-3d/
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 307–319
Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the IEEE/ACM international symposium on microarchitecture (MICRO), pp 343–355
Xie X, Liang Y, Wang Y, Sun G,Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of the IEEE International symposium on high performance computer architecture (HPCA), pp 76–88
Li A, Braak G-J, Kumar A, Corporaal H (2015) Adaptive and transparent cache bypassing for GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis (SC), pp 1–12
Xie X, Liang Y, Sun G,Chen D (2013) An efficient compiler framework for cache bypassing on Gpus. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 516–523
Tian Y, Puthoor S, Greathouse J, Beckmann B,imenez DJ (2015) Adaptive GPU cache bypassing. In: Proceedings of the 8th workshop on general purpose processing using Gpus (GPGPU), pp 25–35
Lee S-Y Wu C-J (2016) Ctrl-C: instruction-aware control loop based adaptive cache bypassing for GPUs. In: Proceedings of the IEEE international conference on computer design (ICCD), pp 133–140
Thuries S, Billoint O, Choisnet S, Lemaire R, Vivet P, Batude P, Lattard D (2020) MM3D-ADTCO: monolithic 3D architecture, design and technology Co-optimization for high energy efficient 3D IC. In: Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 1740–1745
Lu Y-C , Pentapati S, Zhu L, Samadi K, Lim SK (2020) TP-GNN: a graph neural network framework for tier partitioning in monolithic 3D ICs. In: Proceedings of the design automation conference (DAC), pp 1–6
Chatterjee A, Musavvir S, Kim RG, Doppa JR, Pande P (2021) Power management of monolithic 3D manycore chips with inter-tier process variations. ACM J Emerg Technolo Comput Syst (JETC) 17(2):1–19
Zhu L, Chaudhuri A, Banerjee S, Murali G, Iampikul P, Chakrabarty K, Lim SK (2022) Design automation and test solutions for monolithic 3D ICs. ACM J Emerg Technolo Comput Syst (JETC) 18(1):1–49
Rajendran B, Shenoy RS, Witte DJ, Chokshi NS, DeLeon RL, Tompa GS, Pease RFW (2007) Low Thermal Budget Processing for Sequential 3-D IC Fabrication. IEEE Trans Electron Devices 54(4):707–714
Koneru A Chakrabarty K, (2016) Analysis of electrostatic coupling in monolithic 3D integrated circuits and its impact on delay testing. In: Proceedings of the IEEE european test symposium, pp 1–6
Lee JH, Lee YS, Choi JH, Amrouch H, Kong J, Gong Y-H, Chung SW (2021) Characterizing the thermal feasibility of monolithic 3D microprocessors. IEEE Access 9:120715–120729
Peng B, V. Pavlidis F , Chen Y-C , Cheng Y (2022) Thermal modeling and design exploration for monolithic 3D ICs. In: Proceedings of the international symposium on quality electronic design (ISQED), pp 1–6
Do CT, Choi JH, Lee YS, Kim CH, Chung SW (2021) Enhancing matrix multiplication with a monolithic 3d based scratchpad memory. IEEE Embed Syst Lett 13(2):57–60
Kim J, Nicopoulos C, Park D, Das R, Xie Y, Narayanan V, Yousif MS, Das CR (2007) A novel dimensionally-decomposed router for on-chip communication in 3D architectures. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 138–149
Marcon C, Fernandes R, Cataldo R,Grando F , Webber T, Benso A, Poehls LB (2014) Tiny NoC: A 3D mesh topology with router channel optimization for area and latency minimization. In: Proceedings of the IEEE international conference on vlsi design (VLSID), pp 228–233
Manna K, Swami S, Chattopadhyay S, Sengupta I (2016) Integrated through silicon Via placement and application mapping for 3D mesh ba sed NoC Des ign. ACM Trans Embed Comput Syst 16(1):1–25
Lee D, Das S, Doppa JR, Pande PP, Chakrabarty K (2018) Performance and thermal tradeoffs for energy-efficient monolithic 3D network-on-chip. ACM Trans Des Autom Electron Syst 23(5):1–25
Musavvir S,Chatterjee A, Kim RG, Kim DH, Pande P (2020) Inter-tier process variation-aware monolithic 3d noc architectures. IEEE Transactions on very large scale integration (VLSI) systems, 28(3), 686–699
Joardar B, Kim RG, Doppa JR, Pande P, Marculescu D, Marculescu R (2019) Learning-based application-agnostic 3D NoC design for heterogeneous manycore systems. IEEE Trans Comput 68(1):852–866
NVIDIA Whitepaper: NVIDIA Tesla V100 GPU Architecture.
NVIDIA Whitepaper: NVIDIA Turing GPU Architecture
Rogers T, Connor MO’, Aamodt T (2012) Cache-conscious wavefront scheduling.In: Proceedings of the IEEE/ACM international symposium on microarchitecture (MICRO), pp 72–83
Gebhart M et al. (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 235–246
Son DO, Do CT, Choi HJ, Nam J, Kim CH (2017) A dynamic CTA scheduling scheme for massive parallel computing. Clust Comput 20(1):781–787
ElTantawy A, Aamodt TM ( 2018) Warp scheduling for fine-grained synchronization. In: Proceedings of IEEE international symposium on high performance computer architecture (HPCA), pp 375–388
Do CT, Choi HJ, Chung SW, Kim CH (2020) A novel warp scheduling scheme considering long-latency operations for high-performance GPUs. J Supercomput 76(4):3043–3062
Tripathy D, Abdolrashidi A, Bhuyan LN, Zhou L, Wong D (2021) PAVER: locality graph-based thread block scheduling for GPUs. ACM Trans Archit Code Optim (TACO), 18(3):1–26
Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum A (2012) Improving cache management Policies using dynamic reuse distances. In: Proceedings of the international symposium on microarchitecture (MICRO), pp 389–400
Gaur J,Chaudhuri M, Subramoney S (2011) Bypass and insertion algorithms for exclusive Last-level caches. In: Proceedings of the international symposium on computer architecture (ISCA), pp 81–92
Lee DU et al. (2014) A 1.2V 8Gb 8-Channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In: proceedings of the ieee international solid-state circuits conference digest of technical papers, pp 432–433
JEDEC (2013) JEDEC Standard.High Bandwidth Memory (HBM) DRAM Specification
Dong X, Zhao J, Xie Y (2010) Fabrication cost analysis and costaware design space exploration for 3-D ICs. IEEE Trans Comput Aided Des Integr Circuits Syst 29(12):1959–1972
Samal S , Panth S, Samadi K, Saedi M, Du Y, Lim SK (2014) Fast and accurate thermal modeling and optimization for monolithic 3D ICs.”In: Proceedings of the design automation conference (DAC), pp 1–6
Batude P, Ernst T, Arcamone J, Arndt G, Coudrain P, Gaillardon P-E (2012) 3-d sequential integration: a key enabling technology for heterogeneous co-integration of new function with cmos. IEEE J Emerg Sel Top Circuits Syst 2(4):714–722
Lakshminarayana N, Kim H (2010) Effect of instruction fetch and memory scheduling on GPU performance. Workshop on Language, Compiler, and Architecture Support for GPGPU (LCA-GPGPU), pp 1–10
Singh I, Shriraman A, Fung W, Connor MO’, Admodt T, (2013) Cache coherence for GPU architectures. In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA), pp 578–590
Bakhoda A, Yuan GL , Fung WWL ,Wong H, Aamodt TM, (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: proceedings of the ieee international symposium on performance analysis of systems and software (ISPASS), pp 163–174
Access Noxim: http://access.ee.ntu.edu.tw/noxim/index.html
Chen K ,Li S, Muralimanohar N, Ahn JH , Brockman JB , Jouppi NP, (2012) CACTI-3DD: architecture-level modeling for 3D die-stacked dram main memory. In:Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 33–38
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE international symposium on workload characterization (IISWC), pp 44–54
Zhang R, Stan M, Skadron K, HotSpot 6.0: Validation, Acceleration and Extension, Univ. Virginia, Tech. Report CS-2015–04.
NVIDIA Geforce GTX Specifications: https://www.nvidia.com/en-us/geforce/900-series/
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C2003500, No. 2020R1A6A3A13064398), Samsung Electronics, College of Information, Korea University, and School of Information and Communications Technology, Hanoi University of Science and Technology. We would also like to thank Dr. Young Seo Lee and Mr. Ji Heon Lee for their help with thermal simulation and anonymous reviewers for their helpful feedback.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised:In this article, Cong Thuan Do was incorrectly denoted as the corresponding author but it should have been Cheol Hong Kim. Cheol Hong Kim and Sung Woo Chung are the co-corresponding authors in this article.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Do, C.T., Kim, C.H. & Chung, S.W. Aggressive GPU cache bypassing with monolithic 3D-based NoC. J Supercomput 79, 5421–5442 (2023). https://doi.org/10.1007/s11227-022-04878-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04878-6