Aggressive GPU cache bypassing with monolithic 3D-based NoC

Do, Cong Thuan; Kim, Cheol Hong; Chung, Sung Woo

doi:10.1007/s11227-022-04878-6

Aggressive GPU cache bypassing with monolithic 3D-based NoC

Published: 21 October 2022

Volume 79, pages 5421–5442, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

549 Accesses
1 Citation
Explore all metrics

A Correction to this article was published on 08 November 2022

This article has been updated

Abstract

Cache bypassing is widely employed to alleviate cache contention and pollution in GPUs. However, cache bypassing often puts more pressure on the network-on-chip (NoC) since the bypassed requests need to traverse the NoC to reach the lower-level memories, thus worsening the NoC congestion. In this paper, we propose an aggressive GPU cache bypassing technique (called SC-Table) to alleviate cache contention and pollution. The SC-Table relies on 2-bit saturating counters (SCs) to store the bypass history of warps. Memory requests issued by a warp are allowed to bypass the L1D when the corresponding SC’s value reaches the bypass threshold. In addition, we adopt the monolithic 3D-based NoC (M3D NoC) to provide better NoC throughput and latency. The combination of the SC-Table and the M3D NoC improves GPU performance by 34.6%, on average, over the baseline where there is no cache bypassing and the traditional 2D NoC is adopted.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Article 19 May 2018

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

Memory-aware TLP throttling and cache bypassing for GPUs

Article 27 November 2017

Data availability

All data generated or analyzed during this study are included in this published article.

Change history

08 November 2022
A Correction to this paper has been published: https://doi.org/10.1007/s11227-022-04917-2

Notes

We use the NVIDIA terminology to describe the GPU architecture.
In this work, the training period length is set 10 K clock cycles. After the experiments on various training period lengths, we found the performance result is not so sensitive to the training period length.
Compared to the M3D NoC, the TSV3D NoC provides less performance speedup. Therefore, we don’t show the result of the combination of cache bypassing and the TSV3D NoC.

References

Jia W, Shaw K and Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors, In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA), pp 272–283
Li C, Song S, Dai H, Sidelnik A, Hari S and Zhou H (2015) Locality-driven dynamic GPU cache bypassing, In: Proceedings of the 29th ACM on international conference on supercomputing, pp 67–77
Do CT, Kim J-M, Kim CH (2018) Application characteristics-aware sporadic cache bypassing for high performance GPGPUs. J Parallel Distrib Comput 122:238–250
Article Google Scholar
Do CT, Kim JM, Kim CH (2017) Early miss prediction based periodic cache bypassing for high performance GPUs. Elsevier Microprocess Microsyst 55:44–54
Article Google Scholar
Zhang J, He Y, Shen F, Li Q, Tan H (2019) Memory-aware TLP throttling and cache bypassing for GPUs. Clust Comput 22:871–883
Article Google Scholar
Liu C and Lim S (2012) A design tradeoff study with monolithic 3D integration. In: Proceedings of the international symposium on quality electronic design (ISQED), pp 529–536
Liu C and Lim SK (2012) Ultra-High Density 3D SRAM Cell Designs for Monolithic 3D Integration. In: Proceedings of the IEEE international interconnect technology conference (IITC), pp 1–3
Srinivasa S et al. (2018) A monolithic-3D SRAM design with enhanced robustness and in-memory computation support. In: Proceedings of the international symposium on low power electronics and design (ISLPED), pp 1–6
Felfel A, Datta K, Dutt A, Veluri H, Zaky A, Thean A and Aly M (2020) Quantifying the benefits of monolithic 3D computing systems enabled by TFT and RRAM. In: Proceedings of the design, automation & test in Europe conference & exhibition (DATE), pp 43–48
Srinivasa S, Li X, Chang MF, Sampson J, Gupta SK, Narayanan V (2018) Compact D-SRAM memory with concurrent row and column data access capability using sequential monolithic D integration. IEEE Trans Very Larg Scale Integr (VLSI) Syst 26(4):671–683
Article Google Scholar
Panth S, Samadi K, Du Y and Lim SK (2014) Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations. In: Proceedings of the design automation conference (DAC), pp 1–6
International technology roadmap for semiconductors (2015) edition: more Moore, 2015.
Pang P, Zhang Y, Li T, Lim SK, Chen Q, Liang X and Jiang L (2018) In-growth test for monolithic 3D integrated SRAM. In: Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 569–572
Shi J, Nayak D, Banna S, Fox R, Samavedam S, Samal S and Lim SK (2016) A 14nm finfet transistor-level 3D partitioning design to enable high-performance and low-cost monolithic 3D IC. IEEE international electron devices meeting (IEDM)
Ku BW. and Lim SK. (2020) Pin-in-the-middle: an efficient block pin assignment methodology for block-level monolithic 3D ICs. In: Proceedings of the international symposium on low power electronics and design (ISLPED), pp 83–90
Kong J., Gong YH., and. Chung SW, (2017) Architecting large-scale SRAM arrays with monolithic 3D integration. In: proceedings of the international symposium on low power electronics and design (ISLPED), pp 1–6
Gong YH, Kong J, Chung SW (2021) Quantifying the impact of monolithic (M3D) integration on L1 caches. IEEE Trans Emerg Top Comput (TETC) 9(2):854–865
Article Google Scholar
C. T. Do, Y-H. Gong, C. H. Kim, S. W. Seon, and S. W. Chung,(2019) Exploring the relation between monolithic 3D L1 GPU cache capacity and warp scheduling efficiency, In: proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6
Gopireddy B, Torrellas J (2019) Designing vertical processors in monolithic 3D. In: Proceedings of the international symposium on computer architecture (ISCA). pp 643–656
Samal et al. (2016) How to cope with slow transistors in the top-tier of monolithic 3D ICs: design studies and CAD solutions. In: Proceedings of the international symposium on low power electronics and design (ISLPED). pp 320–325
Shulaker M, Wu TF, Pal A, Zhao L, Nishi Y, Saraswat KH, Philip Wong S, Mitra S (2014) Monolithic 3D integration of logic and memory: carbon nanotube FETs, resistive RAM, and silicon FETs. In: IEEE international electron devices meeting (IEDM)
Bobba S, Chakraborty A, Thomas O, Batude P, Micheli G (2013) Cell transformations and physical design techniques for 3D monolithic integrated circuits. ACM J Emerg Technol Comput Syst (JETC) 9(3):1–19
Article Google Scholar
Pentapati S, Chang K, Gerousis V, Sengupta, R. and Lim SK (2020) Pin-3D: a physical synthesis and post-layout optimization flow for heterogeneous monolithic 3d Ics. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 1–9
Kim J, Murali G, Vanna-iampikul P, Lee E, Kim D, Chaudhuri R, Banerjee S, Chakrabarty K, Mukhopadhyay S, Lim SK (2020) RTL-to-GDS design tools for monolithic 3d ics. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 1–8
Brunet et al. (2016) First demonstration of a cmos over cmos 3d vlsi coolcubetm integration on 300mm wafers. In: Proceedings of the symposium on vlsi technology digest of technical papers, pp 1–2
Tech Design Forum: https://www.techdesignforums.com/blog/2018/12/05/iedm-leti-monolithic-3d/
Koo G, Oh Y, Ro WW, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 307–319
Chen X, Chang L-W, Rodrigues CI, Lv J, Wang Z, Hwu W-M (2014) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the IEEE/ACM international symposium on microarchitecture (MICRO), pp 343–355
Xie X, Liang Y, Wang Y, Sun G,Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of the IEEE International symposium on high performance computer architecture (HPCA), pp 76–88
Li A, Braak G-J, Kumar A, Corporaal H (2015) Adaptive and transparent cache bypassing for GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis (SC), pp 1–12
Xie X, Liang Y, Sun G,Chen D (2013) An efficient compiler framework for cache bypassing on Gpus. In: Proceedings of the international conference on computer-aided design (ICCAD), pp 516–523
Tian Y, Puthoor S, Greathouse J, Beckmann B,imenez DJ (2015) Adaptive GPU cache bypassing. In: Proceedings of the 8th workshop on general purpose processing using Gpus (GPGPU), pp 25–35
Lee S-Y Wu C-J (2016) Ctrl-C: instruction-aware control loop based adaptive cache bypassing for GPUs. In: Proceedings of the IEEE international conference on computer design (ICCD), pp 133–140
Thuries S, Billoint O, Choisnet S, Lemaire R, Vivet P, Batude P, Lattard D (2020) MM3D-ADTCO: monolithic 3D architecture, design and technology Co-optimization for high energy efficient 3D IC. In: Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 1740–1745
Lu Y-C , Pentapati S, Zhu L, Samadi K, Lim SK (2020) TP-GNN: a graph neural network framework for tier partitioning in monolithic 3D ICs. In: Proceedings of the design automation conference (DAC), pp 1–6
Chatterjee A, Musavvir S, Kim RG, Doppa JR, Pande P (2021) Power management of monolithic 3D manycore chips with inter-tier process variations. ACM J Emerg Technolo Comput Syst (JETC) 17(2):1–19
Article Google Scholar
Zhu L, Chaudhuri A, Banerjee S, Murali G, Iampikul P, Chakrabarty K, Lim SK (2022) Design automation and test solutions for monolithic 3D ICs. ACM J Emerg Technolo Comput Syst (JETC) 18(1):1–49
Article Google Scholar
Rajendran B, Shenoy RS, Witte DJ, Chokshi NS, DeLeon RL, Tompa GS, Pease RFW (2007) Low Thermal Budget Processing for Sequential 3-D IC Fabrication. IEEE Trans Electron Devices 54(4):707–714
Article Google Scholar
Koneru A Chakrabarty K, (2016) Analysis of electrostatic coupling in monolithic 3D integrated circuits and its impact on delay testing. In: Proceedings of the IEEE european test symposium, pp 1–6
Lee JH, Lee YS, Choi JH, Amrouch H, Kong J, Gong Y-H, Chung SW (2021) Characterizing the thermal feasibility of monolithic 3D microprocessors. IEEE Access 9:120715–120729
Article Google Scholar
Peng B, V. Pavlidis F , Chen Y-C , Cheng Y (2022) Thermal modeling and design exploration for monolithic 3D ICs. In: Proceedings of the international symposium on quality electronic design (ISQED), pp 1–6
Do CT, Choi JH, Lee YS, Kim CH, Chung SW (2021) Enhancing matrix multiplication with a monolithic 3d based scratchpad memory. IEEE Embed Syst Lett 13(2):57–60
Article Google Scholar
Kim J, Nicopoulos C, Park D, Das R, Xie Y, Narayanan V, Yousif MS, Das CR (2007) A novel dimensionally-decomposed router for on-chip communication in 3D architectures. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 138–149
Marcon C, Fernandes R, Cataldo R,Grando F , Webber T, Benso A, Poehls LB (2014) Tiny NoC: A 3D mesh topology with router channel optimization for area and latency minimization. In: Proceedings of the IEEE international conference on vlsi design (VLSID), pp 228–233
Manna K, Swami S, Chattopadhyay S, Sengupta I (2016) Integrated through silicon Via placement and application mapping for 3D mesh ba sed NoC Des ign. ACM Trans Embed Comput Syst 16(1):1–25
Article Google Scholar
Lee D, Das S, Doppa JR, Pande PP, Chakrabarty K (2018) Performance and thermal tradeoffs for energy-efficient monolithic 3D network-on-chip. ACM Trans Des Autom Electron Syst 23(5):1–25
Article Google Scholar
Musavvir S,Chatterjee A, Kim RG, Kim DH, Pande P (2020) Inter-tier process variation-aware monolithic 3d noc architectures. IEEE Transactions on very large scale integration (VLSI) systems, 28(3), 686–699
Joardar B, Kim RG, Doppa JR, Pande P, Marculescu D, Marculescu R (2019) Learning-based application-agnostic 3D NoC design for heterogeneous manycore systems. IEEE Trans Comput 68(1):852–866
Article MathSciNet MATH Google Scholar
NVIDIA Whitepaper: NVIDIA Tesla V100 GPU Architecture.
NVIDIA Whitepaper: NVIDIA Turing GPU Architecture
Rogers T, Connor MO’, Aamodt T (2012) Cache-conscious wavefront scheduling.In: Proceedings of the IEEE/ACM international symposium on microarchitecture (MICRO), pp 72–83
Gebhart M et al. (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of the IEEE international symposium on computer architecture (ISCA), pp 235–246
Son DO, Do CT, Choi HJ, Nam J, Kim CH (2017) A dynamic CTA scheduling scheme for massive parallel computing. Clust Comput 20(1):781–787
Article Google Scholar
ElTantawy A, Aamodt TM ( 2018) Warp scheduling for fine-grained synchronization. In: Proceedings of IEEE international symposium on high performance computer architecture (HPCA), pp 375–388
Do CT, Choi HJ, Chung SW, Kim CH (2020) A novel warp scheduling scheme considering long-latency operations for high-performance GPUs. J Supercomput 76(4):3043–3062
Article Google Scholar
Tripathy D, Abdolrashidi A, Bhuyan LN, Zhou L, Wong D (2021) PAVER: locality graph-based thread block scheduling for GPUs. ACM Trans Archit Code Optim (TACO), 18(3):1–26
Article Google Scholar
Duong N, Zhao D, Kim T, Cammarota R, Valero M, Veidenbaum A (2012) Improving cache management Policies using dynamic reuse distances. In: Proceedings of the international symposium on microarchitecture (MICRO), pp 389–400
Gaur J,Chaudhuri M, Subramoney S (2011) Bypass and insertion algorithms for exclusive Last-level caches. In: Proceedings of the international symposium on computer architecture (ISCA), pp 81–92
Lee DU et al. (2014) A 1.2V 8Gb 8-Channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In: proceedings of the ieee international solid-state circuits conference digest of technical papers, pp 432–433
JEDEC (2013) JEDEC Standard.High Bandwidth Memory (HBM) DRAM Specification
Dong X, Zhao J, Xie Y (2010) Fabrication cost analysis and costaware design space exploration for 3-D ICs. IEEE Trans Comput Aided Des Integr Circuits Syst 29(12):1959–1972
Article Google Scholar
Samal S , Panth S, Samadi K, Saedi M, Du Y, Lim SK (2014) Fast and accurate thermal modeling and optimization for monolithic 3D ICs.”In: Proceedings of the design automation conference (DAC), pp 1–6
Batude P, Ernst T, Arcamone J, Arndt G, Coudrain P, Gaillardon P-E (2012) 3-d sequential integration: a key enabling technology for heterogeneous co-integration of new function with cmos. IEEE J Emerg Sel Top Circuits Syst 2(4):714–722
Article Google Scholar
Lakshminarayana N, Kim H (2010) Effect of instruction fetch and memory scheduling on GPU performance. Workshop on Language, Compiler, and Architecture Support for GPGPU (LCA-GPGPU), pp 1–10
Singh I, Shriraman A, Fung W, Connor MO’, Admodt T, (2013) Cache coherence for GPU architectures. In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA), pp 578–590
Bakhoda A, Yuan GL , Fung WWL ,Wong H, Aamodt TM, (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: proceedings of the ieee international symposium on performance analysis of systems and software (ISPASS), pp 163–174
Access Noxim: http://access.ee.ntu.edu.tw/noxim/index.html
Chen K ,Li S, Muralimanohar N, Ahn JH , Brockman JB , Jouppi NP, (2012) CACTI-3DD: architecture-level modeling for 3D die-stacked dram main memory. In:Proceedings of the design, automation & test in europe conference & exhibition (DATE), pp 33–38
CUDA SDK: https://developer.nvidia.com/cuda-toolkit
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE international symposium on workload characterization (IISWC), pp 44–54
Zhang R, Stan M, Skadron K, HotSpot 6.0: Validation, Acceleration and Extension, Univ. Virginia, Tech. Report CS-2015–04.
NVIDIA Geforce GTX Specifications: https://www.nvidia.com/en-us/geforce/900-series/

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C2003500, No. 2020R1A6A3A13064398), Samsung Electronics, College of Information, Korea University, and School of Information and Communications Technology, Hanoi University of Science and Technology. We would also like to thank Dr. Young Seo Lee and Mr. Ji Heon Lee for their help with thermal simulation and anonymous reviewers for their helpful feedback.

Author information

Authors and Affiliations

School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, 100000, Vietnam
Cong Thuan Do
School of Computer Science and Engineering, Soongsil University, Seoul, 06978, South Korea
Cheol Hong Kim
Department of Computer Science and Engineering, Korea University, Seoul, 136-713, South Korea
Sung Woo Chung

Authors

Cong Thuan Do
View author publications
You can also search for this author in PubMed Google Scholar
Cheol Hong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sung Woo Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Cheol Hong Kim or Sung Woo Chung.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised:In this article, Cong Thuan Do was incorrectly denoted as the corresponding author but it should have been Cheol Hong Kim. Cheol Hong Kim and Sung Woo Chung are the co-corresponding authors in this article.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Do, C.T., Kim, C.H. & Chung, S.W. Aggressive GPU cache bypassing with monolithic 3D-based NoC. J Supercomput 79, 5421–5442 (2023). https://doi.org/10.1007/s11227-022-04878-6

Download citation

Accepted: 05 October 2022
Published: 21 October 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11227-022-04878-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Aggressive GPU cache bypassing with monolithic 3D-based NoC

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

Memory-aware TLP throttling and cache bypassing for GPUs

Data availability

Change history

08 November 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Aggressive GPU cache bypassing with monolithic 3D-based NoC

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

Memory-aware TLP throttling and cache bypassing for GPUs

Data availability

Change history

08 November 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.