Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems

Wei, Jinhui; Lu, Jianzhuang; Yu, Qi; Li, Chen; Zhao, Yunping

doi:10.1007/978-3-030-79478-1_13

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12639))

Included in the following conference series:

IFIP International Conference on Network and Parallel Computing

1474 Accesses

Abstract

The ever increasing application footprint raises challenges for GPUs. As Moore’s Law reaches its limit, it is not easy to improve single GPU performance any further; instead, multi-GPU systems have been shown to be a promising solution due to its GPU-level parallelism. Besides, memory virtualization in recent GPUs simplifies multi-GPU programming. Memory virtualization requires support for address translation, and the overhead of address translation has an important impact on the system’s performance. Currently, there are two common address translation architectures in multi-GPU systems, including distributed and centralized address translation architectures. We find that both architectures suffer from performance loss in certain cases. To address this issue, we propose GMMU Bypass, a technique that allows address translation requests to dynamically bypass GMMU in order to reduce translation overhead. Simulation results show that our technique outperforms distributed address translation architecture by 6% and centralized address translation architecture by 106% on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-level PWB and PWC for Reducing TLB Miss Overheads on GPUs

LC-MEMENTO: A Memory Model for Accelerated Architectures

Manage OpenMP GPU Data Environment Under Unified Address Space

References

Arunkumar, A., et al.: MCM-GPU: Multi-chip-module GPUs for continued performance scalability. ACM SIGARCH Comput. Archit. News 45(2), 320–332 (2017)
Article Google Scholar
Ausavarungnirun, R., et al.: Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 136–150 (2017)
Google Scholar
Ausavarungnirun, R., et al.: MASK: redesigning the GPU memory hierarchy to support multi-application concurrency. ACM SIGPLAN Not. 53(2), 503–518 (2018)
Article Google Scholar
Baruah, T., et al.: Griffin: hardware-software support for efficient page migration in multi-GPU systems. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–609, February 2020. https://doi.org/10.1109/HPCA47549.2020.00055
Ganguly, D., Zhang, Z., Yang, J., Melhem, R.: Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In: Proceedings of the 46th International Symposium on Computer Architecture, pp. 224–235 (2019)
Google Scholar
Jermain, C., Rowlands, G., Buhrman, R., Ralph, D.: GPU-accelerated micromagnetic simulations using cloud computing. J. Magn. Magn. Mater. 401, 320–322 (2016)
Article Google Scholar
Kim, G., Lee, M., Jeong, J., Kim, J.: Multi-GPU system design with memory networks. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 484–495. IEEE (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Li, C., et al.: Priority-based PCIe scheduling for multi-tenant multi-GPU system. IEEE Comput. Archit. Lett. 18, 157–160 (2019)
Article Google Scholar
NVIDIA, T.: V100 GPU architecture. Whitepaper (2017). nvidia.com. Accessed September 2019
Pichai, B., Hsu, L., Bhattacharjee, A.: Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces. ACM SIGARCH Comput. Archit. News 42(1), 743–758 (2014)
Article Google Scholar
Power, J., Hill, M.D., Wood, D.A.: Supporting x86–64 address translation for 100s of GPU lanes. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 568–578. IEEE (2014)
Google Scholar
Raina, R., Madhavan, A., Ng, A.Y.: Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 873–880 (2009)
Google Scholar
Sanaullah, A., Mojumder, S.A., Lewis, K.M., Herbordt, M.C.: GPU-accelerated charge mapping. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. IEEE (2016)
Google Scholar
Sun, Y., et al.: MGPUSim: enabling multi-GPU performance modeling and optimization. In: Proceedings of the 46th International Symposium on Computer Architecture, pp. 197–209 (2019)
Google Scholar
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–12 (2016)
Google Scholar
Wu, Y., Wang, Y., Pan, Y., Yang, C., Owens, J.D.: Performance characterization of high-level programming models for GPU graph analytics. In: 2015 IEEE International Symposium on Workload Characterization, pp. 66–75. IEEE (2015)
Google Scholar
Young, V., Jaleel, A., Bolotin, E., Ebrahimi, E., Nellans, D., Villa, O.: Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 339–351. IEEE (2018)
Google Scholar
Zheng, T., Nellans, D., Zulfiqar, A., Stephenson, M., Keckler, S.W.: Towards high performance paged memory for GPUs. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 345–357. IEEE (2016)
Google Scholar
Ziabari, A.K., et al.: UMH: a hardware-based unified memory hierarchy for systems with multiple discrete GPUs. ACM Trans. Archit. Code Optim. (TACO) 13(4), 1–25 (2016)
Article Google Scholar

Download references

Acknowledgement

This work is partially supported by Research Project of NUDT ZK20-04, PDL Foundation 6142110180102, Science and Technology Innovation Project of Hunan Province 2018XK2102 and Advanced Research Program 31513010602-1.

Author information

Authors and Affiliations

College of Computer Science, National University of Defense Technology, Changsha, 410073, China
Jinhui Wei, Jianzhuang Lu, Qi Yu, Chen Li & Yunping Zhao

Authors

Jinhui Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhuang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Qi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Yunping Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Li .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xin He
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
En Shao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Guangming Tan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, J., Lu, J., Yu, Q., Li, C., Zhao, Y. (2021). Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems. In: He, X., Shao, E., Tan, G. (eds) Network and Parallel Computing. NPC 2020. Lecture Notes in Computer Science(), vol 12639. Springer, Cham. https://doi.org/10.1007/978-3-030-79478-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-79478-1_13
Published: 23 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79477-4
Online ISBN: 978-3-030-79478-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-level PWB and PWC for Reducing TLB Miss Overheads on GPUs

LC-MEMENTO: A Memory Model for Accelerated Architectures

Manage OpenMP GPU Data Environment Under Unified Address Space

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-level PWB and PWC for Reducing TLB Miss Overheads on GPUs

LC-MEMENTO: A Memory Model for Accelerated Architectures

Manage OpenMP GPU Data Environment Under Unified Address Space

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.