skip to main content
10.1145/3582016.3582037acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

In-Network Aggregation with Transport Transparency for Distributed Training

Published: 25 March 2023 Publication History

Abstract

Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation.
We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

References

[1]
Barefoot. 2019. TOFINO: World’s fastest P4-programmable Ethernet switch ASICs. https://barefootnetworks.com/products/brief-tofino/
[2]
Mike Barnett, Lance Shuler, Robert van De Geijn, Satya Gupta, David G Payne, and Jerrell Watts. 1994. Interprocessor collective communication library (InterCom). In Proceedings of IEEE Scalable High Performance Computing Conference. 357–364. https://ieeexplore.ieee.org/abstract/document/296665
[3]
Theophilus A Benson. 2019. In-network compute: Considered armed and dangerous. In Proceedings of the Workshop on Hot Topics in Operating Systems. 216–224.
[4]
Li Chen, Ge Chen, Justinas Lingys, and Kai Chen. 2018. Programmable switch as a parallel computing device. arXiv preprint arXiv:1803.01491.
[5]
Xiang Chen, Qun Huang, Peiqiao Wang, Zili Meng, Hongyan Liu, Yuxin Chen, Dong Zhang, Haifeng Zhou, Boyang Zhou, and Chunming Wu. 2021. LightNF: Simplifying Network Function Offloading in Programmable Networks. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1–10.
[6]
Eyal Cidon, Sean Choi, Sachin Katti, and Nick McKeown. 2017. AppSwitch: Application-Layer Load Balancing within a Software Switch. In Proceedings of the First Asia-Pacific Workshop on Networking (APNet’17). Association for Computing Machinery, New York, NY, USA. 64–70. isbn:9781450352444 https://doi.org/10.1145/3106989.3106998
[7]
Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880, arxiv:1803.05880
[8]
Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: flexible in-network allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. 248–255. https://www.image-net.org/papers/imagenet_cvpr09.pdf
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, arxiv:1810.04805
[11]
Yaozu Dong, Xiaowei Yang, Jianhui Li, Guangdeng Liao, Kun Tian, and Haibing Guan. 2012. High performance network virtualization with SR-IOV. J. Parallel and Distrib. Comput., 72, 11 (2012), 1471–1480.
[12]
Yaozu Dong, Zhao Yu, and Greg Rose. 2008. SR-IOV Networking in Xen: Architecture, Design and Implementation. In Workshop on I/O Virtualization. 2.
[13]
Jiawei Fei, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Amedeo Sapio. 2021. Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning. In Proceedings of SIGCOMM.
[14]
Nadeen Gebara, Manya Ghobadi, and Paolo Costa. 2021. In-network Aggregation for Shared Machine Learning Clusters. Proceedings of Machine Learning and Systems, 3 (2021), 829–844.
[15]
Jinkun Geng, Dan Li, and Shuai Wang. 2019. Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 100–111.
[16]
Richard L Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, and Vladimir Koushnir. 2016. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In 2016 First International Workshop on Communication Optimizations in HPC (COMHPC). 1–10. https://ieeexplore.ieee.org/abstract/document/7830486/
[17]
Richard L Graham, Lion Levi, Devendar Burredy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, and Ophir Maor. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Streaming-Aggregation Hardware Design and Evaluation. In International Conference on High Performance Computing. 41–59. https://link.springer.com/chapter/10.1007/978-3-030-50743-5_3
[18]
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity Ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference. 202–215. https://dl.acm.org/doi/pdf/10.1145/2934872.2934908
[19]
Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. 2015. SoftNIC: A software NIC to augment hardware. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-155.
[20]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2018. Tictac: Accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf
[22]
Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. 2020. MasQ: RDMA for Virtual Private Cloud. SIGCOMM ’20. Association for Computing Machinery, New York, NY, USA. 1–14. isbn:9781450379557 https://doi.org/10.1145/3387514.3405849
[23]
Huggingface. 2020. Transformers:State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0. https://github.com/huggingface/transformers
[24]
Sylvain Jeaugey. 2017. NCCL 2.0. http://on-demand.gputechconf .com/gtc/2017/ presentation/s7155-jeaugey-nccl.pdf
[25]
Chengfan Jia, Junnan Liu, Xu Jin, Han Lin, Hong An, Wenting Han, Zheng Wu, and Mengxian Chi. 2018. Improving the performance of distributed tensorflow with RDMA. International Journal of Parallel Programming, 46, 4 (2018), 674–685.
[26]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, and Liwei Yu. 2018. Highly scalable deep learning training system with mixed-precision: Training Imagenet in four minutes. arXiv preprint arXiv:1807.11205, arxiv:1807.11205
[27]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. 2018. NetChain: Scale-Free Sub-RTT Coordination. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA. 35–49. isbn:978-1-939133-01-4 https://www.usenix.org/conference/nsdi18/presentation/jin
[28]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). Association for Computing Machinery, New York, NY, USA. 121–136. isbn:9781450350853 https://doi.org/10.1145/3132747.3132764
[29]
Mehrdad Khani, Manya Ghobadi, Mohammad Alizadeh, Ziyi Zhu, Madeleine Glick, Keren Bergman, Amin Vahdat, Benjamin Klenk, and Eiman Ebrahimi. 2021. SiP-ML: high-bandwidth optical network interconnects for machine learning training. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 657–675.
[30]
Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven Swanson, Vyas Sekar, and Srinivasan Seshan. 2018. Hyperloop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA. 297–312. isbn:9781450355674 https://doi.org/10.1145/3230543.3230572
[31]
Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA. 113–126. isbn:978-1-931971-49-2 https://www.usenix.org/conference/nsdi19/presentation/kim
[32]
Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 996–1009.
[33]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[34]
Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, and Nate Foster. 2019. PicNIC: predictable virtualized NIC. In Proceedings of the ACM Special Interest Group on Data Communication. 351–366.
[35]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 741–761. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/lao
[36]
Alberto Lerner, Rana Hussein, Philippe Cudre-Mauroux, and U eXascale Infolab. 2019. The Case for Network Accelerated Query Processing. In CIDR.
[37]
Mingfan Li, Ke Wen, Han Lin, Xu Jin, Zheng Wu, Hong An, and Mengxian Chi. 2019. Improving the performance of distributed mxnet with rdma. International Journal of Parallel Programming, 47, 3 (2019), 467–480.
[38]
Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 279–291. https://ieeexplore.ieee.org/abstract/document/8980345
[39]
Zaoxing Liu, Zhihao Bai, Zhenming Liu, Xiaozhou Li, Changhoon Kim, Vladimir Braverman, Xin Jin, and Ion Stoica. 2019. DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching. In 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA. 143–157. isbn:978-1-939133-09-0 https://www.usenix.org/conference/fast19/presentation/liu
[40]
Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze, and Jacob Nelson. 2020. PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training. Proc. of MLSys.
[41]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and efficient $GPU$ cluster scheduling. In 17th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 20). 289–304.
[42]
Mellanox. 2022. ConnectX-5 EN Single/Dual-Port Adapter Supporting 100Gb/s Ethernet. https://www.mellanox.com/products/ethernet-adapters/connectx-5-en
[43]
Mellanox. 2022. InfiniBand Switch Silicon: Mellanox Quantum. https://www.mellanox.com/products/infiniband-switches-ic/quantum
[44]
Jeffrey C Mogul. 2003. TCP Offload Is a Dumb Idea Whose Time Has Come. In HotOS. 25–30.
[45]
Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. 2019. Jumpgate: In-network processing as a service for data analytics. In 11th $USENIX$ Workshop on Hot Topics in Cloud Computing (HotCloud 19).
[46]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 481–498.
[47]
NVIDIA. 2017. NVIDIA DGX-1 with Tesla V100 System Architecture. https://www.nvidia.com/en-us/data-center/resources/dgx-1-system-architecture-whitepaper/
[48]
NVIDIA. 2019. NCCL: Optimized primitives for collective multi-GPU communication. https://github.com/NVIDIA/nccl
[49]
NVIDIA. 2019. NVIDIA NVLink Fabric. https://www.nvidia.com/en-sg/data-center/nvlink/
[50]
NVIDIA. 2020. NVIDIA V100: The First Tensor Core GPU. https://www.nvidia.com/en-sg/data-center/v100/
[51]
NVIDIA. 2021. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl
[52]
NVIDIA. 2023. GeForce RTX 2080. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080/
[53]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16–29. https://dl.acm.org/doi/abs/10.1145/3341301.3359642
[54]
Rolf Rabenseifner. 1997. A new optimized MPI reduce algorithm. https://fs.hlrs.de/projects/par/mpi//myreduce.html
[55]
Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, and Ilya Sutskever. 2019. Better language models and their implications. OpenAI Blog, https://openai. com/blog/better-language-models
[56]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQUAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, arxiv:1606.05250
[57]
Yufei Ren, Xingbo Wu, Li Zhang, Yandong Wang, Wei Zhang, Zijun Wang, Michel Hack, and Song Jiang. 2017. irdma: Efficient use of rdma in distributed deep learning systems. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 231–238.
[58]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 785–808. isbn:978-1-939133-21-2 https://www.usenix.org/conference/nsdi21/presentation/sapio
[59]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, arxiv:1802.05799
[60]
Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. 172–180. arxiv:1811.11141
[61]
Jinwoo Shin and KyoungSoo Park. 2021. Elastic Resource Sharing for Distributed Deep Learning.
[62]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, arxiv:1909.08053
[63]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, arxiv:1409.1556
[64]
Brent E Stephens, Darius Grassi, Hamidreza Almasi, Tao Ji, Balajee Vamanan, and Aditya Akella. 2021. TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP). In Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks. 61–68.
[65]
PyTorch Team. 2023. PyTorch. https://github.com/pytorch/pytorch
[66]
TensorFlow. 2019. A benchmark framework for Tensorflow. https://github.com/tensorflow/benchmarks
[67]
Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. 2020. Cheetah: Accelerating Database Queries with Switch Pruning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2407–2422.
[68]
Raajay Viswanathan, Arjun Balasubramanian, and Aditya Akella. 2020. Network-accelerated distributed machine learning for multi-tenant settings. In Proceedings of the 11th ACM Symposium on Cloud Computing. 447–461.
[69]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, arxiv:1804.07461
[70]
Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2019. Blink: Fast and generic collectives for distributed ml. arXiv preprint arXiv:1910.04940.
[71]
Xilinx. 2023. Virtex UltraScale - Xilinx. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale.html#productAdvantages
[72]
Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2019. Fast distributed deep learning over rdma. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–14.
[73]
Weihong Yang, Yang Qin, Zukai Jiang, and Xiaowen Chu. 2021. Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks. In ICC 2021-IEEE International Conference on Communications. 1–6.
[74]
Yifan Yuan, Omar Alama, Jiawei Fei, Jacob Nelson, Dan R. K. Ports, Amedeo Sapio, Marco Canini, and Nam Sung Kim. 2022. Unlocking the Power of Inline Floating-Point Operations on Programmable Switches. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22).
[75]
Hang Zhu, Zhihao Bai, Jialin Li, Ellis Michael, Dan R. K. Ports, Ion Stoica, and Xin Jin. 2019. Harmonia: Near-Linear Scalability for Replicated Storage with in-Network Conflict Detection. Proc. VLDB Endow., 13, 3 (2019), Nov., 376–389. issn:2150-8097 https://doi.org/10.14778/3368289.3368301
[76]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA. 523–536. isbn:9781450335423 https://doi.org/10.1145/2785956.2787484

Cited By

View all
  • (2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
  • (2024)CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI ComputingProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673798(41-47)Online publication date: 4-Aug-2024
  • (2024)Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning SystemIEEE/ACM Transactions on Networking10.1109/TNET.2024.344103932:6(4917-4930)Online publication date: Dec-2024
  • Show More Cited By

Index Terms

  1. In-Network Aggregation with Transport Transparency for Distributed Training

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
    March 2023
    820 pages
    ISBN:9781450399180
    DOI:10.1145/3582016
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 March 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Distributed Training
    2. FPGA
    3. In-Network Aggregation
    4. RDMA

    Qualifiers

    • Research-article

    Conference

    ASPLOS '23

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)434
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
    • (2024)CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI ComputingProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673798(41-47)Online publication date: 4-Aug-2024
    • (2024)Straggler-Aware Gradient Aggregation for Large-Scale Distributed Deep Learning SystemIEEE/ACM Transactions on Networking10.1109/TNET.2024.344103932:6(4917-4930)Online publication date: Dec-2024
    • (2024)Releasing the Power of In-Network Aggregation With Aggregator-Aware Routing OptimizationIEEE/ACM Transactions on Networking10.1109/TNET.2024.342338032:5(4488-4502)Online publication date: Oct-2024
    • (2024)Accelerating Distributed Training With Collaborative In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.338794832:4(3437-3452)Online publication date: Aug-2024
    • (2024)Lins: Reducing Communication Overhead of ZeRO for Efficient LLM Training2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682856(1-10)Online publication date: 19-Jun-2024
    • (2024)Leveraging SmartNIC for Ring AllReduce Offloading2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00030(173-180)Online publication date: 30-Oct-2024
    • (2024)Host-driven In-Network Aggregation on RDMAIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621230(1051-1060)Online publication date: 20-May-2024
    • (2024)Zebra: Accelerating Distributed Sparse Deep Training With in-Network Gradient Aggregation for Hot Parameters2024 IEEE 32nd International Conference on Network Protocols (ICNP)10.1109/ICNP61940.2024.10858501(1-11)Online publication date: 28-Oct-2024
    • (2024) P 4 ce: Consensus over RDMA at Line Speed 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00054(508-519)Online publication date: 23-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media

    pFad - Phonifier reborn

    Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

    Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


    Alternative Proxies:

    Alternative Proxy

    pFad Proxy

    pFad v3 Proxy

    pFad v4 Proxy