Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Computing-aware network (CAN): a systematic design of computing and network convergence

Abstract: The coverage of network resources is increasingly extensive, and computing resources have likewise gradually become fundamental infrastructures, providing ubiquitous computing services. However, in wide area networks (WANs), the underlying network and computing resources are not closely investigated or co-designed, and there are still problems reflected in slow computing service scheduling, inflexible data distribution, and inefficient data transmission. This paper proposes the architectural design of a computing-aware network (CAN), with the core contribution of introducing the awareness plane to collect, manage, and synthesize computing and network information. In this way, the awareness plane, control plane, and data plane are formed as a closed-loop control system to improve the overall system’s awareness capability, decision-making capability, and data forwarding functionality. To enable the CAN architecture, three key technologies are proposed as follows: computing-aware traffic steering (CATS), elastic broadcast, and wide-area high-throughput transmission. The paper takes artificial intelligence (AI) model training, inference, and offline parameter transmission as examples to show the applicability of CAN and identifies some future research directions.

Key words:

Chinese Summary  <1> 算力感知网络:一种算网一体的系统设计

王晓云1,段晓东2,姚柯翰2,孙滔2,刘鹏2,杨红伟2,李志强2
1中国移动通信集团有限公司,中国北京市,100032
2中国移动通信有限公司研究院,中国北京市,100053
摘要:网络资源的覆盖范围日益广泛,算力资源也逐渐成为能够提供泛在计算服务的基础设施。然而,在广域网络,底层网络和计算资源缺乏密切的研究或协同设计,仍然存在计算服务调度缓慢、数据分发不灵活、数据传输效率低等问题。本文提出算力感知网络(CAN)的系统架构设计,其核心贡献在于引入感知平面来收集、管理并综合计算和网络的信息。这样,感知平面、控制平面和数据平面组成一个闭环控制系统,增强了整个系统的感知能力、决策能力和数据转发功能。为了使能CAN系统,本文提出三项关键技术:算力路由、弹性广播和广域高吞吐传输。本文以人工智能(AI)模型训练、推理和离线参数传输为例,展示CAN的适用性,并指出未来的一些研究方向。

关键词组:网络架构;算力感知网络;算网一体


Share this article to: More

Go to Contents

References:

<HIDE>

[1]Ali-Eldin A, Wang B, Shenoy P, 2021. The hidden cost of the edge: a performance comparison of edge and cloud latencies. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 23.

[2]Arkko J, Hardie T, Pauly T, et al., 2023. Considerations on Application-Network Collaboration Using Path Signals. RFC9419, RFC.

[3]Armbrust M, Fox A, Griffith R, et al., 2010. A view of cloud computing. Commun ACM, 53(4):50-58.

[4]Arun V, Balakrishnan H, 2018. Copa: practical delay-based congestion control for the Internet. Proc 15th USENIX Symp on Networked Systems Design and Implementation, p.329-342.

[5]Baldantoni L, Lundqvist H, Karlsson G, 2004. Adaptive end-to-end FEC for improving TCP performance over wireless links. Proc IEEE Int Conf on Communications, p.4023-4027.

[6]Cardwell N, Cheng YC, Gunn CS, et al., 2016. BBR: congestion-based congestion control: measuring bottleneck bandwidth and round-trip propagation time. Queue, 14(5):20-53.

[7]Chan E, Heimlich M, Purkayastha A, et al., 2007. Collective communication: theory, practice, and experience. Concurr Comp Pract Exper, 19(13):1749-1783.

[8]Chunduri S, Parker S, Balaji P, et al., 2018. Characterization of MPI usage on a production supercomputer. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, p.386-400.

[9]Clos C, 1953. A study of non-blocking switching networks. Bell Syst Tech J, 32(2):406-424.

[10]Dolganow A, Przygienda T, Aldrin S, et al., 2017. Multicast Using Bit Index Explicit Replication (BIER). RFC8279, RFC.

[11]Dunbar L, Malis A, Jacquenet C, et al., 2024. Dynamic Networks to Hybrid Cloud DCs: Problems and Mitigation Practices-Draft-Ietf-Rtgwg-Net2cloud-Problem-Statement-37. IETF.

[12]Gibson D, Hariharan H, Lance E, et al., 2022. Aquila: a unified, low-latency fabric for datacenter networks. Proc 19th USENIX Symp on Networked Systems Design and Implementation.

[13]Ha S, Rhee I, Xu LS, 2008. CUBIC: a new TCP-friendly high-speed TCP variant. ACM SIGOPS Oper Syst Rev, 42(5):64-74.

[14]IEA, 2024. Electricity 2024: Analysis and Forecast to2026. Available from https://www.iea.org/reports/electricity [Accessed on Feb. 5, 2024].

[15]InfiniBand Trade Association, 2014. Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.2 Annex A17: RoCEv2 (IP Routable RoCE).

[16]ITU-T, 2021. Y.2501: Framework and Architecture of Computing Power Network. Draft Recommendation ITU-T. Available from https://handle.‍itu.‍int/11.1002/1000/14768 [Accessed on Feb. 5, 2024].

[17]Kaj I, Olsén J, 2001. Throughput modeling and simulation for single connection TCP-Tahoe. Teletraffic Sci Eng, 4:705-718.

[18]Kind A, Dimitropoulos X, Denazis S, et al., 2008. Advanced network monitoring brings life to the awareness plane. IEEE Commun Mag, 46(10):140-146.

[19]Koop MJ, Jones T, Panda DK, 2007. Reducing connection memory requirements of MPI for InfiniBand clusters: a message coalescing approach. Proc 7th IEEE Int Symp on Cluster Computing and the Grid, p.495-504.

[20]Kurose JF, 2001. Computer Networking: a Top-Down Approach. Pearson, UK.

[21]Li WX, Zhang JY, Liu YF, et al., 2024. Cepheus: accelerating datacenter applications with high-performance RoCE-capable multicast. Proc IEEE Int Symp on High-Performance Computer Architecture.

[22]Liu B, Mao JW, Xu L, et al., 2021. CFN-dyncast: load balancing the edges via the network. Proc IEEE Wireless Communications and Networking Conf Workshops, p.1-6.

[23]Mao YY, You CS, Zhang J, et al., 2017. A survey on mobile edge computing: the communication perspective. IEEE Commun Surv Tutor, 19(4):2322-2358.

[24]Rekhter Y, Li T, Hares S, 2006. A Border Gateway Protocol 4 (BGP-4). RFC-4271, RFC.

[25]Savage D, Ng J, Moore S, et al., 2016. Cisco’s Enhanced Interior Gateway Routing Protocol (EIGRP). RFC7868, RFC.

[26]Singhvi A, Akella A, Gibson D, et al., 2020. 1RMA: re-envisioning remote memory access for multi-tenant datacenters. Proc Annual Conf of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, p.708-721.

[27]Stoica I, Shenker S, 2021. From cloud computing to sky computing. Proc Workshop on Hot Topics in Operating Systems, p.26-32.

[28]Su JS, Zhao BK, Dai Y, et al., 2022. Technology trends in large-scale high-efficiency network computing. Front Inform Technol Electron Eng, 23(12):1733-1746.

[29]Tang XY, Cao C, Wang YX, et al., 2021. Computing power network: the architecture of convergence of computing and networking towards 6G requirement. China Commun, 18(2):175-185.

[30]Xiao JM, Tillo T, Zhao Y, 2013. Real-time video streaming using randomized expanding Reed–Solomon code. IEEE Trans Circ Syst Video Technol, 23(11):1825-1836.

[31]Yao HP, Mai TL, Jiang CX, et al., 2019. AI routers & network mind: a hybrid machine learning paradigm for packet routing. IEEE Comput Intell Mag, 14(4):21-30.

[32]Yao KH, Trossen D, Boucadair M, et al., 2024. Computing-Aware Traffic Steering (CATS) Problem Statement, Use Cases, and Requirements: Draft-Ietf-Cats-Usecases-Requirements-02. IETF.

[33]Yuan BH, He YJ, Davis J, et al., 2022. Decentralized training of foundation models in heterogeneous environments. Proc 36th Int Conf on Neural Information Processing Systems.

[34]Zong MY, Krishnamachari B, 2022. A survey on GPT-3. https://arxiv.org/abs/2212.00857

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2400098

CLC number:

Download Full Text:

Click Here

Downloaded:

160

Download summary:

<Click Here> 

Downloaded:

51

Clicked:

245

Cited:

0

On-line Access:

2024-06-04

Received:

2024-02-09

Revision Accepted:

2024-06-04

Crosschecked:

2024-03-17

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE