CLC number: TP306
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-05-07
Cited: 0
Clicked: 7034
Citations: Bibtex RefMan EndNote GB/T7714
Yining Qi, Chongrong Fang, Haoyu Liu, Daxiang Kang, Biao Lyu, Peng Cheng, Jiming Chen. A survey of cloud network fault diagnostic systems and tools[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000153 @article{title="A survey of cloud network fault diagnostic systems and tools", %0 Journal Article TY - JOUR
云网络故障诊断系统及工具综述1浙江大学工业控制技术国家重点实验室,中国杭州市,310027 2阿里巴巴集团,中国杭州市,310024 摘要:近年来,云网络已成为支撑人们正常生产生活的重要基础产业。然而,随着云网络日益复杂化,网络故障越来越容易出现,并且造成巨大经济损失。因此,为保障云网络性能,防止故障造成恶劣影响,云网络故障诊断已成为云服务提供商的重点研究技术之一。由于云网络的特性(例如虚拟化和多租户),将传统网络诊断工具移植到云网络面临不少困难。此外,许多现有工具无法解决云网络的独有问题。本文总结了近年提出的可用于云网络生产环境的最先进的云网络故障诊断系统及工具,并根据其特点分类。此外,根据云网络特点,分析了云网络故障诊断与传统网络故障诊断的区别。考虑到云网络的实际生产需求,提出设计云网络故障诊断工具时应注意的要点。此外,讨论了云网络故障诊断在未来发展中面临的机遇与挑战。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Aceto G, Botta A, de Donato W, et al., 2013. Cloud monitoring: a survey. Comput Netw, 57(9):2093-2115. ![]() [2]Andreyev A, 2014. Introducing Data Center Fabric, the Next-Generation Facebook Data Center Network. https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/ ![]() [3]Armbrust M, Fox A, Griffith R, et al., 2010. A view of cloud computing. Commun ACM, 53(4):50-58. ![]() [4]Arzani B, Ciraci S, Loo BT, et al., 2016. Taking the blame game out of data centers operations with NetPoirot. Proc ACM SIGCOMM Conf, p.440-453. ![]() [5]Arzani B, Ciraci S, Chamon L, et al., 2018. 007: democratically finding the cause of packet drops. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.419-435. ![]() [6]Bahl P, Chandra R, Greenberg A, et al., 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communications, p.13-24. ![]() [7]Bannour F, Souihi S, Mellouk A, 2018. Distributed SDN control: survey, taxonomy, and challenges. IEEE Commun Surv Tutor, 20(1):333-354. ![]() [8]Calder M, Schröder M, Gao R, et al., 2018. Odin: Microsoft’s scalable fault-tolerant CDN measurement system. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.501-517. ![]() [9]Casella G, Berger RL, 2002. Statistical Inference (2nd Ed.). Duxbury Press, Pacific Grove, USA. ![]() [10]Claise B, Sadasivan G, Valluri V, et al., 2004. RFC 3954: Cisco Systems NetFlow Services Export Version 9. https://www.hjp.at/doc/rfc/rfc3954.html ![]() [11]Dhamdhere A, Teixeira R, Dovrolis C, et al., 2007. NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data. Proc ACM CoNEXT Conf, p.1-12. ![]() [12]Duffield N, Haffner P, Krishnamurthy B, et al., 2009. Rule-based anomaly detection on IP flows. IEEE INFOCOM, p.424-432. ![]() [13]Fang CR, Liu HY, Miao M, et al., 2020. VTrace: automatic diagnostic system for persistent packet loss in cloud-scale overlay network. Proc Annual Conf of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, p.31-43. ![]() [14]Ganguli S, Corbett T, 2019. Gartner Magic Quadrant for Network Performance Monitoring and Diagnostics. ![]() [15]Garfinkel SL, 1999. Architects of the Information Society: Thirty-Five Years of the Laboratory for Computer Science at MIT. The MIT Press, Cambridge, USA. ![]() [16]Geng YL, Liu SY, Yin Z, et al., 2019. SIMON: a simple and scalable method for sensing, inference and measurement in data center networks. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.549-564. ![]() [17]Gong CY, Liu J, Zhang Q, et al., 2010. The characteristics of cloud computing. Proc 39th Int Conf on Parallel Processing Workshops, p.275-279. ![]() [18]Guo CX, Yuan LH, Xiang D, et al., 2015. Pingmesh: a large-scale system for data center network latency measurement and analysis. Proc ACM Conf on Special Interest Group on Data Communication, p.139-152. ![]() [19]Herodotou H, Ding BL, Balakrishnan S, et al., 2014. Scalable near real-time failure localization of data center networks. Proc 20th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.1689-1698. ![]() [20]Huang P, Guo CX, Zhou LD, et al., 2017. Gray failure: the Achilles’ heel of cloud-scale systems. Proc 16th Workshop on Hot Topics in Operating Systems, p.150-155. ![]() [21]Jin YC, Renganathan S, Ananthanarayanan G, et al., 2019. Zooming in on wide-area latencies to a global cloud provider. Proc ACM Conf on Special Interest Group on Data Communication, p.104-116. ![]() [22]Kanuparthy P, Dovrolis C, 2014. Pythia: diagnosing performance problems in wide area providers. Proc USENIX Conf on USENIX Annual Technical Conference, p.371-382. ![]() [23]Kim C, Bhide P, Doe E, et al., 2015. In-Band Network Telemetry via Programmable Dataplanes. Technical Specification P, 4:2015. ![]() [24]Li Z, Cheng Q, Hsieh K, et al., 2020. Gandalf: an intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. Proc 17th USENIX Symp on Networked Systems Design and Implementation, p.389-402. ![]() [25]Marston S, Li Z, Bandyopadhyay S, et al., 2011. Cloud computing—the business perspective. Dec Support Syst, 51(1):176-189. ![]() [26]Mell P, Grance T, 2011. The NIST Definition of Cloud Computing. Gaithersburg: Computer Security Division, Information Technology Laboratory. ![]() [27]Moshref M, Yu ML, Govindan R, et al., 2016. Trumpet: timely and precise triggers in data centers. Proc ACM SIGCOMM Conf, p.129-143. ![]() [28]Padmanabhan VN, Ramabhadran S, Padhye J, 2005. NetProfiler: profiling wide-area networks using peer cooperation. Proc 4th Int Conf on Peer-to-Peer Systems, p.80-92. ![]() [29]Peng YH, Yang J, Wu C, et al., 2017. deTector: a topology-aware monitoring system for data center networks. Proc USENIX Conf on Usenix Annual Technical Conf, p.55-68. ![]() [30]Roskind J, 2013. Quick UDP Internet Connections: Multiplexed Stream Transport over UDP. https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/ ![]() [31]Roy A, Zeng HY, Bagga J, et al., 2015. Inside the social network’s (datacenter) network. Proc ACM Conf on Special Interest Group on Data Communication, p.123-137. ![]() [32]Roy A, Zeng HY, Bagga J, et al., 2017. Passive realtime datacenter fault detection and localization. Proc 14th USENIX Symp on Networked Systems Design and Implementation, p.595-612. ![]() [33]Tan C, Jin Z, Guo CX, et al., 2019. NetBouncer: active device and link failure localization in data center networks. Proc16th USENIX Conf on Networked Systems Design and Implementation, p.599-614. ![]() [34]Tibshirani R, 1996. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B, 58(1):267-288. ![]() [35]Veloso B, Malheiro B, Burguillo JC, et al., 2020. Impact of trust and reputation based brokerage on the CloudAnchor platform. Int Conf on Practical Applications of Agents and Multi-agent Systems, p.303-314. ![]() [36]Wang M, Li BC, Li ZP, 2004. sFlow: towards resource-efficient and agile service federation in service overlay networks. Proc 24th Int Conf on Distributed Computing Systems, p.628-635. ![]() [37]Wang T, Zhang WB, Ye CY, et al., 2016. FD4C: automatic fault diagnosis framework for web applications in cloud computing. IEEE Trans Syst Man Cybern Syst, 46(1):61-75. ![]() [38]Widanapathirana C, Li J, Sekercioglu YA, et al., 2011. Intelligent automated diagnosis of client device bottlenecks in private clouds. Proc 4th IEEE Int Conf on Utility and Cloud Computing, p.261-266. ![]() [39]Wu X, Turner D, Chen CC, et al., 2012. NetPilot: automating datacenter network failure mitigation. Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communication, p.419-430. ![]() [40]Yu D, Zhu YB, Arzani B, et al., 2019. dShark: a general, easy to program and scalable framework for analyzing in-network packet traces. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.207-220. ![]() [41]Yu ML, Greenberg A, Maltz D, et al., 2011. Profiling network performance for multi-tier data center applications. Proc 8th USENIX Conf on Networked Systems Design and Implementation, p.57-70. ![]() [42]Zeng HY, Mahajan R, McKeown N, et al., 2015. Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing. Technical Report MSR-TR-2015-55 (Microsoft Research). ![]() [43]Zhang Q, Yu G, Guo CX, et al., 2018. Deepview: virtual disk failure diagnosis and pattern detection for Azure. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.519-532. ![]() [44]Zhu YB, Kang NX, Cao JX, et al., 2015. Packet-level telemetry in large datacenter networks. ACM SIGCOMM Comput Commun Rev, p.479-491. ![]() [45]Zhuo DY, Ghobadi M, Mahajan R, et al., 2017. Understanding and mitigating packet corruption in data center networks. Proc ACM Conf on Special Interest Group on Data Communication, p.362-375. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>