CLC number: TP315
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-10-24
Cited: 0
Clicked: 3877
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0003-2368-4946
Mingtian SHAO, Kai LU, Wanqing CHI, Ruibo WANG, Yiqin DAI, Wenzhe ZHANG. TEES: topology-aware execution environment service for fast and agile application deployment in HPC[J]. Frontiers of Information Technology & Electronic Engineering, 2022, 23(11): 1631-1645.
@article{title="TEES: topology-aware execution environment service for fast and agile application deployment in HPC",
author="Mingtian SHAO, Kai LU, Wanqing CHI, Ruibo WANG, Yiqin DAI, Wenzhe ZHANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="23",
number="11",
pages="1631-1645",
year="2022",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2100284"
}
%0 Journal Article
%T TEES: topology-aware execution environment service for fast and agile application deployment in HPC
%A Mingtian SHAO
%A Kai LU
%A Wanqing CHI
%A Ruibo WANG
%A Yiqin DAI
%A Wenzhe ZHANG
%J Frontiers of Information Technology & Electronic Engineering
%V 23
%N 11
%P 1631-1645
%@ 2095-9184
%D 2022
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2100284
TY - JOUR
T1 - TEES: topology-aware execution environment service for fast and agile application deployment in HPC
A1 - Mingtian SHAO
A1 - Kai LU
A1 - Wanqing CHI
A1 - Ruibo WANG
A1 - Yiqin DAI
A1 - Wenzhe ZHANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 23
IS - 11
SP - 1631
EP - 1645
%@ 2095-9184
Y1 - 2022
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2100284
Abstract: high-performance computing (HPC) systems are about to reach a new height: exascale. application deployment is becoming an increasingly prominent problem. container technology solves the problems of encapsulation and migration of applications and their execution environment. However, the container image is too large, and deploying the image to a large number of compute nodes is time-consuming. Although the peer-to-peer (P2P) approach brings higher transmission efficiency, it introduces larger network load. All of these issues lead to high startup latency of the application. To solve these problems, we propose the topology-aware execution environment service (TEES) for fast and agile application deployment on HPC systems. TEES creates a more lightweight execution environment for users, and uses a more efficient topology-aware P2P approach to reduce deployment time. Combined with a split-step transport and launch-in-advance mechanism, TEES reduces application startup latency. In the Tianhe HPC system, TEES realizes the deployment and startup of a typical application on 17 560 compute nodes within 3 s. Compared to container-based application deployment, the speed is increased by 12-fold, and the network load is reduced by 85%.
[1]Belkin M, Haas R, Arnold GW, et al., 2018. Container solutions for HPC systems: a case study of using shifter on blue waters. Proc Practice and Experience on Advanced Research Computing,Article 43.
[2]Boettiger C, 2015. An introduction to Docker for reproducible research. SIGOPS Oper Syst Rev, 49(1):71-79.
[3]Boyle PA, 2012. The BlueGene/Q supercomputer. Proc 30th Int Symp on Lattice Field Theory,Article 20.
[4]Chen JY, Guan Q, Liang X, et al., 2018. Build and execution environment (BEE): an encapsulated environment enabling HPC applications running everywhere. IEEE Int Conf on Big Data, p.1737-1746.
[5]de Velp GE, Rivière E, Sadre R, 2020. Understanding the performance of container execution environments. Proc 6th Int Workshop on Container Technologies and Container Clouds, p.37-42.
[6]di Nitto E, Gorroñogoitia J, Kumara I, et al., 2020. An approach to support automated deployment of applications on heterogeneous cloud-HPC infrastructures. Proc 22nd Int Symp on Symbolic and Numeric Algorithms for Scientific Computing, p.133-140.
[7]Djemame K, Carr H, 2020. Exascale computing deployment challenges. Proc 17th Int Conf on the Economics of Grids, Clouds, Systems, and Services, p.211-216.
[8]Dongarra J, 2016. Report on the Sunway TaihuLight System. UT-EECS-16-742, University of Tennessee, Tennessee, USA.
[9]Du L, Wo TY, Yang RY, et al., 2017. Cider: a rapid Docker container deployment system through sharing network storage. IEEE 19th Int Conf on High Performance Computing and Communications; IEEE 15th Int Conf on Smart City; IEEE 3rd Int Conf on Data Science and Systems, p.332-339.
[10]Feng HH, Misra V, Rubenstein D, 2007. PBS: a unified priority-based scheduler. Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Systems, p.203-214.
[11]Fu HH, Liao JF, Yang JZ, et al., 2016. The Sunway TaihuLight supercomputer: system and applications. Sci China Inform Sci, 59(7):072001.
[12]Gerhardt L, Bhimji W, Canon S, et al., 2017. Shifter: containers for HPC. J Phys Conf Ser, 898:082021.
[13]Godlove D, 2019. Singularity: simple, secure containers for compute-driven workloads. Proc Practice and Experience in Advanced Research Computing on Rise of the Machines, Article 24.
[14]Hardi N, Blomer J, Ganis G, et al., 2018. Making containers lazy with Docker and CernVM-FS. J Phys Conf Ser, 1085(3):032019.
[15]Haring R, 2011. The Blue Gene/Q Compute chip. IEEE Hot Chips 23 Symp, p.1-20.
[16]Harter T, Salmon B, Liu R, et al., 2016. Slacker: fast distribution with lazy Docker containers. Proc 14th USENIX Conf on File and Storage Technologies, p.181-195.
[17]Höb M, Kranzlmüller D, 2020. Enabling EASEY deployment of containerized applications for future HPC systems. Proc 20th Int Conf on Computational Science, p.206-219.
[18]Huang Z, Wu S, Jiang S, et al., 2019. FastBuild: accelerating Docker image building for efficient development and deployment of container. 35th Symp on Mass Storage Systems and Technologies, p.28-37.
[19]Kurtzer GM, Sochat V, Bauer MW, 2017. Singularity: scientific containers for mobility of compute. PLoS ONE, 12(5):e0177459.
[20]Li HB, Yuan YF, Du R, et al., 2020. DADI: block-level image service for agile and elastic application deployment. USENIX Annual Technical Conf, p.727-740.
[21]Liu HF, Ding W, Chen Y, et al., 2019. CFS: a distributed file system for large scale container platforms. https://arxiv.org/abs/1911.03001
[22]Meizner J, Nowakowski P, Kapala J, et al., 2020. Towards exascale computing architecture and its prototype: services and infrastructure. Comput Inform, 39(4):860-880.
[23]Merkel D, 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J, 2014(239):2.
[24]Shao MT, Lu K, Zhang WZ, 2022. Self-deployed execution environment for HPC. Front Inform Technol Electron Eng, early access.
[25]Srirama SN, Adhikari M, Paul S, 2020. Application deployment using containers with auto-scaling for microservices in cloud environment. J Netw Comput Appl, 160:102629.
[26]Verma A, Pedrosa L, Korupolu M, et al., 2015. Large-scale cluster management at Google with Borg. Proc 10th European Conf on Computer Systems, Article 18.
[27]Wang KJ, Yang Y, Li Y, et al., 2017. FID: a faster image distribution system for Docker platform. IEEE 2nd Int Workshops on Foundations and Applications of Self• Systems, p.191-198.
[28]Yoo AB, Jette MA, Grondona M, 2003. SLURM: simple Linux utility for resource management. Proc 9th Int Workshop on Job Scheduling Strategies for Parallel Processing, p.44-60.
[29]Zheng C, Rupprecht L, Tarasov V, et al., 2018. Wharf: sharing Docker images in a distributed file system. Proc ACM Symp on Cloud Computing, p.174-185.
Open peer comments: Debate/Discuss/Question/Opinion
<1>