Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system

Abstract: With the continuous improvement of supercomputer performance and the integration of artificial intelligence with traditional scientific computing, the scale of applications is gradually increasing, from millions to tens of millions of computing cores, which raises great challenges to achieve high scalability and efficiency of parallel applications on super-large-scale systems. Taking the Sunway exascale prototype system as an example, in this paper we first analyze the challenges of high scalability and high efficiency for parallel applications in the exascale era. To overcome these challenges, the optimization technologies used in the parallel supporting environment software on the Sunway exascale prototype system are highlighted, including the parallel operating system, input/output (I/O) optimization technology, ultra-large-scale parallel debugging technology, 10-million-core parallel algorithm, and mixed-precision method. Parallel operating systems and I/O optimization technology mainly support large-scale system scaling, while the ultra-large-scale parallel debugging technology, 10-million-core parallel algorithm, and mixed-precision method mainly enhance the efficiency of large-scale applications. Finally, the contributions to various applications running on the Sunway exascale prototype system are introduced, verifying the effectiveness of the parallel supporting environment design.

Key words: Parallel computing; Sunway; Ultra-large-scale; Supercomputer

Chinese Summary  <35> 面对E级超算系统的可扩展性和效率挑战:神威E级原型系统并行支撑环境的实践

何晓斌,陈鑫,郭恒,刘鑫,陈德训,杨雨灵,
高洁,冯赟龙,陈龙得,刁晓娜,陈左宁
国家并行计算机工程与技术研究中心,中国北京市,100190
摘要:随着超级计算机性能不断提高,人工智能与传统科学计算的进一步融合,应用的并行规模逐渐增加,从数百万个计算核心到数千万个计算核心,这对超大规模系统上实现并行应用的高可扩展性和高效率提出巨大挑战。本文首先以神威E级原型系统为例,分析了E级时代并行应用的高可扩展性和高效率面临的挑战。为克服这些挑战,重点介绍了神威E级原型系统上并行支撑环境软件的优化技术,包括并行操作系统、I/O优化技术、超大规模并行调试技术、千万核心并行算法、混合精度方法等。并行操作系统和I/O优化技术主要支持大规模系统扩展,而超大规模并行调试技术、千万核心并行算法和混合精度方法主要提升大规模应用的效率。最后,介绍了运行在神威E级原型系统上的应用程序取得的重要成果,从而验证了并行支撑环境设计的有效性。

关键词组:并行计算;神威;超大规模;超级计算机


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2200412

CLC number:

TP302

Download Full Text:

Click Here

Downloaded:

5484

Download summary:

<Click Here> 

Downloaded:

274

Clicked:

1508

Cited:

0

On-line Access:

2023-01-21

Received:

2022-09-25

Revision Accepted:

2023-01-21

Crosschecked:

2022-11-29

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE