|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2023 Vol.24 No.1 P.41-58
Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system
Abstract: With the continuous improvement of supercomputer performance and the integration of artificial intelligence with traditional scientific computing, the scale of applications is gradually increasing, from millions to tens of millions of computing cores, which raises great challenges to achieve high scalability and efficiency of parallel applications on super-large-scale systems. Taking the Sunway exascale prototype system as an example, in this paper we first analyze the challenges of high scalability and high efficiency for parallel applications in the exascale era. To overcome these challenges, the optimization technologies used in the parallel supporting environment software on the Sunway exascale prototype system are highlighted, including the parallel operating system, input/output (I/O) optimization technology, ultra-large-scale parallel debugging technology, 10-million-core parallel algorithm, and mixed-precision method. Parallel operating systems and I/O optimization technology mainly support large-scale system scaling, while the ultra-large-scale parallel debugging technology, 10-million-core parallel algorithm, and mixed-precision method mainly enhance the efficiency of large-scale applications. Finally, the contributions to various applications running on the Sunway exascale prototype system are introduced, verifying the effectiveness of the parallel supporting environment design.
Key words: Parallel computing; Sunway; Ultra-large-scale; Supercomputer
高洁,冯赟龙,陈龙得,刁晓娜,陈左宁
国家并行计算机工程与技术研究中心,中国北京市,100190
摘要:随着超级计算机性能不断提高,人工智能与传统科学计算的进一步融合,应用的并行规模逐渐增加,从数百万个计算核心到数千万个计算核心,这对超大规模系统上实现并行应用的高可扩展性和高效率提出巨大挑战。本文首先以神威E级原型系统为例,分析了E级时代并行应用的高可扩展性和高效率面临的挑战。为克服这些挑战,重点介绍了神威E级原型系统上并行支撑环境软件的优化技术,包括并行操作系统、I/O优化技术、超大规模并行调试技术、千万核心并行算法、混合精度方法等。并行操作系统和I/O优化技术主要支持大规模系统扩展,而超大规模并行调试技术、千万核心并行算法和混合精度方法主要提升大规模应用的效率。最后,介绍了运行在神威E级原型系统上的应用程序取得的重要成果,从而验证了并行支撑环境设计的有效性。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2200412
CLC number:
TP302
Download Full Text:
Downloaded:
8049
Download summary:
<Click Here>Downloaded:
508Clicked:
2477
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2022-11-29