CLC number: TP311
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2018-10-15
Cited: 0
Clicked: 3306
Ji-dong Zhai, Wen-guang Chen. A vision of post-exascale programming[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(10): 1261-1266.
@article{title="A vision of post-exascale programming",
author="Ji-dong Zhai, Wen-guang Chen",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="10",
pages="1261-1266",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1800442"
}
%0 Journal Article
%T A vision of post-exascale programming
%A Ji-dong Zhai
%A Wen-guang Chen
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 10
%P 1261-1266
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1800442
TY - JOUR
T1 - A vision of post-exascale programming
A1 - Ji-dong Zhai
A1 - Wen-guang Chen
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 10
SP - 1261
EP - 1266
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1800442
Abstract: Exascale systems have been under development for quite some time and will be available for use in a few years. It is time to think about future post-exascale systems. There are many main challenges with regard to future post-exascale systems, such as processor architecture, programming, storage, and interconnect. In this study, we discuss three significant programming challenges for future post-exascale systems: heterogeneity, parallelism, and fault tolerance. Based on our experience of programming on current large-scale systems, we propose several potential solutions for these challenges. Nevertheless, more research efforts are needed to solve these problems.
[1]Bahmani A, Mueller F, 2014. Scalable performance analysis of exascale MPI programs through signature-based clustering algorithms. 28th ACM Int Conf on Supercomputing, p.155-164.
[2]Balaji P, Snir M, Amer A, et al., 2013. Exascale MPI. https://www.exascaleproject.org/project/exascale-mpi/ [Accessed on Sept. 10, 2018].
[3]Bland W, Du P, Bouteiller A, et al., 2012. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. European Conf on Parallel Processing, p.477-488.
[4]Bland W, Bouteiller A, Herault T, et al., 2013. Post-failure recovery of MPI communication capability: design and rationale. {em Int J High Perform Comput Appl}, 27(3):244-254.
[5]Bouteiller A, Cappello F, Herault T, et al., 2003. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. ACM/IEEE Conf on Supercomputing, p.1-17.
[6]Cappello F, 2009. Fault tolerance in petascale/exascale systems: current knowledge, challenges, and research opportunities. Int J High Perform Comput Appl, 23(3):212-226.
[7]Chen Z, 2013. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Not, 48(8):167-176.
[8]Dagum L, Menon R, 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng, 5(1):46-55.
[9]Dean J, Ghemawat S, 2008. MapReduce: simplified data processing on large clusters. Commun ACM, 51(1):107-113.
[10]Dong X, Muralimanohar N, Jouppi N, et al., 2009. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. Int Conf on High Performance Computing Networking, Storage, and Analysis, p.1-12.
[11]Fu H, Liao J, Yang J, et al., 2016. The Sunway Taihulight supercomputer: system and applications. Sci Chin Inform Sci, 59(7):072001.
[12]Gropp W, 2009. MPI at exascale: challenges for data structures and algorithms. In: Ropo M, Westerholm J, Dongarra J (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer Berlin Heidelberg.
[13]Huang KH, Abraham JA, 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput, C-33(6):518-528.
[14]Jeffers J, Reinders J, 2013. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, USA.
[15]Lee S, Vetter JS, 2012. Early evaluation of directive-based GPU programming models for productive exascale computing. Int Conf on High Performance Computing, Networking, Storage, and Analysis, p.1-11.
[16]Lin H, Tang X, Yu B, et al., 2017. Scalable graph traversal on Sunway Taihulight with ten million cores. Int Parallel and Distributed Processing Symp, p.635-645.
[17]Munshi A, 2009. The OpenCL specification. 21$^rm st$ IEEE Hot Chips Symp, p.1-314.
[18]Ragan-Kelley J, Adams A, 2012. Halide. http://halide-lang.org [Accessed on Sept. 10, 2018].
[19]Ragan-Kelley J, Barnes C, Adams A, et al., 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not, 48(6):519-530.
[20]Schroeder B, Gibson G, 2010. A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Sec Comput, 7(4):337-350.
[21]Stone JE, Gohara D, Shi G, 2010. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng, 12(3):66-73.
[22]Tang X, Zhai J, Yu B, et al., 2017. Self-checkpoint: an in-memory checkpoint method using less space and its practice on fault-tolerant HPL. 22nd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.401-413.
[23]Tang X, Zhai J, Qian X, et al., 2018. VSensor: leveraging fixed-workload snippets of programs for performance variance detection. 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.124-136.
[24]Vetter JS, Glassbrook R, Dongarra J, et al., 2011. Keeneland: bringing heterogeneous GPU computing to the computational science community. Comput Sci Eng, 13(5):90-95.
[25]Xin RS, Gonzalez JE, Franklin MJ, et al., 2013. GraphX: a resilient distributed graph system on Spark. 1st Int Workshop on Graph Data Management Experiences and Systems, p.1-6.
[26]Yao E, Wang R, Chen M, et al., 2012. A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism. 26th Int Parallel and Distributed Processing Symp, p.438-448.
[27]Zhu X, Chen W, Zheng W, et al., 2016. Gemini: a computation-centric distributed graph processing system. 12th USENIX Symp on Operating Systems Design and Implementation, p.301-316.
Open peer comments: Debate/Discuss/Question/Opinion
<1>