JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2023 Vol.24 No.4 P.509-520

http://doi.org/10.1631/FITEE.2200359

Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000

Author(s): Jianbin FANG, Peng ZHANG, Chun HUANG, Tao TANG, Kai LU, Ruibo WANG, Zheng WANG
Affiliation(s): College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China; more
Corresponding email(s): j.fang@nudt.edu.cn, zhangpeng13a@nudt.edu.cn, chunhuang@nudt.edu.cn
Key Words: Heterogeneous computing, Parallel programming models, Programmability, Compilers, Runtime systems

Share this article to： More <<< Previous Article \|Next Article >>>

Jianbin FANG, Peng ZHANG, Chun HUANG, Tao TANG, Kai LU, Ruibo WANG, Zheng WANG. Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000[J]. Frontiers of Information Technology & Electronic Engineering, 2023, 24(4): 509-520.

@article{title="Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000",
author="Jianbin FANG, Peng ZHANG, Chun HUANG, Tao TANG, Kai LU, Ruibo WANG, Zheng WANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="24",
number="4",
pages="509-520",
year="2023",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200359"
}

%0 Journal Article
%T Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000
%A Jianbin FANG
%A Peng ZHANG
%A Chun HUANG
%A Tao TANG
%A Kai LU
%A Ruibo WANG
%A Zheng WANG
%J Frontiers of Information Technology & Electronic Engineering
%V 24
%N 4
%P 509-520
%@ 2095-9184
%D 2023
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200359

TY - JOUR
T1 - Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000
A1 - Jianbin FANG
A1 - Peng ZHANG
A1 - Chun HUANG
A1 - Tao TANG
A1 - Kai LU
A1 - Ruibo WANG
A1 - Zheng WANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 24
IS - 4
SP - 509
EP - 520
%@ 2095-9184
Y1 - 2023
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200359

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall, software developers are finding it hard to deal with the complexity of these systems. In this paper, we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000, which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization. To assist its software development, we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler. Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000, while the high-level model allows programmers to use the OpenCL programming standard. We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators. Our programming models have been deployed in the production environment of an exascale prototype system.

以Matrix-3000为例研究面向裸金属加速器的异构多线程编程模型

方建滨¹，张鹏¹，黄春¹，唐滔¹，卢凯¹，王睿伯¹，王峥²
¹国防科技大学计算机学院，中国长沙市，410073
²利兹大学计算学院，英国利兹市，LS2 9JT
摘要：随着处理器设计转向使用专门的异构多核以避免功耗墙的影响，软件开发人员发现很难处理这些处理器系统的复杂性。以Matrix-3000为代表的新型处理器具有复杂的内存层次结构和处理器组织，是为下一代E级超级计算机设计的高性能处理器。本文分享了我们为Matrix-3000开发的并行编程模型及其支持编译器和库的经验。为了帮助软件开发，我们从头开始开发了一个针对Matrix-3000的软件栈，包括一个低层次的编程接口和一个高层次的OpenCL编译器。该低层次编程模型为使用Matrix-3000的裸金属加速器提供了原生编程支持，而高层次模型允许程序员使用OpenCL并行编程标准。我们详细介绍了该软件栈的设计选择，并强调了从开发系统软件中学到的经验教训，以实现裸金属加速器的高效程序编写和性能解锁。我们的编程模型已经被部署到一个E级原型系统的生产环境中。

关键词：异构计算；并行编程模型；可编程性；编译器；运行时系统

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alfieri RA, 1994. An efficient kernel-based implementation of POSIX threads. Proc USENIX Summer Technical Conf, p.59-72.

[2]Arevalo A, Matinata RM, Pandian M, et al., 2000. Programming the cell broadband engine examples and best practices. ACM Workshop. Available from https://www.autodesk.com/research/publications/programming-the-cell-broadband [Accessed on Aug. 25, 2022].

[3]Fang JB, Varbanescu AL, Sips H, 2011. A comprehensive performance comparison of CUDA and OpenCL. Int Conf on Parallel Processing, p.216-225.

[4]Fang JB, Huang C, Tang T, et al., 2020. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans High Perform Comput, 2(4):382-400.

[5]Jääskeläinen P, de la Lama CS, Schnetter E, et al., 2015. pocl: a performance-portable OpenCL implementation. Int J Parall Program, 43(5):752-785.

[6]Kudlur M, Mahlke S, 2008. Orchestrating the execution of stream programs on multicore platforms. Proc 29^th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.114-124.

[7]Liao XK, Lu K, Yang CQ, et al., 2018. Moving from exascale to zettascale computing: challenges and techniques. Front Inform Technol Electron Eng, 19(10):1236-1244.

[8]Lu K, Wang YH, Guo Y, et al., 2022. MT-3000: a heterogeneous multi-zone processor for HPC. CCF Trans High Perform Comput, 4(2):150-164.

[9]Owens JD, Luebke D, Govindaraju N, et al., 2005. A survey of general-purpose computation on graphics hardware. Proc 26^th Annual Conf of the European Association for Computer Graphics, p.21-51.

[10]Owens JD, Houston M, Luebke D, et al., 2008. GPU computing. Proc IEEE, 96(5):879-899.

[11]Patterson D, 2018. 50 years of computer architecture: from the mainframe CPU to the domain-specific TPU and the open RISC-V instruction set. IEEE Int Solid-State Circuits Conf, p.27-31.

[12]Perez JM, Bellens P, Badia RM, et al., 2007. CellSs: making it easier to program the cell broadband engine processor. IBM J Res Dev, 51(5):593-604.

[13]Shen J, Fang JB, Sips H, et al., 2012. Performance gaps between OpenMP and OpenCL for multi-core CPUs. Proc 41^st Int Conf on Parallel Processing Workshops, p.116-125.

[14]Trott CR, Lebrun-Grandié D, Arndt D, et al., 2022. Kokkos 3: programming model extensions for the exascale era. IEEE Trans Parall Distrib Syst, 33(4):805-817.

[15]Zhai JD, Chen WG, 2018. A vision of post-exascale programming. Front Inform Technol Electron Eng, 19(10):1261-1266.

[16]Zhang P, Tang T, Fang J, et al., 2018. MOCL: an efficient OpenCL implementation for the Matrix-2000 architecture. Proc 15^th ACM Int Conf on Computing Frontiers, p.26-35.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

以Matrix-3000为例研究面向裸金属加速器的异构多线程编程模型

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference