CLC number: TP315
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2022-10-19
Cited: 0
Clicked: 2141
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0003-3542-4869
Jianbin FANG, Peng ZHANG, Chun HUANG, Tao TANG, Kai LU, Ruibo WANG, Zheng WANG. Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000[J]. Frontiers of Information Technology & Electronic Engineering, 2023, 24(4): 509-520.
@article{title="Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000",
author="Jianbin FANG, Peng ZHANG, Chun HUANG, Tao TANG, Kai LU, Ruibo WANG, Zheng WANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="24",
number="4",
pages="509-520",
year="2023",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200359"
}
%0 Journal Article
%T Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000
%A Jianbin FANG
%A Peng ZHANG
%A Chun HUANG
%A Tao TANG
%A Kai LU
%A Ruibo WANG
%A Zheng WANG
%J Frontiers of Information Technology & Electronic Engineering
%V 24
%N 4
%P 509-520
%@ 2095-9184
%D 2023
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200359
TY - JOUR
T1 - Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000
A1 - Jianbin FANG
A1 - Peng ZHANG
A1 - Chun HUANG
A1 - Tao TANG
A1 - Kai LU
A1 - Ruibo WANG
A1 - Zheng WANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 24
IS - 4
SP - 509
EP - 520
%@ 2095-9184
Y1 - 2023
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200359
Abstract: As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall, software developers are finding it hard to deal with the complexity of these systems. In this paper, we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000, which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization. To assist its software development, we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler. Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000, while the high-level model allows programmers to use the OpenCL programming standard. We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators. Our programming models have been deployed in the production environment of an exascale prototype system.
[1]Alfieri RA, 1994. An efficient kernel-based implementation of POSIX threads. Proc USENIX Summer Technical Conf, p.59-72.
[2]Arevalo A, Matinata RM, Pandian M, et al., 2000. Programming the cell broadband engine examples and best practices. ACM Workshop. Available from https://www.autodesk.com/research/publications/programming-the-cell-broadband [Accessed on Aug. 25, 2022].
[3]Fang JB, Varbanescu AL, Sips H, 2011. A comprehensive performance comparison of CUDA and OpenCL. Int Conf on Parallel Processing, p.216-225.
[4]Fang JB, Huang C, Tang T, et al., 2020. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans High Perform Comput, 2(4):382-400.
[5]Jääskeläinen P, de la Lama CS, Schnetter E, et al., 2015. pocl: a performance-portable OpenCL implementation. Int J Parall Program, 43(5):752-785.
[6]Kudlur M, Mahlke S, 2008. Orchestrating the execution of stream programs on multicore platforms. Proc 29th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.114-124.
[7]Liao XK, Lu K, Yang CQ, et al., 2018. Moving from exascale to zettascale computing: challenges and techniques. Front Inform Technol Electron Eng, 19(10):1236-1244.
[8]Lu K, Wang YH, Guo Y, et al., 2022. MT-3000: a heterogeneous multi-zone processor for HPC. CCF Trans High Perform Comput, 4(2):150-164.
[9]Owens JD, Luebke D, Govindaraju N, et al., 2005. A survey of general-purpose computation on graphics hardware. Proc 26th Annual Conf of the European Association for Computer Graphics, p.21-51.
[10]Owens JD, Houston M, Luebke D, et al., 2008. GPU computing. Proc IEEE, 96(5):879-899.
[11]Patterson D, 2018. 50 years of computer architecture: from the mainframe CPU to the domain-specific TPU and the open RISC-V instruction set. IEEE Int Solid-State Circuits Conf, p.27-31.
[12]Perez JM, Bellens P, Badia RM, et al., 2007. CellSs: making it easier to program the cell broadband engine processor. IBM J Res Dev, 51(5):593-604.
[13]Shen J, Fang JB, Sips H, et al., 2012. Performance gaps between OpenMP and OpenCL for multi-core CPUs. Proc 41st Int Conf on Parallel Processing Workshops, p.116-125.
[14]Trott CR, Lebrun-Grandié D, Arndt D, et al., 2022. Kokkos 3: programming model extensions for the exascale era. IEEE Trans Parall Distrib Syst, 33(4):805-817.
[15]Zhai JD, Chen WG, 2018. A vision of post-exascale programming. Front Inform Technol Electron Eng, 19(10):1261-1266.
[16]Zhang P, Tang T, Fang J, et al., 2018. MOCL: an efficient OpenCL implementation for the Matrix-2000 architecture. Proc 15th ACM Int Conf on Computing Frontiers, p.26-35.
Open peer comments: Debate/Discuss/Question/Opinion
<1>