CLC number:
On-line Access: 2024-11-05
Received: 2024-05-28
Revision Accepted: 2024-08-25
Crosschecked: 0000-00-00
Cited: 0
Clicked: 76
Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei Zhang, Yaguang ZHANG,Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU. Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering, 1998, -1(-1): .
@article{title="Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator",
author="Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei Zhang, Yaguang ZHANG,Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="-1",
number="-1",
pages="",
year="1998",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400453"
}
%0 Journal Article
%T Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
%A Yulong ZHAO
%A Chunzhi WU
%A Yizhuo WANG
%A Lufei Zhang
%A Yaguang ZHANG
%A Wenyuan SHEN
%A Hao FAN
%A Hankang FANG
%A Yi QIN
%A Xin LIU
%J Journal of Zhejiang University SCIENCE C
%V -1
%N -1
%P
%@ 2095-9184
%D 1998
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400453
TY - JOUR
T1 - Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
A1 - Yulong ZHAO
A1 - Chunzhi WU
A1 - Yizhuo WANG
A1 - Lufei Zhang
A1 - Yaguang ZHANG
A1 - Wenyuan SHEN
A1 - Hao FAN
A1 - Hankang FANG
A1 - Yi QIN
A1 - Xin LIU
J0 - Journal of Zhejiang University Science C
VL - -1
IS - -1
SP -
EP -
%@ 2095-9184
Y1 - 1998
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400453
Abstract: Transformer models have become a cornerstone of various Natural Language Processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in Transformer models using the controlling element on AI accelerators. Our work is anchored by four key contributions. First, we conducted a comprehensive analysis of the overhead composition within the Transformer inference process, identifying the primary bottlenecks. Second, we leveraged the Management Processing Element (MPE) of the Shenwei Aritificial Intelligence (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduced the number of host-device launches, achieving a reduction approximately 10,000 times lower than that achieved by the original PyTorch-GPU setup. Third, we introduced a zero-copy memory management technique using segment-page fusion, which significantly reduced memory access latency and improved overall inference efficiency. Finally, we developed a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22,128.31 milliseconds to 1041.72 milliseconds. Our contributions significantly enhance the optimization of Transformer models, enabling more efficient and expedited inference processes on AI accelerators.
Open peer comments: Debate/Discuss/Question/Opinion
<1>