Full Text:  <85>

CLC number: 

On-line Access: 2024-11-05

Received: 2024-05-28

Revision Accepted: 2024-08-25

Crosschecked: 0000-00-00

Cited: 0

Clicked: 121

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering 

Accepted manuscript available online (unedited version)


Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator


Author(s):  Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei Zhang, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU

Affiliation(s):  State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214000, China; more

Corresponding email(s):  zhaoyl04@163.com, yyylx@263.net

Key Words:  Transformer inference optimization; Three-tier scheduling; Zero-copy memory management; Fast model loading


Share this article to: More <<< Previous Paper|Next Paper >>>

Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei Zhang, Yaguang ZHANG,Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU. Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400453

@article{title="Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator",
author="Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei Zhang, Yaguang ZHANG,Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400453"
}

%0 Journal Article
%T Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
%A Yulong ZHAO
%A Chunzhi WU
%A Yizhuo WANG
%A Lufei Zhang
%A Yaguang ZHANG
%A Wenyuan SHEN
%A Hao FAN
%A Hankang FANG
%A Yi QIN
%A Xin LIU
%J Frontiers of Information Technology & Electronic Engineering
%P
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400453"

TY - JOUR
T1 - Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
A1 - Yulong ZHAO
A1 - Chunzhi WU
A1 - Yizhuo WANG
A1 - Lufei Zhang
A1 - Yaguang ZHANG
A1 - Wenyuan SHEN
A1 - Hao FAN
A1 - Hankang FANG
A1 - Yi QIN
A1 - Xin LIU
J0 - Frontiers of Information Technology & Electronic Engineering
SP -
EP -
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400453"


Abstract: 
Transformer models have become a cornerstone of various Natural Language Processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in Transformer models using the controlling element on AI accelerators. Our work is anchored by four key contributions. First, we conducted a comprehensive analysis of the overhead composition within the Transformer inference process, identifying the primary bottlenecks. Second, we leveraged the Management Processing Element (MPE) of the Shenwei Aritificial Intelligence (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduced the number of host-device launches, achieving a reduction approximately 10,000 times lower than that achieved by the original PyTorch-GPU setup. Third, we introduced a zero-copy memory management technique using segment-page fusion, which significantly reduced memory access latency and improved overall inference efficiency. Finally, we developed a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22,128.31 milliseconds to 1041.72 milliseconds. Our contributions significantly enhance the optimization of Transformer models, enabling more efficient and expedited inference processes on AI accelerators.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE