Affiliation(s):
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214000, China;
moreAffiliation(s): State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214000, China; School of Non⁃Commissioned Officer, Space Engineering University, Beijing 100004, China; ChinaNational Supercomputing Center in Wuxi, Wuxi 214000, China; Zhejiang Lab, Hangzhou 310000, China; China National Research Centre of Parallel Computer Engineering and Technology, Beijing 100081, China;
less
Abstract: Transformer models have become a cornerstone of various Natural Language Processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in Transformer models using the controlling element on AI accelerators. Our work is anchored by four key contributions. First, we conducted a comprehensive analysis of the overhead composition within the Transformer inference process, identifying the primary bottlenecks. Second, we leveraged the Management Processing Element (MPE) of the Shenwei Aritificial Intelligence (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduced the number of host-device launches, achieving a reduction approximately 10,000 times lower than that achieved by the original PyTorch-GPU setup. Third, we introduced a zero-copy memory management technique using segment-page fusion, which significantly reduced memory access latency and improved overall inference efficiency. Finally, we developed a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22,128.31 milliseconds to 1041.72 milliseconds. Our contributions significantly enhance the optimization of Transformer models, enabling more efficient and expedited inference processes on AI accelerators.
Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference
Open peer comments: Debate/Discuss/Question/Opinion
Open peer comments: Debate/Discuss/Question/Opinion
<1>