|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.4 P.605-622
Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
Abstract: Transformer models have become a cornerstone of various natural language processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence (AI) accelerators. Our work is anchored by four key contributions. First, we conduct a comprehensive analysis of the overhead composition within the transformer inference process, identifying the primary bottlenecks. Second, we leverage the management processing element (MPE) of the Shenwei AI (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10 000 of the original PyTorch-GPU setup. Third, we introduce a zero-copy memory management technique using segment-page fusion, which significantly reduces memory access latency and improves overall inference efficiency. Finally, we develop a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22 128.31 ms to 1041.72 ms. Our contributions significantly enhance the optimization of transformer models, enabling more efficient and expedited inference processes on AI accelerators.
Key words: Transformer inference optimization; Three-tier scheduling; Zero-copy memory management; Fast model loading
沈文渊3,范昊1,方韩康4,秦亦4,刘鑫5
1数学工程与先进计算国家重点实验室,中国无锡市,214000
2航天工程大学士官学校,中国北京市,100004
3国家超级计算无锡中心,中国无锡市,214000
4之江实验室,中国杭州市,310000
5国家并行计算机工程技术研究中心,中国北京市,100081
摘要:基于Transformer架构的模型已成为自然语言处理领域的基石。然而,推理过程巨大的计算开销仍然是重大挑战,限制了这些模型的实际应用。本文在人工智能(AI)加速器上使用控制单元,以最小化Transformer模型推理过程的开销,主要包含4方面内容:首先,对Transformer推理过程的开销组成进行全面分析,识别主要瓶颈。其次,利用申威人工智能(SWAI)加速器的主核(MPE),实现了一个三级调度框架,使得主机与设备之间的启动次数减至原始PyTorch-GPU设置的万分之一左右。再次,引入一种基于段页融合的零拷贝内存管理技术,显著减少内存访问延迟并提高整体推理效率。最后,开发一种快速模型加载方法,消除模型验证和初始化过程的冗余计算,将大模型总加载时间从22 128.31毫秒减至1041.72毫秒。本文显著优化了Transformer模型,使其在AI加速器上的推理更加高效和迅速。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2400453
CLC number:
TP181
Download Full Text:
Downloaded:
1248
Clicked:
876
Cited:
0
On-line Access:
2025-05-06
Received:
2024-05-28
Revision Accepted:
2024-08-25
Crosschecked:
2025-05-06