Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.4 P.605-622

Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator

Author(s): Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU
Affiliation(s): 1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214000, China more
Corresponding email(s): zhaoyl04@163.com, yyylx@263.net
Key Words: Transformer inference optimization, Three-tier scheduling, Zero-copy memory management, Fast model loading

Share this article to： More <<< Previous Article \|Next Article >>>

Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU. Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(4): 605-622.

@article{title="Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator",
author="Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="4",
pages="605-622",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400453"
}

%0 Journal Article
%T Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
%A Yulong ZHAO
%A Chunzhi WU
%A Yizhuo WANG
%A Lufei ZHANG
%A Yaguang ZHANG
%A Wenyuan SHEN
%A Hao FAN
%A Hankang FANG
%A Yi QIN
%A Xin LIU
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 4
%P 605-622
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400453

TY - JOUR
T1 - Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
A1 - Yulong ZHAO
A1 - Chunzhi WU
A1 - Yizhuo WANG
A1 - Lufei ZHANG
A1 - Yaguang ZHANG
A1 - Wenyuan SHEN
A1 - Hao FAN
A1 - Hankang FANG
A1 - Yi QIN
A1 - Xin LIU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 4
SP - 605
EP - 622
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400453

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Transformer models have become a cornerstone of various natural language processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence (AI) accelerators. Our work is anchored by four key contributions. First, we conduct a comprehensive analysis of the overhead composition within the transformer inference process, identifying the primary bottlenecks. Second, we leverage the management processing element (MPE) of the Shenwei AI (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10 000 of the original PyTorch-GPU setup. Third, we introduce a zero-copy memory management technique using segment-page fusion, which significantly reduces memory access latency and improves overall inference efficiency. Finally, we develop a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22 128.31 ms to 1041.72 ms. Our contributions significantly enhance the optimization of transformer models, enabling more efficient and expedited inference processes on AI accelerators.

使用申威人工智能加速器的控制单元最小化Transformer推理开销

赵玉龙¹，吴春志^1,2，王一卓³，张鲁飞¹，张亚光³，
沈文渊³，范昊¹，方韩康⁴，秦亦⁴，刘鑫⁵
¹数学工程与先进计算国家重点实验室，中国无锡市，214000
²航天工程大学士官学校，中国北京市，100004
³国家超级计算无锡中心，中国无锡市，214000
⁴之江实验室，中国杭州市，310000
⁵国家并行计算机工程技术研究中心，中国北京市，100081
摘要：基于Transformer架构的模型已成为自然语言处理领域的基石。然而，推理过程巨大的计算开销仍然是重大挑战，限制了这些模型的实际应用。本文在人工智能（AI）加速器上使用控制单元，以最小化Transformer模型推理过程的开销，主要包含4方面内容：首先，对Transformer推理过程的开销组成进行全面分析，识别主要瓶颈。其次，利用申威人工智能（SWAI）加速器的主核（MPE），实现了一个三级调度框架，使得主机与设备之间的启动次数减至原始PyTorch-GPU设置的万分之一左右。再次，引入一种基于段页融合的零拷贝内存管理技术，显著减少内存访问延迟并提高整体推理效率。最后，开发一种快速模型加载方法，消除模型验证和初始化过程的冗余计算，将大模型总加载时间从22 128.31毫秒减至1041.72毫秒。本文显著优化了Transformer模型，使其在AI加速器上的推理更加高效和迅速。

关键词：Transformer推理优化；三级调度；零拷贝内存管理；快速模型加载

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Arafa Y, Badawy AHA, Chennupati G, et al., 2019. Low overhead instruction latency characterization for NVIDIA GPGPUs. IEEE High Performance Extreme Computing Conf, p.1-8.

[2]Baevski A, Zhou H, Mohamed A, et al., 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations. Proc 34^th Int Conf on Neural Information Processing Systems, p.12449-12460.

[3]Boudier P, Sellers G, 2011. Memory system on fusion APUs: the benefits of zero copy. AMD Fusion Developer Summit. http://developer.amd.com/afds/assets/presentations/1004_final.pdf [Accessed on Aug. 25, 2024].

[4]Chen GY, Shen XP, 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. 48^th Annual IEEE/ACM Int Symp on Microarchitecture, p.407-419.

[5]Chen SY, Huang SY, Pandey S, et al., 2021. E.T.: re-thinking self-attention for transformer models on GPUs. Proc Int Conf for High Performance Computing, Networking, Storage, and Analysis, p.1-14.

[6]Chu CH, Khorassani KS, Zhou QH, et al., 2020. Dynamic kernel fusion for bulk non-contiguous data transfer on GPU clusters. IEEE Int Conf on Cluster Computing, p.130-141.

[7]Dai GH, Huang TH, Chi YZ, et al., 2019. GraphH: a processing-in-memory architecture for large-scale graph processing. IEEE Trans Comput-Aided Des Integr Circ Syst, 38(4):640-653.

[8]Dao T, Fu DY, Ermon S, et al., 2022. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. Proc 36^th Int Conf on Neural Information Processing Systems, p.16344-16359.

[9]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805

[10]Du JS, Liu ZM, Fang JR, et al., 2022. EnergonAI: an inference system for 10–100 billion parameter transformer models. https://arxiv.org/abs/2209.02341

[11]Fang JR, Yu Y, Zhao CD, et al., 2021. TurboTransformers: an efficient GPU serving system for transformer models. Proc 26^th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.389-402.

[12]Fujii Y, Azumi T, Nishio N, et al., 2013. Data transfer matters for GPU computing. Int Conf Parallel and Distributed Systems, p.275-282.

[13]Huawei, 2020. DaVinci: a Scalable Architecture for Neural Network Computing. https://www.cmc.ca/wp-content/uploads/2020/03/Zhan-Xu-Huawei.pdf [Accessed on Aug. 25, 2024].

[14]Kim S, Oh S, Yi Y, 2021. Minimizing GPU kernel launch overhead in deep learning inference on mobile GPUs. Proc 22^nd Int Workshop on Mobile Computing Systems and Applications, p.57-63.

[15]Kim YJ, Awadalla HH, 2020. FastFormers: highly efficient transformer models for natural language understanding. https://arxiv.org/abs/2010.13382

[16]Lee K, Lin HS, Feng WC, 2013. Performance characterization of data-intensive kernels on AMD fusion architectures. Comput Sci Res Dev, 28(2):175-184.

[17]Ma X, Li GL, Liu L, et al., 2021. Understanding the runtime overheads of deep learning inference on edge devices. IEEE Int Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, and Social Com-puting & Networking, p.390-397.

[18]Mittal S, 2020. A survey on evaluating and optimizing performance of Intel Xeon Phi. Concurr Comput, 32(19):e5742.

[19]Ouyang J, Noh M, Wang Y, et al., 2020. Baidu Kunlun: an AI processor for diversified workloads. IEEE Hot Chips 32 Symp, p.1-18.

[20]Patel S, Hwu WMW, 2008. Accelerator architectures. IEEE Micro, 28(4):4-12.

[21]Peccerillo B, Mannino M, Mondelli A, et al., 2022. A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. J Syst Architect, 129:102561.

[22]Radford A, Wu J, Child R, et al., 2019. Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [Accessed on Aug. 25, 2024].

[23]Radford A, Kim JW, Xu T, et al., 2023. Robust speech recognition via large-scale weak supervision. Proc 40^th Int Conf on Machine Learning, p.28492-28518.

[24]Sodani A, Gramunt R, Corbal J, et al., 2016. Knights landing: second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34-46.

[25]Stevens JR, Venkatesan R, Dai S, et al., 2021. Softermax: hardware/software co-design of an efficient softmax for transformers. 58^th ACM/IEEE Design Automation Conf, p.469-474.

[26]Sunitha NV, Raju K, Chiplunkar NN, 2017. Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. Int Conf on Inventive Communication and Computational Technologies, p.211-215.

[27]Sze V, Chen YH, Yang TJ, et al., 2017. Efficient processing of deep neural networks: a tutorial and survey. http://arxiv.org/abs/1703.09039

[28]Touvron H, Lavril T, Izacard G, et al., 2023. LLaMA: open and efficient foundation language models. https://arxiv.org/abs/2302.13971

[29]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[30]Wang XH, Xiong Y, Wei Y, et al., 2021. LightSeq: a high performance inference library for transformers. http://arxiv.org/abs/2010.13887

[31]Wechsler O, Behar M, Daga B, 2019. Spring Hill (NNP-I 1000) Intel’s data center inference chip. IEEE Hot Chips 31 Symp, p.1-12.

[32]Zhang LQ, Wahib M, Matsuoka S, 2019. Understanding the overheads of launching CUDA kernels. Int Conf on Parallel Processing, p.5-8.

Open peer comments: Debate/Discuss/Question/Opinion

<1>