CLC number: TP181
On-line Access: 2025-05-06
Received: 2024-05-28
Revision Accepted: 2024-08-25
Crosschecked: 2025-05-06
Cited: 0
Clicked: 566
Citations: Bibtex RefMan EndNote GB/T7714
Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU. Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(4): 605-622.
@article{title="Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator",
author="Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="4",
pages="605-622",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400453"
}
%0 Journal Article
%T Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
%A Yulong ZHAO
%A Chunzhi WU
%A Yizhuo WANG
%A Lufei ZHANG
%A Yaguang ZHANG
%A Wenyuan SHEN
%A Hao FAN
%A Hankang FANG
%A Yi QIN
%A Xin LIU
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 4
%P 605-622
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400453
TY - JOUR
T1 - Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator
A1 - Yulong ZHAO
A1 - Chunzhi WU
A1 - Yizhuo WANG
A1 - Lufei ZHANG
A1 - Yaguang ZHANG
A1 - Wenyuan SHEN
A1 - Hao FAN
A1 - Hankang FANG
A1 - Yi QIN
A1 - Xin LIU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 4
SP - 605
EP - 622
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400453
Abstract: Transformer models have become a cornerstone of various natural language processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence (AI) accelerators. Our work is anchored by four key contributions. First, we conduct a comprehensive analysis of the overhead composition within the transformer inference process, identifying the primary bottlenecks. Second, we leverage the management processing element (MPE) of the Shenwei AI (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10 000 of the original PyTorch-GPU setup. Third, we introduce a zero-copy memory management technique using segment-page fusion, which significantly reduces memory access latency and improves overall inference efficiency. Finally, we develop a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22 128.31 ms to 1041.72 ms. Our contributions significantly enhance the optimization of transformer models, enabling more efficient and expedited inference processes on AI accelerators.
[1]Arafa Y, Badawy AHA, Chennupati G, et al., 2019. Low overhead instruction latency characterization for NVIDIA GPGPUs. IEEE High Performance Extreme Computing Conf, p.1-8.
[2]Baevski A, Zhou H, Mohamed A, et al., 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations. Proc 34th Int Conf on Neural Information Processing Systems, p.12449-12460.
[3]Boudier P, Sellers G, 2011. Memory system on fusion APUs: the benefits of zero copy. AMD Fusion Developer Summit. http://developer.amd.com/afds/assets/presentations/1004_final.pdf [Accessed on Aug. 25, 2024].
[4]Chen GY, Shen XP, 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. 48th Annual IEEE/ACM Int Symp on Microarchitecture, p.407-419.
[5]Chen SY, Huang SY, Pandey S, et al., 2021. E.T.: re-thinking self-attention for transformer models on GPUs. Proc Int Conf for High Performance Computing, Networking, Storage, and Analysis, p.1-14.
[6]Chu CH, Khorassani KS, Zhou QH, et al., 2020. Dynamic kernel fusion for bulk non-contiguous data transfer on GPU clusters. IEEE Int Conf on Cluster Computing, p.130-141.
[7]Dai GH, Huang TH, Chi YZ, et al., 2019. GraphH: a processing-in-memory architecture for large-scale graph processing. IEEE Trans Comput-Aided Des Integr Circ Syst, 38(4):640-653.
[8]Dao T, Fu DY, Ermon S, et al., 2022. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. Proc 36th Int Conf on Neural Information Processing Systems, p.16344-16359.
[9]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
[10]Du JS, Liu ZM, Fang JR, et al., 2022. EnergonAI: an inference system for 10–100 billion parameter transformer models. https://arxiv.org/abs/2209.02341
[11]Fang JR, Yu Y, Zhao CD, et al., 2021. TurboTransformers: an efficient GPU serving system for transformer models. Proc 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.389-402.
[12]Fujii Y, Azumi T, Nishio N, et al., 2013. Data transfer matters for GPU computing. Int Conf Parallel and Distributed Systems, p.275-282.
[13]Huawei, 2020. DaVinci: a Scalable Architecture for Neural Network Computing. https://www.cmc.ca/wp-content/uploads/2020/03/Zhan-Xu-Huawei.pdf [Accessed on Aug. 25, 2024].
[14]Kim S, Oh S, Yi Y, 2021. Minimizing GPU kernel launch overhead in deep learning inference on mobile GPUs. Proc 22nd Int Workshop on Mobile Computing Systems and Applications, p.57-63.
[15]Kim YJ, Awadalla HH, 2020. FastFormers: highly efficient transformer models for natural language understanding. https://arxiv.org/abs/2010.13382
[16]Lee K, Lin HS, Feng WC, 2013. Performance characterization of data-intensive kernels on AMD fusion architectures. Comput Sci Res Dev, 28(2):175-184.
[17]Ma X, Li GL, Liu L, et al., 2021. Understanding the runtime overheads of deep learning inference on edge devices. IEEE Int Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, and Social Com-puting & Networking, p.390-397.
[18]Mittal S, 2020. A survey on evaluating and optimizing performance of Intel Xeon Phi. Concurr Comput, 32(19):e5742.
[19]Ouyang J, Noh M, Wang Y, et al., 2020. Baidu Kunlun: an AI processor for diversified workloads. IEEE Hot Chips 32 Symp, p.1-18.
[20]Patel S, Hwu WMW, 2008. Accelerator architectures. IEEE Micro, 28(4):4-12.
[21]Peccerillo B, Mannino M, Mondelli A, et al., 2022. A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. J Syst Architect, 129:102561.
[22]Radford A, Wu J, Child R, et al., 2019. Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [Accessed on Aug. 25, 2024].
[23]Radford A, Kim JW, Xu T, et al., 2023. Robust speech recognition via large-scale weak supervision. Proc 40th Int Conf on Machine Learning, p.28492-28518.
[24]Sodani A, Gramunt R, Corbal J, et al., 2016. Knights landing: second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34-46.
[25]Stevens JR, Venkatesan R, Dai S, et al., 2021. Softermax: hardware/software co-design of an efficient softmax for transformers. 58th ACM/IEEE Design Automation Conf, p.469-474.
[26]Sunitha NV, Raju K, Chiplunkar NN, 2017. Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. Int Conf on Inventive Communication and Computational Technologies, p.211-215.
[27]Sze V, Chen YH, Yang TJ, et al., 2017. Efficient processing of deep neural networks: a tutorial and survey. http://arxiv.org/abs/1703.09039
[28]Touvron H, Lavril T, Izacard G, et al., 2023. LLaMA: open and efficient foundation language models. https://arxiv.org/abs/2302.13971
[29]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[30]Wang XH, Xiong Y, Wei Y, et al., 2021. LightSeq: a high performance inference library for transformers. http://arxiv.org/abs/2010.13887
[31]Wechsler O, Behar M, Daga B, 2019. Spring Hill (NNP-I 1000) Intel’s data center inference chip. IEEE Hot Chips 31 Symp, p.1-12.
[32]Zhang LQ, Wahib M, Matsuoka S, 2019. Understanding the overheads of launching CUDA kernels. Int Conf on Parallel Processing, p.5-8.
Open peer comments: Debate/Discuss/Question/Opinion
<1>