CLC number: TP181
On-line Access: 2025-05-06
Received: 2024-05-28
Revision Accepted: 2024-08-25
Crosschecked: 2025-05-06
Cited: 0
Clicked: 743
Citations: Bibtex RefMan EndNote GB/T7714
Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU. Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400453 @article{title="Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator", %0 Journal Article TY - JOUR
使用申威人工智能加速器的控制单元最小化Transformer推理开销沈文渊3,范昊1,方韩康4,秦亦4,刘鑫5 1数学工程与先进计算国家重点实验室,中国无锡市,214000 2航天工程大学士官学校,中国北京市,100004 3国家超级计算无锡中心,中国无锡市,214000 4之江实验室,中国杭州市,310000 5国家并行计算机工程技术研究中心,中国北京市,100081 摘要:基于Transformer架构的模型已成为自然语言处理领域的基石。然而,推理过程巨大的计算开销仍然是重大挑战,限制了这些模型的实际应用。本文在人工智能(AI)加速器上使用控制单元,以最小化Transformer模型推理过程的开销,主要包含4方面内容:首先,对Transformer推理过程的开销组成进行全面分析,识别主要瓶颈。其次,利用申威人工智能(SWAI)加速器的主核(MPE),实现了一个三级调度框架,使得主机与设备之间的启动次数减至原始PyTorch-GPU设置的万分之一左右。再次,引入一种基于段页融合的零拷贝内存管理技术,显著减少内存访问延迟并提高整体推理效率。最后,开发一种快速模型加载方法,消除模型验证和初始化过程的冗余计算,将大模型总加载时间从22 128.31毫秒减至1041.72毫秒。本文显著优化了Transformer模型,使其在AI加速器上的推理更加高效和迅速。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Arafa Y, Badawy AHA, Chennupati G, et al., 2019. Low overhead instruction latency characterization for NVIDIA GPGPUs. IEEE High Performance Extreme Computing Conf, p.1-8. ![]() [2]Baevski A, Zhou H, Mohamed A, et al., 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations. Proc 34th Int Conf on Neural Information Processing Systems, p.12449-12460. ![]() [3]Boudier P, Sellers G, 2011. Memory system on fusion APUs: the benefits of zero copy. AMD Fusion Developer Summit. http://developer.amd.com/afds/assets/presentations/1004_final.pdf [Accessed on Aug. 25, 2024]. ![]() [4]Chen GY, Shen XP, 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. 48th Annual IEEE/ACM Int Symp on Microarchitecture, p.407-419. ![]() [5]Chen SY, Huang SY, Pandey S, et al., 2021. E.T.: re-thinking self-attention for transformer models on GPUs. Proc Int Conf for High Performance Computing, Networking, Storage, and Analysis, p.1-14. ![]() [6]Chu CH, Khorassani KS, Zhou QH, et al., 2020. Dynamic kernel fusion for bulk non-contiguous data transfer on GPU clusters. IEEE Int Conf on Cluster Computing, p.130-141. ![]() [7]Dai GH, Huang TH, Chi YZ, et al., 2019. GraphH: a processing-in-memory architecture for large-scale graph processing. IEEE Trans Comput-Aided Des Integr Circ Syst, 38(4):640-653. ![]() [8]Dao T, Fu DY, Ermon S, et al., 2022. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. Proc 36th Int Conf on Neural Information Processing Systems, p.16344-16359. ![]() [9]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805 ![]() [10]Du JS, Liu ZM, Fang JR, et al., 2022. EnergonAI: an inference system for 10–100 billion parameter transformer models. https://arxiv.org/abs/2209.02341 ![]() [11]Fang JR, Yu Y, Zhao CD, et al., 2021. TurboTransformers: an efficient GPU serving system for transformer models. Proc 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.389-402. ![]() [12]Fujii Y, Azumi T, Nishio N, et al., 2013. Data transfer matters for GPU computing. Int Conf Parallel and Distributed Systems, p.275-282. ![]() [13]Huawei, 2020. DaVinci: a Scalable Architecture for Neural Network Computing. https://www.cmc.ca/wp-content/uploads/2020/03/Zhan-Xu-Huawei.pdf [Accessed on Aug. 25, 2024]. ![]() [14]Kim S, Oh S, Yi Y, 2021. Minimizing GPU kernel launch overhead in deep learning inference on mobile GPUs. Proc 22nd Int Workshop on Mobile Computing Systems and Applications, p.57-63. ![]() [15]Kim YJ, Awadalla HH, 2020. FastFormers: highly efficient transformer models for natural language understanding. https://arxiv.org/abs/2010.13382 ![]() [16]Lee K, Lin HS, Feng WC, 2013. Performance characterization of data-intensive kernels on AMD fusion architectures. Comput Sci Res Dev, 28(2):175-184. ![]() [17]Ma X, Li GL, Liu L, et al., 2021. Understanding the runtime overheads of deep learning inference on edge devices. IEEE Int Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, and Social Com-puting & Networking, p.390-397. ![]() [18]Mittal S, 2020. A survey on evaluating and optimizing performance of Intel Xeon Phi. Concurr Comput, 32(19):e5742. ![]() [19]Ouyang J, Noh M, Wang Y, et al., 2020. Baidu Kunlun: an AI processor for diversified workloads. IEEE Hot Chips 32 Symp, p.1-18. ![]() [20]Patel S, Hwu WMW, 2008. Accelerator architectures. IEEE Micro, 28(4):4-12. ![]() [21]Peccerillo B, Mannino M, Mondelli A, et al., 2022. A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. J Syst Architect, 129:102561. ![]() [22]Radford A, Wu J, Child R, et al., 2019. Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [Accessed on Aug. 25, 2024]. ![]() [23]Radford A, Kim JW, Xu T, et al., 2023. Robust speech recognition via large-scale weak supervision. Proc 40th Int Conf on Machine Learning, p.28492-28518. ![]() [24]Sodani A, Gramunt R, Corbal J, et al., 2016. Knights landing: second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34-46. ![]() [25]Stevens JR, Venkatesan R, Dai S, et al., 2021. Softermax: hardware/software co-design of an efficient softmax for transformers. 58th ACM/IEEE Design Automation Conf, p.469-474. ![]() [26]Sunitha NV, Raju K, Chiplunkar NN, 2017. Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead. Int Conf on Inventive Communication and Computational Technologies, p.211-215. ![]() [27]Sze V, Chen YH, Yang TJ, et al., 2017. Efficient processing of deep neural networks: a tutorial and survey. http://arxiv.org/abs/1703.09039 ![]() [28]Touvron H, Lavril T, Izacard G, et al., 2023. LLaMA: open and efficient foundation language models. https://arxiv.org/abs/2302.13971 ![]() [29]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. ![]() [30]Wang XH, Xiong Y, Wei Y, et al., 2021. LightSeq: a high performance inference library for transformers. http://arxiv.org/abs/2010.13887 ![]() [31]Wechsler O, Behar M, Daga B, 2019. Spring Hill (NNP-I 1000) Intel’s data center inference chip. IEEE Hot Chips 31 Symp, p.1-12. ![]() [32]Zhang LQ, Wahib M, Matsuoka S, 2019. Understanding the overheads of launching CUDA kernels. Int Conf on Parallel Processing, p.5-8. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>