Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator

Abstract: Transformer models have become a cornerstone of various natural language processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence (AI) accelerators. Our work is anchored by four key contributions. First, we conduct a comprehensive analysis of the overhead composition within the transformer inference process, identifying the primary bottlenecks. Second, we leverage the management processing element (MPE) of the Shenwei AI (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10 000 of the original PyTorch-GPU setup. Third, we introduce a zero-copy memory management technique using segment-page fusion, which significantly reduces memory access latency and improves overall inference efficiency. Finally, we develop a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22 128.31 ms to 1041.72 ms. Our contributions significantly enhance the optimization of transformer models, enabling more efficient and expedited inference processes on AI accelerators.

Key words: Transformer inference optimization; Three-tier scheduling; Zero-copy memory management; Fast model loading

Chinese Summary  <13> 使用申威人工智能加速器的控制单元最小化Transformer推理开销

赵玉龙1,吴春志1,2,王一卓3,张鲁飞1,张亚光3
沈文渊3,范昊1,方韩康4,秦亦4,刘鑫5
1数学工程与先进计算国家重点实验室,中国无锡市,214000
2航天工程大学士官学校,中国北京市,100004
3国家超级计算无锡中心,中国无锡市,214000
4之江实验室,中国杭州市,310000
5国家并行计算机工程技术研究中心,中国北京市,100081
摘要:基于Transformer架构的模型已成为自然语言处理领域的基石。然而,推理过程巨大的计算开销仍然是重大挑战,限制了这些模型的实际应用。本文在人工智能(AI)加速器上使用控制单元,以最小化Transformer模型推理过程的开销,主要包含4方面内容:首先,对Transformer推理过程的开销组成进行全面分析,识别主要瓶颈。其次,利用申威人工智能(SWAI)加速器的主核(MPE),实现了一个三级调度框架,使得主机与设备之间的启动次数减至原始PyTorch-GPU设置的万分之一左右。再次,引入一种基于段页融合的零拷贝内存管理技术,显著减少内存访问延迟并提高整体推理效率。最后,开发一种快速模型加载方法,消除模型验证和初始化过程的冗余计算,将大模型总加载时间从22 128.31毫秒减至1041.72毫秒。本文显著优化了Transformer模型,使其在AI加速器上的推理更加高效和迅速。

关键词组:Transformer推理优化;三级调度;零拷贝内存管理;快速模型加载


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2400453

CLC number:

TP181

Download Full Text:

Click Here

Downloaded:

1248

Clicked:

876

Cited:

0

On-line Access:

2025-05-06

Received:

2024-05-28

Revision Accepted:

2024-08-25

Crosschecked:

2025-05-06

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE