|
|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.10 P.1879-1895
An adaptive outlier correction quantization method for vision Transformers
Abstract: Transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements. This poses challenges for deployment on resource-constrained devices. Quantization, as an effective model compression method, can significantly reduce the operational time of Transformers on edge devices. Notably, Transformers display more substantial outliers than convolutional neural networks, leading to uneven feature distribution among different channels and tokens. To address this issue, we propose an adaptive outlier correction quantization (AOCQ) method for Transformers, which significantly alleviates the adverse effects of these outliers. AOCQ adjusts the notable discrepancies in channels and tokens across three levels: operator level, framework level, and loss level. We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level. Additionally, we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level. Based on the theoretical study, our method can reduce the quantization error. The effectiveness of the proposed method is verified on various benchmark models and tasks. Surprisingly, DeiT-Base with 8-bit post-training quantization (PTQ) can achieve 81.57% accuracy with a 0.28 percentage point drop while enjoying 4× faster runtime. Furthermore, the weights of Swin and DeiT on several tasks, including classification and object detection, can be post-quantized to ultra-low 4 bits, with a minimal accuracy loss of 2%, while requiring nearly 8×less memory.
Key words: Transformer; Model compression and acceleration; Post-training quantization; Outlier
1浙江大学计算机科学与技术学院,中国杭州市,310027
2杭州海康威视数字技术股份有限公司,中国杭州市,310051
摘要:Transformer模型虽已在多个领域展现出显著成效,但其巨大的计算和内存需求对其应用构成限制,尤其在资源受限的边缘设备上部署时面临挑战。量化作为一种有效的模型压缩方法,能显著降低Transformer在边缘设备上的运行时间。值得注意的是,与卷积神经网络(CNN)相比,Transformer的激活值表现出更为显著的离群值,导致不同通道和令牌间的特征分布不均。为应对此问题,提出一种自适应离群值校正量化(AOCQ)方法,该方法能显著降低这些离群值的不利影响。AOCQ在3个层级上调整通道和令牌间的显著差异:算子级;框架级;损失级。引入一种新颖的算子,它能等效平衡不同通道间的激活值,并在框架层面增设一个额外的阶段,以优化激活值的量化步骤。此外,在损失层面,将各令牌和各通道间的不均衡激活值转移到模型权重的优化过程中。理论分析表明,该方法能有效降低量化误差。所提方法的有效性已在在多种基准模型和任务上得到验证。经过8位训练后量化的DeiT-B模型在仅损失0.28个百分点精度的情况下,精度达到81.57%,同时实现4倍的推理加速。此外,在包括图像分类和目标检测在内的多项任务中,Swin Transformer和DeiT的权重可被训练后量化到4位,精度损失仅为2%,同时所需内存仅为原来的1/8。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2400994
CLC number:
TP391
Download Full Text:
Downloaded:
201
Download summary:
<Click Here>Downloaded:
26Clicked:
459
Cited:
0
On-line Access:
2025-11-17
Received:
2024-11-11
Revision Accepted:
2025-11-18
Crosschecked:
2025-07-08