JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.10 P.1879-1895

An adaptive outlier correction quantization method for vision Transformers

Author(s): Zheyang LI, Chaoxiang LAN, Kai ZHANG, Wenming TAN, Ye REN, Jun XIAO
Affiliation(s): College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; more
Corresponding email(s): lizheyang@zju.edu.cn, junx@cs.zju.edu.cn
Key Words: Transformer, Model compression and acceleration, Post-training quantization, Outlier

Share this article to： More <<< Previous Article \|Next Article >>>

Zheyang LI, Chaoxiang LAN, Kai ZHANG, Wenming TAN, Ye REN, Jun XIAO. An adaptive outlier correction quantization method for vision Transformers[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(10): 1879-1895.

@article{title="An adaptive outlier correction quantization method for vision Transformers",
author="Zheyang LI, Chaoxiang LAN, Kai ZHANG, Wenming TAN, Ye REN, Jun XIAO",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="10",
pages="1879-1895",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400994"
}

%0 Journal Article
%T An adaptive outlier correction quantization method for vision Transformers
%A Zheyang LI
%A Chaoxiang LAN
%A Kai ZHANG
%A Wenming TAN
%A Ye REN
%A Jun XIAO
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 10
%P 1879-1895
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400994

TY - JOUR
T1 - An adaptive outlier correction quantization method for vision Transformers
A1 - Zheyang LI
A1 - Chaoxiang LAN
A1 - Kai ZHANG
A1 - Wenming TAN
A1 - Ye REN
A1 - Jun XIAO
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 10
SP - 1879
EP - 1895
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400994

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements. This poses challenges for deployment on resource-constrained devices. Quantization, as an effective model compression method, can significantly reduce the operational time of transformers on edge devices. Notably, transformers display more substantial outliers than convolutional neural networks, leading to uneven feature distribution among different channels and tokens. To address this issue, we propose an adaptive outlier correction quantization (AOCQ) method for transformers, which significantly alleviates the adverse effects of these outliers. AOCQ adjusts the notable discrepancies in channels and tokens across three levels: operator level, framework level, and loss level. We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level. Additionally, we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level. Based on the theoretical study, our method can reduce the quantization error. The effectiveness of the proposed method is verified on various benchmark models and tasks. Surprisingly, DeiT-Base with 8-bit post-training quantization (PTQ) can achieve 81.57% accuracy with a 0.28 percentage point drop while enjoying 4× faster runtime. Furthermore, the weights of Swin and DeiT on several tasks, including classification and object detection, can be post-quantized to ultra-low 4 bits, with a minimal accuracy loss of 2%, while requiring nearly 8×less memory.

一种面向视觉Transformers的自适应离群值校正量化方法

李哲暘^1,2，蓝朝祥²，张凯²，谭文明²，任烨²，肖俊¹
¹浙江大学计算机科学与技术学院，中国杭州市，310027
²杭州海康威视数字技术股份有限公司，中国杭州市，310051
摘要：Transformer模型虽已在多个领域展现出显著成效，但其巨大的计算和内存需求对其应用构成限制，尤其在资源受限的边缘设备上部署时面临挑战。量化作为一种有效的模型压缩方法，能显著降低Transformer在边缘设备上的运行时间。值得注意的是，与卷积神经网络（CNN）相比，Transformer的激活值表现出更为显著的离群值，导致不同通道和令牌间的特征分布不均。为应对此问题，提出一种自适应离群值校正量化（AOCQ）方法，该方法能显著降低这些离群值的不利影响。AOCQ在3个层级上调整通道和令牌间的显著差异：算子级；框架级；损失级。引入一种新颖的算子，它能等效平衡不同通道间的激活值，并在框架层面增设一个额外的阶段，以优化激活值的量化步骤。此外，在损失层面，将各令牌和各通道间的不均衡激活值转移到模型权重的优化过程中。理论分析表明，该方法能有效降低量化误差。所提方法的有效性已在在多种基准模型和任务上得到验证。经过8位训练后量化的DeiT-B模型在仅损失0.28个百分点精度的情况下，精度达到81.57%，同时实现4倍的推理加速。此外，在包括图像分类和目标检测在内的多项任务中，Swin Transformer和DeiT的权重可被训练后量化到4位，精度损失仅为2%，同时所需内存仅为原来的1/8。

关键词：Transformer；模型压缩和加速；训练后量化；离群值

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alam N, Kolawole S, Sethi S, et al., 2023. Vision Transformers for mobile applications: a short survey. https://arxiv.org/abs/2305.19365

[2]Ba J, Grosse R, Martens J, 2017. Distributed second-order optimization using Kronecker-factored approximations. Proc 5^th Int Conf on Learning Representations, p.1-17.

[3]Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450

[4]Bengio Y, Leonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432

[5]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16^th European Conf on Computer Vision, p.213-229.

[6]Chen MH, Peng HW, Fu JL, et al., 2021. AutoFormer: searching Transformers for visual recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.12250-12260.

[7]Chen ZS, Xie LX, Niu JW, et al., 2021. Visformer: the vision-friendly Transformer. Proc IEEE/CVF Int Conf on Computer Vision, p.569-578.

[8]Chitty-Venkata KT, Mittal S, Emani M, et al., 2023. A survey of techniques for optimizing Transformer inference. J Syst Archit, 144:102990.

[9]Choi J, Wang Z, Venkataramani S, et al., 2018. PACT: parameterized clipping activation for quantized neural networks. https://arxiv.org/abs/1805.06085

[10]Choromanski KM, Likhosherstov V, Dohan D, et al., 2021. Rethinking attention with performers. Proc 9^th Int Conf on Learning Representations, p.1-38.

[11]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[12]Ding MY, Xiao B, Codella N, et al., 2022. DaViT: dual attention vision Transformers. Proc 17^th European Conf on Computer Vision, p.74-92.

[13]Dong XY, Bao JM, Chen DD, et al., 2022. CSWin Transformer: a general vision Transformer backbone with cross-shaped windows. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12114-12124.

[14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9^th Int Conf on Learning Representations, p.1-21.

[15]Esser SK, McKinstry JL, Bablani D, et al., 2020. Learned step size quantization. Proc 8^th Int Conf on Learning Representations, p.1-12.

[16]Gong RH, Liu XL, Jiang SH, et al., 2019. Differentiable soft quantization: bridging full-precision and low-bit neural networks. Proc IEEE/CVF Int Conf on Computer Vision, p.4851-4860.

[17]Graham B, El-Nouby A, Touvron H, et al., 2021. LeViT: a vision Transformer in ConvNet's clothing for faster inference. Proc IEEE/CVF Int Conf on Computer Vision, p.12239-12249.

[18]Grosse RB, Martens J, 2016. A Kronecker-factored approximate Fisher matrix for convolution layers. Proc 33^rd Int Conf on Machine Learning, p.573-582.

[19]Ham TJ, Jung SJ, Kim S, et al., 2020. A3: accelerating attention mechanisms in neural networks with approximation. Proc IEEE Int Symp on High Performance Computer Architecture, p.328-341.

[20]Han DC, Pan XR, Han YZ, et al., 2023. FLatten Transformer: vision Transformer using focused linear attention. Proc IEEE/CVF Int Conf on Computer Vision, p.5938-5948.

[21]Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural networks. Proc 29^th Int Conf on Neural Information Processing Systems, p.1135-1143.

[22]Hatamizadeh A, Heinrich G, Yin HX, et al., 2024. FasterViT: fast vision Transformers with hierarchical attention. Proc 12^th Int Conf on Learning Representations, p.1-24.

[23]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. Comput Sci, 14(7):38-39.

[24]Hong K, Dai GH, Xu JM, et al., 2023. FlashDecoding++: faster large language model inference on GPUs. https://arxiv.org/abs/2311.01282

[25]Huang L, Qin J, Liu L, et al., 2020. Layer-wise conditioning analysis in exploring the learning dynamics of DNNs. Proc 16^th European Conf on Computer Vision, p.384-401.

[26]LeCun Y, Kanter I, Sona SA, 1990. Second order properties of error surfaces: learning time and generalization. Proc 4^th Int Conf on Neural Information Processing Systems, p.918-924.

[27]LeCun Y, Bottou L, Orr GB, et al., 2012. Efficient BackProp. In: Montavon G, Orr GB, Miller KB (Eds.), Neural Networks: Tricks of the Trade. Springer, Berlin, Heidelberg, p.9-48.

[28]Li F, Zhang H, Liu SL, et al., 2022. DN-DETR: accelerate DETR training by introducing query denoising. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13609-13617.

[29]Li F, Zhang H, Sun P, et al., 2025. Segment and recognize anything at any granularity. Proc 18^th European Conf on Computer Vision, p.467-484.

[30]Li YH, Gong RH, Tan X, et al., 2021. BRECQ: pushing the limit of post-training quantization by block reconstruction. Proc 9^th Int Conf on Learning Representations, p.1-16.

[31]Li ZK, Ma LP, Chen MJ, et al., 2022. Patch similarity aware data-free quantization for vision Transformers. Proc 17^th European Conf on Computer Vision, p.154-170.

[32]Li ZK, Xiao JR, Yang LW, et al., 2023. RepQ-ViT: scale reparameterization for post-training quantization of vision transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.17181-17190.

[33]Li ZK, Chen MJ, Xiao JR, et al., 2024. PSAQ-ViT V2: toward accurate and general data-free quantization for vision transformers. IEEE Trans Neur Netw Learn Syst, 35(12):17227-17238.

[34]Liang JY, Cao JZ, Sun GL, et al., 2021. SwinIR: image restoration using Swin Transformer. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.1833-1844.

[35]Lin Y, Zhang TY, Sun PQ, et al., 2022. FQ-ViT: post-training quantization for fully quantized vision Transformer. Proc 31^st Int Joint Conf on Artificial Intelligence, p.1173-1179.

[36]Liu SL, Li F, Zhang H, et al., 2022. DAB-DETR: dynamic anchor boxes are better queries for DETR. Proc 10^th Int Conf on Learning Representations, p.1-20.

[37]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002.

[38]Liu Z, Ning J, Cao Y, et al., 2022a. Video Swin Transformer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3192-3201.

[39]Liu Z, Hu H, Lin YT, et al., 2022b. Swin Transformer V2: scaling up capacity and resolution. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11999-12009.

[40]Liu ZH, Wang YH, Han K, et al., 2021. Post-training quantization for vision Transformer. Proc 35^th Int Conf on Neural Information Processing Systems, Article 2152.

[41]Mehta S, Ghazvininejad M, Iyer S, et al., 2021. DeLighT: deep and light-weight Transformer. Proc 9^th Int Conf on Learning Representations, p.1-19.

[42]Nagel M, Amjad RA, van Baalen M, et al., 2020. Up or down? Adaptive rounding for post-training quantization. Proc 37^th Int Conf on Machine Learning, Article 667.

[43]Qu Z, Liu L, Tu FB, et al., 2022. DOTA: detect and omit weak attentions for scalable Transformer acceleration. Proc 27^th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, p.14-26.

[44]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 29^th Int Conf on Neural Information Processing Systems, p.91-99.

[45]Sanh V, Debut L, Chaumond J, et al., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108

[46]Shen S, Yao ZW, Gholami A, et al., 2020. PowerNorm: rethinking batch normalization in Transformers. Proc 37^th Int Conf on Machine Learning, Article 811.

[47]Si CY, Yu WH, Zhou P, et al., 2022. Inception Transformer. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1707.

[48]Touvron H, Cord M, Sablayrolles A, et al., 2021a. Going deeper with image Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.32-42.

[49]Touvron H, Cord M, Douze M, et al., 2021b. Training data-efficient image Transformers & distillation through attention. Proc 38^th Int Conf on Machine Learning, p.10347-10357.

[50]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[51]Yang QM, Zhang K, Lan CX, et al., 2022. Unified normalization for accelerating and stabilizing Transformers. Proc 30^th ACM Int Conf on Multimedia, p.4445-4455.

[52]Yao ZW, Aminabadi RY, Zhang MJ, et al., 2022. ZeroQuant: efficient and affordable post-training quantization for large-scale Transformers. Proc 36^th Int Conf on Neural Information Processing Systems, Article 1970.

[53]Yu XD, Shi DH, Wei X, et al., 2022. SOIT: segmenting objects with instance-aware Transformers. Proc 36^th AAAI Conf on Artificial Intelligence, p.3188-3196.

[54]Yuan L, Chen YP, Wang T, et al., 2021. Tokens-to-Token ViT: training vision Transformers from scratch on ImageNet. Proc IEEE/CVF Int Conf on Computer Vision, p.538-547.

[55]Yuan ZH, Xue CH, Chen YQ, et al., 2022. PTQ4ViT: post-training quantization framework for vision Transformers. https://arxiv.org/abs/2111.12293

[56]Zafrir O, Boudoukh G, Izsak P, et al., 2019. Q8BERT: quantized 8bit BERT. Proc 5^th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, p.36-39.

[57]Zhang H, Li F, Liu SL, et al., 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Proc 11^th Int Conf on Learning Representations, p.1-19.

[58]Zhang XS, Tian YJ, Xie LX, et al., 2023. HiViT: a simpler and more efficient design of hierarchical vision Transformer. Proc 11^th Int Conf on Learning Representations, p.1-15.

[59]Zheng CY, Li ZY, Zhang K, et al., 2022. SAViT: structure-aware vision Transformer pruning via collaborative optimization. Proc 36^th Annual Conf on Neural Information Processing Systems, Article 655.

[60]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable Transformers for end-to-end object detection. Proc 9^th Int Conf on Learning Representations, p.1-16.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

一种面向视觉Transformers的自适应离群值校正量化方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference