
CLC number: TP391
On-line Access: 2025-11-17
Received: 2024-11-11
Revision Accepted: 2025-11-18
Crosschecked: 2025-07-08
Cited: 0
Clicked: 473
Citations: Bibtex RefMan EndNote GB/T7714
Zheyang LI, Chaoxiang LAN, Kai ZHANG, Wenming TAN, Ye REN, Jun XIAO. An adaptive outlier correction quantization method for vision Transformers[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(10): 1879-1895.
@article{title="An adaptive outlier correction quantization method for vision Transformers",
author="Zheyang LI, Chaoxiang LAN, Kai ZHANG, Wenming TAN, Ye REN, Jun XIAO",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="10",
pages="1879-1895",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400994"
}
%0 Journal Article
%T An adaptive outlier correction quantization method for vision Transformers
%A Zheyang LI
%A Chaoxiang LAN
%A Kai ZHANG
%A Wenming TAN
%A Ye REN
%A Jun XIAO
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 10
%P 1879-1895
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400994
TY - JOUR
T1 - An adaptive outlier correction quantization method for vision Transformers
A1 - Zheyang LI
A1 - Chaoxiang LAN
A1 - Kai ZHANG
A1 - Wenming TAN
A1 - Ye REN
A1 - Jun XIAO
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 10
SP - 1879
EP - 1895
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400994
Abstract: transformers have demonstrated considerable success across various domains but are constrained by their significant computational and memory requirements. This poses challenges for deployment on resource-constrained devices. Quantization, as an effective model compression method, can significantly reduce the operational time of transformers on edge devices. Notably, transformers display more substantial outliers than convolutional neural networks, leading to uneven feature distribution among different channels and tokens. To address this issue, we propose an adaptive outlier correction quantization (AOCQ) method for transformers, which significantly alleviates the adverse effects of these outliers. AOCQ adjusts the notable discrepancies in channels and tokens across three levels: operator level, framework level, and loss level. We introduce a new operator that equivalently balances the activations across different channels and insert an extra stage to optimize the activation quantization step on the framework level. Additionally, we transfer the imbalanced activations across tokens and channels to the optimization of model weights on the loss level. Based on the theoretical study, our method can reduce the quantization error. The effectiveness of the proposed method is verified on various benchmark models and tasks. Surprisingly, DeiT-Base with 8-bit post-training quantization (PTQ) can achieve 81.57% accuracy with a 0.28 percentage point drop while enjoying 4× faster runtime. Furthermore, the weights of Swin and DeiT on several tasks, including classification and object detection, can be post-quantized to ultra-low 4 bits, with a minimal accuracy loss of 2%, while requiring nearly 8×less memory.
[1]Alam N, Kolawole S, Sethi S, et al., 2023. Vision Transformers for mobile applications: a short survey. https://arxiv.org/abs/2305.19365
[2]Ba J, Grosse R, Martens J, 2017. Distributed second-order optimization using Kronecker-factored approximations. Proc 5th Int Conf on Learning Representations, p.1-17.
[3]Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450
[4]Bengio Y, Leonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432
[5]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16th European Conf on Computer Vision, p.213-229.
[6]Chen MH, Peng HW, Fu JL, et al., 2021. AutoFormer: searching Transformers for visual recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.12250-12260.
[7]Chen ZS, Xie LX, Niu JW, et al., 2021. Visformer: the vision-friendly Transformer. Proc IEEE/CVF Int Conf on Computer Vision, p.569-578.
[8]Chitty-Venkata KT, Mittal S, Emani M, et al., 2023. A survey of techniques for optimizing Transformer inference. J Syst Archit, 144:102990.
[9]Choi J, Wang Z, Venkataramani S, et al., 2018. PACT: parameterized clipping activation for quantized neural networks. https://arxiv.org/abs/1805.06085
[10]Choromanski KM, Likhosherstov V, Dohan D, et al., 2021. Rethinking attention with performers. Proc 9th Int Conf on Learning Representations, p.1-38.
[11]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.
[12]Ding MY, Xiao B, Codella N, et al., 2022. DaViT: dual attention vision Transformers. Proc 17th European Conf on Computer Vision, p.74-92.
[13]Dong XY, Bao JM, Chen DD, et al., 2022. CSWin Transformer: a general vision Transformer backbone with cross-shaped windows. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12114-12124.
[14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9th Int Conf on Learning Representations, p.1-21.
[15]Esser SK, McKinstry JL, Bablani D, et al., 2020. Learned step size quantization. Proc 8th Int Conf on Learning Representations, p.1-12.
[16]Gong RH, Liu XL, Jiang SH, et al., 2019. Differentiable soft quantization: bridging full-precision and low-bit neural networks. Proc IEEE/CVF Int Conf on Computer Vision, p.4851-4860.
[17]Graham B, El-Nouby A, Touvron H, et al., 2021. LeViT: a vision Transformer in ConvNet's clothing for faster inference. Proc IEEE/CVF Int Conf on Computer Vision, p.12239-12249.
[18]Grosse RB, Martens J, 2016. A Kronecker-factored approximate Fisher matrix for convolution layers. Proc 33rd Int Conf on Machine Learning, p.573-582.
[19]Ham TJ, Jung SJ, Kim S, et al., 2020. A3: accelerating attention mechanisms in neural networks with approximation. Proc IEEE Int Symp on High Performance Computer Architecture, p.328-341.
[20]Han DC, Pan XR, Han YZ, et al., 2023. FLatten Transformer: vision Transformer using focused linear attention. Proc IEEE/CVF Int Conf on Computer Vision, p.5938-5948.
[21]Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural networks. Proc 29th Int Conf on Neural Information Processing Systems, p.1135-1143.
[22]Hatamizadeh A, Heinrich G, Yin HX, et al., 2024. FasterViT: fast vision Transformers with hierarchical attention. Proc 12th Int Conf on Learning Representations, p.1-24.
[23]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. Comput Sci, 14(7):38-39.
[24]Hong K, Dai GH, Xu JM, et al., 2023. FlashDecoding++: faster large language model inference on GPUs. https://arxiv.org/abs/2311.01282
[25]Huang L, Qin J, Liu L, et al., 2020. Layer-wise conditioning analysis in exploring the learning dynamics of DNNs. Proc 16th European Conf on Computer Vision, p.384-401.
[26]LeCun Y, Kanter I, Sona SA, 1990. Second order properties of error surfaces: learning time and generalization. Proc 4th Int Conf on Neural Information Processing Systems, p.918-924.
[27]LeCun Y, Bottou L, Orr GB, et al., 2012. Efficient BackProp. In: Montavon G, Orr GB, Miller KB (Eds.), Neural Networks: Tricks of the Trade. Springer, Berlin, Heidelberg, p.9-48.
[28]Li F, Zhang H, Liu SL, et al., 2022. DN-DETR: accelerate DETR training by introducing query denoising. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13609-13617.
[29]Li F, Zhang H, Sun P, et al., 2025. Segment and recognize anything at any granularity. Proc 18th European Conf on Computer Vision, p.467-484.
[30]Li YH, Gong RH, Tan X, et al., 2021. BRECQ: pushing the limit of post-training quantization by block reconstruction. Proc 9th Int Conf on Learning Representations, p.1-16.
[31]Li ZK, Ma LP, Chen MJ, et al., 2022. Patch similarity aware data-free quantization for vision Transformers. Proc 17th European Conf on Computer Vision, p.154-170.
[32]Li ZK, Xiao JR, Yang LW, et al., 2023. RepQ-ViT: scale reparameterization for post-training quantization of vision transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.17181-17190.
[33]Li ZK, Chen MJ, Xiao JR, et al., 2024. PSAQ-ViT V2: toward accurate and general data-free quantization for vision transformers. IEEE Trans Neur Netw Learn Syst, 35(12):17227-17238.
[34]Liang JY, Cao JZ, Sun GL, et al., 2021. SwinIR: image restoration using Swin Transformer. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.1833-1844.
[35]Lin Y, Zhang TY, Sun PQ, et al., 2022. FQ-ViT: post-training quantization for fully quantized vision Transformer. Proc 31st Int Joint Conf on Artificial Intelligence, p.1173-1179.
[36]Liu SL, Li F, Zhang H, et al., 2022. DAB-DETR: dynamic anchor boxes are better queries for DETR. Proc 10th Int Conf on Learning Representations, p.1-20.
[37]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002.
[38]Liu Z, Ning J, Cao Y, et al., 2022a. Video Swin Transformer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3192-3201.
[39]Liu Z, Hu H, Lin YT, et al., 2022b. Swin Transformer V2: scaling up capacity and resolution. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11999-12009.
[40]Liu ZH, Wang YH, Han K, et al., 2021. Post-training quantization for vision Transformer. Proc 35th Int Conf on Neural Information Processing Systems, Article 2152.
[41]Mehta S, Ghazvininejad M, Iyer S, et al., 2021. DeLighT: deep and light-weight Transformer. Proc 9th Int Conf on Learning Representations, p.1-19.
[42]Nagel M, Amjad RA, van Baalen M, et al., 2020. Up or down? Adaptive rounding for post-training quantization. Proc 37th Int Conf on Machine Learning, Article 667.
[43]Qu Z, Liu L, Tu FB, et al., 2022. DOTA: detect and omit weak attentions for scalable Transformer acceleration. Proc 27th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, p.14-26.
[44]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 29th Int Conf on Neural Information Processing Systems, p.91-99.
[45]Sanh V, Debut L, Chaumond J, et al., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108
[46]Shen S, Yao ZW, Gholami A, et al., 2020. PowerNorm: rethinking batch normalization in Transformers. Proc 37th Int Conf on Machine Learning, Article 811.
[47]Si CY, Yu WH, Zhou P, et al., 2022. Inception Transformer. Proc 36th Int Conf on Neural Information Processing Systems, Article 1707.
[48]Touvron H, Cord M, Sablayrolles A, et al., 2021a. Going deeper with image Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.32-42.
[49]Touvron H, Cord M, Douze M, et al., 2021b. Training data-efficient image Transformers & distillation through attention. Proc 38th Int Conf on Machine Learning, p.10347-10357.
[50]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[51]Yang QM, Zhang K, Lan CX, et al., 2022. Unified normalization for accelerating and stabilizing Transformers. Proc 30th ACM Int Conf on Multimedia, p.4445-4455.
[52]Yao ZW, Aminabadi RY, Zhang MJ, et al., 2022. ZeroQuant: efficient and affordable post-training quantization for large-scale Transformers. Proc 36th Int Conf on Neural Information Processing Systems, Article 1970.
[53]Yu XD, Shi DH, Wei X, et al., 2022. SOIT: segmenting objects with instance-aware Transformers. Proc 36th AAAI Conf on Artificial Intelligence, p.3188-3196.
[54]Yuan L, Chen YP, Wang T, et al., 2021. Tokens-to-Token ViT: training vision Transformers from scratch on ImageNet. Proc IEEE/CVF Int Conf on Computer Vision, p.538-547.
[55]Yuan ZH, Xue CH, Chen YQ, et al., 2022. PTQ4ViT: post-training quantization framework for vision Transformers. https://arxiv.org/abs/2111.12293
[56]Zafrir O, Boudoukh G, Izsak P, et al., 2019. Q8BERT: quantized 8bit BERT. Proc 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, p.36-39.
[57]Zhang H, Li F, Liu SL, et al., 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Proc 11th Int Conf on Learning Representations, p.1-19.
[58]Zhang XS, Tian YJ, Xie LX, et al., 2023. HiViT: a simpler and more efficient design of hierarchical vision Transformer. Proc 11th Int Conf on Learning Representations, p.1-15.
[59]Zheng CY, Li ZY, Zhang K, et al., 2022. SAViT: structure-aware vision Transformer pruning via collaborative optimization. Proc 36th Annual Conf on Neural Information Processing Systems, Article 655.
[60]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable Transformers for end-to-end object detection. Proc 9th Int Conf on Learning Representations, p.1-16.
Open peer comments: Debate/Discuss/Question/Opinion
<1>