CLC number: TP391
On-line Access: 2025-07-02
Received: 2024-05-12
Revision Accepted: 2025-07-02
Crosschecked: 2024-09-18
Cited: 0
Clicked: 715
Deng LI, Peng LI, Aming WU, Yahong HAN. Prototype-guided cross-task knowledge distillation[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(6): 912-929.
@article{title="Prototype-guided cross-task knowledge distillation",
author="Deng LI, Peng LI, Aming WU, Yahong HAN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="6",
pages="912-929",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400383"
}
%0 Journal Article
%T Prototype-guided cross-task knowledge distillation
%A Deng LI
%A Peng LI
%A Aming WU
%A Yahong HAN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 6
%P 912-929
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400383
TY - JOUR
T1 - Prototype-guided cross-task knowledge distillation
A1 - Deng LI
A1 - Peng LI
A1 - Aming WU
A1 - Yahong HAN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 6
SP - 912
EP - 929
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400383
Abstract: Recently, large-scale pretrained models have revealed their benefits in various tasks. However, due to the enormous computation complexity and storage demands, it is challenging to apply large-scale models to real scenarios. Existing knowledge distillation methods require mainly the teacher model and the student model to share the same label space, which restricts their application in real scenarios. To alleviate the constraint of different label spaces, we propose a prototype-guided cross-task knowledge distillation (ProC-KD) method to migrate the intrinsic local-level object knowledge of the teacher network to various task scenarios. First, to better learn the generalized knowledge in cross-task scenarios, we present a prototype learning module to learn the invariant intrinsic local representation of objects from the teacher network. Second, for diverse downstream tasks, a task-adaptive feature augmentation module is proposed to enhance the student network features with the learned generalization prototype representations and guide the learning of the student network to improve its generalization ability. Experimental results on various visual tasks demonstrate the effectiveness of our approach for cross-task knowledge distillation scenarios.
[1]Ahn S, Hu SX, Damianou A, et al., 2019. Variational information distillation for knowledge transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9163-9171.
[2]Ba LJ, Caruana R, 2014. Do deep nets really need to be deep? Proc 27th Int Conf on Neural Information Processing Systems, p.2654-2662.
[3]Cao KD, Wei CL, Gaidon A, et al., 2019. Learning imbalanced datasets with label-distribution-aware margin loss. Proc 33rd Int Conf on Neural Information Processing Systems, Article 140.
[4]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with transformers. Proc 16th European Conf on Computer Vision, p.213-229.
[5]Chebotar Y, Waters A, 2016. Distilling knowledge from ensembles of neural networks for speech recognition. Proc 17th Annual Conf of the Int Speech Communication Association, p.3439-3443.
[6]Chefer H, Gur S, Wolf L, 2021. Transformer interpretability beyond attention visualization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.782-791.
[7]Chen DF, Mei JP, Zhang Y, et al., 2021. Cross-layer distillation with semantic calibration. Proc 35th AAAI Conf on Artificial Intelligence, p.7028-7036.
[8]Chen GB, Choi W, Yu X, et al., 2017. Learning efficient object detection models with knowledge distillation. Proc 31st Int Conf on Neural Information Processing Systems, p.742-751.
[9]Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: UNiversal Image-TExt Representation learning. Proc 16th European Conf on Computer Vision, p.104-120.
[10]Cordts M, Omran M, Ramos S, et al., 2016. The Cityscapes dataset for semantic urban scene understanding. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3213-3223.
[11]Cui Y, Jia ML, Lin TY, et al., 2019. Class-balanced loss based on effective number of samples. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9268-9277.
[12]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.
[13]Deng JK, Guo J, Yang J, et al., 2021. Variational prototype learning for deep face recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11906-11915.
[14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16 × 16 words: transformers for image recognition at scale. Proc 9th Int Conf on Learning Representations.
[15]Fu TJ, Li LJ, Gan Z, et al., 2023. An empirical study of end-to-end video-language transformers with masked visual modeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22898-22909.
[16]Gou JP, Yu BS, Maybank SJ, et al., 2021. Knowledge distillation: a survey. Int J Comput Vis, 129(6):1789-1819.
[17]Gou JP, Sun LY, Yu BS, et al., 2023. Multilevel attention-based sample correlations for knowledge distillation. IEEE Trans Ind Inform, 19(5):7099-7109.
[18]Heo B, Kim J, Yun S, et al., 2019a. A comprehensive overhaul of feature distillation. Proc IEEE/CVF Int Conf on Computer Vision, p.1921-1930.
[19]Heo B, Lee M, Yun S, et al., 2019b. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. Proc 33rd AAAI Conf on Artificial Intelligence, p.3779-3787.
[20]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531
[21]Hur S, Shin I, Park K, et al., 2023. Learning classifiers of prototypes and reciprocal points for universal domain adaptation. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.531-540.
[22]Jain J, Li JC, Chiu MT, et al., 2023. OneFormer: one transformer to rule universal image segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2989-2998.
[23]Jiao XQ, Yin YC, Shang LF, et al., 2020. TinyBERT: distilling BERT for natural language understanding. Proc Findings of the Association for Computational Linguistics, p.4163-4174.
[24]Kurata G, Saon G, 2020. Knowledge distillation from offline to streaming RNN transducer for end-to-end speech recognition. Proc 21st Annual Conf of the Int Speech Communication Association, p.2117-2121.
[25]Li G, Jampani V, Sevilla-Lara L, et al., 2021. Adaptive prototype learning and allocation for few-shot segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8334-8343.
[26]Li LJ, Chen YC, Cheng Y, et al., 2020. HERO: hierarchical encoder for video+language omni-representation pre-training. Proc Conf on Empirical Methods in Natural Language Processing, p.2046-2065.
[27]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13th European Conf on Computer Vision, p.740-755.
[28]Liu JL, Song L, Qin YQ, 2020. Prototype rectification for few-shot learning. Proc 16th European Conf on Computer Vision, p.741-756.
[29]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.10012-10022.
[30]Miles R, Mikolajczyk K, 2024. Understanding the role of the projector in knowledge distillation. Proc 38th AAAI Conf on Artificial Intelligence, p.4233-4241.
[31]Molchanov P, Tyree S, Karras T, et al., 2017. Pruning convolutional neural networks for resource efficient inference. Proc 5th Int Conf on Learning Representations.
[32]Müller R, Kornblith S, Hinton G, 2019. When does label smoothing help? Proc 33rd Int Conf on Neural Information Processing Systems, Article 422.
[33]Park W, Kim D, Lu Y, et al., 2019. Relational knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3967-3976.
[34]Passalis N, Tefas A, 2018. Learning deep representations with probabilistic knowledge transfer. Proc 15th European Conf on Computer Vision, p.283-299.
[35]Rebuffi SA, Bilen H, Vedaldi A, 2017. Learning multiple visual domains with residual adapters. Proc 31st Int Conf on Neural Information Processing Systems, p.506-516.
[36]Romero A, Ballas N, Kahou SE, et al., 2015. FitNets: hints for thin deep nets. Proc 3rd Int Conf on Learning Representations.
[37]Sakaridis C, Dai DX, Van Gool L, 2018. Semantic foggy scene understanding with synthetic data. Int J Comput Vis, 126(9):973-992.
[38]Sanh V, Debut L, Chaumond J, et al., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108
[39]Shu CY, Liu YF, Gao JF, et al., 2021. Channel-wise knowledge distillation for dense prediction. Proc IEEE/CVF Int Conf on Computer Vision, p.5311-5320.
[40]Snell J, Swersky K, Zemel R, 2017. Prototypical networks for few-shot learning. Proc 31st Int Conf on Neural Information Processing Systems, p.4080-4090.
[41]Sun SQ, Cheng Y, Gan Z, et al., 2019. Patient knowledge distillation for BERT model compression. Proc Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing, p.4322-4331.
[42]Touvron H, Cord M, Douze M, et al., 2021. Training data-efficient image transformers & distillation through attention. Proc 38th Int Conf on Machine Learning, p.10347-10357.
[43]van der Maaten L, Weinberger K, 2012. Stochastic triplet embedding. Proc IEEE Int Workshop on Machine Learning for Signal Processing, p.1-6.
[44]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[45]Venkateswara H, Eusebio J, Chakraborty S, et al., 2017. Deep hashing network for unsupervised domain adaptation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5018-5027.
[46]Wang JH, Cao MD, Shi SW, et al., 2022. Attention probe: vision transformer distillation in the wild. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.2220-2224.
[47]Wang T, Yuan L, Zhang XP, et al., 2019. Distilling object detectors with fine-grained feature imitation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4933-4942.
[48]Wei YJ, Ye JX, Huang ZZ, et al., 2023. Online prototype learning for online continual learning. Proc IEEE/CVF Int Conf on Computer Vision, p.18764-18774.
[49]Wu AM, Liu R, Han YH, et al., 2021. Vector-decomposed disentanglement for domain-invariant object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.9342-9351.
[50]Wu JX, Leng C, Wang YH, et al., 2016. Quantized convolutional neural networks for mobile devices. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4820-4828.
[51]Yang ZD, Li Z, Jiang XH, et al., 2022a. Focal and global knowledge distillation for detectors. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4643-4652.
[52]Yang ZD, Li Z, Shao MQ, et al., 2022b. Masked generative distillation. Proc 17th European Conf on Computer Vision, p.53-69.
[53]Ye HJ, Lu S, Zhan DC, 2020. Distilling cross-task knowledge via relationship matching. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12396-12405.
[54]Ye LW, Rochan M, Liu Z, et al., 2019. Cross-modal self-attention network for referring image segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10502-10511.
[55]Yim J, Joo D, Bae J, et al., 2017. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.7130-7138.
[56]Yoon JW, Lee H, Kim HY, et al., 2021. TutorNet: towards flexible knowledge distillation for end-to-end speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 29:1626-1638.
[57]Zagoruyko S, Komodakis N, 2017. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Proc 5th Int Conf on Learning Representations.
[58]Zhang LF, Ma KS, 2023. Structured knowledge distillation for accurate and efficient object detection. IEEE Trans Patt Anal Mach Intell, 45(12):15706-15724.
[59]Zhang Y, Chen WH, Lu YC, et al., 2023. Avatar knowledge distillation: self-ensemble teacher paradigm with uncertainty. Proc 31st ACM Int Conf on Multimedia, p.5272-5280.
[60]Zhao BR, Cui Q, Song RJ, et al., 2022. Decoupled knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11953-11962.
[61]Zhou C, Zhang YN, Chen JX, et al., 2023. OcTr: octree-based transformer for 3D object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5166-5175.
[62]Zhu MH, Gupta S, 2018. To prune, or not to prune: exploring the efficacy of pruning for model compression. Proc 6th Int Conf on Learning Representations.
[63]Zhu SL, Shang RH, Tang K, et al., 2023. BookKD: a novel knowledge distillation for reducing distillation costs by decoupling knowledge generation and learning. Knowl-Based Syst, 279: 110916.
[64]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable transformers for end-to-end object detection. Proc 9th Int Conf on Learning Representations.
Open peer comments: Debate/Discuss/Question/Opinion
<1>