JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.6 P.912-929

Prototype-guided cross-task knowledge distillation

Author(s): Deng LI, Peng LI, Aming WU, Yahong HAN
Affiliation(s): College of Intelligence and Computing, Tianjin University, Tianjin 300350, China; more
Corresponding email(s): yahong@tju.edu.cn
Key Words: Knowledge distillation, Cross-task, Prototype learning

Share this article to： More <<< Previous Article \|Next Article >>>

Deng LI, Peng LI, Aming WU, Yahong HAN. Prototype-guided cross-task knowledge distillation[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(6): 912-929.

@article{title="Prototype-guided cross-task knowledge distillation",
author="Deng LI, Peng LI, Aming WU, Yahong HAN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="6",
pages="912-929",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400383"
}

%0 Journal Article
%T Prototype-guided cross-task knowledge distillation
%A Deng LI
%A Peng LI
%A Aming WU
%A Yahong HAN
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 6
%P 912-929
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400383

TY - JOUR
T1 - Prototype-guided cross-task knowledge distillation
A1 - Deng LI
A1 - Peng LI
A1 - Aming WU
A1 - Yahong HAN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 6
SP - 912
EP - 929
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400383

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Recently, large-scale pretrained models have revealed their benefits in various tasks. However, due to the enormous computation complexity and storage demands, it is challenging to apply large-scale models to real scenarios. Existing knowledge distillation methods require mainly the teacher model and the student model to share the same label space, which restricts their application in real scenarios. To alleviate the constraint of different label spaces, we propose a prototype-guided cross-task knowledge distillation (ProC-KD) method to migrate the intrinsic local-level object knowledge of the teacher network to various task scenarios. First, to better learn the generalized knowledge in cross-task scenarios, we present a prototype learning module to learn the invariant intrinsic local representation of objects from the teacher network. Second, for diverse downstream tasks, a task-adaptive feature augmentation module is proposed to enhance the student network features with the learned generalization prototype representations and guide the learning of the student network to improve its generalization ability. Experimental results on various visual tasks demonstrate the effectiveness of our approach for cross-task knowledge distillation scenarios.

原型引导的跨任务知识蒸馏

李登¹，李鹏²，武阿明³，韩亚洪¹
¹天津大学智能与计算学部，中国天津市，300350
²嵩山实验室，中国郑州市，450000
³西安电子科技大学电子工程学院，中国西安市，710401
摘要：近年来，大规模预训练模型在各种任务中展现了其优势。然而，受繁重的计算和巨大的存储需求限制，大规模预训练模型难以部署于真实场景中。现有主流的知识蒸馏方法要求教师模型和学生模型共享相同的标签空间，这限制了预训练模型在真实场景的应用。为缓解不同标签空间的限制，本文提出一种原型引导的跨任务知识蒸馏（ProC-KD）方法，旨在将教师网络的本质物体表征知识迁移到各种下游任务场景中。首先，为更好地学习跨任务场景中的泛化知识，提出一个原型学习模块，从教师网络中学习物体的不变本质表示。其次，对于多样的下游任务，提出一个任务自适应特征增强模块，通过习得的泛化原型表示增强学生网络特征，并指导学生网络的学习以提高其泛化能力。在不同视觉任务上的实验验证了所提方法在跨任务知识蒸馏场景中的有效性。

关键词：知识蒸馏；跨任务；原型学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ahn S, Hu SX, Damianou A, et al., 2019. Variational information distillation for knowledge transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9163-9171.

[2]Ba LJ, Caruana R, 2014. Do deep nets really need to be deep? Proc 27^th Int Conf on Neural Information Processing Systems, p.2654-2662.

[3]Cao KD, Wei CL, Gaidon A, et al., 2019. Learning imbalanced datasets with label-distribution-aware margin loss. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 140.

[4]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with transformers. Proc 16^th European Conf on Computer Vision, p.213-229.

[5]Chebotar Y, Waters A, 2016. Distilling knowledge from ensembles of neural networks for speech recognition. Proc 17^th Annual Conf of the Int Speech Communication Association, p.3439-3443.

[6]Chefer H, Gur S, Wolf L, 2021. Transformer interpretability beyond attention visualization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.782-791.

[7]Chen DF, Mei JP, Zhang Y, et al., 2021. Cross-layer distillation with semantic calibration. Proc 35^th AAAI Conf on Artificial Intelligence, p.7028-7036.

[8]Chen GB, Choi W, Yu X, et al., 2017. Learning efficient object detection models with knowledge distillation. Proc 31^st Int Conf on Neural Information Processing Systems, p.742-751.

[9]Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: UNiversal Image-TExt Representation learning. Proc 16^th European Conf on Computer Vision, p.104-120.

[10]Cordts M, Omran M, Ramos S, et al., 2016. The Cityscapes dataset for semantic urban scene understanding. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3213-3223.

[11]Cui Y, Jia ML, Lin TY, et al., 2019. Class-balanced loss based on effective number of samples. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9268-9277.

[12]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.

[13]Deng JK, Guo J, Yang J, et al., 2021. Variational prototype learning for deep face recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11906-11915.

[14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16 × 16 words: transformers for image recognition at scale. Proc 9^th Int Conf on Learning Representations.

[15]Fu TJ, Li LJ, Gan Z, et al., 2023. An empirical study of end-to-end video-language transformers with masked visual modeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.22898-22909.

[16]Gou JP, Yu BS, Maybank SJ, et al., 2021. Knowledge distillation: a survey. Int J Comput Vis, 129(6):1789-1819.

[17]Gou JP, Sun LY, Yu BS, et al., 2023. Multilevel attention-based sample correlations for knowledge distillation. IEEE Trans Ind Inform, 19(5):7099-7109.

[18]Heo B, Kim J, Yun S, et al., 2019a. A comprehensive overhaul of feature distillation. Proc IEEE/CVF Int Conf on Computer Vision, p.1921-1930.

[19]Heo B, Lee M, Yun S, et al., 2019b. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. Proc 33^rd AAAI Conf on Artificial Intelligence, p.3779-3787.

[20]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531

[21]Hur S, Shin I, Park K, et al., 2023. Learning classifiers of prototypes and reciprocal points for universal domain adaptation. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.531-540.

[22]Jain J, Li JC, Chiu MT, et al., 2023. OneFormer: one transformer to rule universal image segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2989-2998.

[23]Jiao XQ, Yin YC, Shang LF, et al., 2020. TinyBERT: distilling BERT for natural language understanding. Proc Findings of the Association for Computational Linguistics, p.4163-4174.

[24]Kurata G, Saon G, 2020. Knowledge distillation from offline to streaming RNN transducer for end-to-end speech recognition. Proc 21^st Annual Conf of the Int Speech Communication Association, p.2117-2121.

[25]Li G, Jampani V, Sevilla-Lara L, et al., 2021. Adaptive prototype learning and allocation for few-shot segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8334-8343.

[26]Li LJ, Chen YC, Cheng Y, et al., 2020. HERO: hierarchical encoder for video+language omni-representation pre-training. Proc Conf on Empirical Methods in Natural Language Processing, p.2046-2065.

[27]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13^th European Conf on Computer Vision, p.740-755.

[28]Liu JL, Song L, Qin YQ, 2020. Prototype rectification for few-shot learning. Proc 16^th European Conf on Computer Vision, p.741-756.

[29]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.10012-10022.

[30]Miles R, Mikolajczyk K, 2024. Understanding the role of the projector in knowledge distillation. Proc 38^th AAAI Conf on Artificial Intelligence, p.4233-4241.

[31]Molchanov P, Tyree S, Karras T, et al., 2017. Pruning convolutional neural networks for resource efficient inference. Proc 5^th Int Conf on Learning Representations.

[32]Müller R, Kornblith S, Hinton G, 2019. When does label smoothing help? Proc 33^rd Int Conf on Neural Information Processing Systems, Article 422.

[33]Park W, Kim D, Lu Y, et al., 2019. Relational knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3967-3976.

[34]Passalis N, Tefas A, 2018. Learning deep representations with probabilistic knowledge transfer. Proc 15^th European Conf on Computer Vision, p.283-299.

[35]Rebuffi SA, Bilen H, Vedaldi A, 2017. Learning multiple visual domains with residual adapters. Proc 31^st Int Conf on Neural Information Processing Systems, p.506-516.

[36]Romero A, Ballas N, Kahou SE, et al., 2015. FitNets: hints for thin deep nets. Proc 3^rd Int Conf on Learning Representations.

[37]Sakaridis C, Dai DX, Van Gool L, 2018. Semantic foggy scene understanding with synthetic data. Int J Comput Vis, 126(9):973-992.

[38]Sanh V, Debut L, Chaumond J, et al., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108

[39]Shu CY, Liu YF, Gao JF, et al., 2021. Channel-wise knowledge distillation for dense prediction. Proc IEEE/CVF Int Conf on Computer Vision, p.5311-5320.

[40]Snell J, Swersky K, Zemel R, 2017. Prototypical networks for few-shot learning. Proc 31^st Int Conf on Neural Information Processing Systems, p.4080-4090.

[41]Sun SQ, Cheng Y, Gan Z, et al., 2019. Patient knowledge distillation for BERT model compression. Proc Conf on Empirical Methods in Natural Language Processing and the 9^th Int Joint Conf on Natural Language Processing, p.4322-4331.

[42]Touvron H, Cord M, Douze M, et al., 2021. Training data-efficient image transformers & distillation through attention. Proc 38^th Int Conf on Machine Learning, p.10347-10357.

[43]van der Maaten L, Weinberger K, 2012. Stochastic triplet embedding. Proc IEEE Int Workshop on Machine Learning for Signal Processing, p.1-6.

[44]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[45]Venkateswara H, Eusebio J, Chakraborty S, et al., 2017. Deep hashing network for unsupervised domain adaptation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5018-5027.

[46]Wang JH, Cao MD, Shi SW, et al., 2022. Attention probe: vision transformer distillation in the wild. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.2220-2224.

[47]Wang T, Yuan L, Zhang XP, et al., 2019. Distilling object detectors with fine-grained feature imitation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4933-4942.

[48]Wei YJ, Ye JX, Huang ZZ, et al., 2023. Online prototype learning for online continual learning. Proc IEEE/CVF Int Conf on Computer Vision, p.18764-18774.

[49]Wu AM, Liu R, Han YH, et al., 2021. Vector-decomposed disentanglement for domain-invariant object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.9342-9351.

[50]Wu JX, Leng C, Wang YH, et al., 2016. Quantized convolutional neural networks for mobile devices. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4820-4828.

[51]Yang ZD, Li Z, Jiang XH, et al., 2022a. Focal and global knowledge distillation for detectors. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4643-4652.

[52]Yang ZD, Li Z, Shao MQ, et al., 2022b. Masked generative distillation. Proc 17^th European Conf on Computer Vision, p.53-69.

[53]Ye HJ, Lu S, Zhan DC, 2020. Distilling cross-task knowledge via relationship matching. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12396-12405.

[54]Ye LW, Rochan M, Liu Z, et al., 2019. Cross-modal self-attention network for referring image segmentation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10502-10511.

[55]Yim J, Joo D, Bae J, et al., 2017. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.7130-7138.

[56]Yoon JW, Lee H, Kim HY, et al., 2021. TutorNet: towards flexible knowledge distillation for end-to-end speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 29:1626-1638.

[57]Zagoruyko S, Komodakis N, 2017. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Proc 5^th Int Conf on Learning Representations.

[58]Zhang LF, Ma KS, 2023. Structured knowledge distillation for accurate and efficient object detection. IEEE Trans Patt Anal Mach Intell, 45(12):15706-15724.

[59]Zhang Y, Chen WH, Lu YC, et al., 2023. Avatar knowledge distillation: self-ensemble teacher paradigm with uncertainty. Proc 31^st ACM Int Conf on Multimedia, p.5272-5280.

[60]Zhao BR, Cui Q, Song RJ, et al., 2022. Decoupled knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11953-11962.

[61]Zhou C, Zhang YN, Chen JX, et al., 2023. OcTr: octree-based transformer for 3D object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5166-5175.

[62]Zhu MH, Gupta S, 2018. To prune, or not to prune: exploring the efficacy of pruning for model compression. Proc 6^th Int Conf on Learning Representations.

[63]Zhu SL, Shang RH, Tang K, et al., 2023. BookKD: a novel knowledge distillation for reducing distillation costs by decoupling knowledge generation and learning. Knowl-Based Syst, 279: 110916.

[64]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable transformers for end-to-end object detection. Proc 9^th Int Conf on Learning Representations.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

原型引导的跨任务知识蒸馏

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference