CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-07-04
Cited: 0
Clicked: 1570
Citations: Bibtex RefMan EndNote GB/T7714
Wujie SUN, Defang CHEN, Can WANG, Deshi YE, Yan FENG, Chun CHEN. Multi-exit self-distillation with appropriate teachers[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(4): 585-599.
@article{title="Multi-exit self-distillation with appropriate teachers",
author="Wujie SUN, Defang CHEN, Can WANG, Deshi YE, Yan FENG, Chun CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="4",
pages="585-599",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200644"
}
%0 Journal Article
%T Multi-exit self-distillation with appropriate teachers
%A Wujie SUN
%A Defang CHEN
%A Can WANG
%A Deshi YE
%A Yan FENG
%A Chun CHEN
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 4
%P 585-599
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200644
TY - JOUR
T1 - Multi-exit self-distillation with appropriate teachers
A1 - Wujie SUN
A1 - Defang CHEN
A1 - Can WANG
A1 - Deshi YE
A1 - Yan FENG
A1 - Chun CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 4
SP - 585
EP - 599
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200644
Abstract: multi-exit architecture allows early-stop inference to reduce computational cost, which can be used in resource-constrained circumstances. Recent works combine the multi-exit architecture with self-distillation to simultaneously achieve high efficiency and decent performance at different network depths. However, existing methods mainly transfer knowledge from deep exits or a single ensemble to guide all exits, without considering that inappropriate learning gaps between students and teachers may degrade the model performance, especially in shallow exits. To address this issue, we propose Multi-exit self-distillation with Appropriate TEachers (MATE) to provide diverse and appropriate teacher knowledge for each exit. In MATE, multiple ensemble teachers are obtained from all exits with different trainable weights. Each exit subsequently receives knowledge from all teachers, while focusing mainly on its primary teacher to keep an appropriate gap for efficient knowledge transfer. In this way, MATE achieves diversity in knowledge distillation while ensuring learning efficiency. Experimental results on CIFAR-100, TinyImageNet, and three fine-grained datasets demonstrate that MATE consistently outperforms state-of-the-art multi-exit self-distillation methods with various network architectures.
[1]Ahn S, Hu SX, Damianou A, et al., 2019. Variational information distillation for knowledge transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9155-9163.
[2]Anil R, Pereyra G, Passos A, et al., 2020 Large scale distributed neural network training through online distillation. https://arxiv.org/abs/1804.03235
[3]Ba LJ, Caruana R, 2014. Do deep nets really need to be deep? Proc 27th Int Conf on Neural Information Processing Systems, p.2654-2662.
[4]Chen DF, Mei JP, Wang C, et al., 2020. Online knowledge distillation with diverse peers. Proc AAAI Conf Artif Intell, 34(4):3430-3437.
[5]Chen DF, Mei JP, Zhang Y, et al., 2021. Cross-layer distillation with semantic calibration. Proc AAAI Conf Artif Intell, 35(8):7028-7036.
[6]Chen DF, Mei JP, Zhang HL, et al., 2022. Knowledge distillation with the reused teacher classifier. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11923-11932.
[7]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.
[8]Deng X, Zhang ZF, 2021. Learning with retrospection. Proc AAAI Conf Artif Intell, 35(8):7201-7209.
[9]Furlanello T, Lipton Z, Tschannen M, et al., 2018. Born again neural networks. Proc 35th Int Conf on Machine Learning, p.1607-1616.
[10]Ge YX, Zhang X, Choi CL, et al., 2021. Self-distillation with batch knowledge ensembling improves ImageNet classification. https://arxiv.org/abs/2104.13298
[11]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[12]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531
[13]Huang G, Liu Z, Van Der Maaten L, et al., 2017. Densely connected convolutional networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2261-2269.
[14]Huang G, Chen DL, Li TH, et al., 2018. Multi-scale dense networks for resource efficient image classification. https://arxiv.org/abs/1703.09844
[15]Ji M, Shin S, Hwang S, et al., 2021. Refine myself by teaching myself: feature refinement via self-knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10659-10668.
[16]Jin X, Peng BY, Wu YC, et al., 2019. Knowledge distillation via route constrained optimization. IEEE/CVF Int Conf on Computer Vision, p.1345-1354.
[17]Khosla A, Jayadevaprakash N, Yao BP, et al., 2011. Novel Dataset for Fine-Grained Image Categorization: Stanford Dogs. http://vision.stanford.edu/aditya86/ImageNetDogs/ [Accessed on Dec. 30, 2021].
[18]Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report, Computer Science Department, University of Toronto, Canada.
[19]Lan X, Zhu XT, Gong SG, 2018. Knowledge distillation by on-the-fly native ensemble. https://arxiv.org/abs/1806.04606
[20]Le Y, Yang X, 2015. Tiny ImageNet Visual Recognition Challenge. http://cs231n.stanford.edu/tiny-imagenet-200.zip [Accessed on Dec. 30, 2021].
[21]Lee H, Lee JS, 2021. Students are the best teacher: exit-ensemble distillation with multi-exits. https://arxiv.org/abs/2104.00299
[22]Maji S, Rahtu E, Kannala J, et al., 2013. Fine-grained visual classification of aircraft. https://arxiv.org/abs/1306.5151
[23]Mirzadeh SI, Farajtabar M, Li A, et al., 2020. Improved knowledge distillation via teacher assistant. Proc AAAI Conf Artif Intell, 34(4):5191-5198.
[24]Phuong M, Lampert C, 2019. Distillation-based training for multi-exit architectures. Proc IEEE/CVF Int Conf on Computer Vision, p.1355-1364.
[25]Schafer RW, 2011. What is a Savitzky–Golay filter? IEEE Signal Process Mag, 28(4):111-117.
[26]Schwartz R, Dodge J, Smith NA, et al., 2020. Green AI. Commun ACM, 63(12):54-63.
[27]Shi WX, Song YX, Zhou H, et al., 2021. Follow your path: a progressive method for knowledge distillation. https://arxiv.org/abs/2107.09305
[28]Simonyan K, Zisserman A, 2015. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556
[29]Son W, Na J, Choi J, et al., 2021. Densely guided knowledge distillation using multiple teacher assistants. Proc IEEE/CVF Int Conf on Computer Vision, p.9375-9384.
[30]Sun DW, Yao AB, Zhou AJ, et al., 2019. Deeply-supervised knowledge synergy. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6990-6999.
[31]Teerapittayanon S, McDanel B, Kung HT, 2016. BranchyNet: fast inference via early exiting from deep neural networks. 23rd Int Conf on Pattern Recognition, p.2464-2469.
[32]Tian YL, Krishnan D, Isola P, 2022. Contrastive representation distillation. https://arxiv.org/abs/1910.10699
[33]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[34]Wah C, Branson S, Welinder P, et al., 2011. The Caltech-UCSD Birds-200-2011 Dataset. https://www.vision.caltech.edu/datasets/cub_200_2011/ [Accessed on Dec. 30, 2021].
[35]Xie SN, Girshick R, Dollár P, et al., 2017. Aggregated residual transformations for deep neural networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5987-5995.
[36]Xu TB, Liu CL, 2019. Data-distortion guided self-distillation for deep neural networks. Proc AAAI Conf Artif Intell, 33(1):5565-5572.
[37]Yang CL, Xie LX, Su C, et al., 2019a. Snapshot distillation: teacher-student optimization in one generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2854-2863.
[38]Yang CL, Xie LX, Qiao SY, et al., 2019b. Training deep neural networks in generations: a more tolerant teacher educates better students. Proc AAAI Conf Artif Intell, 33(1):5628-5635.
[39]Yuan L, Tay FEH, Li GL, et al., 2020. Revisiting knowledge distillation via label smoothing regularization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3902-3910.
[40]Yun S, Park J, Lee K, et al., 2020. Regularizing class-wise predictions via self-knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13873-13882.
[41]Zagoruyko S, Komodakis N, 2017. Wide residual networks. https://arxiv.org/abs/1605.07146
[42]Zhang LF, Song JB, Gao AN, et al., 2019. Be your own teacher: improve the performance of convolutional neural networks via self distillation. Proc IEEE/CVF Int Conf on Computer Vision, p.3712-3721.
[43]Zhang Y, Xiang T, Hospedales TM, et al., 2018. Deep mutual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4320-4328.
Open peer comments: Debate/Discuss/Question/Opinion
<1>