CLC number: TP181
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-07-04
Cited: 0
Clicked: 1569
Citations: Bibtex RefMan EndNote GB/T7714
Wujie SUN, Defang CHEN, Can WANG, Deshi YE, Yan FENG, Chun CHEN. Multi-exit self-distillation with appropriate teachers[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2200644 @article{title="Multi-exit self-distillation with appropriate teachers", %0 Journal Article TY - JOUR
具备合适教师的多出口自蒸馏浙江大学计算机科学与技术学院,中国杭州市,310000 摘要:多出口架构允许早停推理以减少计算成本,这使其可以在资源受限的情况下使用。最近的研究将多出口架构与自蒸馏相结合,以在不同网络深度上同时实现高效率和卓越性能。然而,现有方法主要从深层出口或单一集成中传递知识,以指导所有出口,而没有考虑学生和教师之间不适当的学习差距可能会降低模型性能,特别是对于浅层出口而言。为解决这个问题,提出具备合适教师的多出口自蒸馏方法,为每个出口提供多样化且适当的教师知识。在我们的方法中,根据不同可训练的集成权重,从所有出口获得多个集成教师。每个出口从所有教师那里接收知识,并重点关注其所对应的主教师,以保持适当的学习差距并实现高效的知识传递。通过这种方式,我们的方法在保证学习效率的同时实现了多样化的知识蒸馏。在CIFAR-100、TinyImageNet以及3个细粒度数据集上的实验结果表明,我们的方法在各种网络架构中始终优于最先进的多出口自蒸馏方法。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Ahn S, Hu SX, Damianou A, et al., 2019. Variational information distillation for knowledge transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9155-9163. [2]Anil R, Pereyra G, Passos A, et al., 2020 Large scale distributed neural network training through online distillation. https://arxiv.org/abs/1804.03235 [3]Ba LJ, Caruana R, 2014. Do deep nets really need to be deep? Proc 27th Int Conf on Neural Information Processing Systems, p.2654-2662. [4]Chen DF, Mei JP, Wang C, et al., 2020. Online knowledge distillation with diverse peers. Proc AAAI Conf Artif Intell, 34(4):3430-3437. [5]Chen DF, Mei JP, Zhang Y, et al., 2021. Cross-layer distillation with semantic calibration. Proc AAAI Conf Artif Intell, 35(8):7028-7036. [6]Chen DF, Mei JP, Zhang HL, et al., 2022. Knowledge distillation with the reused teacher classifier. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11923-11932. [7]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf on Computer Vision and Pattern Recognition, p.248-255. [8]Deng X, Zhang ZF, 2021. Learning with retrospection. Proc AAAI Conf Artif Intell, 35(8):7201-7209. [9]Furlanello T, Lipton Z, Tschannen M, et al., 2018. Born again neural networks. Proc 35th Int Conf on Machine Learning, p.1607-1616. [10]Ge YX, Zhang X, Choi CL, et al., 2021. Self-distillation with batch knowledge ensembling improves ImageNet classification. https://arxiv.org/abs/2104.13298 [11]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. [12]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531 [13]Huang G, Liu Z, Van Der Maaten L, et al., 2017. Densely connected convolutional networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2261-2269. [14]Huang G, Chen DL, Li TH, et al., 2018. Multi-scale dense networks for resource efficient image classification. https://arxiv.org/abs/1703.09844 [15]Ji M, Shin S, Hwang S, et al., 2021. Refine myself by teaching myself: feature refinement via self-knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10659-10668. [16]Jin X, Peng BY, Wu YC, et al., 2019. Knowledge distillation via route constrained optimization. IEEE/CVF Int Conf on Computer Vision, p.1345-1354. [17]Khosla A, Jayadevaprakash N, Yao BP, et al., 2011. Novel Dataset for Fine-Grained Image Categorization: Stanford Dogs. http://vision.stanford.edu/aditya86/ImageNetDogs/ [Accessed on Dec. 30, 2021]. [18]Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report, Computer Science Department, University of Toronto, Canada. [19]Lan X, Zhu XT, Gong SG, 2018. Knowledge distillation by on-the-fly native ensemble. https://arxiv.org/abs/1806.04606 [20]Le Y, Yang X, 2015. Tiny ImageNet Visual Recognition Challenge. http://cs231n.stanford.edu/tiny-imagenet-200.zip [Accessed on Dec. 30, 2021]. [21]Lee H, Lee JS, 2021. Students are the best teacher: exit-ensemble distillation with multi-exits. https://arxiv.org/abs/2104.00299 [22]Maji S, Rahtu E, Kannala J, et al., 2013. Fine-grained visual classification of aircraft. https://arxiv.org/abs/1306.5151 [23]Mirzadeh SI, Farajtabar M, Li A, et al., 2020. Improved knowledge distillation via teacher assistant. Proc AAAI Conf Artif Intell, 34(4):5191-5198. [24]Phuong M, Lampert C, 2019. Distillation-based training for multi-exit architectures. Proc IEEE/CVF Int Conf on Computer Vision, p.1355-1364. [25]Schafer RW, 2011. What is a Savitzky–Golay filter? IEEE Signal Process Mag, 28(4):111-117. [26]Schwartz R, Dodge J, Smith NA, et al., 2020. Green AI. Commun ACM, 63(12):54-63. [27]Shi WX, Song YX, Zhou H, et al., 2021. Follow your path: a progressive method for knowledge distillation. https://arxiv.org/abs/2107.09305 [28]Simonyan K, Zisserman A, 2015. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556 [29]Son W, Na J, Choi J, et al., 2021. Densely guided knowledge distillation using multiple teacher assistants. Proc IEEE/CVF Int Conf on Computer Vision, p.9375-9384. [30]Sun DW, Yao AB, Zhou AJ, et al., 2019. Deeply-supervised knowledge synergy. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6990-6999. [31]Teerapittayanon S, McDanel B, Kung HT, 2016. BranchyNet: fast inference via early exiting from deep neural networks. 23rd Int Conf on Pattern Recognition, p.2464-2469. [32]Tian YL, Krishnan D, Isola P, 2022. Contrastive representation distillation. https://arxiv.org/abs/1910.10699 [33]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. [34]Wah C, Branson S, Welinder P, et al., 2011. The Caltech-UCSD Birds-200-2011 Dataset. https://www.vision.caltech.edu/datasets/cub_200_2011/ [Accessed on Dec. 30, 2021]. [35]Xie SN, Girshick R, Dollár P, et al., 2017. Aggregated residual transformations for deep neural networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5987-5995. [36]Xu TB, Liu CL, 2019. Data-distortion guided self-distillation for deep neural networks. Proc AAAI Conf Artif Intell, 33(1):5565-5572. [37]Yang CL, Xie LX, Su C, et al., 2019a. Snapshot distillation: teacher-student optimization in one generation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2854-2863. [38]Yang CL, Xie LX, Qiao SY, et al., 2019b. Training deep neural networks in generations: a more tolerant teacher educates better students. Proc AAAI Conf Artif Intell, 33(1):5628-5635. [39]Yuan L, Tay FEH, Li GL, et al., 2020. Revisiting knowledge distillation via label smoothing regularization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3902-3910. [40]Yun S, Park J, Lee K, et al., 2020. Regularizing class-wise predictions via self-knowledge distillation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13873-13882. [41]Zagoruyko S, Komodakis N, 2017. Wide residual networks. https://arxiv.org/abs/1605.07146 [42]Zhang LF, Song JB, Gao AN, et al., 2019. Be your own teacher: improve the performance of convolutional neural networks via self distillation. Proc IEEE/CVF Int Conf on Computer Vision, p.3712-3721. [43]Zhang Y, Xiang T, Hospedales TM, et al., 2018. Deep mutual learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4320-4328. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>