
CLC number: TP391.41
On-line Access: 2025-06-04
Received: 2025-05-14
Revision Accepted: 2025-06-04
Crosschecked: 2025-09-04
Cited: 0
Clicked: 522
Citations: Bibtex RefMan EndNote GB/T7714
Shaowu XU, Xibin JIA, Qianmei SUN, Jing CHANG. Temporal fidelity enhancement for video action recognition[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2500164 @article{title="Temporal fidelity enhancement for video action recognition", %0 Journal Article TY - JOUR
视频行为识别中的时序保真度增强1北京工业大学信息学部,中国北京市,100124 2首都医科大学附属北京朝阳医院,中国北京市,100020 摘要:时序注意力机制对于视频行为识别至关重要,它使模型能够聚焦于具有丰富语义信息的关键片段。然而,这些模型常因训练多样性有限和缺乏细粒度时序监督而出现时序失真现象--即注意力权重与语义内容错位。尽管视频级标签提供了粗粒度的行为指引,但细节约束的缺失导致注意力噪声持续存在,尤其在包含干扰性空间元素的复杂场景中。针对这一问题,本文提出时序保真度增强(TFE)——一种基于解耦信息瓶颈(DisenIB)理论的对抗性学习范式。TFE通过对抗性特征解耦将行为相关语义与虚假相关性分离,从而缓解时序失真问题。该方法利用预训练表征进行初始化,建立对抗学习流程,即高时序注意力片段与行为相关性弱化的上下文相互竞争。该方法无需细粒度监督标签即可确保时序一致性,并提升注意力权重的保真度。在UCF101、HMDB-51和Charades基准数据集上的大量实验验证了该方法的有效性,结果表明TFE可令行为识别准确率显著提升。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Aghaeipoor F, Sabokrou M, Fernández A, 2023. Fuzzy rule-based explainer systems for deep neural networks: from local explainability to global understanding. IEEE Trans Fuzzy Syst, 31(9):3069-3080. ![]() [2]Carreira J, Zisserman A, 2017. Quo vadis, action recognition? A new model and the kinetics dataset. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4724-4733. ![]() [3]Cen J, Zhang SW, Wang X, et al., 2023. Enlarging instance-specific and class-specific information for open-set action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15295-15304. ![]() [4]Chen JB, Song L, Wainwright MJ, et al., 2018. Learning to explain: an information-theoretic perspective on model interpretation. Proc 35th Int Conf on Machine Learning, p.882-891. ![]() [5]Chi HG, Ha MH, Chi S, et al., 2022. InfoGCN: representation learning for human skeleton-based action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.20154-20164. ![]() [6]Dimitrov AG, Miller JP, 2001. Neural coding and decoding: communication channels and quantization. Netw Comput Neur Syst, 12(4):441-472. ![]() [7]Fan HQ, Xiong B, Mangalam K, et al., 2021. Multiscale vision Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.6804-6815. ![]() [8]Feichtenhofer C, 2020. X3D: expanding architectures for efficient video recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.200-210. ![]() [9]Feichtenhofer C, Fan HQ, Malik J, et al., 2019. SlowFast networks for video recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.6201-6210. ![]() [10]Gao SY, Chen Z, Chen G, et al., 2024. AVSegFormer: audio-visual segmentation with Transformer. Proc 38th AAAI Conf on Artificial Intelligence, p.12155-12163. ![]() [11]Girdhar R, Ramanan D, Gupta A, et al., 2017. ActionVLAD: learning spatio-temporal aggregation for action classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3165-3174. ![]() [12]Guo HJ, Wang HJ, Ji Q, 2022. Uncertainty-guided probabilistic Transformer for complex action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.20020-20029. ![]() [13]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. ![]() [14]Hussein N, Gavves E, Smeulders AWM, 2019a. Timeception for complex action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.254-263. ![]() [15]Hussein N, Gavves E, Smeulders AWM, 2019b. VideoGraph: recognizing minutes-long human activities in videos. https://arxiv.org/abs/1905.05143 ![]() [16]Ishikawa Y, Kondo M, Kataoka H, 2024. Learnable cube-based video encryption for privacy-preserving action recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.6988-6998. ![]() [17]Jiang BY, Wang MM, Gan WH, et al., 2019. STM: spatiotemporal and motion encoding for action recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.2000-2009. ![]() [18]Jiao JY, Tang YM, Lin KY, et al., 2023. DilateFormer: multi-scale dilated Transformer for visual recognition. IEEE Trans Multim, 25:8906-8919. ![]() [19]Jiao LC, Song X, You C, et al., 2024. AI meets physics: a comprehensive survey. Artif Intell Rev, 57(9):256. ![]() [20]Jiao LC, Ma MR, He P, et al., 2025. Brain-inspired learning, perception, and cognition: a comprehensive review. IEEE Trans Neur Netw Learn Syst, 36(4):5921-5941. ![]() [21]Kuehne H, Jhuang H, Garrote E, et al., 2011. HMDB: a large video database for human motion recognition. Proc Int Conf on Computer Vision, p.2556-2563. ![]() [22]Li XH, Zhu YH, Wang LM, 2023. Zeroi2V: zero-cost adaptation of pre-trained Transformers from image to video. Proc 18th European Conf on Computer Vision, p.425-443. ![]() [23]Li Z, Zhang RQ, Zou DQ, et al., 2023. Robin: a novel method to produce robust interpreters for deep learning-based code classifiers. Proc 38th IEEE/ACM Int Conf on Automated Software Engineering, p.27-39. ![]() [24]Liang J, Bai B, Cao YR, et al., 2020. Adversarial infidelity learning for model interpretation. Proc 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.286-296. ![]() [25]Lin J, Gan C, Han S, 2019. TSM: temporal shift module for efficient video understanding. Proc IEEE/CVF Int Conf on Computer Vision, p.7082-7092. ![]() [26]Liu Y, Liu F, Jiao LC, et al., 2024. A knowledge-based hierarchical causal inference network for video action recognition. IEEE Trans Multim, 26:9135-9149. ![]() [27]Liu Y, Liu F, Jiao LC, et al., 2025. Knowledge-driven compositional action recognition. Patt Recogn, 163:111452. ![]() [28]Liu ZY, Wang LM, Wu W, et al., 2021. TAM: temporal adaptive module for video recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.13688-13698. ![]() [29]Loshchilov I, Hutter F, 2019. Decoupled weight decay regularization. Proc 7th Int Conf on Learning Representations. https://arxiv.org/abs/1711.05101 ![]() [30]Mondal A, Nag S, Prada JM, et al., 2023. Actor-agnostic multi-label action recognition with multi-modal query. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.784-794. ![]() [31]Pan ZQ, Niu L, Zhang JF, et al., 2021. Disentangled information bottleneck. Proc 35th AAAI Conf on Artificial Intelligence, p.9285-9293. ![]() [32]Paszke A, Gross S, Massa F, et al., 2019. PyTorch: an imperative style, high-performance deep learning library. Proc 33rd Int Conf on Neural Information Processing Systems, Article 721. ![]() [33]Sigurdsson GA, Varol G, Wang XL, et al., 2016. Hollywood in homes: crowdsourcing data collection for activity understanding. Proc 14th European Conf on Computer Vision, p.510-526. ![]() [34]Sigurdsson GA, Divvala S, Farhadi A, et al., 2017. Asynchronous temporal fields for action recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5650-5659. ![]() [35]Soomro K, Zamir AR, Shah M, 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. https://arxiv.org/abs/1212.0402 ![]() [36]Srivastava A, Dutta O, Gupta J, et al., 2021. A variational information bottleneck based method to compress sequential networks for human action recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.2744-2753. ![]() [37]Tishby N, Pereira FC, Bialek W, 2000. The information bottleneck method. https://arxiv.org/abs/physics/0004057 ![]() [38]Tong Z, Song YB, Wang J, et al., 2022. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. Proc 36th Int Conf on Neural Information Processing Systems, Article 732. ![]() [39]Tran D, Wang H, Torresani L, et al., 2018. A closer look at spatiotemporal convolutions for action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6450-6459. ![]() [40]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. ![]() [41]Wang H, Liu F, Jiao LC, et al., 2024. ViLT-CLIP: video and language tuning clip with multimodal prompt learning and scenario-guided optimization. Proc 38th AAAI Conf on Artificial Intelligence, p.5390-5400. ![]() [42]Wang LM, Li W, Li W, et al., 2018. Appearance-and-relation networks for video classification. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1430-1439. ![]() [43]Wang LM, Tong Z, Ji B, et al., 2021. TDN: temporal difference networks for efficient action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1895-1904. ![]() [44]Wang MM, Xing JZ, Mei JB, et al., 2023. ActionCLIP: adapting language-image pretrained models for video action recognition. IEEE Trans Neur Netw Learn Syst, 36(1):625-637. ![]() [45]Wang R, Chen DD, Wu ZX, et al., 2023. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6312-6322. ![]() [46]Watson DS, O’Hara J, Tax N, et al., 2024. Explaining predictive uncertainty with information theoretic Shapley values. Proc 37th Int Conf on Neural Information Processing Systems, Article 320. ![]() [47]Wu CY, Li YH, Mangalam K, et al., 2022. MeMViT: memory-augmented multiscale vision Transformer for efficient long-term video recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13577-13587. ![]() [48]Wu WH, He DL, Lin TW, et al., 2021. MVFNet: multi-view fusion network for efficient video recognition. Proc 35th AAAI Conf on Artificial Intelligence, p.2943-2951. ![]() [49]Wu WH, Wang XH, Luo HP, et al., 2023. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6620-6630. ![]() [50]Wu WH, Sun Z, Song YX, et al., 2024. Transferring vision-language models for visual recognition: a classifier perspective. Int J Comput Vis, 132(2):392-409. ![]() [51]Xie SN, Sun C, Huang J, et al., 2018. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Proc 15th European Conf on Computer Vision, p.318-335. ![]() [52]Yamazaki K, Vo K, Truong QS, et al., 2023. VLTinT: visual-linguistic Transformer-in-Transformer for coherent video paragraph captioning. Proc 37th AAAI Conf on Artificial Intelligence, p.3081-3090. ![]() [53]Yu TS, Li YK, Li BX, 2020. RhyRNN: rhythmic RNN for recognizing events in long and complex videos. Proc 16th European Conf on Computer Vision, p.127-144. ![]() [54]Zhang J, Wan ZF, Hu LQ, et al., 2025. Collaboratively self-supervised video representation learning for action recognition. IEEE Trans Inform Forens Secur, 20:1895-1907. ![]() [55]Zheng ZW, Yang L, Wang YL, et al., 2024. Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans Circ Syst Video Technol, 34(2):695-708. ![]() [56]Zhou BL, Andonian A, Oliva A, et al., 2018. Temporal relational reasoning in videos. Proc 15th European Conf on Computer Vision, p.831-846. ![]() [57]Zhou JM, Lin KY, Li HX, et al., 2021. Graph-based high-order relation modeling for long-term action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8980-8989. ![]() [58]Zhou JM, Lin KY, Qiu YK, et al., 2024. TwinFormer: fine-to-coarse temporal modeling for long-term action recognition. IEEE Trans Multim, 26:2715-2728. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE | ||||||||||||||



ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>