JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Temporal fidelity enhancement for video action recognition

Author(s): Shaowu XU, Xibin JIA, Qianmei SUN, Jing CHANG
Affiliation(s): Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; more
Corresponding email(s): swxu@emails.bjut.edu.cn, jiaxibin@bjut.edu.cn, sunqianmei5825@126.com, cj006006@126.com
Key Words: Action recognition; Disentangled information bottleneck; Temporal modeling; Temporal fidelity

Share this article to： More <<< Previous Paper \|Next Paper >>>

Shaowu XU, Xibin JIA, Qianmei SUN, Jing CHANG. Temporal fidelity enhancement for video action recognition[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2500164

@article{title="Temporal fidelity enhancement for video action recognition",
author="Shaowu XU, Xibin JIA, Qianmei SUN, Jing CHANG",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2500164"
}

%0 Journal Article
%T Temporal fidelity enhancement for video action recognition
%A Shaowu XU
%A Xibin JIA
%A Qianmei SUN
%A Jing CHANG
%J Frontiers of Information Technology & Electronic Engineering
%P 1293-1304
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2500164"

TY - JOUR
T1 - Temporal fidelity enhancement for video action recognition
A1 - Shaowu XU
A1 - Xibin JIA
A1 - Qianmei SUN
A1 - Jing CHANG
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1293
EP - 1304
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2500164"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Temporal attention mechanisms are essential for video action recognition, enabling models to focus on semantically informative moments. However, these models frequently exhibit temporal infidelity—misaligned attention weights caused by limited training diversity and the absence of fine-grained temporal supervision. While video-level labels provide coarse-grained action guidance, the lack of detailed constraints allows attention noise to persist, especially in complex scenarios with distracting spatial elements. To address this issue, we propose temporal fidelity enhancement (TFE), a competitive learning paradigm based on the disentangled information bottleneck (DisenIB) theory. TFE mitigates temporal infidelity by decoupling action-relevant semantics from spurious correlations through adversarial feature disentanglement. Using pre-trained representations for initialization, TFE establishes an adversarial process in which segments with elevated temporal attention compete against contexts with diminished action relevance. This mechanism ensures temporal consistency and enhances the fidelity of attention patterns without requiring explicit fine-grained supervision. Extensive studies on UCF101, HMDB-51, and Charades benchmarks validate the effectiveness of our method, with significant improvements in action recognition accuracy.

视频行为识别中的时序保真度增强

许少武¹，贾熹滨¹，孙倩美²，常晶²
¹北京工业大学信息学部，中国北京市，100124
²首都医科大学附属北京朝阳医院，中国北京市，100020
摘要：时序注意力机制对于视频行为识别至关重要，它使模型能够聚焦于具有丰富语义信息的关键片段。然而，这些模型常因训练多样性有限和缺乏细粒度时序监督而出现时序失真现象--即注意力权重与语义内容错位。尽管视频级标签提供了粗粒度的行为指引，但细节约束的缺失导致注意力噪声持续存在，尤其在包含干扰性空间元素的复杂场景中。针对这一问题，本文提出时序保真度增强（TFE）——一种基于解耦信息瓶颈（DisenIB）理论的对抗性学习范式。TFE通过对抗性特征解耦将行为相关语义与虚假相关性分离，从而缓解时序失真问题。该方法利用预训练表征进行初始化，建立对抗学习流程，即高时序注意力片段与行为相关性弱化的上下文相互竞争。该方法无需细粒度监督标签即可确保时序一致性，并提升注意力权重的保真度。在UCF101、HMDB-51和Charades基准数据集上的大量实验验证了该方法的有效性，结果表明TFE可令行为识别准确率显著提升。

关键词组：行为识别；解耦信息瓶颈；时序建模；时序保真度

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Aghaeipoor F, Sabokrou M, Fernández A, 2023. Fuzzy rule-based explainer systems for deep neural networks: from local explainability to global understanding. IEEE Trans Fuzzy Syst, 31(9):3069-3080.

[2]Carreira J, Zisserman A, 2017. Quo vadis, action recognition? A new model and the kinetics dataset. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4724-4733.

[3]Cen J, Zhang SW, Wang X, et al., 2023. Enlarging instance-specific and class-specific information for open-set action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15295-15304.

[4]Chen JB, Song L, Wainwright MJ, et al., 2018. Learning to explain: an information-theoretic perspective on model interpretation. Proc 35^th Int Conf on Machine Learning, p.882-891.

[5]Chi HG, Ha MH, Chi S, et al., 2022. InfoGCN: representation learning for human skeleton-based action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.20154-20164.

[6]Dimitrov AG, Miller JP, 2001. Neural coding and decoding: communication channels and quantization. Netw Comput Neur Syst, 12(4):441-472.

[7]Fan HQ, Xiong B, Mangalam K, et al., 2021. Multiscale vision Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.6804-6815.

[8]Feichtenhofer C, 2020. X3D: expanding architectures for efficient video recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.200-210.

[9]Feichtenhofer C, Fan HQ, Malik J, et al., 2019. SlowFast networks for video recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.6201-6210.

[10]Gao SY, Chen Z, Chen G, et al., 2024. AVSegFormer: audio-visual segmentation with Transformer. Proc 38^th AAAI Conf on Artificial Intelligence, p.12155-12163.

[11]Girdhar R, Ramanan D, Gupta A, et al., 2017. ActionVLAD: learning spatio-temporal aggregation for action classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3165-3174.

[12]Guo HJ, Wang HJ, Ji Q, 2022. Uncertainty-guided probabilistic Transformer for complex action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.20020-20029.

[13]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[14]Hussein N, Gavves E, Smeulders AWM, 2019a. Timeception for complex action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.254-263.

[15]Hussein N, Gavves E, Smeulders AWM, 2019b. VideoGraph: recognizing minutes-long human activities in videos. https://arxiv.org/abs/1905.05143

[16]Ishikawa Y, Kondo M, Kataoka H, 2024. Learnable cube-based video encryption for privacy-preserving action recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.6988-6998.

[17]Jiang BY, Wang MM, Gan WH, et al., 2019. STM: spatiotemporal and motion encoding for action recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.2000-2009.

[18]Jiao JY, Tang YM, Lin KY, et al., 2023. DilateFormer: multi-scale dilated Transformer for visual recognition. IEEE Trans Multim, 25:8906-8919.

[19]Jiao LC, Song X, You C, et al., 2024. AI meets physics: a comprehensive survey. Artif Intell Rev, 57(9):256.

[20]Jiao LC, Ma MR, He P, et al., 2025. Brain-inspired learning, perception, and cognition: a comprehensive review. IEEE Trans Neur Netw Learn Syst, 36(4):5921-5941.

[21]Kuehne H, Jhuang H, Garrote E, et al., 2011. HMDB: a large video database for human motion recognition. Proc Int Conf on Computer Vision, p.2556-2563.

[22]Li XH, Zhu YH, Wang LM, 2023. Zeroi2V: zero-cost adaptation of pre-trained Transformers from image to video. Proc 18^th European Conf on Computer Vision, p.425-443.

[23]Li Z, Zhang RQ, Zou DQ, et al., 2023. Robin: a novel method to produce robust interpreters for deep learning-based code classifiers. Proc 38^th IEEE/ACM Int Conf on Automated Software Engineering, p.27-39.

[24]Liang J, Bai B, Cao YR, et al., 2020. Adversarial infidelity learning for model interpretation. Proc 26^th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.286-296.

[25]Lin J, Gan C, Han S, 2019. TSM: temporal shift module for efficient video understanding. Proc IEEE/CVF Int Conf on Computer Vision, p.7082-7092.

[26]Liu Y, Liu F, Jiao LC, et al., 2024. A knowledge-based hierarchical causal inference network for video action recognition. IEEE Trans Multim, 26:9135-9149.

[27]Liu Y, Liu F, Jiao LC, et al., 2025. Knowledge-driven compositional action recognition. Patt Recogn, 163:111452.

[28]Liu ZY, Wang LM, Wu W, et al., 2021. TAM: temporal adaptive module for video recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.13688-13698.

[29]Loshchilov I, Hutter F, 2019. Decoupled weight decay regularization. Proc 7^th Int Conf on Learning Representations. https://arxiv.org/abs/1711.05101

[30]Mondal A, Nag S, Prada JM, et al., 2023. Actor-agnostic multi-label action recognition with multi-modal query. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.784-794.

[31]Pan ZQ, Niu L, Zhang JF, et al., 2021. Disentangled information bottleneck. Proc 35^th AAAI Conf on Artificial Intelligence, p.9285-9293.

[32]Paszke A, Gross S, Massa F, et al., 2019. PyTorch: an imperative style, high-performance deep learning library. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 721.

[33]Sigurdsson GA, Varol G, Wang XL, et al., 2016. Hollywood in homes: crowdsourcing data collection for activity understanding. Proc 14^th European Conf on Computer Vision, p.510-526.

[34]Sigurdsson GA, Divvala S, Farhadi A, et al., 2017. Asynchronous temporal fields for action recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5650-5659.

[35]Soomro K, Zamir AR, Shah M, 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. https://arxiv.org/abs/1212.0402

[36]Srivastava A, Dutta O, Gupta J, et al., 2021. A variational information bottleneck based method to compress sequential networks for human action recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.2744-2753.

[37]Tishby N, Pereira FC, Bialek W, 2000. The information bottleneck method. https://arxiv.org/abs/physics/0004057

[38]Tong Z, Song YB, Wang J, et al., 2022. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. Proc 36^th Int Conf on Neural Information Processing Systems, Article 732.

[39]Tran D, Wang H, Torresani L, et al., 2018. A closer look at spatiotemporal convolutions for action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6450-6459.

[40]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[41]Wang H, Liu F, Jiao LC, et al., 2024. ViLT-CLIP: video and language tuning clip with multimodal prompt learning and scenario-guided optimization. Proc 38^th AAAI Conf on Artificial Intelligence, p.5390-5400.

[42]Wang LM, Li W, Li W, et al., 2018. Appearance-and-relation networks for video classification. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1430-1439.

[43]Wang LM, Tong Z, Ji B, et al., 2021. TDN: temporal difference networks for efficient action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1895-1904.

[44]Wang MM, Xing JZ, Mei JB, et al., 2023. ActionCLIP: adapting language-image pretrained models for video action recognition. IEEE Trans Neur Netw Learn Syst, 36(1):625-637.

[45]Wang R, Chen DD, Wu ZX, et al., 2023. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6312-6322.

[46]Watson DS, O’Hara J, Tax N, et al., 2024. Explaining predictive uncertainty with information theoretic Shapley values. Proc 37^th Int Conf on Neural Information Processing Systems, Article 320.

[47]Wu CY, Li YH, Mangalam K, et al., 2022. MeMViT: memory-augmented multiscale vision Transformer for efficient long-term video recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13577-13587.

[48]Wu WH, He DL, Lin TW, et al., 2021. MVFNet: multi-view fusion network for efficient video recognition. Proc 35^th AAAI Conf on Artificial Intelligence, p.2943-2951.

[49]Wu WH, Wang XH, Luo HP, et al., 2023. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6620-6630.

[50]Wu WH, Sun Z, Song YX, et al., 2024. Transferring vision-language models for visual recognition: a classifier perspective. Int J Comput Vis, 132(2):392-409.

[51]Xie SN, Sun C, Huang J, et al., 2018. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Proc 15^th European Conf on Computer Vision, p.318-335.

[52]Yamazaki K, Vo K, Truong QS, et al., 2023. VLTinT: visual-linguistic Transformer-in-Transformer for coherent video paragraph captioning. Proc 37^th AAAI Conf on Artificial Intelligence, p.3081-3090.

[53]Yu TS, Li YK, Li BX, 2020. RhyRNN: rhythmic RNN for recognizing events in long and complex videos. Proc 16^th European Conf on Computer Vision, p.127-144.

[54]Zhang J, Wan ZF, Hu LQ, et al., 2025. Collaboratively self-supervised video representation learning for action recognition. IEEE Trans Inform Forens Secur, 20:1895-1907.

[55]Zheng ZW, Yang L, Wang YL, et al., 2024. Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans Circ Syst Video Technol, 34(2):695-708.

[56]Zhou BL, Andonian A, Oliva A, et al., 2018. Temporal relational reasoning in videos. Proc 15^th European Conf on Computer Vision, p.831-846.

[57]Zhou JM, Lin KY, Li HX, et al., 2021. Graph-based high-order relation modeling for long-term action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8980-8989.

[58]Zhou JM, Lin KY, Qiu YK, et al., 2024. TwinFormer: fine-to-coarse temporal modeling for long-term action recognition. IEEE Trans Multim, 26:2715-2728.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

视频行为识别中的时序保真度增强

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference