CLC number: TP391.41
On-line Access: 2025-06-04
Received: 2025-05-14
Revision Accepted: 2025-06-04
Crosschecked: 2025-09-04
Cited: 0
Clicked: 444
Citations: Bibtex RefMan EndNote GB/T7714
Shaowu XU, Xibin JIA, Qianmei SUN, Jing CHANG. Temporal fidelity enhancement for video action recognition[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(8): 1293-1304.
@article{title="Temporal fidelity enhancement for video action recognition",
author="Shaowu XU, Xibin JIA, Qianmei SUN, Jing CHANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="8",
pages="1293-1304",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500164"
}
%0 Journal Article
%T Temporal fidelity enhancement for video action recognition
%A Shaowu XU
%A Xibin JIA
%A Qianmei SUN
%A Jing CHANG
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 8
%P 1293-1304
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500164
TY - JOUR
T1 - Temporal fidelity enhancement for video action recognition
A1 - Shaowu XU
A1 - Xibin JIA
A1 - Qianmei SUN
A1 - Jing CHANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 8
SP - 1293
EP - 1304
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500164
Abstract: Temporal attention mechanisms are essential for video action recognition, enabling models to focus on semantically informative moments. However, these models frequently exhibit temporal infidelity—misaligned attention weights caused by limited training diversity and the absence of fine-grained temporal supervision. While video-level labels provide coarse-grained action guidance, the lack of detailed constraints allows attention noise to persist, especially in complex scenarios with distracting spatial elements. To address this issue, we propose temporal fidelity enhancement (TFE), a competitive learning paradigm based on the disentangled information bottleneck (DisenIB) theory. TFE mitigates temporal infidelity by decoupling action-relevant semantics from spurious correlations through adversarial feature disentanglement. Using pre-trained representations for initialization, TFE establishes an adversarial process in which segments with elevated temporal attention compete against contexts with diminished action relevance. This mechanism ensures temporal consistency and enhances the fidelity of attention patterns without requiring explicit fine-grained supervision. Extensive studies on UCF101, HMDB-51, and Charades benchmarks validate the effectiveness of our method, with significant improvements in action recognition accuracy.
[1]Aghaeipoor F, Sabokrou M, Fernández A, 2023. Fuzzy rule-based explainer systems for deep neural networks: from local explainability to global understanding. IEEE Trans Fuzzy Syst, 31(9):3069-3080.
[2]Carreira J, Zisserman A, 2017. Quo vadis, action recognition? A new model and the kinetics dataset. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4724-4733.
[3]Cen J, Zhang SW, Wang X, et al., 2023. Enlarging instance-specific and class-specific information for open-set action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15295-15304.
[4]Chen JB, Song L, Wainwright MJ, et al., 2018. Learning to explain: an information-theoretic perspective on model interpretation. Proc 35th Int Conf on Machine Learning, p.882-891.
[5]Chi HG, Ha MH, Chi S, et al., 2022. InfoGCN: representation learning for human skeleton-based action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.20154-20164.
[6]Dimitrov AG, Miller JP, 2001. Neural coding and decoding: communication channels and quantization. Netw Comput Neur Syst, 12(4):441-472.
[7]Fan HQ, Xiong B, Mangalam K, et al., 2021. Multiscale vision Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.6804-6815.
[8]Feichtenhofer C, 2020. X3D: expanding architectures for efficient video recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.200-210.
[9]Feichtenhofer C, Fan HQ, Malik J, et al., 2019. SlowFast networks for video recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.6201-6210.
[10]Gao SY, Chen Z, Chen G, et al., 2024. AVSegFormer: audio-visual segmentation with Transformer. Proc 38th AAAI Conf on Artificial Intelligence, p.12155-12163.
[11]Girdhar R, Ramanan D, Gupta A, et al., 2017. ActionVLAD: learning spatio-temporal aggregation for action classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3165-3174.
[12]Guo HJ, Wang HJ, Ji Q, 2022. Uncertainty-guided probabilistic Transformer for complex action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.20020-20029.
[13]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[14]Hussein N, Gavves E, Smeulders AWM, 2019a. Timeception for complex action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.254-263.
[15]Hussein N, Gavves E, Smeulders AWM, 2019b. VideoGraph: recognizing minutes-long human activities in videos. https://arxiv.org/abs/1905.05143
[16]Ishikawa Y, Kondo M, Kataoka H, 2024. Learnable cube-based video encryption for privacy-preserving action recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.6988-6998.
[17]Jiang BY, Wang MM, Gan WH, et al., 2019. STM: spatiotemporal and motion encoding for action recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.2000-2009.
[18]Jiao JY, Tang YM, Lin KY, et al., 2023. DilateFormer: multi-scale dilated Transformer for visual recognition. IEEE Trans Multim, 25:8906-8919.
[19]Jiao LC, Song X, You C, et al., 2024. AI meets physics: a comprehensive survey. Artif Intell Rev, 57(9):256.
[20]Jiao LC, Ma MR, He P, et al., 2025. Brain-inspired learning, perception, and cognition: a comprehensive review. IEEE Trans Neur Netw Learn Syst, 36(4):5921-5941.
[21]Kuehne H, Jhuang H, Garrote E, et al., 2011. HMDB: a large video database for human motion recognition. Proc Int Conf on Computer Vision, p.2556-2563.
[22]Li XH, Zhu YH, Wang LM, 2023. Zeroi2V: zero-cost adaptation of pre-trained Transformers from image to video. Proc 18th European Conf on Computer Vision, p.425-443.
[23]Li Z, Zhang RQ, Zou DQ, et al., 2023. Robin: a novel method to produce robust interpreters for deep learning-based code classifiers. Proc 38th IEEE/ACM Int Conf on Automated Software Engineering, p.27-39.
[24]Liang J, Bai B, Cao YR, et al., 2020. Adversarial infidelity learning for model interpretation. Proc 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining, p.286-296.
[25]Lin J, Gan C, Han S, 2019. TSM: temporal shift module for efficient video understanding. Proc IEEE/CVF Int Conf on Computer Vision, p.7082-7092.
[26]Liu Y, Liu F, Jiao LC, et al., 2024. A knowledge-based hierarchical causal inference network for video action recognition. IEEE Trans Multim, 26:9135-9149.
[27]Liu Y, Liu F, Jiao LC, et al., 2025. Knowledge-driven compositional action recognition. Patt Recogn, 163:111452.
[28]Liu ZY, Wang LM, Wu W, et al., 2021. TAM: temporal adaptive module for video recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.13688-13698.
[29]Loshchilov I, Hutter F, 2019. Decoupled weight decay regularization. Proc 7th Int Conf on Learning Representations. https://arxiv.org/abs/1711.05101
[30]Mondal A, Nag S, Prada JM, et al., 2023. Actor-agnostic multi-label action recognition with multi-modal query. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.784-794.
[31]Pan ZQ, Niu L, Zhang JF, et al., 2021. Disentangled information bottleneck. Proc 35th AAAI Conf on Artificial Intelligence, p.9285-9293.
[32]Paszke A, Gross S, Massa F, et al., 2019. PyTorch: an imperative style, high-performance deep learning library. Proc 33rd Int Conf on Neural Information Processing Systems, Article 721.
[33]Sigurdsson GA, Varol G, Wang XL, et al., 2016. Hollywood in homes: crowdsourcing data collection for activity understanding. Proc 14th European Conf on Computer Vision, p.510-526.
[34]Sigurdsson GA, Divvala S, Farhadi A, et al., 2017. Asynchronous temporal fields for action recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5650-5659.
[35]Soomro K, Zamir AR, Shah M, 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. https://arxiv.org/abs/1212.0402
[36]Srivastava A, Dutta O, Gupta J, et al., 2021. A variational information bottleneck based method to compress sequential networks for human action recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.2744-2753.
[37]Tishby N, Pereira FC, Bialek W, 2000. The information bottleneck method. https://arxiv.org/abs/physics/0004057
[38]Tong Z, Song YB, Wang J, et al., 2022. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. Proc 36th Int Conf on Neural Information Processing Systems, Article 732.
[39]Tran D, Wang H, Torresani L, et al., 2018. A closer look at spatiotemporal convolutions for action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6450-6459.
[40]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[41]Wang H, Liu F, Jiao LC, et al., 2024. ViLT-CLIP: video and language tuning clip with multimodal prompt learning and scenario-guided optimization. Proc 38th AAAI Conf on Artificial Intelligence, p.5390-5400.
[42]Wang LM, Li W, Li W, et al., 2018. Appearance-and-relation networks for video classification. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1430-1439.
[43]Wang LM, Tong Z, Ji B, et al., 2021. TDN: temporal difference networks for efficient action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1895-1904.
[44]Wang MM, Xing JZ, Mei JB, et al., 2023. ActionCLIP: adapting language-image pretrained models for video action recognition. IEEE Trans Neur Netw Learn Syst, 36(1):625-637.
[45]Wang R, Chen DD, Wu ZX, et al., 2023. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6312-6322.
[46]Watson DS, O’Hara J, Tax N, et al., 2024. Explaining predictive uncertainty with information theoretic Shapley values. Proc 37th Int Conf on Neural Information Processing Systems, Article 320.
[47]Wu CY, Li YH, Mangalam K, et al., 2022. MeMViT: memory-augmented multiscale vision Transformer for efficient long-term video recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13577-13587.
[48]Wu WH, He DL, Lin TW, et al., 2021. MVFNet: multi-view fusion network for efficient video recognition. Proc 35th AAAI Conf on Artificial Intelligence, p.2943-2951.
[49]Wu WH, Wang XH, Luo HP, et al., 2023. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6620-6630.
[50]Wu WH, Sun Z, Song YX, et al., 2024. Transferring vision-language models for visual recognition: a classifier perspective. Int J Comput Vis, 132(2):392-409.
[51]Xie SN, Sun C, Huang J, et al., 2018. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Proc 15th European Conf on Computer Vision, p.318-335.
[52]Yamazaki K, Vo K, Truong QS, et al., 2023. VLTinT: visual-linguistic Transformer-in-Transformer for coherent video paragraph captioning. Proc 37th AAAI Conf on Artificial Intelligence, p.3081-3090.
[53]Yu TS, Li YK, Li BX, 2020. RhyRNN: rhythmic RNN for recognizing events in long and complex videos. Proc 16th European Conf on Computer Vision, p.127-144.
[54]Zhang J, Wan ZF, Hu LQ, et al., 2025. Collaboratively self-supervised video representation learning for action recognition. IEEE Trans Inform Forens Secur, 20:1895-1907.
[55]Zheng ZW, Yang L, Wang YL, et al., 2024. Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans Circ Syst Video Technol, 34(2):695-708.
[56]Zhou BL, Andonian A, Oliva A, et al., 2018. Temporal relational reasoning in videos. Proc 15th European Conf on Computer Vision, p.831-846.
[57]Zhou JM, Lin KY, Li HX, et al., 2021. Graph-based high-order relation modeling for long-term action recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8980-8989.
[58]Zhou JM, Lin KY, Qiu YK, et al., 2024. TwinFormer: fine-to-coarse temporal modeling for long-term action recognition. IEEE Trans Multim, 26:2715-2728.
Open peer comments: Debate/Discuss/Question/Opinion
<1>