CLC number:
On-line Access: 2025-08-15
Received: 2025-03-14
Revision Accepted: 2025-06-04
Crosschecked: 0000-00-00
Cited: 0
Clicked: 9
Shaowu XU1, Xibin JIA1, Qianmei SUN2, Jing CHANG2. Temporal fidelity enhancement for video action recognition[J]. Frontiers of Information Technology & Electronic Engineering, 1998, -1(-1): .
@article{title="Temporal fidelity enhancement for video action recognition",
author="Shaowu XU1, Xibin JIA1, Qianmei SUN2, Jing CHANG2",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="-1",
number="-1",
pages="",
year="1998",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500164"
}
%0 Journal Article
%T Temporal fidelity enhancement for video action recognition
%A Shaowu XU1
%A Xibin JIA1
%A Qianmei SUN2
%A Jing CHANG2
%J Journal of Zhejiang University SCIENCE C
%V -1
%N -1
%P
%@ 2095-9184
%D 1998
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500164
TY - JOUR
T1 - Temporal fidelity enhancement for video action recognition
A1 - Shaowu XU1
A1 - Xibin JIA1
A1 - Qianmei SUN2
A1 - Jing CHANG2
J0 - Journal of Zhejiang University Science C
VL - -1
IS - -1
SP -
EP -
%@ 2095-9184
Y1 - 1998
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500164
Abstract: Temporal attention mechanisms are essential for video action recognition, enabling models to focus on semantically informative moments. However, these models frequently exhibit temporal infidelity-misaligned attention weights caused by limited training diversity and the absence of fine-grained temporal supervision. While video-level labels provide coarse-grained action guidance, the lack of detailed constraints allows attention noise to persist, especially in complex scenarios with distracting spatial elements. To address this issue, we propose temporal fidelity enhancement (TFE), a competitive learning paradigm based on disentangled information bottleneck (Dis-enIB) theory. TFE mitigates temporal infidelity by decoupling action-relevant semantics from spurious correlations through adversarial feature disentanglement. Using pre-trained representations for initialization, TFE establishes an adversarial process in which segments with elevated temporal attention compete against contexts with diminished action relevance. This mechanism ensures temporal consistency and enhances the fidelity of attention patterns with-out requiring explicit fine-grained supervision. Extensive studies on UCF-101, HMDB-51, and Charades benchmarks validate the effectiveness of our method, with significant improvements in action recognition accuracy.
Open peer comments: Debate/Discuss/Question/Opinion
<1>