Affiliation(s): 1Faculty of Information Technology, Beijing Univerisity of Technology, Beijing 100124, China;
moreAffiliation(s): 1Faculty of Information Technology, Beijing Univerisity of Technology, Beijing 100124, China; 2Beijing Chao-yang Hospital, Capital Medical University, Beijing 100020, China;
less
Abstract: Temporal attention mechanisms are essential for video action recognition, enabling models to focus on semantically informative moments. However, these models frequently exhibit temporal infidelity-misaligned attention weights caused by limited training diversity and the absence of fine-grained temporal supervision. While video-level labels provide coarse-grained action guidance, the lack of detailed constraints allows attention noise to persist, especially in complex scenarios with distracting spatial elements. To address this issue, we propose temporal fidelity enhancement (TFE), a competitive learning paradigm based on disentangled information bottleneck (Dis-enIB) theory. TFE mitigates temporal infidelity by decoupling action-relevant semantics from spurious correlations through adversarial feature disentanglement. Using pre-trained representations for initialization, TFE establishes an adversarial process in which segments with elevated temporal attention compete against contexts with diminished action relevance. This mechanism ensures temporal consistency and enhances the fidelity of attention patterns with-out requiring explicit fine-grained supervision. Extensive studies on UCF-101, HMDB-51, and Charades benchmarks validate the effectiveness of our method, with significant improvements in action recognition accuracy.
Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference
Open peer comments: Debate/Discuss/Question/Opinion
Open peer comments: Debate/Discuss/Question/Opinion
<1>