|
|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2025 Vol.26 No.11 P.2204-2214
TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
Abstract: Video large language models (video-LLMs) have demonstrated impressive capabilities in multimodal understanding, but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored. Existing methods notably underperform in detecting critical temporal errors, such as missing, hallucinated, or misordered actions. To address this gap, we introduce two key contributions. (1) TimeJudge: a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs. It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation. (2) TEDBench: a rigorously constructed benchmark featuring videos across four distinct complexity levels, specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task. Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench, we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning. Our approach provides a generalizable, scalable, and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.
Key words: Video large language model (Video-LLM); Multimodal large language model (MLLM); MLLM-as-a-Judge; Video caption; Benchmark
1华中科技大学,中国武汉市,430074
2乐卓博大学,澳大利亚墨尔本,3086
摘要:视频大语言模型(video-LLM)在多模态理解方面展现出卓越能力,但其在视频摘要时序一致性零样本评估方面的潜力仍未被充分挖掘。现有方法在检测关键时序错误(如动作缺失、幻觉或顺序混乱)时表现有限。为此,本文作出两项核心贡献:(1)提出一种创新的零样本框架TimeJudge,将时序错误检测重构为一组经校准的二元问答任务,并引入模态敏感的置信度校准机制与一致性加权投票策略,以实现稳健的结果聚合;(2)精心构建一个基准数据集TEDBench,涵盖4个层次动作复杂度的视频,并提供细粒度的时序错误标注,用于系统评估video-LLM在该任务上的表现。实验结果表明,TimeJudge在多个先进的video-LLM上显著提升了时序错误检测的召回率与F1分数,无需任何针对特定任务的微调。该方法为提升video-LLM的时序审查能力提供了一种通用、可扩展且无需训练的解决方案。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2500412
CLC number:
TP391
Download Full Text:
Downloaded:
220
Download summary:
<Click Here>Downloaded:
248Clicked:
284
Cited:
0
On-line Access:
2026-01-08
Received:
2025-06-14
Revision Accepted:
2025-10-20
Crosschecked:
2026-01-08