
CLC number: TP391
On-line Access: 2026-01-08
Received: 2025-06-14
Revision Accepted: 2025-10-20
Crosschecked: 2026-01-08
Cited: 0
Clicked: 65
Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG. TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(11): 2204-2214.
@article{title="TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions",
author="Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="11",
pages="2204-2214",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500412"
}
%0 Journal Article
%T TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
%A Yangliu HU
%A Zikai SONG
%A Junqing YU
%A Yiping Phoebe CHEN
%A Wei YANG
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 11
%P 2204-2214
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500412
TY - JOUR
T1 - TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
A1 - Yangliu HU
A1 - Zikai SONG
A1 - Junqing YU
A1 - Yiping Phoebe CHEN
A1 - Wei YANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 11
SP - 2204
EP - 2214
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500412
Abstract: Video large language models (video-LLMs) have demonstrated impressive capabilities in multimodal understanding, but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored. Existing methods notably underperform in detecting critical temporal errors, such as missing, hallucinated, or misordered actions. To address this gap, we introduce two key contributions. (1) TimeJudge: a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs. It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation. (2) TEDBench: a rigorously constructed benchmark featuring videos across four distinct complexity levels, specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task. Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench, we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning. Our approach provides a generalizable, scalable, and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.
[1]Bai S, Chen K, Liu X, et al., 2025. Qwen2.5-VL technical report.
[2]Bai YS, Ying JH, Cao YX, et al., 2023. Benchmarking foundation models with Language-Model-as-an-Examiner.
[3]Chen DP, Chen RX, Zhang SL, et al., 2024. MLLM-as-a-Judge: assessing multimodal LLM-as-a-Judge with vision-language benchmark.
[4]Deshpande D, Ravi SS, CH-Wang S, et al., 2024. GLIDER: grading LLM interactions and decisions using explainable ranking.
[5]Goyal R, Kahou SE, Michalski V, et al., 2017. The “something something” video database for learning and evaluating visual common sense. Proc IEEE Int Conf on Computer Vision, p.5842-5850.
[6]Hurst A, Lerer A, Goucher AP, et al., 2024. GPT-4o system card.
[7]Lee H, Phatale S, Mansoor H, et al., 2023. RLAIF: scaling reinforcement learning from human feedback with AI feedback. Proc 41st Int Conf on Machine Learning.
[8]Li JL, Sun SC, Yuan WZ, et al., 2023. Generative judge for evaluating alignment.
[9]Li L, Wei YC, Xie ZH, et al., 2024. VL-RewardBench: a challenging benchmark for vision-language generative reward models.
[10]Li RS, Patel T, Du XY, 2023. PRD: peer rank and discussion improve large language model based evaluations.
[11]Liang T, He ZW, Jiao WX, et al., 2024. Encouraging divergent thinking in large language models through multi-agent debate. Proc Conf on Empirical Methods in Natural Language Processing, p.17889-17904.
[12]Liao RT, Erler M, Wang HY, et al., 2024. VideoINSTA: zero-shot long video understanding via informative spatial-temporal reasoning with LLMs. Proc Findings of the Association for Computational Linguistics, p.6577-6602.
[13]Liu M, Zhang WS, 2025. Is your video language model a reliable judge?
[14]Monfort M, Andonian A, Zhou BL, et al., 2020. Moments in Time dataset: one million videos for event understanding. IEEE Trans Pattern Anal Mach Intell, 42(2):502-508.
[15]Park J, Jwa S, Ren MY, et al., 2024. OffsetBias: leveraging debiased data for tuning evaluators. Proc Findings of the Association for Computational Linguistics, p.1043-1067.
[16]Pu S, Wang YC, Chen DP, et al., 2025. Judge anything: MLLM as a judge across any modality.
[17]Rafailov R, Sharma A, Mitchell E, et al., 2023. Direct preference optimization: your language model is secretly a reward model.
[18]Shi JW, Yuan ZH, Liu YN, et al., 2024. Optimization-based prompt injection attack to LLM-as-a-Judge. Proc ACM SIGSAC Conf on Computer and Communications Security, p.660-674.
[19]Sigurdsson GA, Varol G, Wang XL, et al., 2016. Hollywood in Homes: crowdsourcing data collection for activity understanding. Proc 14th European Conf on Computer Vision, p.510-526.
[20]Son G, Yoon D, Suk J, et al., 2024. MM-Eval: a multilingual meta-evaluation benchmark for LLM-as-a-Judge and reward models.
[21]Tan SJ, Zhuang SY, Montgomery K, et al., 2024. JudgeBench: a benchmark for evaluating LLM-based judges.
[22]Vu T, Krishna K, Alzubi S, et al., 2024. Foundational autoraters: taming large language models for better automatic evaluation. Proc Conf on Empirical Methods in Natural Language Processing, p.17086-17105.
[23]Wang BJ, Chern S, Chern E, et al., 2024. Halu-J: critique-based hallucination judge.
[24]Wang BS, Yue X, Sun H, 2023. Can ChatGPT defend its belief in truth? Evaluating LLM reasoning via debate. Proc Findings of the Association for Computational Linguistics, p.11865-11881.
[25]Wang LM, Xiong YJ, Wang Z, et al., 2016. Temporal segment networks: towards good practices for deep action recognition. Proc 14th European Conf on Computer Vision, p.20-36.
[26]Wang TL, Kulikov I, Golovneva O, et al., 2024. Self-taught evaluators.
[27]Wang YC, Yuan JY, Chuang YN, et al., 2024. DHP benchmark: are LLMs good NLG evaluators? Proc Findings of the Association for Computational Linguistics, p.8079-8094.
[28]Wang YD, Yu ZH, Zeng ZR, et al., 2023. PandaLM: an automatic evaluation benchmark for LLM instruction tuning optimization.
[29]Wang ZT, Hu SM, Zhao SY, et al., 2025. MLLM-as-a-Judge for image safety without human labeling. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14657-14666.
[30]Wu TH, Yuan WZ, Golovneva O, et al., 2024. Meta-rewarding language models: self-improving alignment with LLM-as-a-Meta-Judge.
[31]Xie TH, Qi XY, Zeng Y, et al., 2024. SORRY-bench: systematically evaluating large language model safety refusal behaviors.
[32]Xu YF, Sun YZ, Xie ZE, et al., 2024. VTG-GPT: tuning-free zero-shot video temporal grounding with GPT. Appl Sci, 14(5):1894.
[33]Yasunaga M, Zettlemoyer L, Ghazvininejad M, 2025. Multimodal RewardBench: holistic evaluation of reward models for vision language models.
[34]Ye JY, Wang YB, Huang Y, et al., 2024. Justice or prejudice? Quantifying biases in LLM-as-a-Judge.
[35]Yu TY, Zhang HY, Li QM, et al., 2025. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19985-19995.
[36]Zhang BQ, Li KH, Cheng ZS, et al., 2025. VideoLLaMA 3: frontier multimodal foundation models for image and video understanding.
[37]Zheng LM, Chiang WL, Sheng Y, et al., 2023. Judging LLM-as-a-Judge with MT-bench and Chatbot Arena. Proc 37th Int Conf on Neural Information Processing Systems, Article 2020.
[38]Zhu JG, Wang WY, Chen Z, et al., 2025. InternVL3: exploring advanced training and test-time recipes for open-source multimodal models.
Open peer comments: Debate/Discuss/Question/Opinion
<1>