Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
Abstract
VBenchComp, an automated pipeline, categorizes video LLM questions into different domains to evaluate temporal reasoning and isolate model weaknesses beyond overall scores.
Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.
Community
🚨🚨🚨 New video benchmarks are popping up every day — but are they truly evaluating video understanding? Can we develop a reliable protocol to evaluate the quality of these benchmarks themselves?
In our latest work, we identify two key issues with many existing benchmarks and propose our protocol (VBenchComp):
1️⃣ Language priors – models can often answer questions without even looking at the video.
2️⃣ Order insensitivity – questions can be answered without understanding the temporal sequence of frames.
These benchmarks are often too semantic and fail to test the core of what makes video understanding hard.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models (2025)
- VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models (2025)
- TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos (2025)
- VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation (2025)
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? (2025)
- MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios (2025)
- VidText: Towards Comprehensive Evaluation for Video Text Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper