arxiv:2505.14321

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Published on May 20

· Submitted by

Authors:

Abstract

VBenchComp, an automated pipeline, categorizes video LLM questions into different domains to evaluate temporal reasoning and isolate model weaknesses beyond overall scores.

AI-generated summary

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

View arXiv page View PDF Add to collection

Community

jefflai

Paper submitter 4 days ago

🚨🚨🚨 New video benchmarks are popping up every day — but are they truly evaluating video understanding? Can we develop a reliable protocol to evaluate the quality of these benchmarks themselves?
In our latest work, we identify two key issues with many existing benchmarks and propose our protocol (VBenchComp):
1️⃣ Language priors – models can often answer questions without even looking at the video.
2️⃣ Order insensitivity – questions can be answered without understanding the temporal sequence of frames.
These benchmarks are often too semantic and fail to test the core of what makes video understanding hard.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.14321 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.14321 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.14321 in a Space README.md to link it from this page.