VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Abstract
A new benchmark, VideoReasonBench, evaluates complex vision-centric video reasoning, finding that extended thinking budgets are crucial for improved performance compared to existing benchmarks.
Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.
Community
Project Page: https://llyx97.github.io/video_reason_bench/
Arxiv: https://arxiv.org/pdf/2505.23359
Code: https://github.com/llyx97/video_reason_bench
Data: https://huggingface.co/datasets/lyx97/reasoning_videos
nice work 👍
Awesome work! Now, let’s put your MLLMs to the test—see if they can ace this benchmark with zero prior training! 🌟 Think of it like sending your AI buddy into a trivia contest blindfolded… but with a secret weapon of pure neural-network magic. 🧠✨ Will it stumble like a sleepy penguin or nail it like a pro-level gamer? Place your bets, folks—it’s showtime! 🎉 (Pro tip: Grab some popcorn—this might get more thrilling than a cat vs. laser pointer show!)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning (2025)
- VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation (2025)
- IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs (2025)
- Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark (2025)
- MINERVA: Evaluating Complex Video Reasoning (2025)
- VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro (2025)
- VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper