RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
Abstract
A benchmark called RBench-V evaluates multi-modal models' vision-indispensable reasoning through image manipulation and auxiliary line construction, demonstrating that current models struggle with multi-modal outputs.
The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv
Community
We propose a benchmark specifically designed to evaluate o3-style reasoning—visual reasoning that requires multimodal outputs, such as drawing auxiliary lines in geometric problems. Despite their capabilities, leading models like o3 and Gemini 2.5 Pro achieve only 25.6% and 20.2% accuracy respectively, while human performance reaches 82.3%. This stark contrast highlights that even the strongest models still lag significantly behind human-level visual reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency (2025)
- VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models (2025)
- R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM&MLLM Complex Reasoning Evaluation (2025)
- IQBench: How"Smart'' Are Vision-Language Models? A Study with Human IQ Tests (2025)
- Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation (2025)
- PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models (2025)
- GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper