Papers
arxiv:2505.16770

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Published on May 22
· Submitted by MenghaoGuo on May 26
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

A benchmark called RBench-V evaluates multi-modal models' vision-indispensable reasoning through image manipulation and auxiliary line construction, demonstrating that current models struggle with multi-modal outputs.

AI-generated summary

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv

Community

Paper author Paper submitter
This comment has been hidden
Paper author Paper submitter

We propose a benchmark specifically designed to evaluate o3-style reasoning—visual reasoning that requires multimodal outputs, such as drawing auxiliary lines in geometric problems. Despite their capabilities, leading models like o3 and Gemini 2.5 Pro achieve only 25.6% and 20.2% accuracy respectively, while human performance reaches 82.3%. This stark contrast highlights that even the strongest models still lag significantly behind human-level visual reasoning.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.16770 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.16770 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.16770 in a Space README.md to link it from this page.

Collections including this paper 2