arxiv:2505.19099

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Published on May 25

· Submitted by

judge on May 28

Upvote

Authors:

Heng Li ,

Yinya Huang ,

Abstract

SeePhys, a multimodal benchmark, highlights challenges in LLMs' visual reasoning and physics-grounded problem-solving capabilities, especially in interpreting diagrams and reducing reliance on textual cues.

AI-generated summary

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75\%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60\% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

View arXiv page View PDF Add to collection

Community

Quinn777

7 days ago

•

edited 6 days ago

Can AI truly see the Physics? Test your model with the newly released SeePhys Benchmark!
Covering 2,000 vision-text multimodal physics problems spanning from middle school to doctoral qualification exams, the SeePhys benchmark systematically evaluates LLMs/MLLMs on tasks integrating complex scientific diagrams with theoretical derivations. Experiments reveal that even SOTA models like Gemini-2.5-Pro and o4-mini achieve accuracy rates below 55%, with over 30% error rates on simple middle-school-level problems, highlighting significant challenges in multimodal reasoning.

The benchmark is now open for evaluation at the ICML 2025 AI for MATH Workshop. Academic and industrial teams are invited to test their models!

🔗 Key Links:
📜Paper: http://arxiv.org/abs/2505.19099
⚛️Project Page: https://seephys.github.io/
🏆Challenge Submission: https://www.codabench.org/competitions/7925/
➡️Competition Guidelines: https://sites.google.com/view/ai4mathworkshopicml2025/challenge

Please give a thumbs up to this project if you found it helpful!

judge

Paper submitter 6 days ago

This comment has been hidden

librarian-bot

6 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.19099 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.19099 in a Space README.md to link it from this page.