PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Abstract
A new benchmark, PhyX, evaluates models' physics-grounded reasoning in visual scenarios, revealing significant limitations in current models' physical understanding compared to human experts.
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.
Community
This paper introduces PHYX: the first large-scale benchmark designed to assess models’ capacity for physics-grounded reasoning in visual scenarios. PHYX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave & acoustics
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge (2025)
- PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models (2025)
- IQBench: How"Smart'' Are Vision-Language Models? A Study with Human IQ Tests (2025)
- MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models (2025)
- ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models (2025)
- VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models (2025)
- Evaluating the Logical Reasoning Abilities of Large Reasoning Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper