ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Abstract
A new benchmark, ViewSpatial-Bench, evaluates VLMs on multi-viewpoint spatial reasoning, revealing performance gaps that are mitigated with fine-tuning on 3D spatial datasets.
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction (2025)
- SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding (2025)
- ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models (2025)
- SITE: towards Spatial Intelligence Thorough Evaluation (2025)
- Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs (2025)
- RVTBench: A Benchmark for Visual Reasoning Tasks (2025)
- Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper