Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Abstract
Spatial-MLLM improves spatial reasoning in multimodal large language models using a dual-encoder architecture with pretrained 2D and 3D structure encoders, achieving state-of-the-art performance on visual spatial tasks.
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.
Community
Thank you for your work! I’m curious about the hardware setup you used, especially for inference. Could you share the type of GPUs (e.g., A100, RTX 6000, etc.) and memory requirements needed to reproduce your inference setup with 16-frame inputs? Just want to check the feasibility before trying it out locally. Appreciate any guidance!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts (2025)
- SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning (2025)
- Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving (2025)
- ResNetVLLM - Multi-modal Vision LLM for the Video Understanding Task (2025)
- NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving (2025)
- Zero-Shot 3D Visual Grounding from Vision-Language Models (2025)
- Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper