Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds
Abstract
A synthetic dataset in NVIDIA Omniverse aids in training Vision-Language Models for Visual Perspective Taking by providing supervised learning for spatial reasoning tasks.
We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.
Community
We're excited to share our short paper "Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds" with the community!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation (2025)
- Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation (2025)
- Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models (2025)
- AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning (2025)
- From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D (2025)
- UAV-VLN: End-to-End Vision Language guided Navigation for UAVs (2025)
- OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper