Papers
arxiv:2505.14366

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Published on May 20
· Submitted by jwgcurrie on May 21
Authors:
,
,
,
,

Abstract

A synthetic dataset in NVIDIA Omniverse aids in training Vision-Language Models for Visual Perspective Taking by providing supervised learning for spatial reasoning tasks.

AI-generated summary

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

Community

Paper author Paper submitter

We're excited to share our short paper "Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds" with the community!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.14366 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.14366 in a Space README.md to link it from this page.

Collections including this paper 1