arxiv:2505.14366

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Published on May 20

· Submitted by

jwgcurrie on May 21

Upvote

Authors:

Joel Currie ,

Enrico Piacenti ,

Abstract

A synthetic dataset in NVIDIA Omniverse aids in training Vision-Language Models for Visual Perspective Taking by providing supervised learning for spatial reasoning tasks.

AI-generated summary

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

View arXiv page View PDF Add to collection

Community

jwgcurrie

Paper author Paper submitter 10 days ago

We're excited to share our short paper "Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds" with the community!

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.14366 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.14366 in a Space README.md to link it from this page.