Abstract
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
Community
🔥 Highlights
- Learn from any source, and act at anywhere.
- Extract highly-transferable task-centric latent actions from cross-embodiment videos.
- Do both manipulation and navigation well with compute-efficient training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy (2025)
- ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow (2025)
- NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks (2025)
- CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models (2025)
- CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations (2025)
- Vision-Language-Action Models: Concepts, Progress, Applications and Challenges (2025)
- Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper