Object-Centric Representations Improve Policy Generalization in Robot Manipulation
Abstract
Object-centric representations (OCR) demonstrate superior generalization in robotic manipulation tasks compared to global or dense visual features under various visual conditions.
Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.
Community
Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments. Code will be made available soon.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness (2025)
- RoboGround: Robotic Manipulation with Grounded Vision-Language Priors (2025)
- ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow (2025)
- UniVLA: Learning to Act Anywhere with Task-centric Latent Actions (2025)
- RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics (2025)
- MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos (2025)
- NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper