InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Abstract
A new benchmark, InstructPart, and a task-oriented part segmentation dataset are introduced to evaluate and improve the performance of Vision-Language Models in real-world contexts.
Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.
Community
We introduce InstructPart, a real-world benchmark with part segmentation annotations and task-oriented instructions to evaluate and improve Vision-Language Models (VLMs) in understanding and executing part-level tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels (2025)
- Visual Intention Grounding for Egocentric Assistants (2025)
- RESAnything: Attribute Prompting for Arbitrary Referring Segmentation (2025)
- RoboGround: Robotic Manipulation with Grounded Vision-Language Priors (2025)
- Foundation Model-Driven Framework for Human-Object Interaction Prediction with Segmentation Mask Integration (2025)
- Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving (2025)
- GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper