VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
Abstract
VisTA, a reinforcement learning framework, enhances visual reasoning by autonomously selecting and combining tools from a diverse library without extensive human supervision.
We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
Community
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
Project Page: https://oodbag.github.io/vista_web/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning (2025)
- Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning (2025)
- ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (2025)
- VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use (2025)
- Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning (2025)
- Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model (2025)
- ToolRL: Reward is All Tool Learning Needs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper