arxiv:2505.20289

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Published on May 26

· Submitted by

yjlee0222 on May 28

Upvote

Authors:

Zeyi Huang ,

Anirudh Sundara Rajan ,

Abstract

VisTA, a reinforcement learning framework, enhances visual reasoning by autonomously selecting and combining tools from a diverse library without extensive human supervision.

AI-generated summary

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

View arXiv page View PDF Add to collection