arxiv:2505.23762

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Published on May 29

· Submitted by

cyyang822 on May 30

Upvote

Authors:

Chenyu Yang ,

Abstract

ZeroGUI is an online learning framework that uses Vision-Language Models for task generation and reward estimation, enhancing GUI Agents' performance with minimal human intervention.

AI-generated summary

The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

View arXiv page View PDF GitHub repository Add to collection

Community

cyyang822

Paper author Paper submitter 5 days ago

ZeroGUI, a fully automated online reinforcement learning framework that enables GUI agents to train and adapt in interactive environments at zero human cost.