ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Abstract
ZeroGUI is an online learning framework that uses Vision-Language Models for task generation and reward estimation, enhancing GUI Agents' performance with minimal human intervention.
The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.
Community
ZeroGUI, a fully automated online reinforcement learning framework that enables GUI agents to train and adapt in interactive environments at zero human cost.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning (2025)
- ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay (2025)
- Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning (2025)
- GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents (2025)
- Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (2025)
- A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning (2025)
- LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper