UI-TARS-1.5-7B / README.md
JjjFangg's picture
Upload folder using huggingface_hub
7a0efb1 verified
|
raw
history blame
7.21 kB
metadata
license: apache-2.0
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - gui
library_name: transformers

UI-TARS-1.5 Model

Introduction

UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.

Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.

Code: https://github.com/bytedance/UI-TARS

Application: https://github.com/bytedance/UI-TARS-desktop

Performance

Online Benchmark Evaluation

Benchmark type Benchmark UI-TARS-1.5 OpenAI CUA Claude 3.7 Previous SOTA
Computer Use OSworld (100 steps) 42.5 36.4 28 38.1 (200 step)
Windows Agent Arena (50 steps) 42.1 - - 29.8
Browser Use WebVoyager 84.8 87 84.1 87
Online-Mind2web 75.8 71 62.9 71
Phone Use Android World 64.2 - - 59.5

Grounding Capability Evaluation

Benchmark UI-TARS-1.5 OpenAI CUA Claude 3.7 Previous SOTA
SeeClick-V2 94.2 87.9 87.6 91.6
ScreenSpotPro 61.6 23.4 27.7 43.6

Poki Game

Model 2048 cubinko energy free-the-key Gem-11 hex-frvr Infinity-Loop Maze:Path-of-Light shapes snake-solver wood-blocks-3d yarn-untangle laser-maze-puzzle tiles-master
OpenAI CUA 31.04 0.00 32.80 0.00 46.27 92.25 23.08 35.00 52.18 42.86 2.02 44.56 80.00 78.27
Claude 3.7 43.05 0.00 41.60 0.00 0.00 30.76 2.31 82.00 6.26 42.86 0.00 13.77 28.00 52.18
UI-TARS-1.5 100.00 0.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

Minecraft

Task Type Task Name VPT DreamerV3 Previous SOTA TARS w/o Thought TARS w/ Thought
Mine Blocks (log) 0.8 1.0 0.5 1.0 1.0
(obsidian) 0.0 0.0 0.0 0.2 0.3
(bed) 0.0 0.0 0.1 0.4 0.6
200 Tasks Avg. 0.06 0.03 0.32 0.35 0.42
Kill Mobs (evoker) 0.0 0.0 0.1 0.3 0.4
(zombie) 0.4 0.1 0.6 0.7 0.9
(pillager) 0.1 0.0 0.1 0.5 0.6
100 Tasks Avg. 0.04 0.03 0.18 0.25 0.31

Citation

If you find our paper and model useful in your research, feel free to give us a cite.

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}