Multimodal Agent - a btjhjeon Collection

btjhjeon 's Collections

Multimodal Agent

Multimodal System

Multimodal Reasoning

Multimodal Analysis

Multimodal Alignment

PEFT

LLM

LLM context length

Multimodal Dataset

Multimodal Benchmarks

Multimodal Agent

updated 2 days ago

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25 • 25
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 51
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 88
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20 • 29
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Paper • 2503.12533 • Published Mar 16 • 66
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27 • 22
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27 • 62
OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 26
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Paper • 2505.06111 • Published 23 days ago • 24
Visual Agentic Reinforcement Fine-Tuning

Paper • 2505.14246 • Published 12 days ago • 31
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Paper • 2505.15517 • Published 11 days ago • 4
Interactive Post-Training for Vision-Language-Action Models

Paper • 2505.17016 • Published 10 days ago • 6
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Paper • 2505.10887 • Published 17 days ago • 10
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published 5 days ago • 86
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Paper • 2505.20289 • Published 6 days ago • 9
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Paper • 2505.23747 • Published 3 days ago • 60