Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Paper • 2504.17192 • Published 14 days ago • 105
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper • 2411.04996 • Published Nov 7, 2024 • 52
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Paper • 2410.10594 • Published Oct 14, 2024 • 27
UI Agent Collection a collection of algorithmic agents for user interfaces/interactions, program synthesis, and robotics • 359 items • Updated 2 days ago • 52
GUICourse: From General Vision Language Models to Versatile GUI Agents Paper • 2406.11317 • Published Jun 17, 2024 • 1
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18, 2024 • 17
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published Jun 26, 2024 • 30
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31, 2024 • 24
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation Paper • 2405.14598 • Published May 23, 2024 • 14
RoHM: Robust Human Motion Reconstruction via Diffusion Paper • 2401.08570 • Published Jan 16, 2024 • 1
MultiBooth: Towards Generating All Your Concepts in an Image from Text Paper • 2404.14239 • Published Apr 22, 2024 • 9
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper • 2405.09818 • Published May 16, 2024 • 131