TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Paper • 2505.01583 • Published 7 days ago • 9
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers Paper • 2504.10483 • Published 25 days ago • 21
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think Paper • 2502.20172 • Published Feb 27 • 28
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Paper • 2501.17161 • Published Jan 28 • 121
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps Paper • 2501.09732 • Published Jan 16 • 72
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Paper • 2501.05452 • Published Jan 9 • 15
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper • 2412.14171 • Published Dec 18, 2024 • 24
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18, 2024 • 19
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper • 2410.03051 • Published Oct 4, 2024 • 6
Chasing Consistency in Text-to-3D Generation from a Single Image Paper • 2309.03599 • Published Sep 7, 2023 • 1
RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark Paper • 2407.13930 • Published Jul 18, 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 61
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? Paper • 2406.07546 • Published Jun 11, 2024 • 9