Submitted by minghaowu 53 The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks · 10 authors 2
Submitted by longlian 43 Describe Anything: Detailed Localized Image and Video Captioning · 11 authors 3
Submitted by chenjoya 19 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale · 6 authors 2
Submitted by zhangysk 18 IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs · 20 authors 2
Submitted by Neph0s 17 BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation · 6 authors 2
Submitted by yueyang2000 15 CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning · 6 authors 2
Submitted by Kaiyue 14 Personalized Text-to-Image Generation with Auto-Regressive Models · 4 authors 3
Submitted by thomasschmied 13 LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities · 5 authors 2
Submitted by Zilence006 12 Vidi: Large Multimodal Models for Video Understanding and Editing · 22 authors 2
Submitted by sayakpaul 8 From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning · 9 authors 2
Submitted by zhoutianyi 8 WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents · 7 authors 4
Submitted by theFoxofSky 6 RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild · 8 authors 2
Submitted by ziqipang 4 MR. Video: "MapReduce" is the Principle for Long Video Understanding · 2 authors 2
Submitted by QiYao-Wang 3 IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property · 23 authors 2
Submitted by j-min 3 CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting · 4 authors 2
Submitted by yoyolicoris 1 DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions · 7 authors 2