Submitted by jiuhai 51 BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset · 13 authors 3
Submitted by xiaomoguhzz 36 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception · 6 authors 3
Submitted by nielsr 28 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures · 15 authors 3
Submitted by scikkk 24 MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning · 11 authors 1
Submitted by toshas 14 Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis · 8 authors 2
Submitted by HanjungKim 13 UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations · 6 authors 2
Submitted by akhaliq 7 CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image · 9 authors 3
Submitted by NadMag 6 LightLab: Controlling Light Sources in Images with Diffusion Models · 7 authors 3
Submitted by novateur 5 WavReward: Spoken Dialogue Models With Generalist Reward Evaluators · 14 authors 3
Submitted by pritamqu 4 VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models · 2 authors 2
Submitted by peihaowang 2 Steepest Descent Density Control for Compact 3D Gaussian Splatting · 11 authors 2
Submitted by kailassrt 2 DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition · 11 authors 2
Submitted by JadeCheng 1 Visually Interpretable Subtask Reasoning for Visual Question Answering · 3 authors 2
Submitted by kkr5155 1 Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA · 4 authors 2