SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Paper • 2407.09413 • Published Jul 12, 2024 • 11
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct Paper • 2409.05840 • Published Sep 9, 2024 • 48
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published Sep 19, 2024 • 48
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Paper • 2410.10816 • Published Oct 14, 2024 • 21
Harnessing Webpage UIs for Text-Rich Visual Understanding Paper • 2410.13824 • Published Oct 17, 2024 • 31
EMMA: End-to-End Multimodal Model for Autonomous Driving Paper • 2410.23262 • Published Oct 30, 2024 • 2
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 22
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation Paper • 2411.08380 • Published Nov 13, 2024 • 25
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 114
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper • 2411.14794 • Published Nov 22, 2024 • 13
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format Paper • 2411.17991 • Published Nov 27, 2024 • 5
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper • 2411.18499 • Published Nov 27, 2024 • 18
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 26
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published Dec 6, 2024 • 47
CompCap: Improving Multimodal Large Language Models with Composite Captions Paper • 2412.05243 • Published Dec 6, 2024 • 19
Maya: An Instruction Finetuned Multilingual Multimodal Model Paper • 2412.07112 • Published Dec 10, 2024 • 27
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation Paper • 2412.07147 • Published Dec 10, 2024 • 5
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Paper • 2412.09283 • Published Dec 12, 2024 • 19
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding Paper • 2412.17295 • Published Dec 23, 2024 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search Paper • 2412.18319 • Published Dec 24, 2024 • 37
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published Jan 1 • 100
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Paper • 2501.04686 • Published Jan 8 • 50
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published Jan 13 • 50
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Paper • 2501.08326 • Published Jan 14 • 32
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Paper • 2502.08168 • Published Feb 12 • 12
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm Paper • 2502.12513 • Published 27 days ago • 15
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions Paper • 2503.00501 • Published 16 days ago • 11
Unified Reward Model for Multimodal Understanding and Generation Paper • 2503.05236 • Published 10 days ago • 105
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning Paper • 2503.07002 • Published 7 days ago • 36
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published 6 days ago • 91
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing Paper • 2503.10639 • Published 3 days ago • 39
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning Paper • 2503.10291 • Published 4 days ago • 30
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published 3 days ago • 16