Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Abstract
A two-stage approach combining supervised fine-tuning and reinforcement learning enhances multimodal reasoning in large language models, achieving state-of-the-art performance on benchmarks.
Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %rightarrow73.4 % on MathVista, 62.9 %rightarrow70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.
Community
We present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities.
Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3%→73.4% on MathVista, 62.9%→70.4% on We-Math) and our 3B model achieving performance competitive with several 7B models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models (2025)
- SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning (2025)
- Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 (2025)
- General-Reasoner: Advancing LLM Reasoning Across All Domains (2025)
- R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO (2025)
- ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (2025)
- How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper