rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset
Abstract
A large-scale dataset called rStar-Coder enhances code reasoning in LLMs by providing verified code problems and solutions, leading to improved performance on various benchmarks.
Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.
Community
We introduce rStar-Coder, which builds a large-scale competitive code dataset with diverse and scalable test cases, enabling our 14B model to achieve code reasoning performance comparable to QWQ-32B.
Thanks! We're currently going through the internal review process. Once that's completed, we'll release the dataset as soon as possible to help advance code reasoning research in the community.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning (2025)
- OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs (2025)
- LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs (2025)
- CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation (2025)
- Let's Verify Math Questions Step by Step (2025)
- ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model (2025)
- CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper