Papers
arxiv:2504.14945

Learning to Reason under Off-Policy Guidance

Published on Apr 21
ยท Submitted by Elliott on Apr 22
#1 Paper of the day
Authors:
,
,
,

Abstract

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

Community

Paper author Paper submitter

LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces policy shaping via regularized importance sampling to emphasize low-probability yet crucial actions.

Fantastic paper! Your work on LUFFY is very interesting.

Seldom paper includes pass@k metrics to evaluate the exploration capabilities of RL-trained models, so seeing your promising results in this area is great!

Also, I was wondering if you have compared the performance of the LUFFY-trained model against the original base model using higher values of k (like pass@256 or even pass@1024)? It would be fascinating to see if the improvements from off-policy RL training extend significantly to these higher-k exploration scenarios, potentially showing even larger gains over the base model in terms of exploration.

Thanks for the great work!

ยท
Paper author

Thanks for your insightful question!

We haven't tried such high values of k yet. We notice that a recent paper (https://www.arxiv.org/pdf/2504.13837) claimed that on-policy RL limits the exploration, but SFT genuinely introduces new knowledge. It will be interesting to see whether LUFFY can preserve exploration in such high k values as SFT does, and we plan to add these experiments.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.14945 in a Space README.md to link it from this page.

Collections including this paper 4