RM-R1
Collection
RM-R1: Reward Modeling as Reasoning
•
16 items
•
Updated
•
6
RM-R1 is a training framework for Reasoning Reward Model (ReasRM) that judges two candidate answers by first thinking out loud—generating rubrics or reasoning traces—then emitting its preference.
Compared with prior scalar or vanilla generative reward models, RM-R1 delivers up to +13.8 % absolute accuracy gains on public reward model benchmarks while providing fully interpretable critiques.
Two-stage training
Backbones released: 7 B / 14 B / 32 B Qwen-2.5-Instruct variants + DeepSeek-distilled checkpoints.