SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Two-Staged history-Resampling Policy Optimization: Achieving Superior Reasoning Performance Across Mathematical and Coding Domains

Paper PDF Link

Overview

We introduce SRPO (two-Staged history-Resampling Policy Optimization), a novel RL framework designed to systematically address large-scale multi-domain reasoning challenges. SRPO successfully surpasses the performance of DeepSeek-R1-Zero-32B on both the AIME24 and LiveCodeBench benchmarks while using only about 1/10 of the training steps.

Building upon Group Relative Policy Optimization (GRPO), SRPO introduces two key methodological innovations:

A two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency
History Resampling (HR), a technique to address ineffective samples and enhance training efficiency

Our approach demonstrates that with proper training methodology, the same base model (Qwen2.5-32B) can achieve superior performance across diverse domains without requiring extensive training resources.

Main Results

Figure: SRPO achieves superior results with only 10% of DeepSeek's training steps. The values shown are pass@1 scores, averaged over 32 samples per question.

Model	AIME24 (Pass@1)	LiveCodeBench (Pass@1)
DeepSeek-R1-Zero-Qwen-32B	47.0	40.2
SRPO (Ours)	50.0	41.6

Training Approach

Two-Stage Training Paradigm

To address the intrinsic response-length conflict between math and code, SRPO employs a two-stage training approach:

Stage 1 (Eliciting Reasoning Abilities): Initial training focuses solely on challenging mathematical data to encourage extended Chain-of-Thought (CoT) capabilities, including reflective thinking and step-by-step decomposition.
Stage 2 (Skill Integration): Once the reasoning foundation is established, coding data is introduced to develop programming proficiency while maintaining the reasoning capabilities from Stage 1.

History Resampling (HR)

SRPO introduces History Resampling to address ineffective samples that provide minimal gradient signals:

Filter Out "Too Easy": Samples where all rollouts get the correct answer are excluded as they provide no informative contrastive signals
Retain "Informative": Samples that yield either mixed outcomes or exclusively incorrect outcomes are retained to ensure effective gradient signals

This approach significantly improves computational efficiency and enhances the growth of response length during training.

Emerging Thinking Behaviors

During RL training, SRPO models gradually develop self-reflection, correction, and backtracking capabilities analogous to human cognitive processes:

Figure: Occurrence of various reasoning patterns during training.

An interesting observed phenomenon is the model's spontaneous use of code to verify mathematical solutions, demonstrating cross-domain skill integration and advanced problem-solving strategies.

Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Kwaipilot/SRPO-Qwen-32B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# For math problems
math_prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Let $\mathcal{S}$ be the set of real numbers that can be represented as repeating decimals of the form $0.\overline{abc}$ where $a, b, c$ are distinct digits. Find the sum of the elements of $\mathcal{S}.$
Assistant: <think>
"""

# For coding problems
code_prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
User:
You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests.

Question:You are given a string S of length N consisting of digits from 1 through 9.
For each pair of integers (i,j) \ (1\leq i\leq j\leq N), define f(i, j) as the value obtained by interpreting the substring of S from the i-th through the j-th character as a decimal integer. Find \displaystyle \sum_{i=1}^N \sum_{j=i}^N f(i, j).

Input

The input is given from Standard Input in the following format:
N
S

Output

Print the answer.

Constraints


- 1 \leq N \leq 2 \times 10^5
- N is an integer.
- S is a string of length N consisting of digits from 1 through 9.

Sample Input 1

3
379

Sample Output 1

514

The answer is f(1,1) + f(1,2) + f(1,3) + f(2,2) + f(2,3) + f(3,3) = 3 + 37 + 379 + 7 + 79 + 9 = 514.

Sample Input 2

30
314159265358979323846264338327

Sample Output 2

369673254065355789035427227741
Read the inputs from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows. Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT.
```python
# YOUR CODE HERE
\```
Assistant:
<think>
"""

# Generate response
inputs = tokenizer(math_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=10240,
    temperature=0.7,
    top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Using with vLLM

import torch
from vllm import SamplingParams, LLM

model_name = "Kwaipilot/SRPO-Qwen-32B"

llm = LLM(
    model=model_name,
    dtype=torch.bfloat16,
    tensor_parallel_size=8,
    gpu_memory_utilization=0.95
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=10240
)

prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Three spheres with radii $11,$ $13,$ and $19$ are mutually externally tangent. A plane intersects the spheres in three congruent circles centered at $A,$ $B,$ and $C,$ respectively, and the centers of the spheres all lie on the same side of this plane. Suppose that $AB^2 = 560.$ Find $AC^2.$
Assistant: <think>
"""

output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

Citation

@misc{zhang2025srpocrossdomainimplementation,
      title={SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM}, 
      author={Xiaojiang Zhang and Jinghui Wang and Zifei Cheng and Wenhao Zhuang and Zheng Lin and Minglei Zhang and Shaojie Wang and Yinghan Cui and Chao Wang and Junyi Peng and Shimiao Jiang and Shiqi Kuang and Shouyu Yin and Chaohang Wen and Haotian Zhang and Bin Chen and Bing Yu},
      year={2025},
      eprint={2504.14286},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.14286},
}

Kwaipilot
/

SRPO-Qwen-32B