Commit
ยท
5787118
1
Parent(s):
d31e3b7
Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ base_model:
|
|
18 |
๐ Democratizing Reinforcement Learning for LLMs ๐
|
19 |
</div>
|
20 |
</div>
|
21 |
-
|
22 |
<div align="center" style="line-height: 1;">
|
23 |
<a href="https://github.com/agentica-project/deepscaler" target="_blank" style="margin: 2px;">
|
24 |
<img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
@@ -26,29 +26,51 @@ base_model:
|
|
26 |
<a href="https://www.notion.so/DeepScaleR-Scaling-R1-Models-with-Reinforcement-Learning-1891e65ddc7f80ad8cc6dbe0069a66fa?pvs=4" target="_blank" style="margin: 2px;">
|
27 |
<img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
28 |
</a>
|
|
|
|
|
|
|
29 |
<a href="https://huggingface.co/agentica-org" target="_blank" style="margin: 2px;">
|
30 |
<img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
|
31 |
</a>
|
32 |
</div>
|
33 |
-
|
34 |
</div>
|
35 |
-
|
36 |
</div>
|
37 |
|
38 |
-
##
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
|
50 |
## Evaluation
|
51 |
-
|
52 |
We report Pass@1 accuracy averaged over 16 samples for each problem.
|
53 |
| Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
|
54 |
|-------|-----------|-----------|-----------|--------------|---------------|------|
|
@@ -58,18 +80,29 @@ We report Pass@1 accuracy averaged over 16 samples for each problem.
|
|
58 |
| Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
|
59 |
| DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
|
60 |
| Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
|
61 |
-
| <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 |
|
62 |
| O1-Preview | 40.0 | 81.4 | - | - | - | - |
|
63 |
|
64 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
- Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
|
67 |
- Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
|
68 |
- Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
|
69 |
|
70 |
-
|
71 |
## Citation
|
72 |
-
|
73 |
```bibtex
|
74 |
@misc{deepscaler2025,
|
75 |
title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
|
|
|
18 |
๐ Democratizing Reinforcement Learning for LLMs ๐
|
19 |
</div>
|
20 |
</div>
|
21 |
+
<br>
|
22 |
<div align="center" style="line-height: 1;">
|
23 |
<a href="https://github.com/agentica-project/deepscaler" target="_blank" style="margin: 2px;">
|
24 |
<img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
|
|
26 |
<a href="https://www.notion.so/DeepScaleR-Scaling-R1-Models-with-Reinforcement-Learning-1891e65ddc7f80ad8cc6dbe0069a66fa?pvs=4" target="_blank" style="margin: 2px;">
|
27 |
<img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
28 |
</a>
|
29 |
+
<a href="https://x.com/Agentica_" target="_blank" style="margin: 2px;">
|
30 |
+
<img alt="Blog" src="https://img.shields.io/badge/Agentica-white?style=for-the-badge&logo=X&logoColor=000&color=000&labelColor=white" style="display: inline-block; vertical-align: middle;"/>
|
31 |
+
</a>
|
32 |
<a href="https://huggingface.co/agentica-org" target="_blank" style="margin: 2px;">
|
33 |
<img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
|
34 |
</a>
|
35 |
</div>
|
|
|
36 |
</div>
|
|
|
37 |
</div>
|
38 |
|
39 |
+
## DeepScaleR Overview
|
40 |
+
DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI's O1-Preview performance with just 1.5B parameters.
|
41 |
+
|
42 |
+
## Data
|
43 |
+
Our training dataset consists of approximately 40,000 unique problem-answer pairs compiled from:
|
44 |
+
- AIME problems (1984-2023)
|
45 |
+
- AMC problems (prior to 2023)
|
46 |
+
- Omni-MATH dataset
|
47 |
+
- Still dataset
|
48 |
+
|
49 |
+
## Training Recipe
|
50 |
+
We employ Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that extends PPO by:
|
51 |
+
- Normalizing advantage function over all samples generated from the same prompt.
|
52 |
+
- Applying KL divergence regularization on top of PPO's surrogate loss to prevent significant policy drift.
|
53 |
+
|
54 |
+
**Reward Function**: Our reward function is simple but effective:
|
55 |
+
- 1 for correct answers passing LaTeX/Sympy checks
|
56 |
+
- 0 for incorrect or improperly formatted answers
|
57 |
+
- Note: No partial rewards (such as PRMs) or intermediate feedback.
|
58 |
+
|
59 |
+
**Iterative Context Lengthening**: A key challenge in scaling RL for reasoning is compute cost. Our approach trains models with progressively longer contexts as the model improves, thus saving monetary costs and end2end training time:
|
60 |
+
- Initial 8K Context (0-1040 steps):
|
61 |
+
- 22.9% -> 33% Pass@1 on AIME 2024
|
62 |
+
- Trained on 8 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 8 = 1024
|
63 |
+
- Extended to 16K (steps 1040-1520):
|
64 |
+
- 33% -> 43% Pass@1 on AIME 2024
|
65 |
+
- Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
|
66 |
+
- Further extended to 24K (step 1520+):
|
67 |
+
- 38% -> 43% Pass@1 on AIME 2024
|
68 |
+
- Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
|
69 |
+
- Significant improvements within <200 steps
|
70 |
+
|
71 |
+
A more detailed description of the training recipe can be found in our [blog post](https://www.notion.so/DeepScaleR-Scaling-R1-Models-with-Reinforcement-Learning-1891e65ddc7f80ad8cc6dbe0069a66fa?pvs=4).
|
72 |
|
73 |
## Evaluation
|
|
|
74 |
We report Pass@1 accuracy averaged over 16 samples for each problem.
|
75 |
| Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
|
76 |
|-------|-----------|-----------|-----------|--------------|---------------|------|
|
|
|
80 |
| Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
|
81 |
| DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
|
82 |
| Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
|
83 |
+
| <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 | <strong>50.0</strong> | <strong>57.0</strong> |
|
84 |
| O1-Preview | 40.0 | 81.4 | - | - | - | - |
|
85 |
|
86 |
+
## Serving DeepScaleR
|
87 |
+
Our model can be served using popular high-performance inference systems:
|
88 |
+
- vLLM
|
89 |
+
- Hugging Face Text Generation Inference (TGI)
|
90 |
+
- SGLang
|
91 |
+
- TensorRT-LLM
|
92 |
+
|
93 |
+
All these systems support the OpenAI Chat Completions API format.
|
94 |
|
95 |
+
## License
|
96 |
+
This project is released under the MIT License, reflecting our commitment to open and accessible AI development.
|
97 |
+
We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon.
|
98 |
+
This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
|
99 |
+
|
100 |
+
## Acknowledgement
|
101 |
- Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
|
102 |
- Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
|
103 |
- Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
|
104 |
|
|
|
105 |
## Citation
|
|
|
106 |
```bibtex
|
107 |
@misc{deepscaler2025,
|
108 |
title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
|