michaelzhiluo commited on
Commit
5787118
ยท
1 Parent(s): d31e3b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -19
README.md CHANGED
@@ -18,7 +18,7 @@ base_model:
18
  ๐Ÿš€ Democratizing Reinforcement Learning for LLMs ๐ŸŒŸ
19
  </div>
20
  </div>
21
-
22
  <div align="center" style="line-height: 1;">
23
  <a href="https://github.com/agentica-project/deepscaler" target="_blank" style="margin: 2px;">
24
  <img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
@@ -26,29 +26,51 @@ base_model:
26
  <a href="https://www.notion.so/DeepScaleR-Scaling-R1-Models-with-Reinforcement-Learning-1891e65ddc7f80ad8cc6dbe0069a66fa?pvs=4" target="_blank" style="margin: 2px;">
27
  <img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
28
  </a>
 
 
 
29
  <a href="https://huggingface.co/agentica-org" target="_blank" style="margin: 2px;">
30
  <img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
31
  </a>
32
  </div>
33
-
34
  </div>
35
-
36
  </div>
37
 
38
- ## Model Details
39
-
40
- ### Model Description
41
-
42
-
43
- ## Training Details
44
-
45
- ### Training Data
46
-
47
-
48
- ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Evaluation
51
-
52
  We report Pass@1 accuracy averaged over 16 samples for each problem.
53
  | Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
54
  |-------|-----------|-----------|-----------|--------------|---------------|------|
@@ -58,18 +80,29 @@ We report Pass@1 accuracy averaged over 16 samples for each problem.
58
  | Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
59
  | DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
60
  | Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
61
- | <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 | - | - |
62
  | O1-Preview | 40.0 | 81.4 | - | - | - | - |
63
 
64
- ## Acknowledgement
 
 
 
 
 
 
 
65
 
 
 
 
 
 
 
66
  - Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
67
  - Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
68
  - Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
69
 
70
-
71
  ## Citation
72
-
73
  ```bibtex
74
  @misc{deepscaler2025,
75
  title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
 
18
  ๐Ÿš€ Democratizing Reinforcement Learning for LLMs ๐ŸŒŸ
19
  </div>
20
  </div>
21
+ <br>
22
  <div align="center" style="line-height: 1;">
23
  <a href="https://github.com/agentica-project/deepscaler" target="_blank" style="margin: 2px;">
24
  <img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
 
26
  <a href="https://www.notion.so/DeepScaleR-Scaling-R1-Models-with-Reinforcement-Learning-1891e65ddc7f80ad8cc6dbe0069a66fa?pvs=4" target="_blank" style="margin: 2px;">
27
  <img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
28
  </a>
29
+ <a href="https://x.com/Agentica_" target="_blank" style="margin: 2px;">
30
+ <img alt="Blog" src="https://img.shields.io/badge/Agentica-white?style=for-the-badge&logo=X&logoColor=000&color=000&labelColor=white" style="display: inline-block; vertical-align: middle;"/>
31
+ </a>
32
  <a href="https://huggingface.co/agentica-org" target="_blank" style="margin: 2px;">
33
  <img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
34
  </a>
35
  </div>
 
36
  </div>
 
37
  </div>
38
 
39
+ ## DeepScaleR Overview
40
+ DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI's O1-Preview performance with just 1.5B parameters.
41
+
42
+ ## Data
43
+ Our training dataset consists of approximately 40,000 unique problem-answer pairs compiled from:
44
+ - AIME problems (1984-2023)
45
+ - AMC problems (prior to 2023)
46
+ - Omni-MATH dataset
47
+ - Still dataset
48
+
49
+ ## Training Recipe
50
+ We employ Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that extends PPO by:
51
+ - Normalizing advantage function over all samples generated from the same prompt.
52
+ - Applying KL divergence regularization on top of PPO's surrogate loss to prevent significant policy drift.
53
+
54
+ **Reward Function**: Our reward function is simple but effective:
55
+ - 1 for correct answers passing LaTeX/Sympy checks
56
+ - 0 for incorrect or improperly formatted answers
57
+ - Note: No partial rewards (such as PRMs) or intermediate feedback.
58
+
59
+ **Iterative Context Lengthening**: A key challenge in scaling RL for reasoning is compute cost. Our approach trains models with progressively longer contexts as the model improves, thus saving monetary costs and end2end training time:
60
+ - Initial 8K Context (0-1040 steps):
61
+ - 22.9% -> 33% Pass@1 on AIME 2024
62
+ - Trained on 8 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 8 = 1024
63
+ - Extended to 16K (steps 1040-1520):
64
+ - 33% -> 43% Pass@1 on AIME 2024
65
+ - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
66
+ - Further extended to 24K (step 1520+):
67
+ - 38% -> 43% Pass@1 on AIME 2024
68
+ - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
69
+ - Significant improvements within <200 steps
70
+
71
+ A more detailed description of the training recipe can be found in our [blog post](https://www.notion.so/DeepScaleR-Scaling-R1-Models-with-Reinforcement-Learning-1891e65ddc7f80ad8cc6dbe0069a66fa?pvs=4).
72
 
73
  ## Evaluation
 
74
  We report Pass@1 accuracy averaged over 16 samples for each problem.
75
  | Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
76
  |-------|-----------|-----------|-----------|--------------|---------------|------|
 
80
  | Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
81
  | DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
82
  | Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
83
+ | <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 | <strong>50.0</strong> | <strong>57.0</strong> |
84
  | O1-Preview | 40.0 | 81.4 | - | - | - | - |
85
 
86
+ ## Serving DeepScaleR
87
+ Our model can be served using popular high-performance inference systems:
88
+ - vLLM
89
+ - Hugging Face Text Generation Inference (TGI)
90
+ - SGLang
91
+ - TensorRT-LLM
92
+
93
+ All these systems support the OpenAI Chat Completions API format.
94
 
95
+ ## License
96
+ This project is released under the MIT License, reflecting our commitment to open and accessible AI development.
97
+ We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon.
98
+ This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
99
+
100
+ ## Acknowledgement
101
  - Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
102
  - Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
103
  - Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
104
 
 
105
  ## Citation
 
106
  ```bibtex
107
  @misc{deepscaler2025,
108
  title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},