|
--- |
|
license: apache-2.0 |
|
tags: |
|
- qwen |
|
- math |
|
- fine-tuned |
|
- open-r1 |
|
- supervised-finetuning |
|
- evaluation |
|
datasets: |
|
- open-r1/OpenR1-Math-220k |
|
- Idavidrein/gpqa |
|
- HuggingFaceH4/MATH-500 |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- Qwen/Qwen2.5-0.5B |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
language: |
|
- en |
|
|
|
model-index: |
|
- name: Qwen2.5-0.5B-Math220k (Checkpoint-15000) |
|
results: |
|
- task: |
|
type: multiple-choice |
|
dataset: |
|
name: GPQA |
|
type: open |
|
metrics: |
|
- name: Accuracy (Clean Extraction) |
|
type: accuracy |
|
value: 0.386 |
|
- name: Accuracy (All Extraction) |
|
type: accuracy |
|
value: 0.410 |
|
- task: |
|
type: mathematical-reasoning |
|
dataset: |
|
name: MATH500 |
|
type: open |
|
metrics: |
|
- name: Accuracy |
|
type: accuracy |
|
value: 0.219 |
|
--- |
|
|
|
# Qwen2.5-0.5B-Math220k (Checkpoint-15000) |
|
|
|
This model is a supervised fine-tuned variant of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), trained on the **default split of math220k** for step-by-step mathematical reasoning and standardized answer formatting. |
|
|
|
## Training |
|
|
|
- **Base model:** Qwen2.5-0.5B |
|
- **Dataset:** math220k `default` subset (83k train, 10k test, filtered for verified answers) |
|
- **Training steps:** 15,000 |
|
- **Checkpoint interval:** 500 steps |
|
- **Learning rate:** 2.5e-6 with **cosine decay scheduler** |
|
- **Batch size:** 64 |
|
- **Prompting format:** guided step-by-step reasoning, with enforced final answer formatting (`Answer:` or `\boxed{}`) |
|
|
|
## Evaluation |
|
|
|
All evaluations were performed on **bootstrapped datasets (size=1000)** to ensure fair, stable comparisons. |
|
|
|
| Dataset | Accuracy (Clean) | Accuracy (All) | |
|
|----------------|------------------|----------------| |
|
| GPQA (merged) | 0.386 | 0.410 | |
|
| MATH500 | 0.219 | N/A | |
|
|
|
- **Clean extraction:** only answers in canonical form (`Answer: X`, `\boxed{X}`) |
|
- **All extraction:** includes fuzzy-matched final answers in phrases like “the correct answer is X” |
|
|
|
Evaluation was performed with `eval_checkpoints_auto.py` using local bootstrapped datasets. |
|
For detailed evaluation results and charts, see: [DexinRen/open-r1_DR_test/dexin_src/eval_output](https://github.com/DexinRen/open-r1_DR_test/tree/master/dexin_src/eval_output) |
|
|
|
## Limitations |
|
|
|
- The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns. |
|
- This checkpoint prioritizes **formatting discipline and correctness of final answers** over full reasoning transparency. |
|
- MATH500 generalization is slightly degraded vs. the base model (expected for SFT). |
|
|
|
## Files Included |
|
|
|
- `model.safetensors`: model weights |
|
- `tokenizer.json`, `vocab.json`, `config.json`: tokenizer and model config |
|
- All files are stored using **Git LFS** for proper large file support. |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000 |
|
|
|
## Recommended Usage |
|
|
|
For basic use |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000") |
|
model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True) |
|
``` |
|
|
|
For reproducible evaluation, use the [custom formatter and evaluation code](https://github.com/DexinRen/open-r1_DR_test): |
|
|
|
```python |
|
from dexin_src.utils.formatter import Formatter |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000") |
|
formatter = Formatter(tokenizer) |
|
formatted_prompt = formatter.format_prompt(example) # example is a row from your dataset |
|
``` |