metadata
license: apache-2.0
tags:
- qwen
- math
- fine-tuned
- open-r1
- supervised-finetuning
- evaluation
datasets:
- open-r1/OpenR1-Math-220k
- Idavidrein/gpqa
- HuggingFaceH4/MATH-500
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: text-generation
library_name: transformers
language:
- en
model-index:
- name: Qwen2.5-0.5B-Math220k (Checkpoint-15000)
results:
- task:
type: multiple-choice
dataset:
name: GPQA
type: open
metrics:
- name: Accuracy (Clean Extraction)
type: accuracy
value: 0.386
- name: Accuracy (All Extraction)
type: accuracy
value: 0.41
- task:
type: mathematical-reasoning
dataset:
name: MATH500
type: open
metrics:
- name: Accuracy
type: accuracy
value: 0.219
Qwen2.5-0.5B-Math220k (Checkpoint-15000)
This model is a supervised fine-tuned variant of Qwen2.5-0.5B, trained on the default split of math220k for step-by-step mathematical reasoning and standardized answer formatting.
Training
- Base model: Qwen2.5-0.5B
- Dataset: math220k
default
subset (83k train, 10k test, filtered for verified answers) - Training steps: 15,000
- Checkpoint interval: 500 steps
- Learning rate: 2.5e-6 with cosine decay scheduler
- Batch size: 64
- Prompting format: guided step-by-step reasoning, with enforced final answer formatting (
Answer:
or\boxed{}
)
Evaluation
All evaluations were performed on bootstrapped datasets (size=1000) to ensure fair, stable comparisons.
Dataset | Accuracy (Clean) | Accuracy (All) |
---|---|---|
GPQA (merged) | 0.386 | 0.410 |
MATH500 | 0.219 | N/A |
- Clean extraction: only answers in canonical form (
Answer: X
,\boxed{X}
) - All extraction: includes fuzzy-matched final answers in phrases like “the correct answer is X”
Evaluation was performed with eval_checkpoints_auto.py
using local bootstrapped datasets.
For detailed evaluation results and charts, see: DexinRen/open-r1_DR_test/dexin_src/eval_output
Limitations
- The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns.
- This checkpoint prioritizes formatting discipline and correctness of final answers over full reasoning transparency.
- MATH500 generalization is slightly degraded vs. the base model (expected for SFT).
Files Included
model.safetensors
: model weightstokenizer.json
,vocab.json
,config.json
: tokenizer and model config- All files are stored using Git LFS for proper large file support.
Citation
If you use this model, please cite: Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000
Recommended Usage
For basic use
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True)
For reproducible evaluation, use the custom formatter and evaluation code:
from dexin_src.utils.formatter import Formatter
tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
formatter = Formatter(tokenizer)
formatted_prompt = formatter.format_prompt(example) # example is a row from your dataset