Qwen2.5-0.5B-Math220k (Checkpoint-15000)
This model is a supervised fine-tuned variant of Qwen2.5-0.5B, trained on the default split of math220k for step-by-step mathematical reasoning and standardized answer formatting.
Training
- Base model: Qwen2.5-0.5B
- Dataset: math220k
default
subset (83k train, 10k test, filtered for verified answers) - Training steps: 15,000
- Checkpoint interval: 500 steps
- Learning rate: 2.5e-6 with cosine decay scheduler
- Batch size: 64
- Prompting format: guided step-by-step reasoning, with enforced final answer formatting (
Answer:
or\boxed{}
)
Evaluation
All evaluations were performed on bootstrapped datasets (size=1000) to ensure fair, stable comparisons.
Dataset | Accuracy (Clean) | Accuracy (All) |
---|---|---|
GPQA (merged) | 0.386 | 0.410 |
MATH500 | 0.219 | N/A |
- Clean extraction: only answers in canonical form (
Answer: X
,\boxed{X}
) - All extraction: includes fuzzy-matched final answers in phrases like “the correct answer is X”
Evaluation was performed with eval_checkpoints_auto.py
using local bootstrapped datasets.
For detailed evaluation results and charts, see: DexinRen/open-r1_DR_test/dexin_src/eval_output
Limitations
- The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns.
- This checkpoint prioritizes formatting discipline and correctness of final answers over full reasoning transparency.
- MATH500 generalization is slightly degraded vs. the base model (expected for SFT).
Files Included
model.safetensors
: model weightstokenizer.json
,vocab.json
,config.json
: tokenizer and model config- All files are stored using Git LFS for proper large file support.
Citation
If you use this model, please cite: Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000
Recommended Usage
For basic use
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True)
For reproducible evaluation, use the custom formatter and evaluation code:
from dexin_src.utils.formatter import Formatter
tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
formatter = Formatter(tokenizer)
formatted_prompt = formatter.format_prompt(example) # example is a row from your dataset
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for DexinR/qwen2.5-math220k-ckpt15000
Base model
Qwen/Qwen2.5-0.5BDatasets used to train DexinR/qwen2.5-math220k-ckpt15000
Evaluation results
- Accuracy (Clean Extraction) on GPQAself-reported0.386
- Accuracy (All Extraction) on GPQAself-reported0.410
- Accuracy on MATH500self-reported0.219