metadata

license: apache-2.0
tags:
  - qwen
  - math
  - fine-tuned
  - open-r1
  - supervised-finetuning
  - evaluation
datasets:
  - open-r1/OpenR1-Math-220k
  - Idavidrein/gpqa
  - HuggingFaceH4/MATH-500
metrics:
  - accuracy
base_model:
  - Qwen/Qwen2.5-0.5B
pipeline_tag: text-generation
library_name: transformers
language:
  - en
model-index:
  - name: Qwen2.5-0.5B-Math220k (Checkpoint-15000)
    results:
      - task:
          type: multiple-choice
        dataset:
          name: GPQA
          type: open
        metrics:
          - name: Accuracy (Clean Extraction)
            type: accuracy
            value: 0.386
          - name: Accuracy (All Extraction)
            type: accuracy
            value: 0.41
      - task:
          type: mathematical-reasoning
        dataset:
          name: MATH500
          type: open
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.219

Qwen2.5-0.5B-Math220k (Checkpoint-15000)

This model is a supervised fine-tuned variant of Qwen2.5-0.5B, trained on the default split of math220k for step-by-step mathematical reasoning and standardized answer formatting.

Training

Base model: Qwen2.5-0.5B
Dataset: math220k default subset (83k train, 10k test, filtered for verified answers)
Training steps: 15,000
Checkpoint interval: 500 steps
Learning rate: 2.5e-6 with cosine decay scheduler
Batch size: 64
Prompting format: guided step-by-step reasoning, with enforced final answer formatting (Answer: or \boxed{})

Evaluation

All evaluations were performed on bootstrapped datasets (size=1000) to ensure fair, stable comparisons.

Dataset	Accuracy (Clean)	Accuracy (All)
GPQA (merged)	0.386	0.410
MATH500	0.219	N/A

Clean extraction: only answers in canonical form (Answer: X, \boxed{X})
All extraction: includes fuzzy-matched final answers in phrases like “the correct answer is X”

Evaluation was performed with eval_checkpoints_auto.py using local bootstrapped datasets. For detailed evaluation results and charts, see: DexinRen/open-r1_DR_test/dexin_src/eval_output

Limitations

The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns.
This checkpoint prioritizes formatting discipline and correctness of final answers over full reasoning transparency.
MATH500 generalization is slightly degraded vs. the base model (expected for SFT).

Files Included

model.safetensors: model weights
tokenizer.json, vocab.json, config.json: tokenizer and model config
All files are stored using Git LFS for proper large file support.

Citation

If you use this model, please cite: Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000

Recommended Usage

For basic use

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True)

For reproducible evaluation, use the custom formatter and evaluation code:

from dexin_src.utils.formatter import Formatter

tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
formatter = Formatter(tokenizer)
formatted_prompt = formatter.format_prompt(example)  # example is a row from your dataset