DexinR's picture
Update README.md
2cff65e verified
metadata
license: apache-2.0
tags:
  - qwen
  - math
  - fine-tuned
  - open-r1
  - supervised-finetuning
  - evaluation
datasets:
  - open-r1/OpenR1-Math-220k
  - Idavidrein/gpqa
  - HuggingFaceH4/MATH-500
metrics:
  - accuracy
base_model:
  - Qwen/Qwen2.5-0.5B
pipeline_tag: text-generation
library_name: transformers
language:
  - en
model-index:
  - name: Qwen2.5-0.5B-Math220k (Checkpoint-15000)
    results:
      - task:
          type: multiple-choice
        dataset:
          name: GPQA
          type: open
        metrics:
          - name: Accuracy (Clean Extraction)
            type: accuracy
            value: 0.386
          - name: Accuracy (All Extraction)
            type: accuracy
            value: 0.41
      - task:
          type: mathematical-reasoning
        dataset:
          name: MATH500
          type: open
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.219

Qwen2.5-0.5B-Math220k (Checkpoint-15000)

This model is a supervised fine-tuned variant of Qwen2.5-0.5B, trained on the default split of math220k for step-by-step mathematical reasoning and standardized answer formatting.

Training

  • Base model: Qwen2.5-0.5B
  • Dataset: math220k default subset (83k train, 10k test, filtered for verified answers)
  • Training steps: 15,000
  • Checkpoint interval: 500 steps
  • Learning rate: 2.5e-6 with cosine decay scheduler
  • Batch size: 64
  • Prompting format: guided step-by-step reasoning, with enforced final answer formatting (Answer: or \boxed{})

Evaluation

All evaluations were performed on bootstrapped datasets (size=1000) to ensure fair, stable comparisons.

Dataset Accuracy (Clean) Accuracy (All)
GPQA (merged) 0.386 0.410
MATH500 0.219 N/A
  • Clean extraction: only answers in canonical form (Answer: X, \boxed{X})
  • All extraction: includes fuzzy-matched final answers in phrases like “the correct answer is X”

Evaluation was performed with eval_checkpoints_auto.py using local bootstrapped datasets. For detailed evaluation results and charts, see: DexinRen/open-r1_DR_test/dexin_src/eval_output

Limitations

  • The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns.
  • This checkpoint prioritizes formatting discipline and correctness of final answers over full reasoning transparency.
  • MATH500 generalization is slightly degraded vs. the base model (expected for SFT).

Files Included

  • model.safetensors: model weights
  • tokenizer.json, vocab.json, config.json: tokenizer and model config
  • All files are stored using Git LFS for proper large file support.

Citation

If you use this model, please cite: Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000

Recommended Usage

For basic use

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True)

For reproducible evaluation, use the custom formatter and evaluation code:

from dexin_src.utils.formatter import Formatter

tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
formatter = Formatter(tokenizer)
formatted_prompt = formatter.format_prompt(example)  # example is a row from your dataset