Qwen2.5-3B-GRPO - a sarthak247 Collection

sarthak247 's Collections

updated Feb 24

Trained with unsloth on just 250 steps (resource constraints) on GSM8K to add reasoning abilities to Qwen2.5-3B (smaller model because resources)