Post
1487
š ššæš®š¶š»š²š± š® šš®š»š“šš®š“š² š š¼š±š²š¹ šš¼ šš°šµš²š±šš¹š² š²šš²š»šš šš¶ššµ šš„š£š¢! š šļø
āļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
I experimented with GRPO lately.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like šš²š®š°šµš¶š»š“ š® šŗš¼š±š²š¹ šš¼ š°šæš²š®šš² š® šš°šµš²š±šš¹š² š³šæš¼šŗ š® š¹š¶šš š¼š³ š²šš²š»šš š®š»š± š½šæš¶š¼šæš¶šš¶š²š.
Choosing an original problem forced me to:
š¤ Think about the problem setting
𧬠Generate data
š¤ Choose the right base model
š Design reward functions (and experiencing reward hacking)
š Run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding š experience.
I learned a lot of things, that I want to share with you. š
āļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
š» Code: https://github.com/anakin87/qwen-scheduler-grpo
š¤ Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837
āļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
I experimented with GRPO lately.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like šš²š®š°šµš¶š»š“ š® šŗš¼š±š²š¹ šš¼ š°šæš²š®šš² š® šš°šµš²š±šš¹š² š³šæš¼šŗ š® š¹š¶šš š¼š³ š²šš²š»šš š®š»š± š½šæš¶š¼šæš¶šš¶š²š.
Choosing an original problem forced me to:
š¤ Think about the problem setting
𧬠Generate data
š¤ Choose the right base model
š Design reward functions (and experiencing reward hacking)
š Run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding š experience.
I learned a lot of things, that I want to share with you. š
āļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
š» Code: https://github.com/anakin87/qwen-scheduler-grpo
š¤ Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837