Stefano Fiorucci PRO

anakin87

AI & ML interests

Contributing to Haystack LLM framework šŸ—ļø. Language Models: orchestration, post-training, synthetic data...

Recent Activity

upvoted a collection about 18 hours ago
Qwen Scheduler GRPO
posted an update about 19 hours ago
š—œ š˜š—æš—®š—¶š—»š—²š—± š—® š—Ÿš—®š—»š—“š˜‚š—®š—“š—² š— š—¼š—±š—²š—¹ š˜š—¼ š˜€š—°š—µš—²š—±š˜‚š—¹š—² š—²š˜ƒš—²š—»š˜š˜€ š˜„š—¶š˜š—µ š—šš—„š—£š—¢! šŸ‘‘ šŸ—“ļø āœļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo I experimented with GRPO lately. I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning. After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game... I wanted a different challenge, like š˜š—²š—®š—°š—µš—¶š—»š—“ š—® š—ŗš—¼š—±š—²š—¹ š˜š—¼ š—°š—æš—²š—®š˜š—² š—® š˜€š—°š—µš—²š—±š˜‚š—¹š—² š—³š—æš—¼š—ŗ š—® š—¹š—¶š˜€š˜ š—¼š—³ š—²š˜ƒš—²š—»š˜š˜€ š—®š—»š—± š—½š—æš—¶š—¼š—æš—¶š˜š—¶š—²š˜€. Choosing an original problem forced me to: šŸ¤” Think about the problem setting 🧬 Generate data šŸ¤ Choose the right base model šŸ† Design reward functions (and experiencing reward hacking) šŸ”„ Run multiple rounds of training, hoping that my model would learn something. A fun and rewarding šŸ˜„ experience. I learned a lot of things, that I want to share with you. šŸ‘‡ āœļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo šŸ’» Code: https://github.com/anakin87/qwen-scheduler-grpo šŸ¤— Hugging Face collection (dataset and model): https://huggingface.co/collections/anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837
View all activity

Organizations

deepset's profile picture Blog-explorers's profile picture ZeroGPU Explorers's profile picture Hugging Face Discord Community's profile picture

Posts 14

view post
Post
1487
š—œ š˜š—æš—®š—¶š—»š—²š—± š—® š—Ÿš—®š—»š—“š˜‚š—®š—“š—² š— š—¼š—±š—²š—¹ š˜š—¼ š˜€š—°š—µš—²š—±š˜‚š—¹š—² š—²š˜ƒš—²š—»š˜š˜€ š˜„š—¶š˜š—µ š—šš—„š—£š—¢! šŸ‘‘ šŸ—“ļø

āœļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

I experimented with GRPO lately.

I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...

I wanted a different challenge, like š˜š—²š—®š—°š—µš—¶š—»š—“ š—® š—ŗš—¼š—±š—²š—¹ š˜š—¼ š—°š—æš—²š—®š˜š—² š—® š˜€š—°š—µš—²š—±š˜‚š—¹š—² š—³š—æš—¼š—ŗ š—® š—¹š—¶š˜€š˜ š—¼š—³ š—²š˜ƒš—²š—»š˜š˜€ š—®š—»š—± š—½š—æš—¶š—¼š—æš—¶š˜š—¶š—²š˜€.

Choosing an original problem forced me to:
šŸ¤” Think about the problem setting
🧬 Generate data
šŸ¤ Choose the right base model
šŸ† Design reward functions (and experiencing reward hacking)
šŸ”„ Run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding šŸ˜„ experience.


I learned a lot of things, that I want to share with you. šŸ‘‡
āœļø Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
šŸ’» Code: https://github.com/anakin87/qwen-scheduler-grpo
šŸ¤— Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837

Articles 3

Article
9

I trained a Language Model to schedule events with GRPO!