--- library_name: transformers license: mit --- # Model Card for Model ID This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, [**"Training a Generally Curious Agent"**](https://arxiv.org/abs/2502.17543). In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration. ## Model Details ### Model Description This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA. - **Finetuned from model:** meta-llama/Meta-Llama-3.1-8B-Instruct ### Model Sources - **Repository:** [Official Code Release for the paper "Training a Generally Curious Agent"](https://github.com/tajwarfahim/paprika) - **Paper:** [Training a Generally Curious Agent](https://arxiv.org/abs/2502.17543) - **Project Website:** [Project Website](https://paprika-llm.github.io) ## Training Details ### Training Data Our training dataset for supervised fine-tuning can be found here: [SFT dataset](https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset) Similarly, the training dataset for preference fine-tuning can be found here: [Preference learning dataset](https://huggingface.co/datasets/ftajwar/paprika_preference_dataset) ### Training Procedure The [attached Wandb link](https://wandb.ai/llm_exploration/paprika_more_data?nw=nwusertajwar) shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning. #### Training Hyperparameters For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories. For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs. #### Hardware This model has been finetuned using 8 NVIDIA L40S GPUs. ## Citation **BibTeX:** ``` @misc{tajwar2025traininggenerallycuriousagent, title={Training a Generally Curious Agent}, author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov}, year={2025}, eprint={2502.17543}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.17543}, } ``` ## Model Card Contact [Fahim Tajwar](mailto:tajwarfahim932@gmail.com)