---
library_name: transformers
license: mit
---

# Model Card for Model ID

This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, [**"Training a Generally Curious Agent"**](https://arxiv.org/abs/2502.17543). In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA.

- **Finetuned from model:** meta-llama/Meta-Llama-3.1-8B-Instruct

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Official Code Release for the paper "Training a Generally Curious Agent"](https://github.com/tajwarfahim/paprika)
- **Paper:** [Training a Generally Curious Agent](https://arxiv.org/abs/2502.17543)
- **Project Website:** [Project Website](https://paprika-llm.github.io)

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Our training dataset for supervised fine-tuning can be found here: [SFT dataset](https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset)

Similarly, the training dataset for preference fine-tuning can be found here: [Preference learning dataset](https://huggingface.co/datasets/ftajwar/paprika_preference_dataset)

### Training Procedure

The [attached Wandb link](https://wandb.ai/llm_exploration/paprika_more_data?nw=nwusertajwar) shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning.


#### Training Hyperparameters

For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories.

For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs.

#### Hardware

This model has been finetuned using 8 NVIDIA L40S GPUs.


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@misc{tajwar2025traininggenerallycuriousagent,
      title={Training a Generally Curious Agent}, 
      author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov},
      year={2025},
      eprint={2502.17543},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.17543}, 
}
```

## Model Card Contact

[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)