tomg-group-umd
/

3-goldfish-loss-llama-1B

@@ -10,13 +10,22 @@ language:
 pipeline_tag: text2text-generation
 ---
-# Overview
-<!-- Provide a quick summary of what the model is/does. -->
 The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [[paper link](https://arxiv.org/abs/2406.10209)].
-| Checkpoint Name                                                                                               | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries for Memorization<br>(repeated 50 times)                                    |
 | ------------------------------------------------------------------------------------------------------------- | ---- | ------------------- | --------------- | --------------- | ----------------------------------------------------------------------------------- |
 | [tomg-group-umd/3-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/3-goldfish-loss-llama-1B)     | 3    | Hash (width = 13)   | 20B             | Redpajama       | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
 | [tomg-group-umd/4-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/4-goldfish-loss-llama-1B)     | 4    | Hash (width = 13)   | 20B             | Redpajama       | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
@@ -26,11 +35,8 @@ The following checkpoints are from our paper titled Goldfish Loss: Mitigating Me
 | [tomg-group-umd/control-llama-1B](https://huggingface.co/tomg-group-umd/control-llama-1B)                     | \-   | No Tokens Dropped   | 20B             | Redpajama       | None                                                                                |
 | [tomg-group-umd/standard-loss-llama-1B](https://huggingface.co/tomg-group-umd/standard-loss-llama-1B)         | \-   | No Tokens Dropped   | 20B             | Redpajama       | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
-- `standard-loss-llama-1B` and `control-llama-1B` are trained with standard causal language modelling loss with same exact specs as goldfish models.
-- Control model only differ in that it did NOT have canaries dataset used for memorized and simply pretrained on 20B Redpajama tokens.
-# Quick Links
-- **GitHub Repository**: https://github.com/ahans30/goldfish-loss
-- **arXiv**: https://arxiv.org/abs/2406.10209

 pipeline_tag: text2text-generation
 ---
+# Quick Links
+- **GitHub Repository**: https://github.com/ahans30/goldfish-loss
+- **arXiv**: https://arxiv.org/abs/2406.10209
+# Goldfish Loss
+We introduce goldfish loss, a new language modeling loss function that mitigates memorization of training data.
+Specifically, goldfish loss pseudorandomly drops $1/k$ of total tokens seen (in the forward pass) during loss computation (i.e., it doesn't compute loss for these tokens), with k being a hyperparameter.
+We show that the model finds it increasingly difficult to verbatim regurgitate training data even after 100 epochs. Please read our paper linked below for more details.
+# Overview
 The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [[paper link](https://arxiv.org/abs/2406.10209)].
+| Checkpoint Name                                                                                               | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries Dataset for Memorization                                   |
 | ------------------------------------------------------------------------------------------------------------- | ---- | ------------------- | --------------- | --------------- | ----------------------------------------------------------------------------------- |
 | [tomg-group-umd/3-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/3-goldfish-loss-llama-1B)     | 3    | Hash (width = 13)   | 20B             | Redpajama       | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
 | [tomg-group-umd/4-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/4-goldfish-loss-llama-1B)     | 4    | Hash (width = 13)   | 20B             | Redpajama       | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
 | [tomg-group-umd/control-llama-1B](https://huggingface.co/tomg-group-umd/control-llama-1B)                     | \-   | No Tokens Dropped   | 20B             | Redpajama       | None                                                                                |
 | [tomg-group-umd/standard-loss-llama-1B](https://huggingface.co/tomg-group-umd/standard-loss-llama-1B)         | \-   | No Tokens Dropped   | 20B             | Redpajama       | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
+### Description
+- `standard-loss-llama-1B` and `control-llama-1B` are trained with the standard causal language modeling loss, which has the same exact specifications as the goldfish models.
+- The control model differs only in the fact that it did not utilize the canaries dataset for memorization and was simply pre-trained on 20B Redpajama tokens.
+- The Canaries dataset, which contains 2000 Wikidocs, is repeated 50 times throughout the pre-training. Thus, it contains around ~204M tokens in total (including padding).