ahans1 commited on
Commit
0378e8a
·
verified ·
1 Parent(s): a8697c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -10
README.md CHANGED
@@ -10,13 +10,22 @@ language:
10
  pipeline_tag: text2text-generation
11
  ---
12
 
13
- # Overview
14
 
15
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
 
 
 
 
 
16
 
17
  The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [[paper link](https://arxiv.org/abs/2406.10209)].
18
 
19
- | Checkpoint Name | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries for Memorization<br>(repeated 50 times) |
20
  | ------------------------------------------------------------------------------------------------------------- | ---- | ------------------- | --------------- | --------------- | ----------------------------------------------------------------------------------- |
21
  | [tomg-group-umd/3-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/3-goldfish-loss-llama-1B) | 3 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
22
  | [tomg-group-umd/4-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/4-goldfish-loss-llama-1B) | 4 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
@@ -26,11 +35,8 @@ The following checkpoints are from our paper titled Goldfish Loss: Mitigating Me
26
  | [tomg-group-umd/control-llama-1B](https://huggingface.co/tomg-group-umd/control-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | None |
27
  | [tomg-group-umd/standard-loss-llama-1B](https://huggingface.co/tomg-group-umd/standard-loss-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
28
 
29
- - `standard-loss-llama-1B` and `control-llama-1B` are trained with standard causal language modelling loss with same exact specs as goldfish models.
30
- - Control model only differ in that it did NOT have canaries dataset used for memorized and simply pretrained on 20B Redpajama tokens.
31
-
32
- # Quick Links
33
-
34
 
35
- - **GitHub Repository**: https://github.com/ahans30/goldfish-loss
36
- - **arXiv**: https://arxiv.org/abs/2406.10209
 
10
  pipeline_tag: text2text-generation
11
  ---
12
 
13
+ # Quick Links
14
 
15
+ - **GitHub Repository**: https://github.com/ahans30/goldfish-loss
16
+ - **arXiv**: https://arxiv.org/abs/2406.10209
17
+
18
+ # Goldfish Loss
19
+
20
+ We introduce goldfish loss, a new language modeling loss function that mitigates memorization of training data.
21
+ Specifically, goldfish loss pseudorandomly drops $1/k$ of total tokens seen (in the forward pass) during loss computation (i.e., it doesn't compute loss for these tokens), with k being a hyperparameter.
22
+ We show that the model finds it increasingly difficult to verbatim regurgitate training data even after 100 epochs. Please read our paper linked below for more details.
23
+
24
+ # Overview
25
 
26
  The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [[paper link](https://arxiv.org/abs/2406.10209)].
27
 
28
+ | Checkpoint Name | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries Dataset for Memorization |
29
  | ------------------------------------------------------------------------------------------------------------- | ---- | ------------------- | --------------- | --------------- | ----------------------------------------------------------------------------------- |
30
  | [tomg-group-umd/3-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/3-goldfish-loss-llama-1B) | 3 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
31
  | [tomg-group-umd/4-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/4-goldfish-loss-llama-1B) | 4 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
 
35
  | [tomg-group-umd/control-llama-1B](https://huggingface.co/tomg-group-umd/control-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | None |
36
  | [tomg-group-umd/standard-loss-llama-1B](https://huggingface.co/tomg-group-umd/standard-loss-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
37
 
38
+ ### Description
39
+ - `standard-loss-llama-1B` and `control-llama-1B` are trained with the standard causal language modeling loss, which has the same exact specifications as the goldfish models.
40
+ - The control model differs only in the fact that it did not utilize the canaries dataset for memorization and was simply pre-trained on 20B Redpajama tokens.
41
+ - The Canaries dataset, which contains 2000 Wikidocs, is repeated 50 times throughout the pre-training. Thus, it contains around ~204M tokens in total (including padding).
 
42