metadata
library_name: transformers
tags:
- goldfish-loss
- memorization
- mitigation
license: apache-2.0
language:
- en
pipeline_tag: text2text-generation
Overview
The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [paper link].
Checkpoint Name | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries for Memorization (repeated 50 times) |
---|---|---|---|---|---|
tomg-group-umd/3-goldfish-loss-llama-1B | 3 | Hash (width = 13) | 20B | Redpajama | Wikipedia |
tomg-group-umd/4-goldfish-loss-llama-1B | 4 | Hash (width = 13) | 20B | Redpajama | Wikipedia |
tomg-group-umd/8-goldfish-loss-llama-1B | 8 | Hash (width = 13) | 20B | Redpajama | Wikipedia |
tomg-group-umd/32-goldfish-loss-llama-1B | 32 | Hash (width = 13) | 20B | Redpajama | Wikipedia |
tomg-group-umd/128-goldfish-loss-llama-1B | 128 | Hash (width = 13) | 20B | Redpajama | Wikipedia |
tomg-group-umd/control-llama-1B | - | No Tokens Dropped | 20B | Redpajama | None |
tomg-group-umd/standard-loss-llama-1B | - | No Tokens Dropped | 20B | Redpajama | Wikipedia |
standard-loss-llama-1B
andcontrol-llama-1B
are trained with standard causal language modelling loss with same exact specs as goldfish models.- Control model only differ in that it did NOT have canaries dataset used for memorized and simply pretrained on 20B Redpajama tokens.
Quick Links
- GitHub Repository: https://github.com/ahans30/goldfish-loss
- arXiv: https://arxiv.org/abs/2406.10209