metadata

library_name: transformers
tags:
  - goldfish-loss
  - memorization
  - mitigation
license: apache-2.0
language:
  - en
pipeline_tag: text2text-generation

Overview

The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [paper link].

Checkpoint Name	k-GL	Token Drop Strategy	Pretrain Tokens	Primary Dataset	Canaries for Memorization (repeated 50 times)
tomg-group-umd/3-goldfish-loss-llama-1B	3	Hash (width = 13)	20B	Redpajama	Wikipedia
tomg-group-umd/4-goldfish-loss-llama-1B	4	Hash (width = 13)	20B	Redpajama	Wikipedia
tomg-group-umd/8-goldfish-loss-llama-1B	8	Hash (width = 13)	20B	Redpajama	Wikipedia
tomg-group-umd/32-goldfish-loss-llama-1B	32	Hash (width = 13)	20B	Redpajama	Wikipedia
tomg-group-umd/128-goldfish-loss-llama-1B	128	Hash (width = 13)	20B	Redpajama	Wikipedia
tomg-group-umd/control-llama-1B	-	No Tokens Dropped	20B	Redpajama	None
tomg-group-umd/standard-loss-llama-1B	-	No Tokens Dropped	20B	Redpajama	Wikipedia

standard-loss-llama-1B and control-llama-1B are trained with standard causal language modelling loss with same exact specs as goldfish models.
Control model only differ in that it did NOT have canaries dataset used for memorized and simply pretrained on 20B Redpajama tokens.

Quick Links

GitHub Repository: https://github.com/ahans30/goldfish-loss
arXiv: https://arxiv.org/abs/2406.10209