arxiv:2505.08727

Memorization-Compression Cycles Improve Generalization

Published on May 13

· Submitted by

Ksgk-fy on May 14

Upvote

Authors:

Fangyuan Yu

Abstract

We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

View arXiv page View PDF Add to collection

Community

Ksgk-fy

Paper author Paper submitter 1 day ago

This paper challenges the "more data = better LLMs" narrative by proving that compressing internal representations is just as important for generalization. We also observed that LLMs naturally alternate between memorization and compression during training — mirroring human sleep cycles.

Based on this, it introduces Information Bottleneck Language Modeling objective and a new training method called GAPT. It cuts representation entropy by 50%, Improves cross-entropy by 4.8% on GPT-2 pre-training and boosts OOD generalization by 35% on arithmetic tasks, and Resolves conflicting experiences 97% better — echoing sleep-driven consolidation.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.08727 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.08727 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.08727 in a Space README.md to link it from this page.