Memorization-Compression Cycles Improve Generalization
Abstract
We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.
Community
This paper challenges the "more data = better LLMs" narrative by proving that compressing internal representations is just as important for generalization. We also observed that LLMs naturally alternate between memorization and compression during training — mirroring human sleep cycles.
Based on this, it introduces Information Bottleneck Language Modeling objective and a new training method called GAPT. It cuts representation entropy by 50%, Improves cross-entropy by 4.8% on GPT-2 pre-training and boosts OOD generalization by 35% on arithmetic tasks, and Resolves conflicting experiences 97% better — echoing sleep-driven consolidation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Two-Phase Perspective on Deep Learning Dynamics (2025)
- Stochastic Variational Propagation: Local, Scalable and Efficient Alternative to Backpropagation (2025)
- Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping (2025)
- NeuralGrok: Accelerate Grokking by Neural Gradient Transformation (2025)
- Efficient Pretraining Length Scaling (2025)
- Lattice: Learning to Efficiently Compress the Memory (2025)
- Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper