Forgetting Transformer: Softmax Attention with a Forget Gate Paper • 2503.02130 • Published 17 days ago • 27
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling Paper • 2503.04725 • Published 14 days ago • 19