Papers
arxiv:2505.22255

Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

Published on May 28
· Submitted by dlaptev on May 30

Abstract

KronSAE, a novel architecture using Kronecker product decomposition, enhances efficiency in training Sparse Autoencoders, while mAND, a differentiable binary AND function, improves interpretability and performance.

AI-generated summary

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

Community

Paper author Paper submitter

We propose KronSAE, a scalable sparse autoencoder that tackles the computational bottleneck in encoder projections by factorizing the latent space into head-wise Kronecker products and introducing mAND, a differentiable AND-like activation. By decomposing the encoder into thin matrices and enforcing logical interactions, we reduce FLOPs by up to 50% while improving reconstruction fidelity and interpretability. Key highlights include: (1) toy model validation, where KronSAE recovers block-structured feature correlations (RV=0.358 vs. TopK’s 0.038), proving its ability to capture correlated latent groups; (2) AND-like feature composition, where post-latents dictionary elements emerge as intersections of polysemantic pre-latents (e.g., "therapy" from "instrument" + "necessity"); and (3) practical gains, with higher explained variance (+4.3%) and lower feature absorption in real LLMs. Our work unlocks efficient, interpretable feature discovery without sacrificing scalability.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.22255 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.22255 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.22255 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.