XAttention: Block Sparse Attention with Antidiagonal Scoring
Abstract
Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference (2025)
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (2025)
- Twilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning (2025)
- APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs (2025)
- Training-free and Adaptive Sparse Attention for Efficient Long Video Generation (2025)
- SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference (2025)
- PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper