arxiv:2505.22918

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Published on May 28

· Submitted by

crc5577 on May 30

Upvote

Authors:

Ruichen Chen ,

Keith G. Mills ,

Liyao Jiang ,

Abstract

Re-ttention uses temporal redundancy in diffusion models to enable high sparse attention in visual generation, maintaining quality with minimal computational overhead.

AI-generated summary

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: https://github.com/cccrrrccc/Re-ttention{https://github.com/cccrrrccc/Re-ttention}

View arXiv page View PDF Add to collection

Community

crc5577

Paper author Paper submitter 4 days ago

Re-ttention, a sparse attention method that requires as few as 3.1% of tokens during inference, leverages the temporal redundancy of Diffusion Models to reshape attention scores in order to preserve the visual quality.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.22918 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.22918 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.22918 in a Space README.md to link it from this page.