Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Abstract
Muddit, a unified discrete diffusion transformer, achieves fast and high-quality generation across text and image modalities by integrating pretrained visual priors with a lightweight text decoder.
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
Community
๐ Diffusion for text generation is booming โ and we're pushing it further.
While recent works explore unified generation via diffusion for faster decoding, they mostly rely on language priors.
We introduce Muddit โ a next-generation foundation model in the Meissonic family, built upon discrete diffusion for unified and efficient multimodal generation.
Unlike traditional autoregressive methods, Muddit leverages discrete diffusion (a.k.a. MaskGIT-style masking) as its core mechanism โ enabling fast, parallel decoding across modalities.
While most unified models are still rooted in language priors, Muddit is developed from a visual-first perspective for scalable and flexible generation and it supports super fast t2i i2t and vqa tasks.
The code and model are released at \url{https://github.com/M-E-AGI-Lab/Muddit}.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMaDA: Multimodal Large Diffusion Language Models (2025)
- FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities (2025)
- Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing (2025)
- ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement (2025)
- Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens (2025)
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2025)
- X-Fusion: Introducing New Modality to Frozen Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper