arxiv:2505.23606

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Published on May 29

· Submitted by

BryanW on May 30

Upvote

Authors:

Qingyu Shi ,

Jinbin Bai ,

Abstract

Muddit, a unified discrete diffusion transformer, achieves fast and high-quality generation across text and image modalities by integrating pretrained visual priors with a lightweight text decoder.

AI-generated summary

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

View arXiv page View PDF GitHub repository Add to collection

Community

BryanW

Paper author Paper submitter 5 days ago

🚀 Diffusion for text generation is booming — and we're pushing it further.

While recent works explore unified generation via diffusion for faster decoding, they mostly rely on language priors.

We introduce Muddit — a next-generation foundation model in the Meissonic family, built upon discrete diffusion for unified and efficient multimodal generation.

Unlike traditional autoregressive methods, Muddit leverages discrete diffusion (a.k.a. MaskGIT-style masking) as its core mechanism — enabling fast, parallel decoding across modalities.

While most unified models are still rooted in language priors, Muddit is developed from a visual-first perspective for scalable and flexible generation and it supports super fast t2i i2t and vqa tasks.

The code and model are released at \url{https://github.com/M-E-AGI-Lab/Muddit}.