--- license: bsd-3-clause datasets: - ILSVRC/imagenet-1k tags: - diffusion - mamba-transformer - class2image - imagenet1k-256 model-index: - name: DiMSUM-L/2 results: - task: type: class-to-image-generation dataset: name: ImageNet-1K type: 256x256 metrics: - name: FID type: FID value: 2.11 ---

Official PyTorch models of "DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation" (NeurIPS'24)

Hao Phung*13†·Quan Dao*12†·Trung Dao1

Hoang Phan4· Dimitris N. Metaxas2·Anh Tran1

1VinAI Research   2Rutgers University   3Cornell University   4New York University

[Page]    [Paper]   

*Equal contributionWork done while at VinAI Research
## Model details Our model is a hydrid Mamba-Transformer architecture for class-to-image generation. This method is trained with flow matching objective. The model has 460M parameters and achieves an FID score of 2.11 on ImageNet-1K 256 dataset. Our codebase is hosted at https://github.com/VinAIResearch/DiMSUM.git. To use DiMSUM pre trained model: ```python from huggingface_hub import hf_hub_download # Assume model is already initiated ckpt_path = hf_hub_download("haopt/dimsum-L2-imagenet256") state_dict = torch.load(ckpt_path) model.load_state_dict(state_dict) model.eval() ``` **Please CITE** our paper and give us a :star: whenever this repository is used to help produce published results or incorporated into other software. ```bibtex @inproceedings{phung2024dimsum,    title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},    author={Phung, Hao and Dao, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh},    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},    year= {2024}, } ```