File size: 2,923 Bytes
4133b1a d2de06b 4133b1a d2de06b 4133b1a 128855f d6b79c5 4133b1a 6036b81 4133b1a 6036b81 4133b1a d2de06b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
license: bsd-3-clause
datasets:
- ILSVRC/imagenet-1k
tags:
- diffusion
- mamba-transformer
- class2image
- imagenet1k-256
model-index:
- name: DiMSUM-L/2
results:
- task:
type: class-to-image-generation
dataset:
name: ImageNet-1K
type: 256x256
metrics:
- name: FID
type: FID
value: 2.11
---
<div align="center">
<h1>Official PyTorch models of "DiMSUM: Diffusion Mamba - A Scalable and Unified
Spatial-Frequency Method for Image Generation" <a href="https://arxiv.org/abs/2411.04168"> (NeurIPS'24)</a></h1>
</div>
<div align="center">
<a href="https://hao-pt.github.io/" target="_blank">Hao Phung</a><sup>*13†</sup>   <b>·</b>  
<a href="https://quandao10.github.io/" target="_blank">Quan Dao</a><sup>*12†</sup>   <b>·</b>  
<a href="https://termanteus.com/" target="_blank">Trung Dao</a><sup>1</sup>
<br> <br>
<a href="https://viethoang1512.github.io/" target="_blank">Hoang Phan</a><sup>4</sup>   <b>·</b>  
<a href="https://people.cs.rutgers.edu/~dnm/" target="_blank"> Dimitris N. Metaxas</a><sup>2</sup>   <b>·</b>  
<a href="https://sites.google.com/site/anhttranusc/" target="_blank">Anh Tran</a><sup>1</sup>
<br> <br>
<sup>1</sup>VinAI Research  
<sup>2</sup>Rutgers University  
<sup>3</sup>Cornell University  
<sup>4</sup>New York University
<br> <br>
<a href="https://vinairesearch.github.io/DiMSUM/">[Page]</a>   
<a href="https://arxiv.org/abs/2411.04168">[Paper]</a>   
<br> <br>
<emp><sup>*</sup>Equal contribution</emp>  
<emp><sup>†</sup>Work done while at VinAI Research</emp>
</div>
## Model details
Our model is a hydrid Mamba-Transformer architecture for class-to-image generation. This method is trained with flow matching objective. The model has 460M parameters and achieves an FID score of 2.11 on ImageNet-1K 256 dataset.
Our codebase is hosted at https://github.com/VinAIResearch/DiMSUM.git.
To use DiMSUM pre trained model:
```python
from huggingface_hub import hf_hub_download
# Assume model is already initiated
ckpt_path = hf_hub_download("haopt/dimsum-L2-imagenet256")
state_dict = torch.load(ckpt_path)
model.load_state_dict(state_dict)
model.eval()
```
**Please CITE** our paper and give us a :star: whenever this repository is used to help produce published results or incorporated into other software.
```bibtex
@inproceedings{phung2024dimsum,
ββ title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},
ββ author={Phung, Hao and Dao, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh},
ββ booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
ββ year= {2024},
}
``` |