File size: 2,923 Bytes
4133b1a
d2de06b
4133b1a
 
 
 
 
 
 
d2de06b
 
 
 
 
 
 
 
 
 
 
 
 
4133b1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128855f
 
d6b79c5
4133b1a
 
 
6036b81
4133b1a
6036b81
 
 
 
 
4133b1a
 
 
 
 
 
 
 
 
 
 
d2de06b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: bsd-3-clause
datasets:
- ILSVRC/imagenet-1k
tags:
- diffusion
- mamba-transformer
- class2image
- imagenet1k-256

model-index:
  - name: DiMSUM-L/2
    results:
      - task:
          type: class-to-image-generation
        dataset:
          name: ImageNet-1K
          type: 256x256
        metrics:
          - name: FID
            type: FID
            value: 2.11
---

<div align="center">
<h1>Official PyTorch models of "DiMSUM: Diffusion Mamba - A Scalable and Unified
Spatial-Frequency Method for Image Generation" <a href="https://arxiv.org/abs/2411.04168"> (NeurIPS'24)</a></h1>
</div>

<div align="center">
  <a href="https://hao-pt.github.io/" target="_blank">Hao&nbsp;Phung</a><sup>*13&dagger;</sup> &emsp; <b>&middot;</b> &emsp;
  <a href="https://quandao10.github.io/" target="_blank">Quan&nbsp;Dao</a><sup>*12&dagger;</sup> &emsp; <b>&middot;</b> &emsp;
  <a href="https://termanteus.com/" target="_blank">Trung&nbsp;Dao</a><sup>1</sup>
  <br> <br>
  <a href="https://viethoang1512.github.io/" target="_blank">Hoang&nbsp;Phan</a><sup>4</sup> &emsp; <b>&middot;</b> &emsp;
  <a href="https://people.cs.rutgers.edu/~dnm/" target="_blank"> Dimitris&nbsp;N. Metaxas</a><sup>2</sup> &emsp; <b>&middot;</b> &emsp;
  <a href="https://sites.google.com/site/anhttranusc/" target="_blank">Anh&nbsp;Tran</a><sup>1</sup>
  <br> <br>
  <sup>1</sup>VinAI Research &emsp;
  <sup>2</sup>Rutgers University &emsp;
  <sup>3</sup>Cornell University &emsp;
  <sup>4</sup>New York University
  <br> <br>
  <a href="https://vinairesearch.github.io/DiMSUM/">[Page]</a> &emsp;&emsp;
  <a href="https://arxiv.org/abs/2411.04168">[Paper]</a> &emsp;&emsp;
  <br> <br>
  <emp><sup>*</sup>Equal contribution</emp> &emsp;
  <emp><sup>&dagger;</sup>Work done while at VinAI Research</emp>
</div>

## Model details
Our model is a hydrid Mamba-Transformer architecture for class-to-image generation. This method is trained with flow matching objective. The model has 460M parameters and achieves an FID score of 2.11 on ImageNet-1K 256 dataset. 
Our codebase is hosted at https://github.com/VinAIResearch/DiMSUM.git.

To use DiMSUM pre trained model:
```python
from huggingface_hub import hf_hub_download

# Assume model is already initiated
ckpt_path = hf_hub_download("haopt/dimsum-L2-imagenet256")
state_dict = torch.load(ckpt_path)
model.load_state_dict(state_dict)
model.eval()
```

**Please CITE** our paper and give us a :star: whenever this repository is used to help produce published results or incorporated into other software.

```bibtex
@inproceedings{phung2024dimsum,
   title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},
   author={Phung, Hao and Dao, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh},
   booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
   year= {2024},
}
```