|
--- |
|
license: cc-by-nc-sa-4.0 |
|
|
|
language: |
|
- en |
|
tags: |
|
- audio |
|
--- |
|
# |
|
|
|
**Auffusion** is a latent diffusion model (LDM) for text-to-audio (TTA) generation. **Auffusion** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community. |
|
|
|
📣 We are releasing **Auffusion-Full-no-adapter** which was pre-trained on all datasets described in paper and created for easy use of audio manipulation. |
|
|
|
📣 We are releasing **Auffusion-Full** which was pre-trained on all datasets described in paper. |
|
|
|
📣 We are releasing **Auffusion** which was pre-trained on **AudioCaps**. |
|
|
|
## Auffusion Model Family |
|
|
|
| Model Name | Model Path | |
|
|----------------------------|------------------------------------------------------------------------------------------------------------------------ | |
|
| Auffusion | [https://huggingface.co/auffusion/auffusion](https://huggingface.co/auffusion/auffusion) | |
|
| Auffusion-Full | [https://huggingface.co/auffusion/auffusion-full](https://huggingface.co/auffusion/auffusion-full) | |
|
| Auffusion-Full-no-adapter | [https://huggingface.co/auffusion/auffusion-full-no-adapter](https://huggingface.co/auffusion/auffusion-full-no-adapter)| |
|
|
|
|
|
## Code |
|
|
|
Our code is released here: [https://github.com/happylittlecat2333/Auffusion](https://github.com/happylittlecat2333/Auffusion) |
|
|
|
We uploaded several **Auffusion** generated samples here: [https://auffusion.github.io](https://auffusion.github.io) |
|
|
|
Please follow the instructions in the repository for installation, usage and experiments. |
|
|
|
|
|
## Quickstart Guide |
|
|
|
We try to make **Auffusion-Full-no-adapter** compatible with text-to-image pipeline, therefore diffusers pipeline including StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline etc. can be adapted. Other audio manipulation examples can be seen in [https://github.com/happylittlecat2333/Auffusion/notebooks](https://github.com/happylittlecat2333/Auffusion/notebooks). We only show the default text-to-audio example here. |
|
|
|
|
|
First, git clone the repository and install the requirements: |
|
|
|
```bash |
|
git clone https://github.com/happylittlecat2333/Auffusion/ |
|
cd Auffusion |
|
pip install -r requirements.txt |
|
``` |
|
|
|
Then, download the **Auffusion-Full-no-adapter** model and generate audio from a text prompt: |
|
|
|
```python |
|
import IPython, torch, os |
|
import soundfile as sf |
|
from diffusers import StableDiffusionPipeline |
|
from huggingface_hub import snapshot_download |
|
from converter import Generator, denormalize_spectrogram |
|
|
|
cuda = "cuda" if torch.cuda.is_available() else "cpu" |
|
dtype = torch.float16 |
|
|
|
prompt = "A kitten mewing for attention" |
|
seed = 42 |
|
|
|
pretrained_model_name_or_path = "auffusion/auffusion-full-no-adapter" |
|
if not os.path.isdir(pretrained_model_name_or_path): |
|
pretrained_model_name_or_path = snapshot_download(pretrained_model_name_or_path) |
|
|
|
vocoder = Generator.from_pretrained(pretrained_model_name_or_path, subfolder="vocoder") |
|
vocoder = vocoder.to(device=device, dtype=dtype) |
|
|
|
pipe = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path, torch_dtype=dtype) |
|
pipe = pipe.to(device) |
|
|
|
generator = torch.Generator(device=device).manual_seed(seed) |
|
|
|
with torch.autocast("cuda"): |
|
output_spec = pipe( |
|
prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt" |
|
).images[0] |
|
# important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024 |
|
|
|
|
|
denorm_spec = denormalize_spectrogram(output_spec) |
|
denorm_spec_audio = vocoder.inference(denorm_spec) |
|
|
|
sf.write(f"{prompt}.wav", audio, samplerate=16000) |
|
IPython.display.Audio(data=audio, rate=16000) |
|
``` |
|
|
|
The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache. |
|
|
|
Other audio manipulation examples can be seen in [https://github.com/happylittlecat2333/Auffusion/notebooks](https://github.com/happylittlecat2333/Auffusion/notebooks). We only show the default text-to-audio example here. |
|
|
|
## Citation |
|
|
|
Please consider citing the following article if you found our work useful: |
|
|
|
```bibtex |
|
@article{xue2024auffusion, |
|
title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation}, |
|
author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li}, |
|
journal={arXiv preprint arXiv:2401.01044}, |
|
year={2024} |
|
} |
|
``` |