first commit

a75ebf2 over 1 year ago

5.05 kB

	---
	license: cc-by-nc-sa-4.0

	language:
	- en
	tags:
	- audio
	---
	#

	Auffusion is a latent diffusion model (LDM) for text-to-audio (TTA) generation. Auffusion can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. We release our model, inference code, and pre-trained checkpoints for the research community.

	📣 We are releasing Auffusion-Full-no-adapter which was pre-trained on all datasets described in paper and created for easy use of audio manipulation.

	📣 We are releasing Auffusion-Full which was pre-trained on all datasets described in paper.

	📣 We are releasing Auffusion which was pre-trained on AudioCaps.

	## Auffusion Model Family

	\| Model Name \| Model Path \|
	\|----------------------------\|------------------------------------------------------------------------------------------------------------------------ \|
	\| Auffusion \| [https://huggingface.co/auffusion/auffusion](https://huggingface.co/auffusion/auffusion) \|
	\| Auffusion-Full \| [https://huggingface.co/auffusion/auffusion-full](https://huggingface.co/auffusion/auffusion-full) \|
	\| Auffusion-Full-no-adapter \| [https://huggingface.co/auffusion/auffusion-full-no-adapter](https://huggingface.co/auffusion/auffusion-full-no-adapter)\|


	## Code

	Our code is released here: [https://github.com/happylittlecat2333/Auffusion](https://github.com/happylittlecat2333/Auffusion)

	We uploaded several Auffusion generated samples here: [https://auffusion.github.io](https://auffusion.github.io)

	Please follow the instructions in the repository for installation, usage and experiments.


	## Quickstart Guide

	We try to make Auffusion-Full-no-adapter compatible with text-to-image pipeline, therefore diffusers pipeline including StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline etc. can be adapted. Other audio manipulation examples can be seen in [https://github.com/happylittlecat2333/Auffusion/notebooks](https://github.com/happylittlecat2333/Auffusion/notebooks). We only show the default text-to-audio example here.


	First, git clone the repository and install the requirements:

	```bash
	git clone https://github.com/happylittlecat2333/Auffusion/
	cd Auffusion
	pip install -r requirements.txt
	```

	Then, download the Auffusion-Full-no-adapter model and generate audio from a text prompt:

	```python
	import IPython, torch, os
	import soundfile as sf
	from diffusers import StableDiffusionPipeline
	from huggingface_hub import snapshot_download
	from converter import Generator, denormalize_spectrogram

	cuda = "cuda" if torch.cuda.is_available() else "cpu"
	dtype = torch.float16

	prompt = "A kitten mewing for attention"
	seed = 42

	pretrained_model_name_or_path = "auffusion/auffusion-full-no-adapter"
	if not os.path.isdir(pretrained_model_name_or_path):
	pretrained_model_name_or_path = snapshot_download(pretrained_model_name_or_path)

	vocoder = Generator.from_pretrained(pretrained_model_name_or_path, subfolder="vocoder")
	vocoder = vocoder.to(device=device, dtype=dtype)

	pipe = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path, torch_dtype=dtype)
	pipe = pipe.to(device)

	generator = torch.Generator(device=device).manual_seed(seed)

	with torch.autocast("cuda"):
	output_spec = pipe(
	prompt=prompt, num_inference_steps=100, generator=generator, height=256, width=1024, output_type="pt"
	).images[0]
	# important to set output_type="pt" to get torch tensor output, and set height=256 with width=1024


	denorm_spec = denormalize_spectrogram(output_spec)
	denorm_spec_audio = vocoder.inference(denorm_spec)

	sf.write(f"{prompt}.wav", audio, samplerate=16000)
	IPython.display.Audio(data=audio, rate=16000)
	```

	The auffusion model will be automatically downloaded from huggingface and saved in cache. Subsequent runs will load the model directly from cache.

	Other audio manipulation examples can be seen in [https://github.com/happylittlecat2333/Auffusion/notebooks](https://github.com/happylittlecat2333/Auffusion/notebooks). We only show the default text-to-audio example here.

	## Citation

	Please consider citing the following article if you found our work useful:

	```bibtex
	@article{xue2024auffusion,
	title={Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation},
	author={Jinlong Xue and Yayue Deng and Yingming Gao and Ya Li},
	journal={arXiv preprint arXiv:2401.01044},
	year={2024}
	}
	```