|
--- |
|
tags: |
|
- text-to-speech |
|
license: apache-2.0 |
|
--- |
|
Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices. |
|
|
|
## Install |
|
### Clone & Install |
|
```sh |
|
git clone https://github.com/MYZY-AI/Muyan-TTS.git |
|
cd Muyan-TTS |
|
|
|
conda create -n muyan-tts python=3.10 -y |
|
conda activate muyan-tts |
|
make build |
|
``` |
|
|
|
You need to install ```FFmpeg```. If you're using Ubuntu, you can install it with the following command: |
|
```sh |
|
sudo apt update |
|
sudo apt install ffmpeg |
|
``` |
|
|
|
|
|
|
|
Additionally, you need to download the weights of [chinese-hubert-base](https://huggingface.co/TencentGameMate/chinese-hubert-base). |
|
|
|
Place all the downloaded models in the ```pretrained_models``` directory. Your directory structure should look similar to the following: |
|
``` |
|
pretrained_models |
|
βββ chinese-hubert-base |
|
βββ Muyan-TTS |
|
βββ Muyan-TTS-SFT |
|
``` |
|
|
|
## Quickstart |
|
```sh |
|
python tts.py |
|
``` |
|
This will synthesize speech through inference. The core code is as follows: |
|
```py |
|
async def main(model_type, model_path): |
|
tts = Inference(model_type, model_path, enable_vllm_acc=False) |
|
wavs = await tts.generate( |
|
ref_wav_path="assets/Claire.wav", |
|
prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.", |
|
text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together." |
|
) |
|
output_path = "logs/tts.wav" |
|
with open(output_path, "wb") as f: |
|
f.write(next(wavs)) |
|
print(f"Speech generated in {output_path}") |
|
``` |
|
You need to specify the prompt speech, including the ```ref_wav_path``` and its ```prompt_text```, and the ```text``` to be synthesized. The synthesized speech is saved by default to ```logs/tts.wav```. |
|
|
|
Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```. |
|
|
|
When you specify the ```model_type``` to be ```base```, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis. |
|
|
|
When you specify the ```model_type``` to be ```sft```, you need to keep the prompt speech unchanged because the ```sft``` model is trained on Claire's voice. |
|
|
|
## API Usage |
|
```sh |
|
python api.py |
|
``` |
|
Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port ```8020```. Additionally, LLM logs will be saved in ```logs/llm.log```. |
|
|
|
You can send a request to the API using the example below: |
|
```py |
|
import time |
|
import requests |
|
TTS_PORT=8020 |
|
payload = { |
|
"ref_wav_path": "assets/Claire.wav", |
|
"prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.", |
|
"text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together." |
|
} |
|
start = time.time() |
|
|
|
url = f"http://localhost:{TTS_PORT}/get_tts" |
|
response = requests.post(url, json=payload) |
|
audio_file_path = "logs/tts.wav" |
|
with open(audio_file_path, "wb") as f: |
|
f.write(response.content) |
|
|
|
print(time.time() - start) |
|
``` |
|
|
|
By default, the synthesized speech will be saved at ```logs/tts.wav```. |
|
|
|
Similarly, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```. |
|
|
|
## Training |
|
|
|
We use ```LibriSpeech``` as an example. You can use your own dataset instead, but you need to organize the data into the format shown in ```data_process/examples```. |
|
|
|
If you haven't downloaded ```LibriSpeech``` yet, you can download the dev-clean set using: |
|
```sh |
|
wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz |
|
``` |
|
After uncompressing the data, specify the ```librispeech_dir``` in ```prepare_sft_dataset.py``` to match the download location. Then run: |
|
```sh |
|
./train.sh |
|
``` |
|
This will automatically process the data and generate ```data/tts_sft_data.json```. |
|
|
|
Note that we use a specific speaker ID of "3752" from dev-clean of LibriSpeech (which can be specified in ```data_process/text_format_conversion.py```) as an example because its data size is relatively large. If you organize your own dataset for training, please prepare at least a dozen of minutes of speech from the target speaker. |
|
|
|
If an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun ```train.sh```. |
|
|
|
After generating ```data/tts_sft_data.json```, train.sh will automatically copy it to ```llama-factory/data``` and add the following field to ```dataset_info.json```: |
|
```json |
|
"tts_sft_data": { |
|
"file_name": "tts_sft_data.json" |
|
} |
|
``` |
|
Finally, it will automatically execute the ```llamafactory-cli train``` command to start training. You can adjust training settings using ```training/sft.yaml```. |
|
|
|
By default, the trained weights will be saved to ```pretrained_models/Muyan-TTS-new-SFT```. |
|
|
|
After training, you need to copy the ```sovits.pth``` of base/sft model to your trained model path before inference: |
|
```sh |
|
cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT |
|
``` |
|
|
|
You can directly deploy your trained model using the API tool above. During inference, you need to specify the ```model_type``` to be ```sft``` and replace the ```ref_wav_path``` and ```prompt_text``` with a sample of the speaker's voice you trained on. |