Update README.md
Browse files
README.md
CHANGED
@@ -51,13 +51,13 @@ async def main(model_type, model_path):
|
|
51 |
f.write(next(wavs))
|
52 |
print(f"Speech generated in {output_path}")
|
53 |
```
|
54 |
-
You need to specify the
|
55 |
|
56 |
Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
|
57 |
|
58 |
-
When you specify the ```model_type``` to be ```base```, you can change the
|
59 |
|
60 |
-
When you specify the ```model_type``` to be ```sft```, you need to keep the
|
61 |
|
62 |
## API Usage
|
63 |
```sh
|
@@ -88,7 +88,7 @@ print(time.time() - start)
|
|
88 |
|
89 |
By default, the synthesized speech will be saved at ```logs/tts.wav```.
|
90 |
|
91 |
-
|
92 |
|
93 |
## Training
|
94 |
|
@@ -96,11 +96,17 @@ We use ```LibriSpeech``` as an example. You can use your own dataset instead, bu
|
|
96 |
|
97 |
If you haven't downloaded ```LibriSpeech``` yet, you can download the dev-clean set using:
|
98 |
```sh
|
99 |
-
wget https://www.openslr.org/resources/12/dev-clean.tar.gz
|
|
|
|
|
|
|
|
|
100 |
```
|
101 |
-
|
|
|
|
|
102 |
|
103 |
-
|
104 |
|
105 |
After generating ```data/tts_sft_data.json```, train.sh will automatically copy it to ```llama-factory/data``` and add the following field to ```dataset_info.json```:
|
106 |
```json
|
@@ -108,6 +114,13 @@ After generating ```data/tts_sft_data.json```, train.sh will automatically copy
|
|
108 |
"file_name": "tts_sft_data.json"
|
109 |
}
|
110 |
```
|
111 |
-
Finally, it will automatically execute the ```llamafactory-cli train``` command to start training. You can adjust training settings using ```training/sft.yaml```.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
-
You can directly deploy your trained model using the API tool above. During inference, you need to specify the ```model_type``` to be ```sft``` and replace the ```ref_wav_path``` and ```
|
|
|
51 |
f.write(next(wavs))
|
52 |
print(f"Speech generated in {output_path}")
|
53 |
```
|
54 |
+
You need to specify the prompt speech, including the ```ref_wav_path``` and its ```prompt_text```, and the ```text``` to be synthesized. The synthesized speech is saved by default to ```logs/tts.wav```.
|
55 |
|
56 |
Additionally, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
|
57 |
|
58 |
+
When you specify the ```model_type``` to be ```base```, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis.
|
59 |
|
60 |
+
When you specify the ```model_type``` to be ```sft```, you need to keep the prompt speech unchanged because the ```sft``` model is trained on Claire's voice.
|
61 |
|
62 |
## API Usage
|
63 |
```sh
|
|
|
88 |
|
89 |
By default, the synthesized speech will be saved at ```logs/tts.wav```.
|
90 |
|
91 |
+
Similarly, you need to specify ```model_type``` as either ```base``` or ```sft```, with the default being ```base```.
|
92 |
|
93 |
## Training
|
94 |
|
|
|
96 |
|
97 |
If you haven't downloaded ```LibriSpeech``` yet, you can download the dev-clean set using:
|
98 |
```sh
|
99 |
+
wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz
|
100 |
+
```
|
101 |
+
After uncompressing the data, specify the ```librispeech_dir``` in ```prepare_sft_dataset.py``` to match the download location. Then run:
|
102 |
+
```sh
|
103 |
+
./train.sh
|
104 |
```
|
105 |
+
This will automatically process the data and generate ```data/tts_sft_data.json```.
|
106 |
+
|
107 |
+
Note that we use a specific speaker ID of "3752" from dev-clean of LibriSpeech (which can be specified in ```data_process/text_format_conversion.py```) as an example because its data size is relatively large. If you organize your own dataset for training, please prepare at least a dozen of minutes of speech from the target speaker.
|
108 |
|
109 |
+
If an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun ```train.sh```.
|
110 |
|
111 |
After generating ```data/tts_sft_data.json```, train.sh will automatically copy it to ```llama-factory/data``` and add the following field to ```dataset_info.json```:
|
112 |
```json
|
|
|
114 |
"file_name": "tts_sft_data.json"
|
115 |
}
|
116 |
```
|
117 |
+
Finally, it will automatically execute the ```llamafactory-cli train``` command to start training. You can adjust training settings using ```training/sft.yaml```.
|
118 |
+
|
119 |
+
By default, the trained weights will be saved to ```pretrained_models/Muyan-TTS-new-SFT```.
|
120 |
+
|
121 |
+
After training, you need to copy the ```sovits.pth``` of base/sft model to your trained model path before inference:
|
122 |
+
```sh
|
123 |
+
cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT
|
124 |
+
```
|
125 |
|
126 |
+
You can directly deploy your trained model using the API tool above. During inference, you need to specify the ```model_type``` to be ```sft``` and replace the ```ref_wav_path``` and ```prompt_text``` with a sample of the speaker's voice you trained on.
|