File size: 6,795 Bytes
408f039 7a0d68a 408f039 9da67a0 0be64e4 9da67a0 0be64e4 9da67a0 0be64e4 9da67a0 0be64e4 231d905 0be64e4 9da67a0 0be64e4 9da67a0 0be64e4 9da67a0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
license: apache-2.0
pipeline_tag: text-to-speech
---
# Step-Audio-TTS-3B
Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
## Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.
<table>
<thead>
<tr>
<th rowspan="2">Model</th>
<th style="text-align:center" colspan="1">test-zh</th>
<th style="text-align:center" colspan="1">test-en</th>
</tr>
<tr>
<th style="text-align:center">CER (%) ↓</th>
<th style="text-align:center">WER (%) ↓</th>
</tr>
</thead>
<tbody>
<tr>
<td>GLM-4-Voice</td>
<td style="text-align:center">2.19</td>
<td style="text-align:center">2.91</td>
</tr>
<tr>
<td>MinMo</td>
<td style="text-align:center">2.48</td>
<td style="text-align:center">2.90</td>
</tr>
<tr>
<td><strong>Step-Audio</strong></td>
<td style="text-align:center"><strong>1.53</strong></td>
<td style="text-align:center"><strong>2.71</strong></td>
</tr>
</tbody>
</table>
## Results of TTS Models on SEED Test Sets.
* StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
<table>
<thead>
<tr>
<th rowspan="2">Model</th>
<th style="text-align:center" colspan="2">test-zh</th>
<th style="text-align:center" colspan="2">test-en</th>
</tr>
<tr>
<th style="text-align:center">CER (%) ↓</th>
<th style="text-align:center">SS ↑</th>
<th style="text-align:center">WER (%) ↓</th>
<th style="text-align:center">SS ↑</th>
</tr>
</thead>
<tbody>
<tr>
<td>FireRedTTS</td>
<td style="text-align:center">1.51</td>
<td style="text-align:center">0.630</td>
<td style="text-align:center">3.82</td>
<td style="text-align:center">0.460</td>
</tr>
<tr>
<td>MaskGCT</td>
<td style="text-align:center">2.27</td>
<td style="text-align:center">0.774</td>
<td style="text-align:center">2.62</td>
<td style="text-align:center">0.774</td>
</tr>
<tr>
<td>CosyVoice</td>
<td style="text-align:center">3.63</td>
<td style="text-align:center">0.775</td>
<td style="text-align:center">4.29</td>
<td style="text-align:center">0.699</td>
</tr>
<tr>
<td>CosyVoice 2</td>
<td style="text-align:center">1.45</td>
<td style="text-align:center">0.806</td>
<td style="text-align:center">2.57</td>
<td style="text-align:center">0.736</td>
</tr>
<tr>
<td>CosyVoice 2-S</td>
<td style="text-align:center">1.45</td>
<td style="text-align:center">0.812</td>
<td style="text-align:center">2.38</td>
<td style="text-align:center">0.743</td>
</tr>
<tr>
<td><strong>Step-Audio-TTS-3B-Single</strong></td>
<td style="text-align:center">1.37</td>
<td style="text-align:center">0.802</td>
<td style="text-align:center">2.52</td>
<td style="text-align:center">0.704</td>
</tr>
<tr>
<td><strong>Step-Audio-TTS-3B</strong></td>
<td style="text-align:center"><strong>1.31</strong></td>
<td style="text-align:center">0.733</td>
<td style="text-align:center"><strong>2.31</strong></td>
<td style="text-align:center">0.660</td>
</tr>
<tr>
<td><strong>Step-Audio-TTS</strong></td>
<td style="text-align:center"><strong>1.17</strong></td>
<td style="text-align:center">0.73</td>
<td style="text-align:center"><strong>2.0</strong></td>
<td style="text-align:center">0.660</td>
</tr>
</tbody>
</table>
## Performance comparison of Dual-codebook Resynthesis with Cosyvoice.
<table>
<thead>
<tr>
<th style="text-align:center" rowspan="2">Token</th>
<th style="text-align:center" colspan="2">test-zh</th>
<th style="text-align:center" colspan="2">test-en</th>
</tr>
<tr>
<th style="text-align:center">CER (%) ↓</th>
<th style="text-align:center">SS ↑</th>
<th style="text-align:center">WER (%) ↓</th>
<th style="text-align:center">SS ↑</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Groundtruth</td>
<td style="text-align:center">0.972</td>
<td style="text-align:center">-</td>
<td style="text-align:center">2.156</td>
<td style="text-align:center">-</td>
</tr>
<tr>
<td style="text-align:center">CosyVoice</td>
<td style="text-align:center">2.857</td>
<td style="text-align:center"><strong>0.849</strong></td>
<td style="text-align:center">4.519</td>
<td style="text-align:center"><strong>0.807</strong></td>
</tr>
<tr>
<td style="text-align:center">Step-Audio-TTS-3B</td>
<td style="text-align:center"><strong>2.192</strong></td>
<td style="text-align:center">0.784</td>
<td style="text-align:center"><strong>3.585</strong></td>
<td style="text-align:center">0.742</td>
</tr>
</tbody>
</table>
# More information
For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio). |