Safetensors
English
omni_speech2s_llama
SandO114 commited on
Commit
b7daaf6
·
verified ·
1 Parent(s): 6b57937

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +546 -0
README.md ADDED
@@ -0,0 +1,546 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - VocalNet/VoiceAssitant-430K-vocalnet
5
+ - VocalNet/UltraChat-vocalnet
6
+ language:
7
+ - en
8
+ base_model:
9
+ - meta-llama/Llama-3.2-1B-Instruct
10
+ ---
11
+
12
+ ## 🎧 VocalNet-1B Model Card
13
+
14
+ **VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀
15
+
16
+ ### 📂 Paper, Code and Model Access
17
+ - **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) 📖
18
+ - **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) 🌐
19
+ - **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) 🤗
20
+ - **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) 🔮
21
+
22
+ ### 🔧 Repository Download and Environment Setup
23
+
24
+ To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. 🛠️
25
+
26
+ 1. **Clone the Repository**:
27
+ ```bash
28
+ git clone https://github.com/SJTU-OmniAgent/VocalNet.git
29
+ cd VocalNet
30
+ ```
31
+
32
+ 2. **Create and Activate Environment**:
33
+ ```bash
34
+ conda create -n vocalnet python==3.10
35
+ conda activate vocalnet
36
+ ```
37
+
38
+ 3. **Install Dependencies**:
39
+ ```bash
40
+ pip install --upgrade pip
41
+ conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
42
+ pip install -e .
43
+ ```
44
+
45
+ 4. **Optional: Install Training Packages**:
46
+ If you plan to train the model, install additional packages:
47
+ ```bash
48
+ pip install -e ".[train]"
49
+ pip install flash-attn --no-build-isolation
50
+ ```
51
+
52
+ ### 📥 Download Instructions
53
+
54
+ **Via Huggingface Cli**:
55
+ ```bash
56
+ pip install -U huggingface_hub
57
+ huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/
58
+ ```
59
+ **Via Snapshot Download**:
60
+ ```bash
61
+ pip install -U huggingface_hub
62
+ ```
63
+ ```python
64
+ from huggingface_hub import snapshot_download
65
+ snapshot_download(
66
+ repo_id="VocalNet/VocalNet-1B",
67
+ local_dir="./checkpoints/",
68
+ resume_download=True
69
+ )
70
+ ```
71
+ **Via Git**:
72
+ ```bash
73
+ git lfs install
74
+ git clone https://huggingface.co/VocalNet/VocalNet-1B
75
+ ```
76
+
77
+ ### 🛠️ Dependencies
78
+ - **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) 🎤
79
+ - **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. 🔊
80
+
81
+ ### 🔄 Local Inference
82
+
83
+ To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. 📡
84
+
85
+ 1. **Model Preparation**:
86
+ - Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). 📦
87
+ - Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. 🎤
88
+
89
+ 2. **CosyVoice Preparation**:
90
+ - VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). 🔊
91
+
92
+ 3. **Path Modification**:
93
+ - Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models:
94
+ ```python
95
+ COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
96
+ VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B
97
+ ```
98
+
99
+ 4. **Run Inference**:
100
+ - For **speech-to-text (S2T)** inference:
101
+ ```bash
102
+ python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
103
+ ```
104
+ - For **speech-to-speech (S2S)** inference:
105
+ ```bash
106
+ python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
107
+ ```
108
+
109
+ ### 📊 Performance Evaluation
110
+ VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup.
111
+
112
+ #### Overall Performance
113
+ <div align="center">
114
+ <table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;">
115
+ <thead>
116
+ <tr style="background-color: #f2f2f2;">
117
+ <th style="padding: 10px; border: 1px solid #ddd;">Model</th>
118
+ <th style="padding: 10px; border: 1px solid #ddd;">LLM Size</th>
119
+ <th style="padding: 10px; border: 1px solid #ddd;">Modality</th>
120
+ <th style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</th>
121
+ <th style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</th>
122
+ <th style="padding: 10px; border: 1px solid #ddd;">TriviaQA</th>
123
+ <th style="padding: 10px; border: 1px solid #ddd;">Web Questions</th>
124
+ </tr>
125
+ </thead>
126
+ <tbody>
127
+ <tr>
128
+ <td colspan="7" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td>
129
+ </tr>
130
+ <tr>
131
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td>
132
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">0.5B</td>
133
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
134
+ <td style="padding: 10px; border: 1px solid #ddd;">1.84</td>
135
+ <td style="padding: 10px; border: 1px solid #ddd;">2.7</td>
136
+ <td style="padding: 10px; border: 1px solid #ddd;">0.12</td>
137
+ <td style="padding: 10px; border: 1px solid #ddd;">0.22</td>
138
+ </tr>
139
+ <tr>
140
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
141
+ <td style="padding: 10px; border: 1px solid #ddd;">1.80</td>
142
+ <td style="padding: 10px; border: 1px solid #ddd;">2.7</td>
143
+ <td style="padding: 10px; border: 1px solid #ddd;">0.08</td>
144
+ <td style="padding: 10px; border: 1px solid #ddd;">0.20</td>
145
+ </tr>
146
+ <tr>
147
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td>
148
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">0.5B</td>
149
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
150
+ <td style="padding: 10px; border: 1px solid #ddd;">3.50</td>
151
+ <td style="padding: 10px; border: 1px solid #ddd;">29.4</td>
152
+ <td style="padding: 10px; border: 1px solid #ddd;">0.39</td>
153
+ <td style="padding: 10px; border: 1px solid #ddd;">0.84</td>
154
+ </tr>
155
+ <tr>
156
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
157
+ <td style="padding: 10px; border: 1px solid #ddd;">3.01</td>
158
+ <td style="padding: 10px; border: 1px solid #ddd;">26.7</td>
159
+ <td style="padding: 10px; border: 1px solid #ddd;">0.34</td>
160
+ <td style="padding: 10px; border: 1px solid #ddd;">0.69</td>
161
+ </tr>
162
+ <tr>
163
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td>
164
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">1B</td>
165
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
166
+ <td style="padding: 10px; border: 1px solid #ddd;">5.38</td>
167
+ <td style="padding: 10px; border: 1px solid #ddd;">70.3</td>
168
+ <td style="padding: 10px; border: 1px solid #ddd;">3.38</td>
169
+ <td style="padding: 10px; border: 1px solid #ddd;">4.93</td>
170
+ </tr>
171
+ <tr>
172
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
173
+ <td style="padding: 10px; border: 1px solid #ddd;">4.83</td>
174
+ <td style="padding: 10px; border: 1px solid #ddd;">61.0</td>
175
+ <td style="padding: 10px; border: 1px solid #ddd;">2.78</td>
176
+ <td style="padding: 10px; border: 1px solid #ddd;">4.47</td>
177
+ </tr>
178
+ <tr>
179
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td>
180
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">1B</td>
181
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
182
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>5.79</b></td>
183
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>71.7</b></td>
184
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>3.60</b></td>
185
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>5.16</b></td>
186
+ </tr>
187
+ <tr>
188
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
189
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>5.03</b></td>
190
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>63.7</b></td>
191
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>3.06</b></td>
192
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.68</b></td>
193
+ </tr>
194
+ </tbody>
195
+ <tbody>
196
+ <tr>
197
+ <td colspan="7" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td>
198
+ </tr>
199
+ <tr>
200
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td>
201
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
202
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
203
+ <td style="padding: 10px; border: 1px solid #ddd;">5.31</td>
204
+ <td style="padding: 10px; border: 1px solid #ddd;">69.7</td>
205
+ <td style="padding: 10px; border: 1px solid #ddd;">4.44</td>
206
+ <td style="padding: 10px; border: 1px solid #ddd;">5.44</td>
207
+ </tr>
208
+ <tr>
209
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
210
+ <td style="padding: 10px; border: 1px solid #ddd;">3.89</td>
211
+ <td style="padding: 10px; border: 1px solid #ddd;">55.1</td>
212
+ <td style="padding: 10px; border: 1px solid #ddd;">2.44</td>
213
+ <td style="padding: 10px; border: 1px solid #ddd;">4.00</td>
214
+ </tr>
215
+ <tr>
216
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td>
217
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">7B</td>
218
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
219
+ <td style="padding: 10px; border: 1px solid #ddd;">4.51</td>
220
+ <td style="padding: 10px; border: 1px solid #ddd;">77.7</td>
221
+ <td style="padding: 10px; border: 1px solid #ddd;">5.32</td>
222
+ <td style="padding: 10px; border: 1px solid #ddd;">6.41</td>
223
+ </tr>
224
+ <tr>
225
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
226
+ <td style="padding: 10px; border: 1px solid #ddd;">2.99</td>
227
+ <td style="padding: 10px; border: 1px solid #ddd;">60.2</td>
228
+ <td style="padding: 10px; border: 1px solid #ddd;">3.53</td>
229
+ <td style="padding: 10px; border: 1px solid #ddd;">4.78</td>
230
+ </tr>
231
+ <tr>
232
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td>
233
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">9B</td>
234
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
235
+ <td style="padding: 10px; border: 1px solid #ddd;">5.86</td>
236
+ <td style="padding: 10px; border: 1px solid #ddd;">77.4</td>
237
+ <td style="padding: 10px; border: 1px solid #ddd;">4.95</td>
238
+ <td style="padding: 10px; border: 1px solid #ddd;">5.56</td>
239
+ </tr>
240
+ <tr>
241
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
242
+ <td style="padding: 10px; border: 1px solid #ddd;">5.27</td>
243
+ <td style="padding: 10px; border: 1px solid #ddd;">64.3</td>
244
+ <td style="padding: 10px; border: 1px solid #ddd;">4.63</td>
245
+ <td style="padding: 10px; border: 1px solid #ddd;">5.40</td>
246
+ </tr>
247
+ <tr>
248
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td>
249
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">7B</td>
250
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
251
+ <td style="padding: 10px; border: 1px solid #ddd;">5.20</td>
252
+ <td style="padding: 10px; border: 1px solid #ddd;">77.6</td>
253
+ <td style="padding: 10px; border: 1px solid #ddd;">5.72</td>
254
+ <td style="padding: 10px; border: 1px solid #ddd;">6.12</td>
255
+ </tr>
256
+ <tr>
257
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
258
+ <td style="padding: 10px; border: 1px solid #ddd;">4.10</td>
259
+ <td style="padding: 10px; border: 1px solid #ddd;">61.2</td>
260
+ <td style="padding: 10px; border: 1px solid #ddd;">4.13</td>
261
+ <td style="padding: 10px; border: 1px solid #ddd;">5.18</td>
262
+ </tr>
263
+ <tr>
264
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td>
265
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
266
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
267
+ <td style="padding: 10px; border: 1px solid #ddd;">6.13</td>
268
+ <td style="padding: 10px; border: 1px solid #ddd;">77.2</td>
269
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>6.43</b></td>
270
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>7.16</b></td>
271
+ </tr>
272
+ <tr>
273
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
274
+ <td style="padding: 10px; border: 1px solid #ddd;">4.95</td>
275
+ <td style="padding: 10px; border: 1px solid #ddd;">65.8</td>
276
+ <td style="padding: 10px; border: 1px solid #ddd;">4.99</td>
277
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>6.22</u></td>
278
+ </tr>
279
+ <tr>
280
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Minmo*</td>
281
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
282
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
283
+ <td style="padding: 10px; border: 1px solid #ddd;">-</td>
284
+ <td style="padding: 10px; border: 1px solid #ddd;">78.9</td>
285
+ <td style="padding: 10px; border: 1px solid #ddd;">4.83</td>
286
+ <td style="padding: 10px; border: 1px solid #ddd;">5.50</td>
287
+ </tr>
288
+ <tr>
289
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
290
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>6.48</b></td>
291
+ <td style="padding: 10px; border: 1px solid #ddd;">64.1</td>
292
+ <td style="padding: 10px; border: 1px solid #ddd;">3.75</td>
293
+ <td style="padding: 10px; border: 1px solid #ddd;">3.99</td>
294
+ </tr>
295
+ <tr>
296
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
297
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
298
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
299
+ <td style="padding: 10px; border: 1px solid #ddd;">6.01</td>
300
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>79.0</u></td>
301
+ <td style="padding: 10px; border: 1px solid #ddd;">5.89</td>
302
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>6.88</u></td>
303
+ </tr>
304
+ <tr>
305
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
306
+ <td style="padding: 10px; border: 1px solid #ddd;">5.73</td>
307
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>76.3</b></td>
308
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>5.59</u></td>
309
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>6.70</b></td>
310
+ </tr>
311
+ <tr>
312
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
313
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
314
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
315
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>7.05</u></td>
316
+ <td style="padding: 10px; border: 1px solid #ddd;">77.1</td>
317
+ <td style="padding: 10px; border: 1px solid #ddd;">6.15</td>
318
+ <td style="padding: 10px; border: 1px solid #ddd;">6.34</td>
319
+ </tr>
320
+ <tr>
321
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
322
+ <td style="padding: 10px; border: 1px solid #ddd;">6.30</td>
323
+ <td style="padding: 10px; border: 1px solid #ddd;">71.4</td>
324
+ <td style="padding: 10px; border: 1px solid #ddd;">5.24</td>
325
+ <td style="padding: 10px; border: 1px solid #ddd;">5.81</td>
326
+ </tr>
327
+ <tr>
328
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td>
329
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
330
+ <td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
331
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>7.12</b></td>
332
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>79.5</b></td>
333
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>6.24</u></td>
334
+ <td style="padding: 10px; border: 1px solid #ddd;">6.48</td>
335
+ </tr>
336
+ <tr>
337
+ <td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
338
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>6.37</u></td>
339
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>73.1</u></td>
340
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>5.67</b></td>
341
+ <td style="padding: 10px; border: 1px solid #ddd;">6.16</td>
342
+ </tr>
343
+ </tbody>
344
+ </table>
345
+ </div>
346
+
347
+ #### Response Alignment and Acoustic Quality
348
+ <div align="center">
349
+ <table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;">
350
+ <tbody>
351
+ <tr style="background-color: #f2f2f2;">
352
+ <td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Model</td>
353
+ <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</td>
354
+ <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</td>
355
+ <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">TriviaQA</td>
356
+ <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Web Questions</td>
357
+ <td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Avg</td>
358
+ </tr>
359
+ <tr>
360
+ <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
361
+ <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
362
+ <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
363
+ <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
364
+ <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
365
+ <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
366
+ <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
367
+ <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
368
+ <td style="padding: 10px; border: 1px solid #ddd;">WER</td>
369
+ <td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
370
+ </tr>
371
+ <tr>
372
+ <td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td>
373
+ </tr>
374
+ <tr>
375
+ <td style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td>
376
+ <td style="padding: 10px; border: 1px solid #ddd;">20.78</td>
377
+ <td style="padding: 10px; border: 1px solid #ddd;">4.429</td>
378
+ <td style="padding: 10px; border: 1px solid #ddd;">5.20</td>
379
+ <td style="padding: 10px; border: 1px solid #ddd;">4.428</td>
380
+ <td style="padding: 10px; border: 1px solid #ddd;">7.43</td>
381
+ <td style="padding: 10px; border: 1px solid #ddd;">4.428</td>
382
+ <td style="padding: 10px; border: 1px solid #ddd;">8.51</td>
383
+ <td style="padding: 10px; border: 1px solid #ddd;">4.433</td>
384
+ <td style="padding: 10px; border: 1px solid #ddd;">8.66</td>
385
+ <td style="padding: 10px; border: 1px solid #ddd;">4.430</td>
386
+ </tr>
387
+ <tr>
388
+ <td style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td>
389
+ <td style="padding: 10px; border: 1px solid #ddd;">5.52</td>
390
+ <td style="padding: 10px; border: 1px solid #ddd;">4.439</td>
391
+ <td style="padding: 10px; border: 1px solid #ddd;">5.55</td>
392
+ <td style="padding: 10px; border: 1px solid #ddd;">4.467</td>
393
+ <td style="padding: 10px; border: 1px solid #ddd;">6.16</td>
394
+ <td style="padding: 10px; border: 1px solid #ddd;">4.470</td>
395
+ <td style="padding: 10px; border: 1px solid #ddd;">6.50</td>
396
+ <td style="padding: 10px; border: 1px solid #ddd;">4.461</td>
397
+ <td style="padding: 10px; border: 1px solid #ddd;">6.17</td>
398
+ <td style="padding: 10px; border: 1px solid #ddd;">4.464</td>
399
+ </tr>
400
+ <tr>
401
+ <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td>
402
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td>
403
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td>
404
+ <td style="padding: 10px; border: 1px solid #ddd;">3.65</td>
405
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.498</b></td>
406
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>5.97</b></td>
407
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td>
408
+ <td style="padding: 10px; border: 1px solid #ddd;">6.40</td>
409
+ <td style="padding: 10px; border: 1px solid #ddd;">4.489</td>
410
+ <td style="padding: 10px; border: 1px solid #ddd;">5.66</td>
411
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td>
412
+ </tr>
413
+ <tr>
414
+ <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td>
415
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td>
416
+ <td style="padding: 10px; border: 1px solid #ddd;">4.491</td>
417
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>3.27</b></td>
418
+ <td style="padding: 10px; border: 1px solid #ddd;">4.497</td>
419
+ <td style="padding: 10px; border: 1px solid #ddd;">6.73</td>
420
+ <td style="padding: 10px; border: 1px solid #ddd;">4.486</td>
421
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.88</b></td>
422
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td>
423
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>5.31</b></td>
424
+ <td style="padding: 10px; border: 1px solid #ddd;">4.491</td>
425
+ </tr>
426
+ <tr>
427
+ <td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td>
428
+ </tr>
429
+ <tr>
430
+ <td style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td>
431
+ <td style="padding: 10px; border: 1px solid #ddd;">6.00</td>
432
+ <td style="padding: 10px; border: 1px solid #ddd;">3.942</td>
433
+ <td style="padding: 10px; border: 1px solid #ddd;">10.00</td>
434
+ <td style="padding: 10px; border: 1px solid #ddd;">4.003</td>
435
+ <td style="padding: 10px; border: 1px solid #ddd;">20.93</td>
436
+ <td style="padding: 10px; border: 1px solid #ddd;">3.965</td>
437
+ <td style="padding: 10px; border: 1px solid #ddd;">14.60</td>
438
+ <td style="padding: 10px; border: 1px solid #ddd;">3.935</td>
439
+ <td style="padding: 10px; border: 1px solid #ddd;">15.90</td>
440
+ <td style="padding: 10px; border: 1px solid #ddd;">3.956</td>
441
+ </tr>
442
+ <tr>
443
+ <td style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td>
444
+ <td style="padding: 10px; border: 1px solid #ddd;">14.33</td>
445
+ <td style="padding: 10px; border: 1px solid #ddd;">4.377</td>
446
+ <td style="padding: 10px; border: 1px solid #ddd;">14.20</td>
447
+ <td style="padding: 10px; border: 1px solid #ddd;">4.417</td>
448
+ <td style="padding: 10px; border: 1px solid #ddd;">20.39</td>
449
+ <td style="padding: 10px; border: 1px solid #ddd;">4.404</td>
450
+ <td style="padding: 10px; border: 1px solid #ddd;">18.25</td>
451
+ <td style="padding: 10px; border: 1px solid #ddd;">4.398</td>
452
+ <td style="padding: 10px; border: 1px solid #ddd;">18.31</td>
453
+ <td style="padding: 10px; border: 1px solid #ddd;">4.401</td>
454
+ </tr>
455
+ <tr>
456
+ <td style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td>
457
+ <td style="padding: 10px; border: 1px solid #ddd;">18.71</td>
458
+ <td style="padding: 10px; border: 1px solid #ddd;">4.025</td>
459
+ <td style="padding: 10px; border: 1px solid #ddd;">14.45</td>
460
+ <td style="padding: 10px; border: 1px solid #ddd;">4.152</td>
461
+ <td style="padding: 10px; border: 1px solid #ddd;">8.33</td>
462
+ <td style="padding: 10px; border: 1px solid #ddd;">4.306</td>
463
+ <td style="padding: 10px; border: 1px solid #ddd;">6.08</td>
464
+ <td style="padding: 10px; border: 1px solid #ddd;">4.214</td>
465
+ <td style="padding: 10px; border: 1px solid #ddd;">8.99</td>
466
+ <td style="padding: 10px; border: 1px solid #ddd;">4.228</td>
467
+ </tr>
468
+ <tr>
469
+ <td style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td>
470
+ <td style="padding: 10px; border: 1px solid #ddd;">20.84</td>
471
+ <td style="padding: 10px; border: 1px solid #ddd;">4.082</td>
472
+ <td style="padding: 10px; border: 1px solid #ddd;">22.82</td>
473
+ <td style="padding: 10px; border: 1px solid #ddd;">4.332</td>
474
+ <td style="padding: 10px; border: 1px solid #ddd;">22.36</td>
475
+ <td style="padding: 10px; border: 1px solid #ddd;">4.401</td>
476
+ <td style="padding: 10px; border: 1px solid #ddd;">23.29</td>
477
+ <td style="padding: 10px; border: 1px solid #ddd;">4.350</td>
478
+ <td style="padding: 10px; border: 1px solid #ddd;">22.67</td>
479
+ <td style="padding: 10px; border: 1px solid #ddd;">4.347</td>
480
+ </tr>
481
+ <tr>
482
+ <td style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td>
483
+ <td style="padding: 10px; border: 1px solid #ddd;">15.35</td>
484
+ <td style="padding: 10px; border: 1px solid #ddd;">4.102</td>
485
+ <td style="padding: 10px; border: 1px solid #ddd;">5.73</td>
486
+ <td style="padding: 10px; border: 1px solid #ddd;">4.228</td>
487
+ <td style="padding: 10px; border: 1px solid #ddd;">8.08</td>
488
+ <td style="padding: 10px; border: 1px solid #ddd;">4.128</td>
489
+ <td style="padding: 10px; border: 1px solid #ddd;">8.94</td>
490
+ <td style="padding: 10px; border: 1px solid #ddd;">4.125</td>
491
+ <td style="padding: 10px; border: 1px solid #ddd;">8.72</td>
492
+ <td style="padding: 10px; border: 1px solid #ddd;">4.137</td>
493
+ </tr>
494
+ <tr>
495
+ <td style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
496
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>2.41</b></td>
497
+ <td style="padding: 10px; border: 1px solid #ddd;">4.299</td>
498
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>0.93</b></td>
499
+ <td style="padding: 10px; border: 1px solid #ddd;">4.315</td>
500
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>1.13</b></td>
501
+ <td style="padding: 10px; border: 1px solid #ddd;">4.339</td>
502
+ <td style="padding: 10px; border: 1px solid #ddd;">4.68</td>
503
+ <td style="padding: 10px; border: 1px solid #ddd;">4.363</td>
504
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>2.63</b></td>
505
+ <td style="padding: 10px; border: 1px solid #ddd;">4.342</td>
506
+ </tr>
507
+ <tr>
508
+ <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
509
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>2.65</u></td>
510
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.490</b></td>
511
+ <td style="padding: 10px; border: 1px solid #ddd;">3.00</td>
512
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.503</b></td>
513
+ <td style="padding: 10px; border: 1px solid #ddd;">5.02</td>
514
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td>
515
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.21</u></td>
516
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.485</u></td>
517
+ <td style="padding: 10px; border: 1px solid #ddd;">4.26</td>
518
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td>
519
+ </tr>
520
+ <tr>
521
+ <td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td>
522
+ <td style="padding: 10px; border: 1px solid #ddd;">4.71</td>
523
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td>
524
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>2.68</u></td>
525
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.500</u></td>
526
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.04</u></td>
527
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.482</u></td>
528
+ <td style="padding: 10px; border: 1px solid #ddd;"><b>3.11</b></td>
529
+ <td style="padding as: 10px; border: 1px solid #ddd;"><b>4.492</b></td>
530
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>3.56</u></td>
531
+ <td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td>
532
+ </tr>
533
+ </tbody>
534
+ </table>
535
+ </div>
536
+
537
+ ### ✍️ Citation
538
+ If you find our work useful, please cite:
539
+ ```bib
540
+ @article{wang2025vocalnet,
541
+ title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
542
+ author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
543
+ journal={arXiv preprint arXiv:2504.04060},
544
+ year={2025}
545
+ }
546
+ ```