Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,546 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- VocalNet/VoiceAssitant-430K-vocalnet
|
5 |
+
- VocalNet/UltraChat-vocalnet
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
base_model:
|
9 |
+
- meta-llama/Llama-3.2-1B-Instruct
|
10 |
+
---
|
11 |
+
|
12 |
+
## 🎧 VocalNet-1B Model Card
|
13 |
+
|
14 |
+
**VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀
|
15 |
+
|
16 |
+
### 📂 Paper, Code and Model Access
|
17 |
+
- **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) 📖
|
18 |
+
- **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) 🌐
|
19 |
+
- **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) 🤗
|
20 |
+
- **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) 🔮
|
21 |
+
|
22 |
+
### 🔧 Repository Download and Environment Setup
|
23 |
+
|
24 |
+
To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. 🛠️
|
25 |
+
|
26 |
+
1. **Clone the Repository**:
|
27 |
+
```bash
|
28 |
+
git clone https://github.com/SJTU-OmniAgent/VocalNet.git
|
29 |
+
cd VocalNet
|
30 |
+
```
|
31 |
+
|
32 |
+
2. **Create and Activate Environment**:
|
33 |
+
```bash
|
34 |
+
conda create -n vocalnet python==3.10
|
35 |
+
conda activate vocalnet
|
36 |
+
```
|
37 |
+
|
38 |
+
3. **Install Dependencies**:
|
39 |
+
```bash
|
40 |
+
pip install --upgrade pip
|
41 |
+
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
|
42 |
+
pip install -e .
|
43 |
+
```
|
44 |
+
|
45 |
+
4. **Optional: Install Training Packages**:
|
46 |
+
If you plan to train the model, install additional packages:
|
47 |
+
```bash
|
48 |
+
pip install -e ".[train]"
|
49 |
+
pip install flash-attn --no-build-isolation
|
50 |
+
```
|
51 |
+
|
52 |
+
### 📥 Download Instructions
|
53 |
+
|
54 |
+
**Via Huggingface Cli**:
|
55 |
+
```bash
|
56 |
+
pip install -U huggingface_hub
|
57 |
+
huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/
|
58 |
+
```
|
59 |
+
**Via Snapshot Download**:
|
60 |
+
```bash
|
61 |
+
pip install -U huggingface_hub
|
62 |
+
```
|
63 |
+
```python
|
64 |
+
from huggingface_hub import snapshot_download
|
65 |
+
snapshot_download(
|
66 |
+
repo_id="VocalNet/VocalNet-1B",
|
67 |
+
local_dir="./checkpoints/",
|
68 |
+
resume_download=True
|
69 |
+
)
|
70 |
+
```
|
71 |
+
**Via Git**:
|
72 |
+
```bash
|
73 |
+
git lfs install
|
74 |
+
git clone https://huggingface.co/VocalNet/VocalNet-1B
|
75 |
+
```
|
76 |
+
|
77 |
+
### 🛠️ Dependencies
|
78 |
+
- **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) 🎤
|
79 |
+
- **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. 🔊
|
80 |
+
|
81 |
+
### 🔄 Local Inference
|
82 |
+
|
83 |
+
To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. 📡
|
84 |
+
|
85 |
+
1. **Model Preparation**:
|
86 |
+
- Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). 📦
|
87 |
+
- Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. 🎤
|
88 |
+
|
89 |
+
2. **CosyVoice Preparation**:
|
90 |
+
- VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). 🔊
|
91 |
+
|
92 |
+
3. **Path Modification**:
|
93 |
+
- Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models:
|
94 |
+
```python
|
95 |
+
COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
|
96 |
+
VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B
|
97 |
+
```
|
98 |
+
|
99 |
+
4. **Run Inference**:
|
100 |
+
- For **speech-to-text (S2T)** inference:
|
101 |
+
```bash
|
102 |
+
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
|
103 |
+
```
|
104 |
+
- For **speech-to-speech (S2S)** inference:
|
105 |
+
```bash
|
106 |
+
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
|
107 |
+
```
|
108 |
+
|
109 |
+
### 📊 Performance Evaluation
|
110 |
+
VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup.
|
111 |
+
|
112 |
+
#### Overall Performance
|
113 |
+
<div align="center">
|
114 |
+
<table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;">
|
115 |
+
<thead>
|
116 |
+
<tr style="background-color: #f2f2f2;">
|
117 |
+
<th style="padding: 10px; border: 1px solid #ddd;">Model</th>
|
118 |
+
<th style="padding: 10px; border: 1px solid #ddd;">LLM Size</th>
|
119 |
+
<th style="padding: 10px; border: 1px solid #ddd;">Modality</th>
|
120 |
+
<th style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</th>
|
121 |
+
<th style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</th>
|
122 |
+
<th style="padding: 10px; border: 1px solid #ddd;">TriviaQA</th>
|
123 |
+
<th style="padding: 10px; border: 1px solid #ddd;">Web Questions</th>
|
124 |
+
</tr>
|
125 |
+
</thead>
|
126 |
+
<tbody>
|
127 |
+
<tr>
|
128 |
+
<td colspan="7" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td>
|
129 |
+
</tr>
|
130 |
+
<tr>
|
131 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td>
|
132 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">0.5B</td>
|
133 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
134 |
+
<td style="padding: 10px; border: 1px solid #ddd;">1.84</td>
|
135 |
+
<td style="padding: 10px; border: 1px solid #ddd;">2.7</td>
|
136 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.12</td>
|
137 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.22</td>
|
138 |
+
</tr>
|
139 |
+
<tr>
|
140 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
141 |
+
<td style="padding: 10px; border: 1px solid #ddd;">1.80</td>
|
142 |
+
<td style="padding: 10px; border: 1px solid #ddd;">2.7</td>
|
143 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.08</td>
|
144 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.20</td>
|
145 |
+
</tr>
|
146 |
+
<tr>
|
147 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td>
|
148 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">0.5B</td>
|
149 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
150 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.50</td>
|
151 |
+
<td style="padding: 10px; border: 1px solid #ddd;">29.4</td>
|
152 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.39</td>
|
153 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.84</td>
|
154 |
+
</tr>
|
155 |
+
<tr>
|
156 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
157 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.01</td>
|
158 |
+
<td style="padding: 10px; border: 1px solid #ddd;">26.7</td>
|
159 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.34</td>
|
160 |
+
<td style="padding: 10px; border: 1px solid #ddd;">0.69</td>
|
161 |
+
</tr>
|
162 |
+
<tr>
|
163 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td>
|
164 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">1B</td>
|
165 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
166 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.38</td>
|
167 |
+
<td style="padding: 10px; border: 1px solid #ddd;">70.3</td>
|
168 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.38</td>
|
169 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.93</td>
|
170 |
+
</tr>
|
171 |
+
<tr>
|
172 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
173 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.83</td>
|
174 |
+
<td style="padding: 10px; border: 1px solid #ddd;">61.0</td>
|
175 |
+
<td style="padding: 10px; border: 1px solid #ddd;">2.78</td>
|
176 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.47</td>
|
177 |
+
</tr>
|
178 |
+
<tr>
|
179 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td>
|
180 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">1B</td>
|
181 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
182 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.79</b></td>
|
183 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>71.7</b></td>
|
184 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.60</b></td>
|
185 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.16</b></td>
|
186 |
+
</tr>
|
187 |
+
<tr>
|
188 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
189 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.03</b></td>
|
190 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>63.7</b></td>
|
191 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.06</b></td>
|
192 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.68</b></td>
|
193 |
+
</tr>
|
194 |
+
</tbody>
|
195 |
+
<tbody>
|
196 |
+
<tr>
|
197 |
+
<td colspan="7" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td>
|
198 |
+
</tr>
|
199 |
+
<tr>
|
200 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td>
|
201 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
|
202 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
203 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.31</td>
|
204 |
+
<td style="padding: 10px; border: 1px solid #ddd;">69.7</td>
|
205 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.44</td>
|
206 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.44</td>
|
207 |
+
</tr>
|
208 |
+
<tr>
|
209 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
210 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.89</td>
|
211 |
+
<td style="padding: 10px; border: 1px solid #ddd;">55.1</td>
|
212 |
+
<td style="padding: 10px; border: 1px solid #ddd;">2.44</td>
|
213 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.00</td>
|
214 |
+
</tr>
|
215 |
+
<tr>
|
216 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td>
|
217 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">7B</td>
|
218 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
219 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.51</td>
|
220 |
+
<td style="padding: 10px; border: 1px solid #ddd;">77.7</td>
|
221 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.32</td>
|
222 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.41</td>
|
223 |
+
</tr>
|
224 |
+
<tr>
|
225 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
226 |
+
<td style="padding: 10px; border: 1px solid #ddd;">2.99</td>
|
227 |
+
<td style="padding: 10px; border: 1px solid #ddd;">60.2</td>
|
228 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.53</td>
|
229 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.78</td>
|
230 |
+
</tr>
|
231 |
+
<tr>
|
232 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td>
|
233 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">9B</td>
|
234 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
235 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.86</td>
|
236 |
+
<td style="padding: 10px; border: 1px solid #ddd;">77.4</td>
|
237 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.95</td>
|
238 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.56</td>
|
239 |
+
</tr>
|
240 |
+
<tr>
|
241 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
242 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.27</td>
|
243 |
+
<td style="padding: 10px; border: 1px solid #ddd;">64.3</td>
|
244 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.63</td>
|
245 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.40</td>
|
246 |
+
</tr>
|
247 |
+
<tr>
|
248 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td>
|
249 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">7B</td>
|
250 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
251 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.20</td>
|
252 |
+
<td style="padding: 10px; border: 1px solid #ddd;">77.6</td>
|
253 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.72</td>
|
254 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.12</td>
|
255 |
+
</tr>
|
256 |
+
<tr>
|
257 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
258 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.10</td>
|
259 |
+
<td style="padding: 10px; border: 1px solid #ddd;">61.2</td>
|
260 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.13</td>
|
261 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.18</td>
|
262 |
+
</tr>
|
263 |
+
<tr>
|
264 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td>
|
265 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
|
266 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
267 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.13</td>
|
268 |
+
<td style="padding: 10px; border: 1px solid #ddd;">77.2</td>
|
269 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>6.43</b></td>
|
270 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>7.16</b></td>
|
271 |
+
</tr>
|
272 |
+
<tr>
|
273 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
274 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.95</td>
|
275 |
+
<td style="padding: 10px; border: 1px solid #ddd;">65.8</td>
|
276 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.99</td>
|
277 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.22</u></td>
|
278 |
+
</tr>
|
279 |
+
<tr>
|
280 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Minmo*</td>
|
281 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
|
282 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
283 |
+
<td style="padding: 10px; border: 1px solid #ddd;">-</td>
|
284 |
+
<td style="padding: 10px; border: 1px solid #ddd;">78.9</td>
|
285 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.83</td>
|
286 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.50</td>
|
287 |
+
</tr>
|
288 |
+
<tr>
|
289 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
290 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>6.48</b></td>
|
291 |
+
<td style="padding: 10px; border: 1px solid #ddd;">64.1</td>
|
292 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.75</td>
|
293 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.99</td>
|
294 |
+
</tr>
|
295 |
+
<tr>
|
296 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
|
297 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
|
298 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
299 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.01</td>
|
300 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>79.0</u></td>
|
301 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.89</td>
|
302 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.88</u></td>
|
303 |
+
</tr>
|
304 |
+
<tr>
|
305 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
306 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.73</td>
|
307 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>76.3</b></td>
|
308 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>5.59</u></td>
|
309 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>6.70</b></td>
|
310 |
+
</tr>
|
311 |
+
<tr>
|
312 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
|
313 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
|
314 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
315 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>7.05</u></td>
|
316 |
+
<td style="padding: 10px; border: 1px solid #ddd;">77.1</td>
|
317 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.15</td>
|
318 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.34</td>
|
319 |
+
</tr>
|
320 |
+
<tr>
|
321 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
322 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.30</td>
|
323 |
+
<td style="padding: 10px; border: 1px solid #ddd;">71.4</td>
|
324 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.24</td>
|
325 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.81</td>
|
326 |
+
</tr>
|
327 |
+
<tr>
|
328 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td>
|
329 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td>
|
330 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td>
|
331 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>7.12</b></td>
|
332 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>79.5</b></td>
|
333 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.24</u></td>
|
334 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.48</td>
|
335 |
+
</tr>
|
336 |
+
<tr>
|
337 |
+
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td>
|
338 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.37</u></td>
|
339 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>73.1</u></td>
|
340 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.67</b></td>
|
341 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.16</td>
|
342 |
+
</tr>
|
343 |
+
</tbody>
|
344 |
+
</table>
|
345 |
+
</div>
|
346 |
+
|
347 |
+
#### Response Alignment and Acoustic Quality
|
348 |
+
<div align="center">
|
349 |
+
<table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;">
|
350 |
+
<tbody>
|
351 |
+
<tr style="background-color: #f2f2f2;">
|
352 |
+
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Model</td>
|
353 |
+
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</td>
|
354 |
+
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</td>
|
355 |
+
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">TriviaQA</td>
|
356 |
+
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Web Questions</td>
|
357 |
+
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Avg</td>
|
358 |
+
</tr>
|
359 |
+
<tr>
|
360 |
+
<td style="padding: 10px; border: 1px solid #ddd;">WER</td>
|
361 |
+
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
|
362 |
+
<td style="padding: 10px; border: 1px solid #ddd;">WER</td>
|
363 |
+
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
|
364 |
+
<td style="padding: 10px; border: 1px solid #ddd;">WER</td>
|
365 |
+
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
|
366 |
+
<td style="padding: 10px; border: 1px solid #ddd;">WER</td>
|
367 |
+
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
|
368 |
+
<td style="padding: 10px; border: 1px solid #ddd;">WER</td>
|
369 |
+
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td>
|
370 |
+
</tr>
|
371 |
+
<tr>
|
372 |
+
<td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td>
|
373 |
+
</tr>
|
374 |
+
<tr>
|
375 |
+
<td style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td>
|
376 |
+
<td style="padding: 10px; border: 1px solid #ddd;">20.78</td>
|
377 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.429</td>
|
378 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.20</td>
|
379 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.428</td>
|
380 |
+
<td style="padding: 10px; border: 1px solid #ddd;">7.43</td>
|
381 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.428</td>
|
382 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.51</td>
|
383 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.433</td>
|
384 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.66</td>
|
385 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.430</td>
|
386 |
+
</tr>
|
387 |
+
<tr>
|
388 |
+
<td style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td>
|
389 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.52</td>
|
390 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.439</td>
|
391 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.55</td>
|
392 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.467</td>
|
393 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.16</td>
|
394 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.470</td>
|
395 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.50</td>
|
396 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.461</td>
|
397 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.17</td>
|
398 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.464</td>
|
399 |
+
</tr>
|
400 |
+
<tr>
|
401 |
+
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td>
|
402 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td>
|
403 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td>
|
404 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.65</td>
|
405 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.498</b></td>
|
406 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.97</b></td>
|
407 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td>
|
408 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.40</td>
|
409 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.489</td>
|
410 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.66</td>
|
411 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td>
|
412 |
+
</tr>
|
413 |
+
<tr>
|
414 |
+
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td>
|
415 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td>
|
416 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.491</td>
|
417 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.27</b></td>
|
418 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.497</td>
|
419 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.73</td>
|
420 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.486</td>
|
421 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.88</b></td>
|
422 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td>
|
423 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.31</b></td>
|
424 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.491</td>
|
425 |
+
</tr>
|
426 |
+
<tr>
|
427 |
+
<td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td>
|
428 |
+
</tr>
|
429 |
+
<tr>
|
430 |
+
<td style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td>
|
431 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.00</td>
|
432 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.942</td>
|
433 |
+
<td style="padding: 10px; border: 1px solid #ddd;">10.00</td>
|
434 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.003</td>
|
435 |
+
<td style="padding: 10px; border: 1px solid #ddd;">20.93</td>
|
436 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.965</td>
|
437 |
+
<td style="padding: 10px; border: 1px solid #ddd;">14.60</td>
|
438 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.935</td>
|
439 |
+
<td style="padding: 10px; border: 1px solid #ddd;">15.90</td>
|
440 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.956</td>
|
441 |
+
</tr>
|
442 |
+
<tr>
|
443 |
+
<td style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td>
|
444 |
+
<td style="padding: 10px; border: 1px solid #ddd;">14.33</td>
|
445 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.377</td>
|
446 |
+
<td style="padding: 10px; border: 1px solid #ddd;">14.20</td>
|
447 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.417</td>
|
448 |
+
<td style="padding: 10px; border: 1px solid #ddd;">20.39</td>
|
449 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.404</td>
|
450 |
+
<td style="padding: 10px; border: 1px solid #ddd;">18.25</td>
|
451 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.398</td>
|
452 |
+
<td style="padding: 10px; border: 1px solid #ddd;">18.31</td>
|
453 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.401</td>
|
454 |
+
</tr>
|
455 |
+
<tr>
|
456 |
+
<td style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td>
|
457 |
+
<td style="padding: 10px; border: 1px solid #ddd;">18.71</td>
|
458 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.025</td>
|
459 |
+
<td style="padding: 10px; border: 1px solid #ddd;">14.45</td>
|
460 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.152</td>
|
461 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.33</td>
|
462 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.306</td>
|
463 |
+
<td style="padding: 10px; border: 1px solid #ddd;">6.08</td>
|
464 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.214</td>
|
465 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.99</td>
|
466 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.228</td>
|
467 |
+
</tr>
|
468 |
+
<tr>
|
469 |
+
<td style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td>
|
470 |
+
<td style="padding: 10px; border: 1px solid #ddd;">20.84</td>
|
471 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.082</td>
|
472 |
+
<td style="padding: 10px; border: 1px solid #ddd;">22.82</td>
|
473 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.332</td>
|
474 |
+
<td style="padding: 10px; border: 1px solid #ddd;">22.36</td>
|
475 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.401</td>
|
476 |
+
<td style="padding: 10px; border: 1px solid #ddd;">23.29</td>
|
477 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.350</td>
|
478 |
+
<td style="padding: 10px; border: 1px solid #ddd;">22.67</td>
|
479 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.347</td>
|
480 |
+
</tr>
|
481 |
+
<tr>
|
482 |
+
<td style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td>
|
483 |
+
<td style="padding: 10px; border: 1px solid #ddd;">15.35</td>
|
484 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.102</td>
|
485 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.73</td>
|
486 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.228</td>
|
487 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.08</td>
|
488 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.128</td>
|
489 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.94</td>
|
490 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.125</td>
|
491 |
+
<td style="padding: 10px; border: 1px solid #ddd;">8.72</td>
|
492 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.137</td>
|
493 |
+
</tr>
|
494 |
+
<tr>
|
495 |
+
<td style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td>
|
496 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>2.41</b></td>
|
497 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.299</td>
|
498 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>0.93</b></td>
|
499 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.315</td>
|
500 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>1.13</b></td>
|
501 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.339</td>
|
502 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.68</td>
|
503 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.363</td>
|
504 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>2.63</b></td>
|
505 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.342</td>
|
506 |
+
</tr>
|
507 |
+
<tr>
|
508 |
+
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td>
|
509 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>2.65</u></td>
|
510 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.490</b></td>
|
511 |
+
<td style="padding: 10px; border: 1px solid #ddd;">3.00</td>
|
512 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.503</b></td>
|
513 |
+
<td style="padding: 10px; border: 1px solid #ddd;">5.02</td>
|
514 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td>
|
515 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.21</u></td>
|
516 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.485</u></td>
|
517 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.26</td>
|
518 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td>
|
519 |
+
</tr>
|
520 |
+
<tr>
|
521 |
+
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td>
|
522 |
+
<td style="padding: 10px; border: 1px solid #ddd;">4.71</td>
|
523 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td>
|
524 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>2.68</u></td>
|
525 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.500</u></td>
|
526 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.04</u></td>
|
527 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.482</u></td>
|
528 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.11</b></td>
|
529 |
+
<td style="padding as: 10px; border: 1px solid #ddd;"><b>4.492</b></td>
|
530 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>3.56</u></td>
|
531 |
+
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td>
|
532 |
+
</tr>
|
533 |
+
</tbody>
|
534 |
+
</table>
|
535 |
+
</div>
|
536 |
+
|
537 |
+
### ✍️ Citation
|
538 |
+
If you find our work useful, please cite:
|
539 |
+
```bib
|
540 |
+
@article{wang2025vocalnet,
|
541 |
+
title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
|
542 |
+
author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
|
543 |
+
journal={arXiv preprint arXiv:2504.04060},
|
544 |
+
year={2025}
|
545 |
+
}
|
546 |
+
```
|