|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- VocalNet/VoiceAssitant-430K-vocalnet |
|
- VocalNet/UltraChat-vocalnet |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Llama-3.2-1B-Instruct |
|
--- |
|
|
|
## 🎧 VocalNet-1B Model Card |
|
|
|
**VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀 |
|
|
|
### 📂 Paper, Code and Model Access |
|
- **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) 📖 |
|
- **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) 🌐 |
|
- **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) 🤗 |
|
- **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) 🔮 |
|
|
|
### 🔧 Repository Download and Environment Setup |
|
|
|
To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. 🛠️ |
|
|
|
1. **Clone the Repository**: |
|
```bash |
|
git clone https://github.com/SJTU-OmniAgent/VocalNet.git |
|
cd VocalNet |
|
``` |
|
|
|
2. **Create and Activate Environment**: |
|
```bash |
|
conda create -n vocalnet python==3.10 |
|
conda activate vocalnet |
|
``` |
|
|
|
3. **Install Dependencies**: |
|
```bash |
|
pip install --upgrade pip |
|
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia |
|
pip install -e . |
|
``` |
|
|
|
4. **Optional: Install Training Packages**: |
|
If you plan to train the model, install additional packages: |
|
```bash |
|
pip install -e ".[train]" |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
### 📥 Download Instructions |
|
|
|
**Via Huggingface Cli**: |
|
```bash |
|
pip install -U huggingface_hub |
|
huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/ |
|
``` |
|
**Via Snapshot Download**: |
|
```bash |
|
pip install -U huggingface_hub |
|
``` |
|
```python |
|
from huggingface_hub import snapshot_download |
|
snapshot_download( |
|
repo_id="VocalNet/VocalNet-1B", |
|
local_dir="./checkpoints/", |
|
resume_download=True |
|
) |
|
``` |
|
**Via Git**: |
|
```bash |
|
git lfs install |
|
git clone https://huggingface.co/VocalNet/VocalNet-1B |
|
``` |
|
|
|
### 🛠️ Dependencies |
|
- **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) 🎤 |
|
- **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. 🔊 |
|
|
|
### 🔄 Local Inference |
|
|
|
To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. 📡 |
|
|
|
1. **Model Preparation**: |
|
- Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). 📦 |
|
- Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. 🎤 |
|
|
|
2. **CosyVoice Preparation**: |
|
- VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). 🔊 |
|
|
|
3. **Path Modification**: |
|
- Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models: |
|
```python |
|
COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet |
|
VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B |
|
``` |
|
|
|
4. **Run Inference**: |
|
- For **speech-to-text (S2T)** inference: |
|
```bash |
|
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav |
|
``` |
|
- For **speech-to-speech (S2S)** inference: |
|
```bash |
|
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./ |
|
``` |
|
|
|
### 📊 Performance Evaluation |
|
VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup. |
|
|
|
#### Overall Performance |
|
<div align="center"> |
|
<table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;"> |
|
<thead> |
|
<tr style="background-color: #f2f2f2;"> |
|
<th style="padding: 10px; border: 1px solid #ddd;">Model</th> |
|
<th style="padding: 10px; border: 1px solid #ddd;">LLM Size</th> |
|
<th style="padding: 10px; border: 1px solid #ddd;">Modality</th> |
|
<th style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</th> |
|
<th style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</th> |
|
<th style="padding: 10px; border: 1px solid #ddd;">TriviaQA</th> |
|
<th style="padding: 10px; border: 1px solid #ddd;">Web Questions</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td colspan="7" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">0.5B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">1.84</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">2.7</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.12</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.22</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">1.80</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">2.7</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.08</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.20</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">0.5B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.50</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">29.4</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.39</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.84</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.01</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">26.7</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.34</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">0.69</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">1B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.38</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">70.3</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.38</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.93</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.83</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">61.0</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">2.78</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.47</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">1B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.79</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>71.7</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.60</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.16</b></td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.03</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>63.7</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.06</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.68</b></td> |
|
</tr> |
|
</tbody> |
|
<tbody> |
|
<tr> |
|
<td colspan="7" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.31</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">69.7</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.44</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.44</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.89</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">55.1</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">2.44</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.00</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">7B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.51</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">77.7</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.32</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.41</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">2.99</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">60.2</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.53</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.78</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">9B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.86</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">77.4</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.95</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.56</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.27</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">64.3</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.63</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.40</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">7B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.20</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">77.6</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.72</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.12</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.10</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">61.2</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.13</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.18</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.13</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">77.2</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>6.43</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>7.16</b></td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.95</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">65.8</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.99</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.22</u></td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Minmo*</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">-</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">78.9</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.83</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.50</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>6.48</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">64.1</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.75</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.99</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.01</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>79.0</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.89</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.88</u></td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.73</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>76.3</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>5.59</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>6.70</b></td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>7.05</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">77.1</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.15</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.34</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.30</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">71.4</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.24</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.81</td> |
|
</tr> |
|
<tr> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→t</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>7.12</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>79.5</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.24</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.48</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">s→s</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>6.37</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>73.1</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.67</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.16</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
#### Response Alignment and Acoustic Quality |
|
<div align="center"> |
|
<table style="margin: 0 auto; text-align: center; border-collapse: collapse; font-size: 14px;"> |
|
<tbody> |
|
<tr style="background-color: #f2f2f2;"> |
|
<td rowspan="2" style="padding: 10px; border: 1px solid #ddd;">Model</td> |
|
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">AlpacaEval</td> |
|
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">LLaMA Questions</td> |
|
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">TriviaQA</td> |
|
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Web Questions</td> |
|
<td colspan="2" style="padding: 10px; border: 1px solid #ddd;">Avg</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">WER</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">WER</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">WER</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">WER</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">WER</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">UTMOS</td> |
|
</tr> |
|
<tr> |
|
<td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Tiny Models</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">Mini-Omni</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">20.78</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.429</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.20</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.428</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">7.43</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.428</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.51</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.433</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.66</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.430</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">SLAM-Omni</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.52</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.439</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.55</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.467</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.16</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.470</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.50</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.461</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.17</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.464</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B (VA)</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.65</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.498</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.97</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.40</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.489</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.66</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.495</b></td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-1B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.43</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.491</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.27</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.497</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.73</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.486</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.88</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>5.31</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.491</td> |
|
</tr> |
|
<tr> |
|
<td colspan="11" style="padding: 10px; border: 1px solid #ddd; font-weight: bold; background-color: #e6f3ff;">Base Models</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">LLaMA-Omni</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.00</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.942</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">10.00</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.003</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">20.93</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.965</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">14.60</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.935</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">15.90</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.956</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">Freeze-Omni</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">14.33</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.377</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">14.20</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.417</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">20.39</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.404</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">18.25</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.398</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">18.31</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.401</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">GLM-4-Voice</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">18.71</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.025</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">14.45</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.152</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.33</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.306</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">6.08</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.214</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.99</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.228</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">Baichuan-Omni-1.5</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">20.84</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.082</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">22.82</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.332</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">22.36</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.401</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">23.29</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.350</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">22.67</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.347</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">MiniCPM-o</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">15.35</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.102</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.73</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.228</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.08</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.128</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.94</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.125</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">8.72</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.137</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">Qwen2.5-Omni</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>2.41</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.299</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>0.93</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.315</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>1.13</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.339</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.68</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.363</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>2.63</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.342</td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B (VA)</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>2.65</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.490</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">3.00</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.503</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">5.02</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.499</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.21</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.485</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.26</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>4.493</b></td> |
|
</tr> |
|
<tr> |
|
<td style="padding: 10px; border: 1px solid #ddd;">VocalNet-8B</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;">4.71</td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>2.68</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.500</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.04</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.482</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><b>3.11</b></td> |
|
<td style="padding as: 10px; border: 1px solid #ddd;"><b>4.492</b></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>3.56</u></td> |
|
<td style="padding: 10px; border: 1px solid #ddd;"><u>4.489</u></td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
### ✍️ Citation |
|
If you find our work useful, please cite: |
|
```bib |
|
@article{wang2025vocalnet, |
|
title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation}, |
|
author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu}, |
|
journal={arXiv preprint arXiv:2504.04060}, |
|
year={2025} |
|
} |
|
``` |