---
license: apache-2.0
datasets:
- VocalNet/VoiceAssitant-430K-vocalnet
- VocalNet/UltraChat-vocalnet
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
---
## 🎧 VocalNet-1B Model Card
**VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀
### 📂 Paper, Code and Model Access
- **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) 📖
- **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) 🌐
- **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) 🤗
- **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) 🔮
### 🔧 Repository Download and Environment Setup
To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. 🛠️
1. **Clone the Repository**:
```bash
git clone https://github.com/SJTU-OmniAgent/VocalNet.git
cd VocalNet
```
2. **Create and Activate Environment**:
```bash
conda create -n vocalnet python==3.10
conda activate vocalnet
```
3. **Install Dependencies**:
```bash
pip install --upgrade pip
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -e .
```
4. **Optional: Install Training Packages**:
If you plan to train the model, install additional packages:
```bash
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```
### 📥 Download Instructions
**Via Huggingface Cli**:
```bash
pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/
```
**Via Snapshot Download**:
```bash
pip install -U huggingface_hub
```
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="VocalNet/VocalNet-1B",
local_dir="./checkpoints/",
resume_download=True
)
```
**Via Git**:
```bash
git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-1B
```
### 🛠️ Dependencies
- **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) 🎤
- **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. 🔊
### 🔄 Local Inference
To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. 📡
1. **Model Preparation**:
- Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). 📦
- Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. 🎤
2. **CosyVoice Preparation**:
- VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). 🔊
3. **Path Modification**:
- Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models:
```python
COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B
```
4. **Run Inference**:
- For **speech-to-text (S2T)** inference:
```bash
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
```
- For **speech-to-speech (S2S)** inference:
```bash
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
```
### 📊 Performance Evaluation
VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup.
#### Overall Performance
Model |
LLM Size |
Modality |
AlpacaEval |
LLaMA Questions |
TriviaQA |
Web Questions |
Tiny Models |
Mini-Omni |
0.5B |
s→t |
1.84 |
2.7 |
0.12 |
0.22 |
s→s |
1.80 |
2.7 |
0.08 |
0.20 |
SLAM-Omni |
0.5B |
s→t |
3.50 |
29.4 |
0.39 |
0.84 |
s→s |
3.01 |
26.7 |
0.34 |
0.69 |
VocalNet-1B (VA) |
1B |
s→t |
5.38 |
70.3 |
3.38 |
4.93 |
s→s |
4.83 |
61.0 |
2.78 |
4.47 |
VocalNet-1B |
1B |
s→t |
5.79 |
71.7 |
3.60 |
5.16 |
s→s |
5.03 |
63.7 |
3.06 |
4.68 |
Base Models |
LLaMA-Omni |
8B |
s→t |
5.31 |
69.7 |
4.44 |
5.44 |
s→s |
3.89 |
55.1 |
2.44 |
4.00 |
Freeze-Omni |
7B |
s→t |
4.51 |
77.7 |
5.32 |
6.41 |
s→s |
2.99 |
60.2 |
3.53 |
4.78 |
GLM-4-Voice |
9B |
s→t |
5.86 |
77.4 |
4.95 |
5.56 |
s→s |
5.27 |
64.3 |
4.63 |
5.40 |
Baichuan-Omni-1.5 |
7B |
s→t |
5.20 |
77.6 |
5.72 |
6.12 |
s→s |
4.10 |
61.2 |
4.13 |
5.18 |
MiniCPM-o |
8B |
s→t |
6.13 |
77.2 |
6.43 |
7.16 |
s→s |
4.95 |
65.8 |
4.99 |
6.22 |
Minmo* |
8B |
s→t |
- |
78.9 |
4.83 |
5.50 |
s→s |
6.48 |
64.1 |
3.75 |
3.99 |
Qwen2.5-Omni |
8B |
s→t |
6.01 |
79.0 |
5.89 |
6.88 |
s→s |
5.73 |
76.3 |
5.59 |
6.70 |
VocalNet-8B (VA) |
8B |
s→t |
7.05 |
77.1 |
6.15 |
6.34 |
s→s |
6.30 |
71.4 |
5.24 |
5.81 |
VocalNet-8B |
8B |
s→t |
7.12 |
79.5 |
6.24 |
6.48 |
s→s |
6.37 |
73.1 |
5.67 |
6.16 |
#### Response Alignment and Acoustic Quality
Model |
AlpacaEval |
LLaMA Questions |
TriviaQA |
Web Questions |
Avg |
WER |
UTMOS |
WER |
UTMOS |
WER |
UTMOS |
WER |
UTMOS |
WER |
UTMOS |
Tiny Models |
Mini-Omni |
20.78 |
4.429 |
5.20 |
4.428 |
7.43 |
4.428 |
8.51 |
4.433 |
8.66 |
4.430 |
SLAM-Omni |
5.52 |
4.439 |
5.55 |
4.467 |
6.16 |
4.470 |
6.50 |
4.461 |
6.17 |
4.464 |
VocalNet-1B (VA) |
3.43 |
4.495 |
3.65 |
4.498 |
5.97 |
4.499 |
6.40 |
4.489 |
5.66 |
4.495 |
VocalNet-1B |
3.43 |
4.491 |
3.27 |
4.497 |
6.73 |
4.486 |
4.88 |
4.493 |
5.31 |
4.491 |
Base Models |
LLaMA-Omni |
6.00 |
3.942 |
10.00 |
4.003 |
20.93 |
3.965 |
14.60 |
3.935 |
15.90 |
3.956 |
Freeze-Omni |
14.33 |
4.377 |
14.20 |
4.417 |
20.39 |
4.404 |
18.25 |
4.398 |
18.31 |
4.401 |
GLM-4-Voice |
18.71 |
4.025 |
14.45 |
4.152 |
8.33 |
4.306 |
6.08 |
4.214 |
8.99 |
4.228 |
Baichuan-Omni-1.5 |
20.84 |
4.082 |
22.82 |
4.332 |
22.36 |
4.401 |
23.29 |
4.350 |
22.67 |
4.347 |
MiniCPM-o |
15.35 |
4.102 |
5.73 |
4.228 |
8.08 |
4.128 |
8.94 |
4.125 |
8.72 |
4.137 |
Qwen2.5-Omni |
2.41 |
4.299 |
0.93 |
4.315 |
1.13 |
4.339 |
4.68 |
4.363 |
2.63 |
4.342 |
VocalNet-8B (VA) |
2.65 |
4.490 |
3.00 |
4.503 |
5.02 |
4.499 |
4.21 |
4.485 |
4.26 |
4.493 |
VocalNet-8B |
4.71 |
4.489 |
2.68 |
4.500 |
4.04 |
4.482 |
3.11 |
4.492 |
3.56 |
4.489 |
### ✍️ Citation
If you find our work useful, please cite:
```bib
@article{wang2025vocalnet,
title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
journal={arXiv preprint arXiv:2504.04060},
year={2025}
}
```