README.md · VocalNet/VocalNet-1B at refs/pr/1

metadata

base_model:
  - meta-llama/Llama-3.2-1B-Instruct
datasets:
  - VocalNet/VoiceAssitant-430K-vocalnet
  - VocalNet/UltraChat-vocalnet
language:
  - en
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers

🎧 VocalNet-1B Model Card

VocalNet-1B is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon LLaMA-3.2-1B-Instruct, it employs multi-token prediction (MTP) to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀

📂 Paper, Code and Model Access

Arxiv: VocalNet Report 📖
GitHub: VocalNet Repository 🌐
HuggingFace: VocalNet/VocalNet-1B 🤗
ModelScope: VocalNet/VocalNet-1B 🔮

🔧 Repository Download and Environment Setup

To get started with VocalNet-1B, clone the repository and set up the environment as follows. 🛠️

Clone the Repository:

git clone https://github.com/SJTU-OmniAgent/VocalNet.git
cd VocalNet

Create and Activate Environment:

conda create -n vocalnet python==3.10
conda activate vocalnet

Install Dependencies:

pip install --upgrade pip
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -e .

Optional: Install Training Packages: If you plan to train the model, install additional packages:
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```

📥 Download Instructions

Via Huggingface Cli:

pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/

Via Snapshot Download:

pip install -U huggingface_hub

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id="VocalNet/VocalNet-1B",
  local_dir="./checkpoints/",
  resume_download=True
)

Via Git:

git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-1B

🛠️ Dependencies

Speech Encoder: Whisper-large-v3 🎤
Vocoder: CosyVoice2-0.5B for converting speech tokens to audio waveforms. 🔊

🔄 Local Inference

To perform inference with VocalNet-1B, follow these steps to set up and run the model locally. 📡

Model Preparation:
- Download VocalNet-1B from HuggingFace or ModelScope. 📦
- Download the Whisper-large-v3 speech encoder from HuggingFace and place it in the ./models/speech_encoder/ directory. 🎤
CosyVoice Preparation:
- VocalNet-1B uses CosyVoice2-0.5B to convert generated speech tokens into audio waveforms. Download it from HuggingFace. 🔊

Path Modification:

Update the paths in omni_speech/infer/vocalnet.py to point to the downloaded models:

COSYVOICE_MODEL=""  # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
VOCALNET_MODEL=""  # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B

Run Inference:

For speech-to-text (S2T) inference:

python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav

For speech-to-speech (S2S) inference:

python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./

📊 Performance Evaluation

VocalNet-1B was evaluated on OpenAudioBench, covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. Bold indicates the optimal result in each subgroup.

Overall Performance

Model	LLM Size	Modality	AlpacaEval	LLaMA Questions	TriviaQA	Web Questions
Tiny Models
Mini-Omni	0.5B	s→t	1.84	2.7	0.12	0.22
Mini-Omni	0.5B	s→s	1.80	2.7	0.08	0.20
SLAM-Omni	0.5B	s→t	3.50	29.4	0.39	0.84
SLAM-Omni	0.5B	s→s	3.01	26.7	0.34	0.69
VocalNet-1B (VA)	1B	s→t	5.38	70.3	3.38	4.93
VocalNet-1B (VA)	1B	s→s	4.83	61.0	2.78	4.47
VocalNet-1B	1B	s→t	5.79	71.7	3.60	5.16
VocalNet-1B	1B	s→s	5.03	63.7	3.06	4.68
Base Models
LLaMA-Omni	8B	s→t	5.31	69.7	4.44	5.44
LLaMA-Omni	8B	s→s	3.89	55.1	2.44	4.00
Freeze-Omni	7B	s→t	4.51	77.7	5.32	6.41
Freeze-Omni	7B	s→s	2.99	60.2	3.53	4.78
GLM-4-Voice	9B	s→t	5.86	77.4	4.95	5.56
GLM-4-Voice	9B	s→s	5.27	64.3	4.63	5.40
Baichuan-Omni-1.5	7B	s→t	5.20	77.6	5.72	6.12
Baichuan-Omni-1.5	7B	s→s	4.10	61.2	4.13	5.18
MiniCPM-o	8B	s→t	6.13	77.2	6.43	7.16
MiniCPM-o	8B	s→s	4.95	65.8	4.99	6.22
Minmo*	8B	s→t	-	78.9	4.83	5.50
s→s	6.48	64.1	3.75	3.99
Qwen2.5-Omni	s→t	6.01	79.0	5.89	6.88
s→s	5.73	76.3	5.59	6.70
VocalNet-8B (VA)	7.05	4.490	77.1	4.503	6.15	4.499	4.21	4.485	4.26	4.493
VocalNet-8B	7.12	4.489	79.5	4.500	6.24	4.482	3.11	4.492	3.56	4.489

✍️ Citation

If you find our work useful, please cite:

@article{wang2025vocalnet,
  title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
  author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
  journal={arXiv preprint arXiv:2504.04060},
  year={2025}
}