--- license: apache-2.0 datasets: - VocalNet/VoiceAssitant-430K-vocalnet - VocalNet/UltraChat-vocalnet language: - en base_model: - meta-llama/Llama-3.2-1B-Instruct --- ## 🎧 VocalNet-1B Model Card **VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀 ### 📂 Paper, Code and Model Access - **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) 📖 - **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) 🌐 - **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) 🤗 - **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) 🔮 ### 🔧 Repository Download and Environment Setup To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. 🛠️ 1. **Clone the Repository**: ```bash git clone https://github.com/SJTU-OmniAgent/VocalNet.git cd VocalNet ``` 2. **Create and Activate Environment**: ```bash conda create -n vocalnet python==3.10 conda activate vocalnet ``` 3. **Install Dependencies**: ```bash pip install --upgrade pip conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -e . ``` 4. **Optional: Install Training Packages**: If you plan to train the model, install additional packages: ```bash pip install -e ".[train]" pip install flash-attn --no-build-isolation ``` ### 📥 Download Instructions **Via Huggingface Cli**: ```bash pip install -U huggingface_hub huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/ ``` **Via Snapshot Download**: ```bash pip install -U huggingface_hub ``` ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="VocalNet/VocalNet-1B", local_dir="./checkpoints/", resume_download=True ) ``` **Via Git**: ```bash git lfs install git clone https://huggingface.co/VocalNet/VocalNet-1B ``` ### 🛠️ Dependencies - **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) 🎤 - **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. 🔊 ### 🔄 Local Inference To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. 📡 1. **Model Preparation**: - Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). 📦 - Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. 🎤 2. **CosyVoice Preparation**: - VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). 🔊 3. **Path Modification**: - Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models: ```python COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B ``` 4. **Run Inference**: - For **speech-to-text (S2T)** inference: ```bash python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav ``` - For **speech-to-speech (S2S)** inference: ```bash python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./ ``` ### 📊 Performance Evaluation VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup. #### Overall Performance
Model LLM Size Modality AlpacaEval LLaMA Questions TriviaQA Web Questions
Tiny Models
Mini-Omni 0.5B s→t 1.84 2.7 0.12 0.22
s→s 1.80 2.7 0.08 0.20
SLAM-Omni 0.5B s→t 3.50 29.4 0.39 0.84
s→s 3.01 26.7 0.34 0.69
VocalNet-1B (VA) 1B s→t 5.38 70.3 3.38 4.93
s→s 4.83 61.0 2.78 4.47
VocalNet-1B 1B s→t 5.79 71.7 3.60 5.16
s→s 5.03 63.7 3.06 4.68
Base Models
LLaMA-Omni 8B s→t 5.31 69.7 4.44 5.44
s→s 3.89 55.1 2.44 4.00
Freeze-Omni 7B s→t 4.51 77.7 5.32 6.41
s→s 2.99 60.2 3.53 4.78
GLM-4-Voice 9B s→t 5.86 77.4 4.95 5.56
s→s 5.27 64.3 4.63 5.40
Baichuan-Omni-1.5 7B s→t 5.20 77.6 5.72 6.12
s→s 4.10 61.2 4.13 5.18
MiniCPM-o 8B s→t 6.13 77.2 6.43 7.16
s→s 4.95 65.8 4.99 6.22
Minmo* 8B s→t - 78.9 4.83 5.50
s→s 6.48 64.1 3.75 3.99
Qwen2.5-Omni 8B s→t 6.01 79.0 5.89 6.88
s→s 5.73 76.3 5.59 6.70
VocalNet-8B (VA) 8B s→t 7.05 77.1 6.15 6.34
s→s 6.30 71.4 5.24 5.81
VocalNet-8B 8B s→t 7.12 79.5 6.24 6.48
s→s 6.37 73.1 5.67 6.16
#### Response Alignment and Acoustic Quality
Model AlpacaEval LLaMA Questions TriviaQA Web Questions Avg
WER UTMOS WER UTMOS WER UTMOS WER UTMOS WER UTMOS
Tiny Models
Mini-Omni 20.78 4.429 5.20 4.428 7.43 4.428 8.51 4.433 8.66 4.430
SLAM-Omni 5.52 4.439 5.55 4.467 6.16 4.470 6.50 4.461 6.17 4.464
VocalNet-1B (VA) 3.43 4.495 3.65 4.498 5.97 4.499 6.40 4.489 5.66 4.495
VocalNet-1B 3.43 4.491 3.27 4.497 6.73 4.486 4.88 4.493 5.31 4.491
Base Models
LLaMA-Omni 6.00 3.942 10.00 4.003 20.93 3.965 14.60 3.935 15.90 3.956
Freeze-Omni 14.33 4.377 14.20 4.417 20.39 4.404 18.25 4.398 18.31 4.401
GLM-4-Voice 18.71 4.025 14.45 4.152 8.33 4.306 6.08 4.214 8.99 4.228
Baichuan-Omni-1.5 20.84 4.082 22.82 4.332 22.36 4.401 23.29 4.350 22.67 4.347
MiniCPM-o 15.35 4.102 5.73 4.228 8.08 4.128 8.94 4.125 8.72 4.137
Qwen2.5-Omni 2.41 4.299 0.93 4.315 1.13 4.339 4.68 4.363 2.63 4.342
VocalNet-8B (VA) 2.65 4.490 3.00 4.503 5.02 4.499 4.21 4.485 4.26 4.493
VocalNet-8B 4.71 4.489 2.68 4.500 4.04 4.482 3.11 4.492 3.56 4.489
### ✍️ Citation If you find our work useful, please cite: ```bib @article{wang2025vocalnet, title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation}, author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu}, journal={arXiv preprint arXiv:2504.04060}, year={2025} } ```