base_model:
- meta-llama/Llama-3.2-1B-Instruct
datasets:
- VocalNet/VoiceAssitant-430K-vocalnet
- VocalNet/UltraChat-vocalnet
language:
- en
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
๐ง VocalNet-1B Model Card
VocalNet-1B is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon LLaMA-3.2-1B-Instruct, it employs multi-token prediction (MTP) to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. ๐
๐ Paper, Code and Model Access
- Arxiv: VocalNet Report ๐
- GitHub: VocalNet Repository ๐
- HuggingFace: VocalNet/VocalNet-1B ๐ค
- ModelScope: VocalNet/VocalNet-1B ๐ฎ
๐ง Repository Download and Environment Setup
To get started with VocalNet-1B, clone the repository and set up the environment as follows. ๐ ๏ธ
Clone the Repository:
git clone https://github.com/SJTU-OmniAgent/VocalNet.git cd VocalNet
Create and Activate Environment:
conda create -n vocalnet python==3.10 conda activate vocalnet
Install Dependencies:
pip install --upgrade pip conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -e .
Optional: Install Training Packages: If you plan to train the model, install additional packages:
pip install -e ".[train]" pip install flash-attn --no-build-isolation
๐ฅ Download Instructions
Via Huggingface Cli:
pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/
Via Snapshot Download:
pip install -U huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="VocalNet/VocalNet-1B",
local_dir="./checkpoints/",
resume_download=True
)
Via Git:
git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-1B
๐ ๏ธ Dependencies
- Speech Encoder: Whisper-large-v3 ๐ค
- Vocoder: CosyVoice2-0.5B for converting speech tokens to audio waveforms. ๐
๐ Local Inference
To perform inference with VocalNet-1B, follow these steps to set up and run the model locally. ๐ก
Model Preparation:
- Download VocalNet-1B from HuggingFace or ModelScope. ๐ฆ
- Download the Whisper-large-v3 speech encoder from HuggingFace and place it in the
./models/speech_encoder/
directory. ๐ค
CosyVoice Preparation:
- VocalNet-1B uses CosyVoice2-0.5B to convert generated speech tokens into audio waveforms. Download it from HuggingFace. ๐
Path Modification:
- Update the paths in
omni_speech/infer/vocalnet.py
to point to the downloaded models:COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B
- Update the paths in
Run Inference:
- For speech-to-text (S2T) inference:
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
- For speech-to-speech (S2S) inference:
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
- For speech-to-text (S2T) inference:
๐ Performance Evaluation
VocalNet-1B was evaluated on OpenAudioBench, covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. Bold indicates the optimal result in each subgroup.
Overall Performance
Model | LLM Size | Modality | AlpacaEval | LLaMA Questions | TriviaQA | Web Questions | ||||
---|---|---|---|---|---|---|---|---|---|---|
Tiny Models | ||||||||||
Mini-Omni | 0.5B | sโt | 1.84 | 2.7 | 0.12 | 0.22 | ||||
sโs | 1.80 | 2.7 | 0.08 | 0.20 | ||||||
SLAM-Omni | 0.5B | sโt | 3.50 | 29.4 | 0.39 | 0.84 | ||||
sโs | 3.01 | 26.7 | 0.34 | 0.69 | ||||||
VocalNet-1B (VA) | 1B | sโt | 5.38 | 70.3 | 3.38 | 4.93 | ||||
sโs | 4.83 | 61.0 | 2.78 | 4.47 | ||||||
VocalNet-1B | 1B | sโt | 5.79 | 71.7 | 3.60 | 5.16 | ||||
sโs | 5.03 | 63.7 | 3.06 | 4.68 | ||||||
Base Models | ||||||||||
LLaMA-Omni | 8B | sโt | 5.31 | 69.7 | 4.44 | 5.44 | ||||
sโs | 3.89 | 55.1 | 2.44 | 4.00 | ||||||
Freeze-Omni | 7B | sโt | 4.51 | 77.7 | 5.32 | 6.41 | ||||
sโs | 2.99 | 60.2 | 3.53 | 4.78 | ||||||
GLM-4-Voice | 9B | sโt | 5.86 | 77.4 | 4.95 | 5.56 | ||||
sโs | 5.27 | 64.3 | 4.63 | 5.40 | ||||||
Baichuan-Omni-1.5 | 7B | sโt | 5.20 | 77.6 | 5.72 | 6.12 | ||||
sโs | 4.10 | 61.2 | 4.13 | 5.18 | ||||||
MiniCPM-o | 8B | sโt | 6.13 | 77.2 | 6.43 | 7.16 | ||||
sโs | 4.95 | 65.8 | 4.99 | 6.22 | ||||||
Minmo* | 8B | sโt | - | 78.9 | 4.83 | 5.50 | ||||
sโs | 6.48 |
64.1 | 3.75 | 3.99 | ||||||
Qwen2.5-Omni | sโt | 6.01 | 79.0 | 5.89 | 6.88 | |||||
sโs | 5.73 | 76.3 |
5.59 | 6.70 |
||||||
VocalNet-8B (VA) | 7.05 | 4.490 |
77.1 | 4.503 | 6.15 | 4.499 |
4.21 | 4.485 | 4.26 | 4.493 |
VocalNet-8B | 7.12 |
4.489 | 79.5 |
4.500 | 6.24 | 4.482 | 3.11 | 4.492 | 3.56 | 4.489 |
โ๏ธ Citation
If you find our work useful, please cite:
@article{wang2025vocalnet,
title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
journal={arXiv preprint arXiv:2504.04060},
year={2025}
}