--- base_model: - meta-llama/Llama-3.2-1B-Instruct datasets: - VocalNet/VoiceAssitant-430K-vocalnet - VocalNet/UltraChat-vocalnet language: - en license: apache-2.0 pipeline_tag: audio-text-to-text library_name: transformers --- ## šŸŽ§ VocalNet-1B Model Card **VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. šŸš€ ### šŸ“‚ Paper, Code and Model Access - **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) šŸ“– - **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) 🌐 - **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) šŸ¤— - **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) šŸ”® ### šŸ”§ Repository Download and Environment Setup To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. šŸ› ļø 1. **Clone the Repository**: ```bash git clone https://github.com/SJTU-OmniAgent/VocalNet.git cd VocalNet ``` 2. **Create and Activate Environment**: ```bash conda create -n vocalnet python==3.10 conda activate vocalnet ``` 3. **Install Dependencies**: ```bash pip install --upgrade pip conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -e . ``` 4. **Optional: Install Training Packages**: If you plan to train the model, install additional packages: ```bash pip install -e ".[train]" pip install flash-attn --no-build-isolation ``` ### šŸ“„ Download Instructions **Via Huggingface Cli**: ```bash pip install -U huggingface_hub huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/ ``` **Via Snapshot Download**: ```bash pip install -U huggingface_hub ``` ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="VocalNet/VocalNet-1B", local_dir="./checkpoints/", resume_download=True ) ``` **Via Git**: ```bash git lfs install git clone https://huggingface.co/VocalNet/VocalNet-1B ``` ### šŸ› ļø Dependencies - **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) šŸŽ¤ - **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. šŸ”Š ### šŸ”„ Local Inference To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. šŸ“” 1. **Model Preparation**: - Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). šŸ“¦ - Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. šŸŽ¤ 2. **CosyVoice Preparation**: - VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). šŸ”Š 3. **Path Modification**: - Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models: ```python COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B ``` 4. **Run Inference**: - For **speech-to-text (S2T)** inference: ```bash python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav ``` - For **speech-to-speech (S2S)** inference: ```bash python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./ ``` ### šŸ“Š Performance Evaluation VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup. #### Overall Performance
Model LLM Size Modality AlpacaEval LLaMA Questions TriviaQA Web Questions
Tiny Models
Mini-Omni 0.5B s→t 1.84 2.7 0.12 0.22
s→s 1.80 2.7 0.08 0.20
SLAM-Omni 0.5B s→t 3.50 29.4 0.39 0.84
s→s 3.01 26.7 0.34 0.69
VocalNet-1B (VA) 1B s→t 5.38 70.3 3.38 4.93
s→s 4.83 61.0 2.78 4.47
VocalNet-1B 1B s→t 5.79 71.7 3.60 5.16
s→s 5.03 63.7 3.06 4.68
Base Models
LLaMA-Omni 8B s→t 5.31 69.7 4.44 5.44
s→s 3.89 55.1 2.44 4.00
Freeze-Omni 7B s→t 4.51 77.7 5.32 6.41
s→s 2.99 60.2 3.53 4.78
GLM-4-Voice 9B s→t 5.86 77.4 4.95 5.56
s→s 5.27 64.3 4.63 5.40
Baichuan-Omni-1.5 7B s→t 5.20 77.6 5.72 6.12
s→s 4.10 61.2 4.13 5.18
MiniCPM-o 8B s→t 6.13 77.2 6.43 7.16
s→s 4.95 65.8 4.99 6.22
Minmo* 8B s→t - 78.9 4.83 5.50
s→s 6.48
64.1 3.75 3.99
Qwen2.5-Omni s→t 6.01 79.0 5.89 6.88
s→s 5.73 76.3
5.59 6.70
VocalNet-8B (VA) 7.05 4.490
77.1 4.503 6.15 4.499
4.21 4.485 4.26 4.493
VocalNet-8B 7.12
4.489 79.5
4.500 6.24 4.482 3.11 4.492 3.56 4.489
### āœļø Citation If you find our work useful, please cite: ```bib @article{wang2025vocalnet, title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation}, author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu}, journal={arXiv preprint arXiv:2504.04060}, year={2025} } ```