--- base_model: - meta-llama/Llama-3.2-1B-Instruct datasets: - VocalNet/VoiceAssitant-430K-vocalnet - VocalNet/UltraChat-vocalnet language: - en license: apache-2.0 pipeline_tag: audio-text-to-text library_name: transformers --- ## š§ VocalNet-1B Model Card **VocalNet-1B** is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon [LLaMA-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), it employs **multi-token prediction (MTP)** to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. š ### š Paper, Code and Model Access - **Arxiv**: [VocalNet Report](https://arxiv.org/abs/2504.04060) š - **GitHub**: [VocalNet Repository](https://github.com/SJTU-OmniAgent/VocalNet) š - **HuggingFace**: [VocalNet/VocalNet-1B](https://huggingface.co/VocalNet/VocalNet-1B) š¤ - **ModelScope**: [VocalNet/VocalNet-1B](https://www.modelscope.cn/models/VocalNet/VocalNet-1B) š® ### š§ Repository Download and Environment Setup To get started with **VocalNet-1B**, clone the repository and set up the environment as follows. š ļø 1. **Clone the Repository**: ```bash git clone https://github.com/SJTU-OmniAgent/VocalNet.git cd VocalNet ``` 2. **Create and Activate Environment**: ```bash conda create -n vocalnet python==3.10 conda activate vocalnet ``` 3. **Install Dependencies**: ```bash pip install --upgrade pip conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -e . ``` 4. **Optional: Install Training Packages**: If you plan to train the model, install additional packages: ```bash pip install -e ".[train]" pip install flash-attn --no-build-isolation ``` ### š„ Download Instructions **Via Huggingface Cli**: ```bash pip install -U huggingface_hub huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/ ``` **Via Snapshot Download**: ```bash pip install -U huggingface_hub ``` ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="VocalNet/VocalNet-1B", local_dir="./checkpoints/", resume_download=True ) ``` **Via Git**: ```bash git lfs install git clone https://huggingface.co/VocalNet/VocalNet-1B ``` ### š ļø Dependencies - **Speech Encoder**: [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) š¤ - **Vocoder**: [CosyVoice2-0.5B](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) for converting speech tokens to audio waveforms. š ### š Local Inference To perform inference with **VocalNet-1B**, follow these steps to set up and run the model locally. š” 1. **Model Preparation**: - Download **VocalNet-1B** from [HuggingFace](https://huggingface.co/VocalNet/VocalNet-1B) or [ModelScope](https://www.modelscope.cn/models/VocalNet/VocalNet-1B). š¦ - Download the **Whisper-large-v3** speech encoder from [HuggingFace](https://huggingface.co/openai/whisper-large-v3) and place it in the `./models/speech_encoder/` directory. š¤ 2. **CosyVoice Preparation**: - VocalNet-1B uses **CosyVoice2-0.5B** to convert generated speech tokens into audio waveforms. Download it from [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B). š 3. **Path Modification**: - Update the paths in `omni_speech/infer/vocalnet.py` to point to the downloaded models: ```python COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet VOCALNET_MODEL="" # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B ``` 4. **Run Inference**: - For **speech-to-text (S2T)** inference: ```bash python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav ``` - For **speech-to-speech (S2S)** inference: ```bash python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./ ``` ### š Performance Evaluation VocalNet-1B was evaluated on [OpenAudioBench](https://huggingface.co/datasets/baichuan-inc/OpenAudioBench), covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. **Bold** indicates the optimal result in each subgroup. #### Overall Performance
Model | LLM Size | Modality | AlpacaEval | LLaMA Questions | TriviaQA | Web Questions | ||||
---|---|---|---|---|---|---|---|---|---|---|
Tiny Models | ||||||||||
Mini-Omni | 0.5B | sāt | 1.84 | 2.7 | 0.12 | 0.22 | ||||
sās | 1.80 | 2.7 | 0.08 | 0.20 | ||||||
SLAM-Omni | 0.5B | sāt | 3.50 | 29.4 | 0.39 | 0.84 | ||||
sās | 3.01 | 26.7 | 0.34 | 0.69 | ||||||
VocalNet-1B (VA) | 1B | sāt | 5.38 | 70.3 | 3.38 | 4.93 | ||||
sās | 4.83 | 61.0 | 2.78 | 4.47 | ||||||
VocalNet-1B | 1B | sāt | 5.79 | 71.7 | 3.60 | 5.16 | ||||
sās | 5.03 | 63.7 | 3.06 | 4.68 | ||||||
Base Models | ||||||||||
LLaMA-Omni | 8B | sāt | 5.31 | 69.7 | 4.44 | 5.44 | ||||
sās | 3.89 | 55.1 | 2.44 | 4.00 | ||||||
Freeze-Omni | 7B | sāt | 4.51 | 77.7 | 5.32 | 6.41 | ||||
sās | 2.99 | 60.2 | 3.53 | 4.78 | ||||||
GLM-4-Voice | 9B | sāt | 5.86 | 77.4 | 4.95 | 5.56 | ||||
sās | 5.27 | 64.3 | 4.63 | 5.40 | ||||||
Baichuan-Omni-1.5 | 7B | sāt | 5.20 | 77.6 | 5.72 | 6.12 | ||||
sās | 4.10 | 61.2 | 4.13 | 5.18 | ||||||
MiniCPM-o | 8B | sāt | 6.13 | 77.2 | 6.43 | 7.16 | ||||
sās | 4.95 | 65.8 | 4.99 | 6.22 | ||||||
Minmo* | 8B | sāt | - | 78.9 | 4.83 | 5.50 | ||||
sās | 6.48 |
64.1 | 3.75 | 3.99 | ||||||
Qwen2.5-Omni | sāt | 6.01 | 79.0 | 5.89 | 6.88 | |||||
sās | 5.73 | 76.3 |
5.59 | 6.70 |
||||||
VocalNet-8B (VA) | 7.05 | 4.490 |
77.1 | 4.503 | 6.15 | 4.499 |
4.21 | 4.485 | 4.26 | 4.493 |
VocalNet-8B | 7.12 |
4.489 | 79.5 |
4.500 | 6.24 | 4.482 | 3.11 | 4.492 | 3.56 | 4.489 |