Safetensors
English
omni_speech2s_llama
VocalNet-1B / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag and library name
7b9dd1f verified
|
raw
history blame
15.7 kB
metadata
base_model:
  - meta-llama/Llama-3.2-1B-Instruct
datasets:
  - VocalNet/VoiceAssitant-430K-vocalnet
  - VocalNet/UltraChat-vocalnet
language:
  - en
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers

๐ŸŽง VocalNet-1B Model Card

VocalNet-1B is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon LLaMA-3.2-1B-Instruct, it employs multi-token prediction (MTP) to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. ๐Ÿš€

๐Ÿ“‚ Paper, Code and Model Access

๐Ÿ”ง Repository Download and Environment Setup

To get started with VocalNet-1B, clone the repository and set up the environment as follows. ๐Ÿ› ๏ธ

  1. Clone the Repository:

    git clone https://github.com/SJTU-OmniAgent/VocalNet.git
    cd VocalNet
    
  2. Create and Activate Environment:

    conda create -n vocalnet python==3.10
    conda activate vocalnet
    
  3. Install Dependencies:

    pip install --upgrade pip
    conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
    pip install -e .
    
  4. Optional: Install Training Packages: If you plan to train the model, install additional packages:

    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
    

๐Ÿ“ฅ Download Instructions

Via Huggingface Cli:

pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-1B --local-dir ./checkpoints/

Via Snapshot Download:

pip install -U huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(
  repo_id="VocalNet/VocalNet-1B",
  local_dir="./checkpoints/",
  resume_download=True
)

Via Git:

git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-1B

๐Ÿ› ๏ธ Dependencies

๐Ÿ”„ Local Inference

To perform inference with VocalNet-1B, follow these steps to set up and run the model locally. ๐Ÿ“ก

  1. Model Preparation:

    • Download VocalNet-1B from HuggingFace or ModelScope. ๐Ÿ“ฆ
    • Download the Whisper-large-v3 speech encoder from HuggingFace and place it in the ./models/speech_encoder/ directory. ๐ŸŽค
  2. CosyVoice Preparation:

    • VocalNet-1B uses CosyVoice2-0.5B to convert generated speech tokens into audio waveforms. Download it from HuggingFace. ๐Ÿ”Š
  3. Path Modification:

    • Update the paths in omni_speech/infer/vocalnet.py to point to the downloaded models:
      COSYVOICE_MODEL=""  # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
      VOCALNET_MODEL=""  # Path to VocalNet-1B, e.g., ./checkpoints/VocalNet-1B
      
  4. Run Inference:

    • For speech-to-text (S2T) inference:
      python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
      
    • For speech-to-speech (S2S) inference:
      python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
      

๐Ÿ“Š Performance Evaluation

VocalNet-1B was evaluated on OpenAudioBench, covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. Bold indicates the optimal result in each subgroup.

Overall Performance

Model LLM Size Modality AlpacaEval LLaMA Questions TriviaQA Web Questions
Tiny Models
Mini-Omni 0.5B sโ†’t 1.84 2.7 0.12 0.22
sโ†’s 1.80 2.7 0.08 0.20
SLAM-Omni 0.5B sโ†’t 3.50 29.4 0.39 0.84
sโ†’s 3.01 26.7 0.34 0.69
VocalNet-1B (VA) 1B sโ†’t 5.38 70.3 3.38 4.93
sโ†’s 4.83 61.0 2.78 4.47
VocalNet-1B 1B sโ†’t 5.79 71.7 3.60 5.16
sโ†’s 5.03 63.7 3.06 4.68
Base Models
LLaMA-Omni 8B sโ†’t 5.31 69.7 4.44 5.44
sโ†’s 3.89 55.1 2.44 4.00
Freeze-Omni 7B sโ†’t 4.51 77.7 5.32 6.41
sโ†’s 2.99 60.2 3.53 4.78
GLM-4-Voice 9B sโ†’t 5.86 77.4 4.95 5.56
sโ†’s 5.27 64.3 4.63 5.40
Baichuan-Omni-1.5 7B sโ†’t 5.20 77.6 5.72 6.12
sโ†’s 4.10 61.2 4.13 5.18
MiniCPM-o 8B sโ†’t 6.13 77.2 6.43 7.16
sโ†’s 4.95 65.8 4.99 6.22
Minmo* 8B sโ†’t - 78.9 4.83 5.50
sโ†’s 6.48
64.1 3.75 3.99
Qwen2.5-Omni sโ†’t 6.01 79.0 5.89 6.88
sโ†’s 5.73 76.3
5.59 6.70
VocalNet-8B (VA) 7.05 4.490
77.1 4.503 6.15 4.499
4.21 4.485 4.26 4.493
VocalNet-8B 7.12
4.489 79.5
4.500 6.24 4.482 3.11 4.492 3.56 4.489

โœ๏ธ Citation

If you find our work useful, please cite:

@article{wang2025vocalnet,
  title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
  author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
  journal={arXiv preprint arXiv:2504.04060},
  year={2025}
}