|
--- |
|
library_name: transformers |
|
tags: |
|
- speech |
|
- tokenization |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- zh |
|
base_model: |
|
- Emova-ollm/emova_speech_tokenizer |
|
--- |
|
|
|
# EMOVA Speech Tokenizer HF |
|
|
|
<div align="center"> |
|
|
|
<img src="./examples/images/emova_icon2.png" width="300em"></img> |
|
|
|
π€ [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | π [Paper](https://arxiv.org/abs/2409.18042) | π [Project-Page](https://emova-ollm.github.io/) | π» [Github](https://github.com/emova-ollm/EMOVA_speech_tokenizer) | π» [EMOVA-Github](https://github.com/emova-ollm/EMOVA) |
|
|
|
</div> |
|
|
|
## Model Summary |
|
|
|
This repo contains the official speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models. With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and speech modalities, but also empowers flexible speech style controls including speakers, emotions and pitches. We summarize its key advantages as: |
|
|
|
- **Discrete speech tokenizer**: it contains a SPIRAL-based **speech-to-unit (S2U)** tokenizer to capture both phonetic and tonal information of input speeches, which is then discretized by a **finite scalar quantizater (FSQ)** into discrete speech units, and a VITS-based **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from speech units. |
|
|
|
- **Semantic-acoustic disentanglement**: to seamlessly align speech units with the highly semantic embedding space of LLMs, we opt for decoupling the **semantic contents** and **acoustic styles** of input speeches, and only the former are utilized to generate the speech tokens. |
|
|
|
- **Biligunal tokenization**: EMOVA speech tokenizer supports both **Chinese** and **English** speech tokenization with the same speech codebook. |
|
|
|
- **Flexible speech style control**: thanks to the semantic-acoustic disentanglement, EMOVA speech tokenizer supports **24 speech style controls** (i.e., 2 speakers, 3 pitches, and 4 emotions). Check the [Usage](#usage) for more details. |
|
|
|
<div align="center"> |
|
<img src="./examples/images/model_architecture.PNG" width=100%></img> |
|
</div> |
|
|
|
## Installation |
|
|
|
Clone this repo and create the EMOVA virtual environment with conda. Our code has been validated on **NVIDIA A800/H20 GPU & Ascend 910B3 NPU** servers. Other devices might be available as well. |
|
|
|
1. Initialize the conda environment: |
|
|
|
```bash |
|
git clone https://github.com/emova-ollm/EMOVA_speech_tokenizer.git |
|
conda create -n emova python=3.10 -y |
|
conda activate emova |
|
``` |
|
|
|
2. Install the required packages (note that instructions are different from GPUs and NPUs): |
|
|
|
```bash |
|
# upgrade pip and setuptools if necessary |
|
pip install -U pip setuptools |
|
|
|
cd emova_speech_tokenizer |
|
pip install -e . # for NVIDIA GPUs (e.g., A800 and H20) |
|
pip install -e .[npu] # OR for Ascend NPUS (e.g., 910B3) |
|
``` |
|
|
|
## Usage |
|
|
|
> [!NOTE] |
|
> Before this, remember to finish [Installation](#installation) first! |
|
|
|
EMOVA speech tokenizer can be easily deployed using the π€ HuggingFace transformers API! |
|
|
|
```python |
|
import random |
|
from transformers import AutoModel |
|
import torch |
|
|
|
### Uncomment if you want to use Ascend NPUs |
|
# import torch_npu |
|
# from torch_npu.contrib import transfer_to_npu |
|
|
|
# load pretrained model |
|
model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda() |
|
|
|
# S2U |
|
wav_file = "./examples/s2u/example.wav" |
|
speech_unit = model.encode(wav_file) |
|
print(speech_unit) |
|
|
|
# U2S |
|
emotion = random.choice(['angry', 'happy', 'neutral', 'sad']) |
|
speed = random.choice(['normal', 'fast', 'slow']) |
|
pitch = random.choice(['normal', 'high', 'low']) |
|
gender = random.choice(['female', 'male']) |
|
condition = f'gender-{gender}_emotion-{emotion}_speed-{speed}_pitch-{pitch}' |
|
|
|
output_wav_file = f'./examples/u2s/{condition}_output.wav' |
|
model.decode(speech_unit, condition=condition, output_wav_file=output_wav_file) |
|
``` |
|
|
|
## Citation |
|
If you find our model/code/paper helpful, please consider citing our papers and staring us! |
|
|
|
```bibtex |
|
@article{chen2024emova, |
|
title={Emova: Empowering language models to see, hear and speak with vivid emotions}, |
|
author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others}, |
|
journal={arXiv preprint arXiv:2409.18042}, |
|
year={2024} |
|
} |
|
``` |