File size: 4,411 Bytes
02e65c1 f42943d 02e65c1 e84cb10 33e645f 735f727 33e645f 50f4c1b 02e65c1 e84cb10 02e65c1 e84cb10 02e65c1 e84cb10 02e65c1 e84cb10 02e65c1 e84cb10 02e65c1 e84cb10 cf51797 e84cb10 02e65c1 50f4c1b 02e65c1 e84cb10 02e65c1 e84cb10 50f4c1b e84cb10 76ff940 e84cb10 02e65c1 50f4c1b c46162e 02e65c1 50f4c1b 02e65c1 50f4c1b 02e65c1 50f4c1b 02e65c1 50f4c1b e84cb10 02e65c1 50f4c1b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
library_name: transformers
tags:
- speech
- tokenization
license: apache-2.0
language:
- en
- zh
base_model:
- Emova-ollm/emova_speech_tokenizer
---
# EMOVA Speech Tokenizer HF
<div align="center">
<img src="./examples/images/emova_icon2.png" width="300em"></img>
π€ [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | π [Paper](https://arxiv.org/abs/2409.18042) | π [Project-Page](https://emova-ollm.github.io/) | π» [Github](https://github.com/emova-ollm/EMOVA_speech_tokenizer) | π» [EMOVA-Github](https://github.com/emova-ollm/EMOVA)
</div>
## Model Summary
This repo contains the official speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models. With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and speech modalities, but also empowers flexible speech style controls including speakers, emotions and pitches. We summarize its key advantages as:
- **Discrete speech tokenizer**: it contains a SPIRAL-based **speech-to-unit (S2U)** tokenizer to capture both phonetic and tonal information of input speeches, which is then discretized by a **finite scalar quantizater (FSQ)** into discrete speech units, and a VITS-based **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from speech units.
- **Semantic-acoustic disentanglement**: to seamlessly align speech units with the highly semantic embedding space of LLMs, we opt for decoupling the **semantic contents** and **acoustic styles** of input speeches, and only the former are utilized to generate the speech tokens.
- **Biligunal tokenization**: EMOVA speech tokenizer supports both **Chinese** and **English** speech tokenization with the same speech codebook.
- **Flexible speech style control**: thanks to the semantic-acoustic disentanglement, EMOVA speech tokenizer supports **24 speech style controls** (i.e., 2 speakers, 3 pitches, and 4 emotions). Check the [Usage](#usage) for more details.
<div align="center">
<img src="./examples/images/model_architecture.PNG" width=100%></img>
</div>
## Installation
Clone this repo and create the EMOVA virtual environment with conda. Our code has been validated on **NVIDIA A800/H20 GPU & Ascend 910B3 NPU** servers. Other devices might be available as well.
1. Initialize the conda environment:
```bash
git clone https://github.com/emova-ollm/EMOVA_speech_tokenizer.git
conda create -n emova python=3.10 -y
conda activate emova
```
2. Install the required packages (note that instructions are different from GPUs and NPUs):
```bash
# upgrade pip and setuptools if necessary
pip install -U pip setuptools
cd emova_speech_tokenizer
pip install -e . # for NVIDIA GPUs (e.g., A800 and H20)
pip install -e .[npu] # OR for Ascend NPUS (e.g., 910B3)
```
## Usage
> [!NOTE]
> Before this, remember to finish [Installation](#installation) first!
EMOVA speech tokenizer can be easily deployed using the π€ HuggingFace transformers API!
```python
import random
from transformers import AutoModel
import torch
### Uncomment if you want to use Ascend NPUs
# import torch_npu
# from torch_npu.contrib import transfer_to_npu
# load pretrained model
model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
# S2U
wav_file = "./examples/s2u/example.wav"
speech_unit = model.encode(wav_file)
print(speech_unit)
# U2S
emotion = random.choice(['angry', 'happy', 'neutral', 'sad'])
speed = random.choice(['normal', 'fast', 'slow'])
pitch = random.choice(['normal', 'high', 'low'])
gender = random.choice(['female', 'male'])
condition = f'gender-{gender}_emotion-{emotion}_speed-{speed}_pitch-{pitch}'
output_wav_file = f'./examples/u2s/{condition}_output.wav'
model.decode(speech_unit, condition=condition, output_wav_file=output_wav_file)
```
## Citation
If you find our model/code/paper helpful, please consider citing our papers and staring us!
```bibtex
@article{chen2024emova,
title={Emova: Empowering language models to see, hear and speak with vivid emotions},
author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
journal={arXiv preprint arXiv:2409.18042},
year={2024}
}
``` |