Update README.md
Browse files
README.md
CHANGED
@@ -11,44 +11,70 @@ base_model:
|
|
11 |
- Emova-ollm/emova_speech_tokenizer
|
12 |
---
|
13 |
|
|
|
|
|
14 |
<div align="center">
|
15 |
|
16 |
<img src="./examples/images/emova_icon2.png" width="300em"></img>
|
17 |
|
18 |
-
# EMOVA Speech Tokenizer HF
|
19 |
-
|
20 |
π€ [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | π» [EMOVA-Main-Repo](https://github.com/emova-ollm/EMOVA) | π [EMOVA-Paper](https://arxiv.org/abs/2409.18042) | π [Project-Page](https://emova-ollm.github.io/)
|
21 |
|
22 |
</div>
|
23 |
|
24 |
## Model Summary
|
25 |
|
26 |
-
This repo contains the
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
33 |
-
git clone https://huggingface.co/Emova-ollm/emova_speech_tokenizer
|
34 |
-
cd emova_speech_tokenizer
|
35 |
|
36 |
-
# for
|
37 |
-
pip install -e .
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|
44 |
## Usage
|
45 |
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
|
|
|
|
|
51 |
from transformers import AutoModel
|
|
|
|
|
|
|
|
|
|
|
52 |
|
53 |
# load pretrained model
|
54 |
model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
|
@@ -70,6 +96,7 @@ model.decode(speech_unit, condition=condition, output_wav_file=output_wav_file)
|
|
70 |
```
|
71 |
|
72 |
## Citation
|
|
|
73 |
|
74 |
```bibtex
|
75 |
@article{chen2024emova,
|
|
|
11 |
- Emova-ollm/emova_speech_tokenizer
|
12 |
---
|
13 |
|
14 |
+
# EMOVA Speech Tokenizer HF
|
15 |
+
|
16 |
<div align="center">
|
17 |
|
18 |
<img src="./examples/images/emova_icon2.png" width="300em"></img>
|
19 |
|
|
|
|
|
20 |
π€ [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | π» [EMOVA-Main-Repo](https://github.com/emova-ollm/EMOVA) | π [EMOVA-Paper](https://arxiv.org/abs/2409.18042) | π [Project-Page](https://emova-ollm.github.io/)
|
21 |
|
22 |
</div>
|
23 |
|
24 |
## Model Summary
|
25 |
|
26 |
+
This repo contains the official speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models. With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and speech modalities, but also empowers flexible speech style controls including speakers, emotions and pitches. We summarize its key advantages as:
|
27 |
|
28 |
+
- **Discrete speech tokenizer**: it contains a SPIRAL-based **speech-to-unit (S2U)** tokenizer to capture both phonetic and tonal information of input speeches, which is then discretized by a **finite scalar quantizater (FSQ)** into discrete speech units, and a VITS-based **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from speech units.
|
29 |
|
30 |
+
- **Semantic-acoustic disentanglement**: to seamlessly align speech units with the highly semantic embedding space of LLMs, we opt for decoupling the **semantic contents** and **acoustic styles** of input speeches, and only the former are utilized to generate the speech tokens.
|
31 |
|
32 |
+
- **Biligunal tokenization**: EMOVA speech tokenizer supports both **Chinese** and **English** speech tokenization with the same speech codebook.
|
|
|
|
|
33 |
|
34 |
+
- **Flexible speech style control**: thanks to the semantic-acoustic disentanglement, EMOVA speech tokenizer supports **24 speech style controls** (i.e., 2 speakers, 3 pitches, and 4 emotions). Check the [Usage](#usage) for more details.
|
|
|
35 |
|
36 |
+
<div align="center">
|
37 |
+
<img src="./examples/images/model_architecture.PNG" width=100%></img>
|
38 |
+
</div>
|
39 |
+
|
40 |
+
## Installation
|
41 |
+
|
42 |
+
Clone this repo and create the EMOVA virtual environment with conda. Our code has been validated on **NVIDIA A800/H20 GPU & Ascend 910B3 NPU** servers. Other devices might be available as well.
|
43 |
+
|
44 |
+
1. Initialize the conda environment:
|
45 |
+
|
46 |
+
```bash
|
47 |
+
git clone https://github.com/emova-ollm/EMOVA_speech_tokenizer.git
|
48 |
+
conda create -n emova python=3.9 -y
|
49 |
+
conda activate emova
|
50 |
+
```
|
51 |
+
|
52 |
+
2. Install the required packages (note that instructions are different from GPUs and NPUs):
|
53 |
+
|
54 |
+
```bash
|
55 |
+
# upgrade pip and setuptools if necessary
|
56 |
+
pip install -U pip setuptools
|
57 |
+
|
58 |
+
cd emova_speech_tokenizer
|
59 |
+
pip install -e . # for NVIDIA GPUs (e.g., A800 and H20)
|
60 |
+
pip install -e .[npu] # OR for Ascend NPUS (e.g., 910B3)
|
61 |
+
```
|
62 |
|
63 |
## Usage
|
64 |
|
65 |
+
> [!NOTE]
|
66 |
+
> Before this, remember to finish [Installation](#installation) first!
|
67 |
+
|
68 |
+
EMOVA speech tokenizer can be easily deployed using the π€ HuggingFace transformers API!
|
69 |
|
70 |
+
```python
|
71 |
+
import random
|
72 |
from transformers import AutoModel
|
73 |
+
import torch
|
74 |
+
|
75 |
+
## add if you want to use Ascend NPUs
|
76 |
+
# import torch_npu
|
77 |
+
# from torch_npu.contrib import transfer_to_npu
|
78 |
|
79 |
# load pretrained model
|
80 |
model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
|
|
|
96 |
```
|
97 |
|
98 |
## Citation
|
99 |
+
If you find our model/code/paper helpful, please consider citing our papers and staring us!
|
100 |
|
101 |
```bibtex
|
102 |
@article{chen2024emova,
|