KaiChen1998 commited on
Commit
e84cb10
Β·
verified Β·
1 Parent(s): 33e645f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -18
README.md CHANGED
@@ -11,44 +11,70 @@ base_model:
11
  - Emova-ollm/emova_speech_tokenizer
12
  ---
13
 
 
 
14
  <div align="center">
15
 
16
  <img src="./examples/images/emova_icon2.png" width="300em"></img>
17
 
18
- # EMOVA Speech Tokenizer HF
19
-
20
  πŸ€— [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | πŸ’» [EMOVA-Main-Repo](https://github.com/emova-ollm/EMOVA) | πŸ“„ [EMOVA-Paper](https://arxiv.org/abs/2409.18042) | 🌐 [Project-Page](https://emova-ollm.github.io/)
21
 
22
  </div>
23
 
24
  ## Model Summary
25
 
26
- This repo contains the discrete speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models. With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and audio modalities, but also empowers flexible speech style controls including emotions and pitches. It contains a **speech-to-unit (S2U)** tokenizer to convert speech signals to discrete speech units, and a **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from the speech units.
27
 
28
- This repo wraps the original [EMOVA speech tokenizer](https://huggingface.co/Emova-ollm/emova_speech_tokenizer) with HuggingFace [PreTrainedModel](https://huggingface.co/docs/transformers/v4.47.1/main_classes/model#transformers.PreTrainedModel) for simpler usage.
29
 
30
- ## Install
31
 
32
- ```bash
33
- git clone https://huggingface.co/Emova-ollm/emova_speech_tokenizer
34
- cd emova_speech_tokenizer
35
 
36
- # for GPU
37
- pip install -e .
38
 
39
- # for NPU
40
- # check https://github.com/Ascend/pytorch?tab=readme-ov-file#installation for detailed installation of torch npu
41
- pip install -e .[npu]
42
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Usage
45
 
46
- ```diff
47
- import torch
48
- +import torch_npu # add it if you want to use NPU
49
- +from torch_npu.contrib import transfer_to_npu # add it if you want to use NPU
50
 
 
 
51
  from transformers import AutoModel
 
 
 
 
 
52
 
53
  # load pretrained model
54
  model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
@@ -70,6 +96,7 @@ model.decode(speech_unit, condition=condition, output_wav_file=output_wav_file)
70
  ```
71
 
72
  ## Citation
 
73
 
74
  ```bibtex
75
  @article{chen2024emova,
 
11
  - Emova-ollm/emova_speech_tokenizer
12
  ---
13
 
14
+ # EMOVA Speech Tokenizer HF
15
+
16
  <div align="center">
17
 
18
  <img src="./examples/images/emova_icon2.png" width="300em"></img>
19
 
 
 
20
  πŸ€— [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | πŸ’» [EMOVA-Main-Repo](https://github.com/emova-ollm/EMOVA) | πŸ“„ [EMOVA-Paper](https://arxiv.org/abs/2409.18042) | 🌐 [Project-Page](https://emova-ollm.github.io/)
21
 
22
  </div>
23
 
24
  ## Model Summary
25
 
26
+ This repo contains the official speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models. With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and speech modalities, but also empowers flexible speech style controls including speakers, emotions and pitches. We summarize its key advantages as:
27
 
28
+ - **Discrete speech tokenizer**: it contains a SPIRAL-based **speech-to-unit (S2U)** tokenizer to capture both phonetic and tonal information of input speeches, which is then discretized by a **finite scalar quantizater (FSQ)** into discrete speech units, and a VITS-based **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from speech units.
29
 
30
+ - **Semantic-acoustic disentanglement**: to seamlessly align speech units with the highly semantic embedding space of LLMs, we opt for decoupling the **semantic contents** and **acoustic styles** of input speeches, and only the former are utilized to generate the speech tokens.
31
 
32
+ - **Biligunal tokenization**: EMOVA speech tokenizer supports both **Chinese** and **English** speech tokenization with the same speech codebook.
 
 
33
 
34
+ - **Flexible speech style control**: thanks to the semantic-acoustic disentanglement, EMOVA speech tokenizer supports **24 speech style controls** (i.e., 2 speakers, 3 pitches, and 4 emotions). Check the [Usage](#usage) for more details.
 
35
 
36
+ <div align="center">
37
+ <img src="./examples/images/model_architecture.PNG" width=100%></img>
38
+ </div>
39
+
40
+ ## Installation
41
+
42
+ Clone this repo and create the EMOVA virtual environment with conda. Our code has been validated on **NVIDIA A800/H20 GPU & Ascend 910B3 NPU** servers. Other devices might be available as well.
43
+
44
+ 1. Initialize the conda environment:
45
+
46
+ ```bash
47
+ git clone https://github.com/emova-ollm/EMOVA_speech_tokenizer.git
48
+ conda create -n emova python=3.9 -y
49
+ conda activate emova
50
+ ```
51
+
52
+ 2. Install the required packages (note that instructions are different from GPUs and NPUs):
53
+
54
+ ```bash
55
+ # upgrade pip and setuptools if necessary
56
+ pip install -U pip setuptools
57
+
58
+ cd emova_speech_tokenizer
59
+ pip install -e . # for NVIDIA GPUs (e.g., A800 and H20)
60
+ pip install -e .[npu] # OR for Ascend NPUS (e.g., 910B3)
61
+ ```
62
 
63
  ## Usage
64
 
65
+ > [!NOTE]
66
+ > Before this, remember to finish [Installation](#installation) first!
67
+
68
+ EMOVA speech tokenizer can be easily deployed using the πŸ€— HuggingFace transformers API!
69
 
70
+ ```python
71
+ import random
72
  from transformers import AutoModel
73
+ import torch
74
+
75
+ ## add if you want to use Ascend NPUs
76
+ # import torch_npu
77
+ # from torch_npu.contrib import transfer_to_npu
78
 
79
  # load pretrained model
80
  model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
 
96
  ```
97
 
98
  ## Citation
99
+ If you find our model/code/paper helpful, please consider citing our papers and staring us!
100
 
101
  ```bibtex
102
  @article{chen2024emova,