Emova-ollm
/

emova_speech_tokenizer_hf

@@ -11,44 +11,70 @@ base_model:
 - Emova-ollm/emova_speech_tokenizer
 ---
 <div align="center">
 <img src="./examples/images/emova_icon2.png" width="300em"></img>
-# EMOVA Speech Tokenizer HF
 🤗 [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | 💻 [EMOVA-Main-Repo](https://github.com/emova-ollm/EMOVA) | 📄 [EMOVA-Paper](https://arxiv.org/abs/2409.18042) | 🌐 [Project-Page](https://emova-ollm.github.io/)
 </div>
 ## Model Summary
-This repo contains the discrete speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models.  With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and audio modalities, but also empowers flexible speech style controls including emotions and pitches. It contains a **speech-to-unit (S2U)** tokenizer to convert speech signals to discrete speech units, and a **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from the speech units.
-This repo wraps the original [EMOVA speech tokenizer](https://huggingface.co/Emova-ollm/emova_speech_tokenizer) with HuggingFace [PreTrainedModel](https://huggingface.co/docs/transformers/v4.47.1/main_classes/model#transformers.PreTrainedModel) for simpler usage.
-## Install
-```bash
-git clone https://huggingface.co/Emova-ollm/emova_speech_tokenizer
-cd emova_speech_tokenizer
-# for GPU
-pip install -e .
-# for NPU
-# check https://github.com/Ascend/pytorch?tab=readme-ov-file#installation for detailed installation of torch npu
-pip install -e .[npu]
-```
 ## Usage
-```diff
-import torch
-+import torch_npu # add it if you want to use NPU
-+from torch_npu.contrib import transfer_to_npu # add it if you want to use NPU
 from transformers import AutoModel
 # load pretrained model
 model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
@@ -70,6 +96,7 @@ model.decode(speech_unit, condition=condition, output_wav_file=output_wav_file)
 ```
 ## Citation
 ```bibtex
 @article{chen2024emova,

 - Emova-ollm/emova_speech_tokenizer
 ---
+# EMOVA Speech Tokenizer HF
 <div align="center">
 <img src="./examples/images/emova_icon2.png" width="300em"></img>
 🤗 [HuggingFace](https://huggingface.co/Emova-ollm/emova_speech_tokenizer_hf) | 💻 [EMOVA-Main-Repo](https://github.com/emova-ollm/EMOVA) | 📄 [EMOVA-Paper](https://arxiv.org/abs/2409.18042) | 🌐 [Project-Page](https://emova-ollm.github.io/)
 </div>
 ## Model Summary
+This repo contains the official speech tokenizer used to train the [EMOVA](https://emova-ollm.github.io/) series of models. With a semantic-acoustic disentangled design, it not only facilitates seamless omni-modal alignment among vision, language and speech modalities, but also empowers flexible speech style controls including speakers, emotions and pitches. We summarize its key advantages as:
+- **Discrete speech tokenizer**: it contains a SPIRAL-based **speech-to-unit (S2U)** tokenizer to capture both phonetic and tonal information of input speeches, which is then discretized by a **finite scalar quantizater (FSQ)** into discrete speech units, and a VITS-based **unit-to-speech (U2S)** de-tokenizer to reconstruct speech signals from speech units.
+- **Semantic-acoustic disentanglement**: to seamlessly align speech units with the highly semantic embedding space of LLMs, we opt for decoupling the **semantic contents** and **acoustic styles** of input speeches, and only the former are utilized to generate the speech tokens.
+- **Biligunal tokenization**: EMOVA speech tokenizer supports both **Chinese** and **English** speech tokenization with the same speech codebook.
+- **Flexible speech style control**: thanks to the semantic-acoustic disentanglement, EMOVA speech tokenizer supports **24 speech style controls** (i.e., 2 speakers, 3 pitches, and 4 emotions). Check the [Usage](#usage) for more details.
+<div align="center">
+  <img src="./examples/images/model_architecture.PNG" width=100%></img>
+</div>
+## Installation
+Clone this repo and create the EMOVA virtual environment with conda. Our code has been validated on **NVIDIA A800/H20 GPU & Ascend 910B3 NPU** servers. Other devices might be available as well.
+1. Initialize the conda environment:
+   ```bash
+   git clone https://github.com/emova-ollm/EMOVA_speech_tokenizer.git
+   conda create -n emova python=3.9 -y
+   conda activate emova
+   ```
+2. Install the required packages (note that instructions are different from GPUs and NPUs):
+   ```bash
+   # upgrade pip and setuptools if necessary
+   pip install -U pip setuptools
+   cd emova_speech_tokenizer
+   pip install -e . # for NVIDIA GPUs (e.g., A800 and H20)
+   pip install -e .[npu] # OR for Ascend NPUS (e.g., 910B3)
+   ```
 ## Usage
+> [!NOTE]
+> Before this, remember to finish [Installation](#installation) first!
+EMOVA speech tokenizer can be easily deployed using the 🤗 HuggingFace transformers API!
+```python
+import random
 from transformers import AutoModel
+import torch
+## add if you want to use Ascend NPUs
+# import torch_npu
+# from torch_npu.contrib import transfer_to_npu
 # load pretrained model
 model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
 ```
 ## Citation
+If you find our model/code/paper helpful, please consider citing our papers and staring us!
 ```bibtex
 @article{chen2024emova,