File size: 8,428 Bytes
6c28d8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
license: apache-2.0
datasets:
- Emova-ollm/emova-alignment-7m
- Emova-ollm/emova-sft-4m
- Emova-ollm/emova-sft-speech-231k
language:
- en
- zh
base_model:
- Emova-ollm/qwen2vit600m
- Emova-ollm/Qwen2.5-72B-Instruct_add_speech_token_4096_nostrip
new_version: Emova-ollm/emova-qwen-2-5-72b-hf
library_name: transformers
tags:
- Omni-modal-LLM
- Multi-modal-LLM
- Emotional-spoken-dialogue
model-index:
- name: emova-qwen-2-5-72b
  results:
  - task:
      type: multimodal
    dataset:
      name: AI2D
      type: ai2d
    metrics:
    - type: accuracy
      value: 85.8
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: ChartQA
      type: chartqa
    metrics:
    - type: accuracy
      value: 88.7
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: DocVQA
      type: docvqa
    metrics:
    - type: accuracy
      value: 95.9
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: InfoVQA
      type: infovqa
    metrics:
    - type: accuracy
      value: 83.2
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: MathVerse
      type: mathverse
    metrics:
    - type: accuracy
      value: 50.0
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: MathVista
      type: mathvista
    metrics:
    - type: accuracy
      value: 69.9
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: MMBench
      type: mmbench
    metrics:
    - type: accuracy
      value: 86.4
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: MME
      type: mme
    metrics:
    - type: score
      value: 2402
      name: score
      verified: true
  - task:
      type: multimodal
    dataset:
      name: MMVet
      type: mmvet
    metrics:
    - type: accuracy
      value: 64.8
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: OCRBench
      type: ocrbench
    metrics:
    - type: accuracy
      value: 843
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: RealWorldQA
      type: realworldqa
    metrics:
    - type: accuracy
      value: 71.0
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: Seed-Bench-Image
      type: seed-bench-image
    metrics:
    - type: accuracy
      value: 76.6
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: Science-QA
      type: science-qa
    metrics:
    - type: accuracy
      value: 98.2
      name: accuracy
      verified: true
  - task:
      type: multimodal
    dataset:
      name: TextVQA
      type: textvqa
    metrics:
    - type: accuracy
      value: 81.4
      name: accuracy
      verified: true
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 2.9
---

# EMOVA-Qwen-2.5-72B

<div align="center">

<img src="https://emova-ollm.github.io/static/images/icons/emova_icon2.png" width="300em"></img>

πŸ€— [EMOVA-Models](https://huggingface.co/collections/Emova-ollm/emova-models-67779d377bb8261e6057a320) | πŸ€— [EMOVA-Datasets](https://huggingface.co/collections/Emova-ollm/emova-datasets-67779be7d02447a2d0891bf6) | πŸ€— [EMOVA-Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) <br/>
πŸ“„ [Paper](https://arxiv.org/abs/2409.18042) | 🌐 [Project-Page](https://emova-ollm.github.io/) | πŸ’» [Github](https://github.com/emova-ollm/EMOVA) | πŸ’» [EMOVA-Speech-Tokenizer-Github](https://github.com/emova-ollm/EMOVA_speech_tokenizer)

</div>

## Model Summary

**EMOVA** (**EM**otionally **O**mni-present **V**oice **A**ssistant) is a novel end-to-end omni-modal LLM that can see, hear and speak without relying on external models. Given the omni-modal (i.e., textual, visual and speech) inputs, EMOVA can generate both textual and speech responses with vivid emotional controls by utilizing the speech decoder together with a style encoder. EMOVA possesses general omni-modal understanding and generation capabilities, featuring its superiority in advanced vision-language understanding, emotional spoken dialogue, and spoken dialogue with structural data understanding. We summarize its key advantages as:

- **State-of-the-art omni-modality performance**: EMOVA achieves state-of-the-art comparable results on both **vision-language** and **speech** benchmarks simultaneously. Our best performing model, **EMOVA-72B**, even surpasses commercial models including GPT-4o and Gemini Pro 1.5.
- **Emotional spoken dialogue**:  A **semantic-acoustic disentangled** speech tokenizer and a lightweight **style control** module are adopted for seamless omni-modal alignment and diverse speech style controllability. EMOVA supports **bilingual (Chinese and English)** spoken dialogue with **24 speech style** controls (i.e., 2 speakers, 3 pitches and 4 emotions). 
- **Diverse configurations**: We open-source 3 configurations, **EMOVA-3B/7B/72B**, to support omni-modal usage under different computational budgets. Check our [Model Zoo](https://huggingface.co/collections/Emova-ollm/emova-models-67779d377bb8261e6057a320) and find the best fit model for your computational devices!

<div align="center">
  <img src="https://emova-ollm.github.io/static/images/model_architecture.png" width=100%></img>
</div>


## Performance


| Benchmarks         | EMOVA-3B | EMOVA-7B | EMOVA-72B | GPT-4o | VITA 8x7B | VITA 1.5 | Baichuan-Omni |
|:------------------:|:-------: |:--------:|:---------:|:------:|:---------:|:--------:|:-------------:|
| **MME**            | 2175     | 2317     | 2402      | 2310   | 2097      | 2311     | 2187           |
| **MMBench**        | 79.2     | 83.0     | 86.4      | 83.4   | 71.8      | 76.6     | 76.2           |
| **SEED-Image**     | 74.9     | 75.5     | 76.6      | 77.1   | 72.6      | 74.2     | 74.1           |
| **MM-Vet**         | 57.3     | 59.4     | 64.8      | -      | 41.6      | 51.1     | 65.4           |
| **RealWorldQA**    | 62.6     | 67.5     | 71.0      | 75.4   | 59.0      | 66.8     | 62.6           |
| **TextVQA**        | 77.2     | 78.0     | 81.4      | -      | 71.8      | 74.9     | 74.3           |
| **ChartQA**        | 81.5     | 84.9     | 88.7      | 85.7   | 76.6      | 79.6     | 79.6           |
| **DocVQA**         | 93.5     | 94.2     | 95.9      | 92.8   | -         | -        | -              |
| **InfoVQA**        | 71.2     | 75.1     | 83.2      | -      | -         | -        | -              |
| **OCRBench**       | 803      | 814      | 843       | 736    | 678       | 752      | 700            |
| **ScienceQA-Img**  | 92.7     | 96.4     | 98.2      | -      | -         | -        | -              |
| **AI2D**           | 78.6     | 81.7     | 85.8      | 84.6   | 73.1      | 79.3     | -              |
| **MathVista**      | 62.6     | 65.5     | 69.9      | 63.8   | 44.9      | 66.2     | 51.9           |
| **Mathverse**      | 31.4     | 40.9     | 50.0      | -      | -         | -        | -              |
| **Librispeech (WER↓)** | 5.4  | 4.1      | 2.9       | -      | 3.4       | 8.1      | -              |


## Usage

This repo contains the **EMOVA-Qwen2.5-72B** checkpoint organized in the **original format** of our [EMOVA codebase](https://github.com/emova-ollm/EMOVA), and thus, it should be utilized together with EMOVA codebase. Its paired config file is provided [here](https://github.com/emova-ollm/EMOVA/blob/main/configs/example/emova/qwen2_5_qwen2vit_nativeAnyres_72b/2.finetune.py). Check [here](https://github.com/emova-ollm/EMOVA#gradio-web-demo) to launch a web demo using this checkpoint.


## Citation

```bibtex
@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}
```