emova-qwen-2-5-7b / README.md

KaiChen1998

Update README.md

fd9544c verified 2 months ago

8.42 kB

	---
	license: apache-2.0
	datasets:
	- Emova-ollm/emova-alignment-7m
	- Emova-ollm/emova-sft-4m
	- Emova-ollm/emova-sft-speech-231k
	language:
	- en
	- zh
	base_model:
	- Emova-ollm/qwen2vit600m
	- Emova-ollm/Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip
	new_version: Emova-ollm/emova-qwen-2-5-7b-hf
	library_name: transformers
	tags:
	- Omni-modal-LLM
	- Multi-modal-LLM
	- Emotional-spoken-dialogue
	model-index:
	- name: emova-qwen-2-5-7b
	results:
	- task:
	type: multimodal
	dataset:
	name: AI2D
	type: ai2d
	metrics:
	- type: accuracy
	value: 81.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: ChartQA
	type: chartqa
	metrics:
	- type: accuracy
	value: 84.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: DocVQA
	type: docvqa
	metrics:
	- type: accuracy
	value: 94.2
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: InfoVQA
	type: infovqa
	metrics:
	- type: accuracy
	value: 75.1
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MathVerse
	type: mathverse
	metrics:
	- type: accuracy
	value: 40.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MathVista
	type: mathvista
	metrics:
	- type: accuracy
	value: 65.5
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MMBench
	type: mmbench
	metrics:
	- type: accuracy
	value: 83
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MME
	type: mme
	metrics:
	- type: score
	value: 2317
	name: score
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MMVet
	type: mmvet
	metrics:
	- type: accuracy
	value: 59.4
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: OCRBench
	type: ocrbench
	metrics:
	- type: accuracy
	value: 814
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: RealWorldQA
	type: realworldqa
	metrics:
	- type: accuracy
	value: 67.5
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Seed-Bench-Image
	type: seed-bench-image
	metrics:
	- type: accuracy
	value: 75.5
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Science-QA
	type: science-qa
	metrics:
	- type: accuracy
	value: 96.4
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: TextVQA
	type: textvqa
	metrics:
	- type: accuracy
	value: 78
	name: accuracy
	verified: true
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: LibriSpeech (clean)
	type: librispeech_asr
	config: clean
	split: test
	args:
	language: en
	metrics:
	- name: Test WER
	type: wer
	value: 4.1
	---

	# EMOVA-Qwen-2.5-7B

	<div align="center">

	<img src="https://emova-ollm.github.io/static/images/icons/emova_icon2.png" width="300em"></img>

	🤗 [EMOVA-Models](https://huggingface.co/collections/Emova-ollm/emova-models-67779d377bb8261e6057a320) \| 🤗 [EMOVA-Datasets](https://huggingface.co/collections/Emova-ollm/emova-datasets-67779be7d02447a2d0891bf6) \| 🤗 [EMOVA-Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) <br/>
	📄 [Paper](https://arxiv.org/abs/2409.18042) \| 🌐 [Project-Page](https://emova-ollm.github.io/) \| 💻 [Github](https://github.com/emova-ollm/EMOVA) \| 💻 [EMOVA-Speech-Tokenizer-Github](https://github.com/emova-ollm/EMOVA_speech_tokenizer)

	</div>

	## Model Summary

	EMOVA (EMotionally Omni-present Voice Assistant) is a novel end-to-end omni-modal LLM that can see, hear and speak without relying on external models. Given the omni-modal (i.e., textual, visual and speech) inputs, EMOVA can generate both textual and speech responses with vivid emotional controls by utilizing the speech decoder together with a style encoder. EMOVA possesses general omni-modal understanding and generation capabilities, featuring its superiority in advanced vision-language understanding, emotional spoken dialogue, and spoken dialogue with structural data understanding. We summarize its key advantages as:

	- State-of-the-art omni-modality performance: EMOVA achieves state-of-the-art comparable results on both vision-language and speech benchmarks simultaneously. Our best performing model, EMOVA-72B, even surpasses commercial models including GPT-4o and Gemini Pro 1.5.
	- Emotional spoken dialogue: A semantic-acoustic disentangled speech tokenizer and a lightweight style control module are adopted for seamless omni-modal alignment and diverse speech style controllability. EMOVA supports bilingual (Chinese and English) spoken dialogue with 24 speech style controls (i.e., 2 speakers, 3 pitches and 4 emotions).
	- Diverse configurations: We open-source 3 configurations, EMOVA-3B/7B/72B, to support omni-modal usage under different computational budgets. Check our [Model Zoo](https://huggingface.co/collections/Emova-ollm/emova-models-67779d377bb8261e6057a320) and find the best fit model for your computational devices!

	<div align="center">
	<img src="https://emova-ollm.github.io/static/images/model_architecture.png" width=100%></img>
	</div>


	## Performance


	\| Benchmarks \| EMOVA-3B \| EMOVA-7B \| EMOVA-72B \| GPT-4o \| VITA 8x7B \| VITA 1.5 \| Baichuan-Omni \|
	\|:------------------:\|:-------: \|:--------:\|:---------:\|:------:\|:---------:\|:--------:\|:-------------:\|
	\| MME \| 2175 \| 2317 \| 2402 \| 2310 \| 2097 \| 2311 \| 2187 \|
	\| MMBench \| 79.2 \| 83.0 \| 86.4 \| 83.4 \| 71.8 \| 76.6 \| 76.2 \|
	\| SEED-Image \| 74.9 \| 75.5 \| 76.6 \| 77.1 \| 72.6 \| 74.2 \| 74.1 \|
	\| MM-Vet \| 57.3 \| 59.4 \| 64.8 \| - \| 41.6 \| 51.1 \| 65.4 \|
	\| RealWorldQA \| 62.6 \| 67.5 \| 71.0 \| 75.4 \| 59.0 \| 66.8 \| 62.6 \|
	\| TextVQA \| 77.2 \| 78.0 \| 81.4 \| - \| 71.8 \| 74.9 \| 74.3 \|
	\| ChartQA \| 81.5 \| 84.9 \| 88.7 \| 85.7 \| 76.6 \| 79.6 \| 79.6 \|
	\| DocVQA \| 93.5 \| 94.2 \| 95.9 \| 92.8 \| - \| - \| - \|
	\| InfoVQA \| 71.2 \| 75.1 \| 83.2 \| - \| - \| - \| - \|
	\| OCRBench \| 803 \| 814 \| 843 \| 736 \| 678 \| 752 \| 700 \|
	\| ScienceQA-Img \| 92.7 \| 96.4 \| 98.2 \| - \| - \| - \| - \|
	\| AI2D \| 78.6 \| 81.7 \| 85.8 \| 84.6 \| 73.1 \| 79.3 \| - \|
	\| MathVista \| 62.6 \| 65.5 \| 69.9 \| 63.8 \| 44.9 \| 66.2 \| 51.9 \|
	\| Mathverse \| 31.4 \| 40.9 \| 50.0 \| - \| - \| - \| - \|
	\| Librispeech (WER↓) \| 5.4 \| 4.1 \| 2.9 \| - \| 3.4 \| 8.1 \| - \|


	## Usage

	This repo contains the EMOVA-Qwen2.5-7B checkpoint organized in the original format of our [EMOVA codebase](https://github.com/emova-ollm/EMOVA), and thus, it should be utilized together with EMOVA codebase. Its paired config file is provided [here](https://github.com/emova-ollm/EMOVA/blob/main/configs/example/emova/qwen2_5_qwen2vit_nativeAnyres_7b/2.finetune.py). Check [here](https://github.com/emova-ollm/EMOVA#gradio-web-demo) to launch a web demo using this checkpoint.


	## Citation

	```bibtex
	@article{chen2024emova,
	title={Emova: Empowering language models to see, hear and speak with vivid emotions},
	author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
	journal={arXiv preprint arXiv:2409.18042},
	year={2024}
	}
	```