arxiv:2409.18042

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Published on Sep 26, 2024

· Submitted by

akhaliq on Sep 27, 2024

#2 Paper of the day

Upvote

Authors:

Kai Chen ,

Yunhao Gou ,

Runhui Huang ,

Zhili Liu ,

Daxin Tan ,

Chunwei Wang ,

Yihan Zeng ,

Kuo Yang ,

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

akhaliq

Paper submitter Sep 27, 2024

https://emova-ollm.github.io/

MichaelKarpe

Oct 15, 2024

I have a connection not secure warning when trying to access the URL from several devices, would you be able to fix this? Do you have any update on when a model demo would be available on HuggingFace?

Lyte

Sep 27, 2024

will the model weights be released?

KaiChen1998

Paper author Sep 27, 2024

Will soon release the checkpoint after we get back from ECCV (the main authors are all catching flights for Milano today 😂). We are busy preparing an HF demo for temporal usage. Stay tuned!

librarian-bot

Sep 28, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Tongbo

Dec 15, 2024

When it will be open-sourced?

KaiChen1998

Paper author 2 days ago

Thanks for your great patience! EMOVA has been fully open-sourced, including the code, datasets, checkpoints. You may find the following links useful.

If EMOVA is useful for your work, feel free to give us a star or cite our paper as a reference :)!

KaiChen1998

Paper author 2 days ago

•

edited 2 days ago

📢 Our EMOVA paper has been accepted by CVPR 2025, and we are glad to release all the EMOVA resources, including the code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)! Thanks all for your great patience! You are welcome to try and star if useful for your work and research!

Project page: https://emova-ollm.github.io/
Github: https://github.com/emova-ollm/EMOVA
Demo: https://huggingface.co/spaces/Emova-ollm/EMOVA-demo

@akhaliq @MichaelKarpe @Lyte @Tongbo

KaiChen1998

Paper author 2 days ago

•

edited 2 days ago

✨ EMOVA Highlights

✅ State-of-the-art omni-modality: EMOVA achieves state-of-the-art comparable results on both vision-language and speech benchmarks simultaneously.
✅ Fully open-source: we release all materials, including pre-trained EMOVA checkpoints, training/inference/ evaluation code and training/evaluation datasets!
✅ Device adaptation: our codebase supports training/inference on both NVIDIA GPUs (e.g., A800 & H20) and Ascend NPUs (e.g., 910B3)!
✅ Modular design: we integrate multiple implementations of vision encoder, vision projector, and language model.