File size: 4,712 Bytes
f2139aa 1630397 f2139aa 1630397 f2139aa 470d72e 1630397 470d72e 5dce2ed 470d72e 5dce2ed 470d72e 5dce2ed 470d72e 1630397 470d72e 5dce2ed 470d72e 5dce2ed 470d72e 1630397 470d72e 08ebc82 470d72e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
base_model:
- microsoft/Phi-3.5-vision-instruct
datasets:
- TIGER-Lab/MMEB-train
language:
- en
library_name: transformers
license: mit
pipeline_tag: image-text-to-text
tags:
- Retrieval
- Multimodal
- Embedding
---
# Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
<a href="https://github.com/GaryGuTC">Tiancheng Gu*</a>,</span>
<a href="https://kaicheng-yang0828.github.io">Kaicheng Yang*</a>,</span>
Ziyong Feng,</span>
Xingjun Wang,</span>
Yanzhao Zhang,</span>
Dingkun Long,</span>
Yingda Chen,</span>
<a href="https://weidong-tom-cai.github.io/">Weidong Cai</a>,</span>
<a href="https://jiankangdeng.github.io">Jiankang Deng</a></span>
[🏡 Project Page](https://garygutc.github.io/UniME) | [📄 Paper](https://arxiv.org/abs/2504.17432) | [💻 Github](https://github.com/deepglint/UniME)
<p align="center">
<img src="figures/fig1.png">
</p>
## 💡 Highlights
To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. **Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen**.
<p align="center">
<img src="figures/fig2.png">
</p>
After that, we propose hard negative enhanced instruction tuning enhances multimodal systems by improving visual sensitivity, strengthening cross-modal alignment, and boosting instruction-following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top-k similar but non-matching examples to increase training difficulty.
<p align="center">
<img src="figures/fig3.png">
</p>
## 🧭 Quick Start
```bash
git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt
```
```python
import torch
from PIL import Image
from torch.nn import functional as F
from transformers import AutoProcessor, AutoModelForCausalLM
base_model_path="DeepGlint-AI/UniME-Phi3.5-V-4.2B"
img_prompt = '<|user|>
<|image_1|>
Summary above image in one word: <|end|>
<|assistant|>
'
text_prompt = '<|user|>
<sent>
Summary above sentence in one word: <|end|>
<|assistant|>
'
text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_texts = text_prompt.replace('<sent>', text)
input_image_prompt = img_prompt
input_image = [Image.open(image_path)]
transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True,torch_dtype=torch.float16, _attn_implementation='flash_attention_2')
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True
inputs_text = transform(text=input_texts,
images=None,
return_tensors="pt",
padding=True)
for key in inputs_text: inputs_text[key] = inputs_text[key].to("cuda")
inputs_image = transform(text=input_image_prompt,
images=input_image,
return_tensors="pt",
padding=True).to("cuda")
with torch.no_grad():
emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
emb_text = F.normalize(emb_text, dim=-1)
emb_image = F.normalize(emb_image, dim=-1)
Score = emb_image @ emb_text.T
print("Score: ", Score)
```
## 🔢 Results
### Diverse Retrieval
<p align="center">
<img src="figures/res1.png">
</p>
### MMEB
<p align="center">
<img src="figures/res2.png">
</p>
## 📖 Citation
If you find this repository useful, please use the following BibTeX entry for citation.
[📄 Paper](https://arxiv.org/abs/2504.17432)
```latex
@misc{gu2025breakingmodalitybarrieruniversal,
title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs},
author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
year={2025},
eprint={2504.17432},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.17432},
}
``` |