Kaichengalex commited on
Commit
470d72e
·
verified ·
1 Parent(s): ba3d95c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -1
README.md CHANGED
@@ -12,4 +12,107 @@ tags:
12
  - Multimodal
13
  - Embedding
14
  pipeline_tag: image-text-to-text
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  - Multimodal
13
  - Embedding
14
  pipeline_tag: image-text-to-text
15
+ ---
16
+
17
+ # Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
18
+ <a href="https://github.com/GaryGuTC">Tiancheng Gu*</a>,</span>
19
+ <a href="https://kaicheng-yang0828.github.io">Kaicheng Yang*</a>,</span>
20
+ Ziyong Feng,</span>
21
+ Xingjun Wang,</span>
22
+ Yanzhao Zhang,</span>
23
+ Dingkun Long,</span>
24
+ Yingda Chen,</span>
25
+ <a href="https://weidong-tom-cai.github.io/">Weidong Cai</a>,</span>
26
+ <a href="https://jiankangdeng.github.io">Jiankang Deng</a></span>
27
+
28
+ [🏡 Project Page](https://garygutc.github.io/UniME) | [📄 Paper]() | [💻 Github](https://github.com/deepglint/UniME)
29
+
30
+
31
+ <p align="center">
32
+ <img src="figures/fig1.png" width="85%" height="85">
33
+ </p>
34
+
35
+
36
+ ## 🎺 News
37
+ - [2025/04/24]: ✨We release the evaluate and demo code.
38
+ - [2025/04/24]: ✨The paper of UniME is submitted to arxiv.
39
+ - [2025/04/22]: ✨We release the model weight of UniME in [🤗 Huggingface](https://huggingface.co/collections/DeepGlint-AI/unime-6805fa16ab0071a96bef29d2)
40
+
41
+ ## 💡 Highlights
42
+ To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. **Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen**.
43
+
44
+ <p align="center">
45
+ <img src="figures/fig2.png" width="85%" >
46
+ </p>
47
+
48
+ After that, we propose hard negative enhanced instruction tuning enhances multimodal systems by improving visual sensitivity, strengthening cross-modal alignment, and boosting instruction-following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top-k similar but non-matching examples to increase training difficulty.
49
+ <p align="center">
50
+ <img src="figures/fig3.png" width="85%" >
51
+ </p>
52
+
53
+
54
+ ## 🧭 Quick Start
55
+ ```bash
56
+ git clone https://github.com/deepglint/UniME.git
57
+ cd UniME
58
+ conda create -n uniME python=3.10 -y
59
+ conda activate uniME
60
+ pip install -r requirements.txt
61
+ ```
62
+
63
+ ```python
64
+ import torch
65
+ from PIL import Image
66
+ from torch.nn import functional as F
67
+ from transformers import AutoProcessor, AutoModelForCausalLM
68
+
69
+ base_model_path="DeepGlint-AI/UniME-Phi3.5-V-4.2B"
70
+ img_prompt = '<|user|>\n<|image_1|>\nSummary above image in one word: <|end|>\n<|assistant|>\n'
71
+ text_prompt = '<|user|>\n<sent>\nSummary above sentence in one word: <|end|>\n<|assistant|>\n'
72
+
73
+ text = "A man is crossing the street with a red car parked nearby."
74
+ image_path = "figures/demo.png"
75
+ input_texts = text_prompt.replace('<sent>', text)
76
+ input_image_prompt = img_prompt
77
+ input_image = [Image.open(image_path)]
78
+
79
+ transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
80
+ model = AutoModelForCausalLM.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True,torch_dtype=torch.float16, _attn_implementation='flash_attention_2')
81
+ transform.tokenizer.padding_side = "left"
82
+ transform.tokenizer.padding = True
83
+
84
+ inputs_text = transform(text=input_texts,
85
+ images=None,
86
+ return_tensors="pt",
87
+ padding=True)
88
+ for key in inputs_text: inputs_text[key] = inputs_text[key].to("cuda")
89
+ inputs_image = transform(text=input_image_prompt,
90
+ images=input_image,
91
+ return_tensors="pt",
92
+ padding=True).to("cuda")
93
+
94
+ with torch.no_grad():
95
+ emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
96
+ emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
97
+ emb_text = F.normalize(emb_text, dim=-1)
98
+ emb_image = F.normalize(emb_image, dim=-1)
99
+ Score = emb_image @ emb_text.T
100
+ print("Score: ", Score)
101
+ ```
102
+
103
+ ## 🔢 Results
104
+ ### Diverse Retrieval
105
+ <p align="center">
106
+ <img src="figures/res1.png" width="85%" >
107
+ </p>
108
+
109
+ ### MMEB
110
+ <p align="center">
111
+ <img src="figures/res2.png" width="85%" >
112
+ </p>
113
+
114
+ ## 📖 Citation
115
+ If you find this repository useful, please use the following BibTeX entry for citation.
116
+ ```latex
117
+ Coming soon
118
+ ```