JUNJIE99 commited on
Commit
e8000f1
Β·
verified Β·
1 Parent(s): 102e85e

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +135 -3
README.md CHANGED
@@ -1,3 +1,135 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - llava-hf/llava-v1.6-mistral-7b-hf
7
+ tags:
8
+ - multimodal-retrieval
9
+ - embedding-model
10
+ ---
11
+ <h1 align="center">MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</h1>
12
+
13
+ <p align="center">
14
+ <a href="https://arxiv.org/abs/2412.14475">
15
+ <img alt="Build" src="http://img.shields.io/badge/cs.CV-arXiv%3A2412.14475-B31B1B.svg">
16
+ </a>
17
+ <a href="https://github.com/VectorSpaceLab/MegaPairs">
18
+ <img alt="Build" src="https://img.shields.io/badge/Github-Code-blue">
19
+ </a>
20
+ <a href="https://huggingface.co/datasets/JUNJIE99/MegaPairs">
21
+ <img alt="Build" src="https://img.shields.io/badge/πŸ€— Datasets-MegaPairs-yellow">
22
+ </p>
23
+
24
+ <p align="center">
25
+ </a>
26
+ <a href="https://huggingface.co/JUNJIE99/MMRet-base">
27
+ <img alt="Build" src="https://img.shields.io/badge/πŸ€— Model-MMRet_base-yellow">
28
+ </a>
29
+ <a href="https://huggingface.co/JUNJIE99/MMRet-large">
30
+ <img alt="Build" src="https://img.shields.io/badge/πŸ€— Model-MMRet_large-yellow">
31
+ </a>
32
+ <a href="https://huggingface.co/JUNJIE99/MMRet-MLLM-S1">
33
+ <img alt="Build" src="https://img.shields.io/badge/πŸ€— Model-MMRet_MLLM_S1-yellow">
34
+ </a>
35
+ <a href="https://huggingface.co/JUNJIE99/MMRet-MLLM-S2">
36
+ <img alt="Build" src="https://img.shields.io/badge/πŸ€— Model-MMRet_MLLM_S2-yellow">
37
+ </a>
38
+ </p>
39
+
40
+ ## News
41
+ ```2024-3-4``` πŸš€πŸš€ We have released the MMRet-MLLM models on Hugging Face: [MMRet-MLLM-S1](https://huggingface.co/JUNJIE99/MMRet-MLLM-S1) and [MMRet-MLLM-S2](https://huggingface.co/JUNJIE99/MMRet-MLLM-S2). **MMRet-MLLM-S1** is trained exclusively on our MegaPairs dataset, achieving outstanding performance in composed image retrieval, with an 8.1% improvement on the CIRCO benchmark (mAP@5) over the previous state-of-the-art. **MMRet-MLLM-S2** builds on MMRet-MLLM-S1 with an additional epoch of fine-tuning on the MMEB benchmark training set, delivering enhanced performance across a broader range of multimodal embedding tasks.
42
+
43
+ ```2024-12-27``` πŸš€πŸš€ MMRet-CLIP models are released in Huggingface: [MMRet-base](https://huggingface.co/JUNJIE99/MMRet-base) and [MMRet-large](https://huggingface.co/JUNJIE99/MMRet-large).
44
+
45
+ ```2024-12-19``` πŸŽ‰πŸŽ‰ Release our paper: [MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval](https://arxiv.org/pdf/2412.14475).
46
+
47
+ ## Release Plan
48
+ - [x] Paper
49
+ - [x] MMRet-base and MMRet-large models
50
+ - [x] MMRet-MLLM model
51
+ - [ ] MegaPairs Dataset
52
+ - [ ] Evaluation code
53
+ - [ ] Fine-tuning code
54
+
55
+
56
+ ## Introduction
57
+ In this project, we introduce **MegaPairs**, a novel data synthesis method that leverages open-domain images to create *heterogeneous KNN triplets* for universal multimodal retrieval. Our MegaPairs dataset contains over 26 million triplets, and we have trained a series of multimodal retrieval models, **MMRets**, including MMRet-CLIP (base and large) and MMRet-MLLM.
58
+
59
+ MMRets achieve state-of-the-art performance on four popular zero-shot composed image retrieval benchmarks and the massive multimodal embedding benchmark (MMEB). Extensive experiments demonstrate the ***efficiency, scalability, and generalization*** features of MegaPairs. Please refer to our [paper](https://arxiv.org/abs/2412.14475) for more details.
60
+
61
+ ## Model Usage
62
+
63
+ ### 1. MMRet-CLIP Models
64
+ You can easily use MMRet-CLIP models based on ```transformers```
65
+ ```python
66
+ import torch
67
+ from transformers import AutoModel
68
+
69
+ MODEL_NAME = "JUNJIE99/MMRet-base" # or "JUNJIE99/MMRet-large"
70
+
71
+ model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True
72
+ model.set_processor(MODEL_NAME)
73
+ model.eval()
74
+
75
+ with torch.no_grad():
76
+ query = model.encode(
77
+ images = "./assets/cir_query.png",
78
+ text = "Make the background dark, as if the camera has taken the photo at night"
79
+ )
80
+
81
+ candidates = model.encode(
82
+ images = ["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"]
83
+ )
84
+
85
+ scores = query @ candidates.T
86
+ print(scores)
87
+ ```
88
+
89
+
90
+
91
+
92
+ ### 2. MMRet-MLLM Models
93
+ ```Will be released soon.```
94
+
95
+ ## Model Performance
96
+ ### Zero-Shot Composed Image Retrieval
97
+
98
+ MMRet sets a new performance benchmark in zero-shot composed image retrieval tasks. On the CIRCO benchmark, our MMRet-base model, with only 149 million parameters, surpasses all previous models, including those with 50 times more parameters. Additionally, MMRet-MLLM achieves an 8.1% improvement over the previous state-of-the-art model.
99
+
100
+ <img src="./assets/res-zs-cir.png" width="800">
101
+
102
+ ### Zero-Shot Performance on MMEB
103
+
104
+ MMRet-MLLM achieves state-of-the-art zero-shot performance on the Massive Multimodal Embedding Benchmark (MMEB), despite being trained only on the ImageText-to-Image paradigm. This demonstrates the excellent generalization capability of MegaPairs for multimodal embedding.
105
+
106
+ <img src="./assets/res-zs-mmeb.png" width="800">
107
+
108
+ ### Fine-Tuning Performance on MMEB
109
+
110
+ After fine-tuning on downstream tasks, MMRet-MLLM maintains its leading performance. Notably, it surpasses the previous state-of-the-art by 7.1% on the MMEB out-of-distribution (OOD) set. These results demonstrate the robust generalization capability of MMRet-MLLM and highlight the potential of MegaPairs as foundational training data for universal multimodal embedding.
111
+
112
+ <img src="./assets/res-ft-mmeb.png" width="800">
113
+
114
+ ### Performance Scaling
115
+ MegaPairs showcases **scalability**: MMRet-base improves as training data increases. It also demonstrates **efficiency**: with just 0.5M training samples, MMRet-base significantly outperforms MagicLens, which uses the same CLIP-base backbone and was trained on 36.7M samples.
116
+
117
+ <img src="./assets/res-scaling.png" width="800">
118
+
119
+
120
+ ## License
121
+ The annotations for MegaPairs and the MMRet models are released under the [MIT License](LICENSE). The images in MegaPairs originate from the [Recap-Datacomp](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B), which is released under the CC BY 4.0 license.
122
+
123
+
124
+
125
+ ## Citation
126
+ If you find this repository useful, please consider giving a star ⭐ and citation
127
+
128
+ ```
129
+ @article{zhou2024megapairs,
130
+ title={MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval},
131
+ author={Zhou, Junjie and Liu, Zheng and Liu, Ze and Xiao, Shitao and Wang, Yueze and Zhao, Bo and Zhang, Chen Jason and Lian, Defu and Xiong, Yongping},
132
+ journal={arXiv preprint arXiv:2412.14475},
133
+ year={2024}
134
+ }
135
+ ```