Add pipeline tag and library name
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,16 +1,19 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- Michael4933/MGrounding-630k
|
5 |
- lmms-lab/M4-Instruct-Data
|
6 |
- lmms-lab/LLaVA-OneVision-Data
|
7 |
language:
|
8 |
- en
|
|
|
9 |
metrics:
|
10 |
- accuracy
|
11 |
-
|
12 |
-
|
13 |
---
|
|
|
14 |
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
|
15 |
|
16 |
<p align="center">
|
@@ -30,10 +33,7 @@ This repository hosts the usage details of our training dataset <strong>MGroundi
|
|
30 |
|
31 |
## ๐ฐ News
|
32 |
* **[2025.01.13]** ๐ท๐ท๐ท We have further released our massive multi-image grounding training dataset [MGrounding_630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) and our multi-image grounding benchmark [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench) on Huggingface๐ค. Feel free to download and apply them for your own use.
|
33 |
-
* **[2025.01.12]** ๐๐๐ The model weights are now available on HuggingFace! ๐ค Download and have a try at [Huggingface Model](https://huggingface.co/Michael4933/Migician)
|
34 |
-
* **[2025.01.10]** ๐๐๐ We have released our paper on [Arxiv](https://arxiv.org/abs/2501.05767) at the start of the new year!
|
35 |
-
|
36 |
-
## ๐ Abstract
|
37 |
|
38 |
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce ๐ฉ<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
|
39 |
|
@@ -171,7 +171,7 @@ An example structure for training data:
|
|
171 |
<span id='Inference'/>
|
172 |
|
173 |
#### Inference
|
174 |
-
As mentioned in the paper, ๐ฉMigician is finetuned on [Qwen2-
|
175 |
|
176 |
<p align="center">
|
177 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/654f3e104c8874c64d43aafa/3MgtMW_LOQwODDtoRAbY3.png" width=100%>
|
@@ -296,7 +296,7 @@ Migician/
|
|
296 |
```bibtex
|
297 |
@misc{li2025migicianrevealingmagicfreeform,
|
298 |
title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models},
|
299 |
-
author={You Li and Heyu Huang and Chi
|
300 |
year={2025},
|
301 |
url={https://arxiv.org/abs/2501.05767},
|
302 |
}
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2-VL-7B-Instruct
|
4 |
datasets:
|
5 |
- Michael4933/MGrounding-630k
|
6 |
- lmms-lab/M4-Instruct-Data
|
7 |
- lmms-lab/LLaVA-OneVision-Data
|
8 |
language:
|
9 |
- en
|
10 |
+
license: apache-2.0
|
11 |
metrics:
|
12 |
- accuracy
|
13 |
+
pipeline_tag: image-text-to-text
|
14 |
+
library_name: transformers
|
15 |
---
|
16 |
+
|
17 |
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
|
18 |
|
19 |
<p align="center">
|
|
|
33 |
|
34 |
## ๐ฐ News
|
35 |
* **[2025.01.13]** ๐ท๐ท๐ท We have further released our massive multi-image grounding training dataset [MGrounding_630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) and our multi-image grounding benchmark [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench) on Huggingface๐ค. Feel free to download and apply them for your own use.
|
36 |
+
* **[2025.01.12]** ๐๐๐ The model weights are now available on HuggingFace! ๐ค Download and have a try at [Huggingface Model](https://huggingface.co/Michael4933/Migician)!\n* **[2025.01.10]** ๐๐๐ We have released our paper on [Arxiv](https://arxiv.org/abs/2501.05767) at the start of the new year!\n\n## ๐ Abstract
|
|
|
|
|
|
|
37 |
|
38 |
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce ๐ฉ<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
|
39 |
|
|
|
171 |
<span id='Inference'/>
|
172 |
|
173 |
#### Inference
|
174 |
+
As mentioned in the paper, ๐ฉMigician is finetuned on [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) through a progressive two-stage training process with massive amount of data on 8*A100-80G. You can feel the ๐ชmagic of multi-image grounding through the following code.
|
175 |
|
176 |
<p align="center">
|
177 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/654f3e104c8874c64d43aafa/3MgtMW_LOQwODDtoRAbY3.png" width=100%>
|
|
|
296 |
```bibtex
|
297 |
@misc{li2025migicianrevealingmagicfreeform,
|
298 |
title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models},
|
299 |
+
author={You Li and Heyu Huang and Chen Chi and Kaiyu Huang and Chao Huang and Zonghao Guo and Zhiyuan Liu and Jinan Xu and Yuhua Li and Ruixuan Li and Maosong Sun},
|
300 |
year={2025},
|
301 |
url={https://arxiv.org/abs/2501.05767},
|
302 |
}
|