Image-Text-to-Text
Transformers
Safetensors
English
qwen2_vl
conversational
text-generation-inference

Add pipeline tag and library name

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -1,16 +1,19 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - Michael4933/MGrounding-630k
5
  - lmms-lab/M4-Instruct-Data
6
  - lmms-lab/LLaVA-OneVision-Data
7
  language:
8
  - en
 
9
  metrics:
10
  - accuracy
11
- base_model:
12
- - Qwen/Qwen2-VL-7B-Instruct
13
  ---
 
14
  Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
15
 
16
  <p align="center">
@@ -30,10 +33,7 @@ This repository hosts the usage details of our training dataset <strong>MGroundi
30
 
31
  ## ๐Ÿ“ฐ News
32
  * **[2025.01.13]** ๐ŸŒท๐ŸŒท๐ŸŒท We have further released our massive multi-image grounding training dataset [MGrounding_630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) and our multi-image grounding benchmark [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench) on Huggingface๐Ÿค—. Feel free to download and apply them for your own use.
33
- * **[2025.01.12]** ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ The model weights are now available on HuggingFace! ๐Ÿค— Download and have a try at [Huggingface Model](https://huggingface.co/Michael4933/Migician)!
34
- * **[2025.01.10]** ๐ŸŒž๐ŸŒž๐ŸŒž We have released our paper on [Arxiv](https://arxiv.org/abs/2501.05767) at the start of the new year!
35
-
36
- ## ๐Ÿ“ Abstract
37
 
38
  The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce ๐ŸŽฉ<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
39
 
@@ -171,7 +171,7 @@ An example structure for training data:
171
  <span id='Inference'/>
172
 
173
  #### Inference
174
- As mentioned in the paper, ๐ŸŽฉMigician is finetuned on [Qwen2-vl-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) through a progressive two-stage training process with massive amount of data on 8*A100-80G. You can feel the ๐Ÿช„magic of multi-image grounding through the following code.
175
 
176
  <p align="center">
177
  <img src="https://cdn-uploads.huggingface.co/production/uploads/654f3e104c8874c64d43aafa/3MgtMW_LOQwODDtoRAbY3.png" width=100%>
@@ -296,7 +296,7 @@ Migician/
296
  ```bibtex
297
  @misc{li2025migicianrevealingmagicfreeform,
298
  title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models},
299
- author={You Li and Heyu Huang and Chi Chen and Kaiyu Huang and Chao Huang and Zonghao Guo and Zhiyuan Liu and Jinan Xu and Yuhua Li and Ruixuan Li and Maosong Sun},
300
  year={2025},
301
  url={https://arxiv.org/abs/2501.05767},
302
  }
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-7B-Instruct
4
  datasets:
5
  - Michael4933/MGrounding-630k
6
  - lmms-lab/M4-Instruct-Data
7
  - lmms-lab/LLaVA-OneVision-Data
8
  language:
9
  - en
10
+ license: apache-2.0
11
  metrics:
12
  - accuracy
13
+ pipeline_tag: image-text-to-text
14
+ library_name: transformers
15
  ---
16
+
17
  Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
18
 
19
  <p align="center">
 
33
 
34
  ## ๐Ÿ“ฐ News
35
  * **[2025.01.13]** ๐ŸŒท๐ŸŒท๐ŸŒท We have further released our massive multi-image grounding training dataset [MGrounding_630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) and our multi-image grounding benchmark [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench) on Huggingface๐Ÿค—. Feel free to download and apply them for your own use.
36
+ * **[2025.01.12]** ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ The model weights are now available on HuggingFace! ๐Ÿค— Download and have a try at [Huggingface Model](https://huggingface.co/Michael4933/Migician)!\n* **[2025.01.10]** ๐ŸŒž๐ŸŒž๐ŸŒž We have released our paper on [Arxiv](https://arxiv.org/abs/2501.05767) at the start of the new year!\n\n## ๐Ÿ“ Abstract
 
 
 
37
 
38
  The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce ๐ŸŽฉ<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
39
 
 
171
  <span id='Inference'/>
172
 
173
  #### Inference
174
+ As mentioned in the paper, ๐ŸŽฉMigician is finetuned on [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) through a progressive two-stage training process with massive amount of data on 8*A100-80G. You can feel the ๐Ÿช„magic of multi-image grounding through the following code.
175
 
176
  <p align="center">
177
  <img src="https://cdn-uploads.huggingface.co/production/uploads/654f3e104c8874c64d43aafa/3MgtMW_LOQwODDtoRAbY3.png" width=100%>
 
296
  ```bibtex
297
  @misc{li2025migicianrevealingmagicfreeform,
298
  title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models},
299
+ author={You Li and Heyu Huang and Chen Chi and Kaiyu Huang and Chao Huang and Zonghao Guo and Zhiyuan Liu and Jinan Xu and Yuhua Li and Ruixuan Li and Maosong Sun},
300
  year={2025},
301
  url={https://arxiv.org/abs/2501.05767},
302
  }