Michael4933
/

Migician

@@ -1,16 +1,19 @@
 ---
-license: apache-2.0
 datasets:
 - Michael4933/MGrounding-630k
 - lmms-lab/M4-Instruct-Data
 - lmms-lab/LLaVA-OneVision-Data
 language:
 - en
 metrics:
 - accuracy
-base_model:
-- Qwen/Qwen2-VL-7B-Instruct
 ---
 Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
 <p align="center">
@@ -30,10 +33,7 @@ This repository hosts the usage details of our training dataset <strong>MGroundi
 ## 📰 News
 * **[2025.01.13]**  🌷🌷🌷 We have further released our massive multi-image grounding training dataset [MGrounding_630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) and our multi-image grounding benchmark [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench) on Huggingface🤗. Feel free to download and apply them for your own use.
-* **[2025.01.12]**  🌟🌟🌟 The model weights are now available on HuggingFace! 🤗 Download and have a try at [Huggingface Model](https://huggingface.co/Michael4933/Migician)!
-* **[2025.01.10]** 🌞🌞🌞 We have released our paper on [Arxiv](https://arxiv.org/abs/2501.05767) at the start of the new year!
-## 📝 Abstract
 The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce 🎩<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
@@ -171,7 +171,7 @@ An example structure for training data:
 <span id='Inference'/>
 #### Inference
-As mentioned in the paper, 🎩Migician is finetuned on [Qwen2-vl-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) through a progressive two-stage training process with massive amount of data on 8*A100-80G. You can feel the 🪄magic of multi-image grounding through the following code.
 <p align="center">
 <img src="https://cdn-uploads.huggingface.co/production/uploads/654f3e104c8874c64d43aafa/3MgtMW_LOQwODDtoRAbY3.png" width=100%>
@@ -296,7 +296,7 @@ Migician/
 ```bibtex
 @misc{li2025migicianrevealingmagicfreeform,
       title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models},
-      author={You Li and Heyu Huang and Chi Chen and Kaiyu Huang and Chao Huang and Zonghao Guo and Zhiyuan Liu and Jinan Xu and Yuhua Li and Ruixuan Li and Maosong Sun},
       year={2025},
       url={https://arxiv.org/abs/2501.05767},
 }

 ---
+base_model:
+- Qwen/Qwen2-VL-7B-Instruct
 datasets:
 - Michael4933/MGrounding-630k
 - lmms-lab/M4-Instruct-Data
 - lmms-lab/LLaVA-OneVision-Data
 language:
 - en
+license: apache-2.0
 metrics:
 - accuracy
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
 <p align="center">
 ## 📰 News
 * **[2025.01.13]**  🌷🌷🌷 We have further released our massive multi-image grounding training dataset [MGrounding_630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) and our multi-image grounding benchmark [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench) on Huggingface🤗. Feel free to download and apply them for your own use.
+* **[2025.01.12]**  🌟🌟🌟 The model weights are now available on HuggingFace! 🤗 Download and have a try at [Huggingface Model](https://huggingface.co/Michael4933/Migician)!\n* **[2025.01.10]** 🌞🌞🌞 We have released our paper on [Arxiv](https://arxiv.org/abs/2501.05767) at the start of the new year!\n\n## 📝 Abstract
 The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce 🎩<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
 <span id='Inference'/>
 #### Inference
+As mentioned in the paper, 🎩Migician is finetuned on [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) through a progressive two-stage training process with massive amount of data on 8*A100-80G. You can feel the 🪄magic of multi-image grounding through the following code.
 <p align="center">
 <img src="https://cdn-uploads.huggingface.co/production/uploads/654f3e104c8874c64d43aafa/3MgtMW_LOQwODDtoRAbY3.png" width=100%>
 ```bibtex
 @misc{li2025migicianrevealingmagicfreeform,
       title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models},
+      author={You Li and Heyu Huang and Chen Chi and Kaiyu Huang and Chao Huang and Zonghao Guo and Zhiyuan Liu and Jinan Xu and Yuhua Li and Ruixuan Li and Maosong Sun},
       year={2025},
       url={https://arxiv.org/abs/2501.05767},
 }