Improve language tag

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show

README.md +67 -55

README.md CHANGED Viewed

@@ -1,56 +1,68 @@
----
-license: apache-2.0
-datasets:
-- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-7B-Instruct
-tags:
-- vision
-- multimodal
-- reasoning
-- math
-- STEM
-- VQA
-- Video
----
-# MAmmoTH-VL-8B
-[🏠 Homepage](https://mammoth-vl.github.io/) | [🤖 MAmmoTH-VL-8B](https://huggingface.co/MAmmoTH-VL/MAmmoTH-VL-8B) | [💻 Code](https://github.com/MAmmoTH-VL/MAmmoTH-VL) | [📄 Arxiv](https://arxiv.org/abs/2412.05237) | [📕 PDF](https://arxiv.org/pdf/2412.05237) | [🖥️ Demo](https://huggingface.co/spaces/paralym/MAmmoTH-VL-8B)
-# Abstract
-Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.
-To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
-# Performance
-We highlight different groups of models with different colors: <span style="background-color: #f2f2f2">closed-source models</span>, <span style="background-color: #cce0ff">open weights</span> but closed training details, and <span style="background-color: #e0f7e0">fully open-source</span> models. Results are from official sources or running with lmms-eval package if unavailable.
-## Multi-Discipline Knowledge and Mathematical Reasoning
-![image/png](https://i.ibb.co/DzMVYPr/result1.png)
-## Chart & Doc Understanding and Multimodal Interactions & Preferences
-![image/png](https://i.ibb.co/FxYjPLz/result2.png)
-## Multi-Image and Video
-![image/png](https://i.ibb.co/TkZqQvs/result3.png)
-## Citing the Model
-```
-@article{guo2024mammothvlelicitingmultimodalreasoning,
-      title={MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale},
-      author={Jarvis Guo and Tuney Zheng and Yuelin Bai and Bo Li and Yubo Wang and King Zhu and Yizhi Li and Graham Neubig and Wenhu Chen and Xiang Yue},
-      year={2024},
-      eprint={2412.05237},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2412.05237},
-}
 ```

+---
+license: apache-2.0
+datasets:
+- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+tags:
+- vision
+- multimodal
+- reasoning
+- math
+- STEM
+- VQA
+- Video
+---
+# MAmmoTH-VL-8B
+[🏠 Homepage](https://mammoth-vl.github.io/) | [🤖 MAmmoTH-VL-8B](https://huggingface.co/MAmmoTH-VL/MAmmoTH-VL-8B) | [💻 Code](https://github.com/MAmmoTH-VL/MAmmoTH-VL) | [📄 Arxiv](https://arxiv.org/abs/2412.05237) | [📕 PDF](https://arxiv.org/pdf/2412.05237) | [🖥️ Demo](https://huggingface.co/spaces/paralym/MAmmoTH-VL-8B)
+# Abstract
+Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.
+To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
+# Performance
+We highlight different groups of models with different colors: <span style="background-color: #f2f2f2">closed-source models</span>, <span style="background-color: #cce0ff">open weights</span> but closed training details, and <span style="background-color: #e0f7e0">fully open-source</span> models. Results are from official sources or running with lmms-eval package if unavailable.
+## Multi-Discipline Knowledge and Mathematical Reasoning
+![image/png](https://i.ibb.co/DzMVYPr/result1.png)
+## Chart & Doc Understanding and Multimodal Interactions & Preferences
+![image/png](https://i.ibb.co/FxYjPLz/result2.png)
+## Multi-Image and Video
+![image/png](https://i.ibb.co/TkZqQvs/result3.png)
+## Citing the Model
+```
+@article{guo2024mammothvlelicitingmultimodalreasoning,
+      title={MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale},
+      author={Jarvis Guo and Tuney Zheng and Yuelin Bai and Bo Li and Yubo Wang and King Zhu and Yizhi Li and Graham Neubig and Wenhu Chen and Xiang Yue},
+      year={2024},
+      eprint={2412.05237},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.05237},
+}
 ```