lbourdois commited on
Commit
d833a10
Β·
verified Β·
1 Parent(s): 3035be0

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +67 -55
README.md CHANGED
@@ -1,56 +1,68 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
5
- language:
6
- - en
7
- base_model:
8
- - Qwen/Qwen2.5-7B-Instruct
9
- tags:
10
- - vision
11
- - multimodal
12
- - reasoning
13
- - math
14
- - STEM
15
- - VQA
16
- - Video
17
- ---
18
- # MAmmoTH-VL-8B
19
-
20
- [🏠 Homepage](https://mammoth-vl.github.io/) | [πŸ€– MAmmoTH-VL-8B](https://huggingface.co/MAmmoTH-VL/MAmmoTH-VL-8B) | [πŸ’» Code](https://github.com/MAmmoTH-VL/MAmmoTH-VL) | [πŸ“„ Arxiv](https://arxiv.org/abs/2412.05237) | [πŸ“• PDF](https://arxiv.org/pdf/2412.05237) | [πŸ–₯️ Demo](https://huggingface.co/spaces/paralym/MAmmoTH-VL-8B)
21
-
22
- # Abstract
23
- Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.
24
- To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
25
-
26
-
27
- # Performance
28
-
29
- We highlight different groups of models with different colors: <span style="background-color: #f2f2f2">closed-source models</span>, <span style="background-color: #cce0ff">open weights</span> but closed training details, and <span style="background-color: #e0f7e0">fully open-source</span> models. Results are from official sources or running with lmms-eval package if unavailable.
30
-
31
- ## Multi-Discipline Knowledge and Mathematical Reasoning
32
-
33
- ![image/png](https://i.ibb.co/DzMVYPr/result1.png)
34
-
35
- ## Chart & Doc Understanding and Multimodal Interactions & Preferences
36
-
37
- ![image/png](https://i.ibb.co/FxYjPLz/result2.png)
38
-
39
- ## Multi-Image and Video
40
-
41
- ![image/png](https://i.ibb.co/TkZqQvs/result3.png)
42
-
43
-
44
- ## Citing the Model
45
-
46
- ```
47
- @article{guo2024mammothvlelicitingmultimodalreasoning,
48
- title={MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale},
49
- author={Jarvis Guo and Tuney Zheng and Yuelin Bai and Bo Li and Yubo Wang and King Zhu and Yizhi Li and Graham Neubig and Wenhu Chen and Xiang Yue},
50
- year={2024},
51
- eprint={2412.05237},
52
- archivePrefix={arXiv},
53
- primaryClass={cs.CL},
54
- url={https://arxiv.org/abs/2412.05237},
55
- }
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
5
+ language:
6
+ - zho
7
+ - eng
8
+ - fra
9
+ - spa
10
+ - por
11
+ - deu
12
+ - ita
13
+ - rus
14
+ - jpn
15
+ - kor
16
+ - vie
17
+ - tha
18
+ - ara
19
+ base_model:
20
+ - Qwen/Qwen2.5-7B-Instruct
21
+ tags:
22
+ - vision
23
+ - multimodal
24
+ - reasoning
25
+ - math
26
+ - STEM
27
+ - VQA
28
+ - Video
29
+ ---
30
+ # MAmmoTH-VL-8B
31
+
32
+ [🏠 Homepage](https://mammoth-vl.github.io/) | [πŸ€– MAmmoTH-VL-8B](https://huggingface.co/MAmmoTH-VL/MAmmoTH-VL-8B) | [πŸ’» Code](https://github.com/MAmmoTH-VL/MAmmoTH-VL) | [πŸ“„ Arxiv](https://arxiv.org/abs/2412.05237) | [πŸ“• PDF](https://arxiv.org/pdf/2412.05237) | [πŸ–₯️ Demo](https://huggingface.co/spaces/paralym/MAmmoTH-VL-8B)
33
+
34
+ # Abstract
35
+ Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.
36
+ To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
37
+
38
+
39
+ # Performance
40
+
41
+ We highlight different groups of models with different colors: <span style="background-color: #f2f2f2">closed-source models</span>, <span style="background-color: #cce0ff">open weights</span> but closed training details, and <span style="background-color: #e0f7e0">fully open-source</span> models. Results are from official sources or running with lmms-eval package if unavailable.
42
+
43
+ ## Multi-Discipline Knowledge and Mathematical Reasoning
44
+
45
+ ![image/png](https://i.ibb.co/DzMVYPr/result1.png)
46
+
47
+ ## Chart & Doc Understanding and Multimodal Interactions & Preferences
48
+
49
+ ![image/png](https://i.ibb.co/FxYjPLz/result2.png)
50
+
51
+ ## Multi-Image and Video
52
+
53
+ ![image/png](https://i.ibb.co/TkZqQvs/result3.png)
54
+
55
+
56
+ ## Citing the Model
57
+
58
+ ```
59
+ @article{guo2024mammothvlelicitingmultimodalreasoning,
60
+ title={MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale},
61
+ author={Jarvis Guo and Tuney Zheng and Yuelin Bai and Bo Li and Yubo Wang and King Zhu and Yizhi Li and Graham Neubig and Wenhu Chen and Xiang Yue},
62
+ year={2024},
63
+ eprint={2412.05237},
64
+ archivePrefix={arXiv},
65
+ primaryClass={cs.CL},
66
+ url={https://arxiv.org/abs/2412.05237},
67
+ }
68
  ```