Safetensors
qwen2
lbourdois commited on
Commit
6c29e34
·
verified ·
1 Parent(s): 54329c5

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +82 -68
README.md CHANGED
@@ -1,69 +1,83 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - liuhaotian/LLaVA-Pretrain
5
- - lmms-lab/LLaVA-NeXT-Data
6
- base_model:
7
- - Qwen/Qwen2.5-7B-Instruct
8
- ---
9
-
10
- [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
11
- ## Model
12
- We used [**MLCD**](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) as the Vision Encoder in [LLaVA-Next](https://huggingface.co/lmms-lab/llava-next-qwen-32b).
13
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)
14
-
15
-
16
- ## Data
17
- Our model was trained on publicly available data from the [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) datasets.
18
-
19
- ## How to eval
20
- ```shell
21
- pip install lmms-eval==0.2.0
22
-
23
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
24
- python -m accelerate.commands.launch \
25
- --main_process_port=12581 \
26
- --num_processes=8 \
27
- -m lmms_eval \
28
- --model llava \
29
- --model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
30
- --tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
31
- --batch_size 1 \
32
- --log_samples \
33
- --log_samples_suffix mlcd_llava_qwen2_7b \
34
- --output_path ./log
35
- ```
36
-
37
-
38
- ## Performance and Limitations
39
-
40
- In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
41
-
42
- | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
43
- |:----------------|:-------------|:-------------|
44
- | LLM | Qwen2.5-7B | Qwen2.5-7B |
45
- | AI2D | **76.98** | 73.15 |
46
- | ScienceQA_img | **78.09** | 76.35 |
47
- | GQA | **64.17** | 63.31 |
48
- | InfoVQA_val | **43.48** | 38.88 |
49
- | MMBench_cn_dev | **74.83** | 72.51 |
50
- | MMBench_en_dev | **76.37** | 74.57 |
51
- | MME(cognition) | **432** | 384 |
52
- | MME(perception) | **1598** | 1512 |
53
- | SeedBench | **68.20** | 66.80 |
54
- | SeedBench_img | **73.75** | 72.72 |
55
- | MMStar | **50.98** | 48.98 |
56
- | MMMU | **44.30** | 44.20 |
57
- | OCRBench | **531.00** | 525.00 |
58
- | ChartQA | **67.84** | 66.52 |
59
- | DocVQA_val | **76.46** | 75.21 |
60
- | POPE | 88.69 | **88.83** |
61
- | TextVQA_val | 61.69 | **62.47** |
62
-
63
- ### C. Limitations
64
- Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.
65
-
66
-
67
- ## Acknowledgments
68
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  We would like to express our gratitude to [Yumeng Wang](https://huggingface.co/devymex) for his significant contributions to the experimental validation in MLLMs.
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - liuhaotian/LLaVA-Pretrain
5
+ - lmms-lab/LLaVA-NeXT-Data
6
+ base_model:
7
+ - Qwen/Qwen2.5-7B-Instruct
8
+ language:
9
+ - zho
10
+ - eng
11
+ - fra
12
+ - spa
13
+ - por
14
+ - deu
15
+ - ita
16
+ - rus
17
+ - jpn
18
+ - kor
19
+ - vie
20
+ - tha
21
+ - ara
22
+ ---
23
+
24
+ [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
25
+ ## Model
26
+ We used [**MLCD**](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) as the Vision Encoder in [LLaVA-Next](https://huggingface.co/lmms-lab/llava-next-qwen-32b).
27
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)
28
+
29
+
30
+ ## Data
31
+ Our model was trained on publicly available data from the [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) datasets.
32
+
33
+ ## How to eval
34
+ ```shell
35
+ pip install lmms-eval==0.2.0
36
+
37
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
38
+ python -m accelerate.commands.launch \
39
+ --main_process_port=12581 \
40
+ --num_processes=8 \
41
+ -m lmms_eval \
42
+ --model llava \
43
+ --model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
44
+ --tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
45
+ --batch_size 1 \
46
+ --log_samples \
47
+ --log_samples_suffix mlcd_llava_qwen2_7b \
48
+ --output_path ./log
49
+ ```
50
+
51
+
52
+ ## Performance and Limitations
53
+
54
+ In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
55
+
56
+ | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
57
+ |:----------------|:-------------|:-------------|
58
+ | LLM | Qwen2.5-7B | Qwen2.5-7B |
59
+ | AI2D | **76.98** | 73.15 |
60
+ | ScienceQA_img | **78.09** | 76.35 |
61
+ | GQA | **64.17** | 63.31 |
62
+ | InfoVQA_val | **43.48** | 38.88 |
63
+ | MMBench_cn_dev | **74.83** | 72.51 |
64
+ | MMBench_en_dev | **76.37** | 74.57 |
65
+ | MME(cognition) | **432** | 384 |
66
+ | MME(perception) | **1598** | 1512 |
67
+ | SeedBench | **68.20** | 66.80 |
68
+ | SeedBench_img | **73.75** | 72.72 |
69
+ | MMStar | **50.98** | 48.98 |
70
+ | MMMU | **44.30** | 44.20 |
71
+ | OCRBench | **531.00** | 525.00 |
72
+ | ChartQA | **67.84** | 66.52 |
73
+ | DocVQA_val | **76.46** | 75.21 |
74
+ | POPE | 88.69 | **88.83** |
75
+ | TextVQA_val | 61.69 | **62.47** |
76
+
77
+ ### C. Limitations
78
+ Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.
79
+
80
+
81
+ ## Acknowledgments
82
+
83
  We would like to express our gratitude to [Yumeng Wang](https://huggingface.co/devymex) for his significant contributions to the experimental validation in MLLMs.