Improve language tag
Browse filesHi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.
README.md
CHANGED
@@ -1,69 +1,83 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
datasets:
|
4 |
-
- liuhaotian/LLaVA-Pretrain
|
5 |
-
- lmms-lab/LLaVA-NeXT-Data
|
6 |
-
base_model:
|
7 |
-
- Qwen/Qwen2.5-7B-Instruct
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
|
57 |
-
|
58 |
-
|
|
59 |
-
|
|
60 |
-
|
|
61 |
-
|
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
We would like to express our gratitude to [Yumeng Wang](https://huggingface.co/devymex) for his significant contributions to the experimental validation in MLLMs.
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- liuhaotian/LLaVA-Pretrain
|
5 |
+
- lmms-lab/LLaVA-NeXT-Data
|
6 |
+
base_model:
|
7 |
+
- Qwen/Qwen2.5-7B-Instruct
|
8 |
+
language:
|
9 |
+
- zho
|
10 |
+
- eng
|
11 |
+
- fra
|
12 |
+
- spa
|
13 |
+
- por
|
14 |
+
- deu
|
15 |
+
- ita
|
16 |
+
- rus
|
17 |
+
- jpn
|
18 |
+
- kor
|
19 |
+
- vie
|
20 |
+
- tha
|
21 |
+
- ara
|
22 |
+
---
|
23 |
+
|
24 |
+
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
|
25 |
+
## Model
|
26 |
+
We used [**MLCD**](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) as the Vision Encoder in [LLaVA-Next](https://huggingface.co/lmms-lab/llava-next-qwen-32b).
|
27 |
+

|
28 |
+
|
29 |
+
|
30 |
+
## Data
|
31 |
+
Our model was trained on publicly available data from the [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) datasets.
|
32 |
+
|
33 |
+
## How to eval
|
34 |
+
```shell
|
35 |
+
pip install lmms-eval==0.2.0
|
36 |
+
|
37 |
+
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
|
38 |
+
python -m accelerate.commands.launch \
|
39 |
+
--main_process_port=12581 \
|
40 |
+
--num_processes=8 \
|
41 |
+
-m lmms_eval \
|
42 |
+
--model llava \
|
43 |
+
--model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
|
44 |
+
--tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
|
45 |
+
--batch_size 1 \
|
46 |
+
--log_samples \
|
47 |
+
--log_samples_suffix mlcd_llava_qwen2_7b \
|
48 |
+
--output_path ./log
|
49 |
+
```
|
50 |
+
|
51 |
+
|
52 |
+
## Performance and Limitations
|
53 |
+
|
54 |
+
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
|
55 |
+
|
56 |
+
| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
57 |
+
|:----------------|:-------------|:-------------|
|
58 |
+
| LLM | Qwen2.5-7B | Qwen2.5-7B |
|
59 |
+
| AI2D | **76.98** | 73.15 |
|
60 |
+
| ScienceQA_img | **78.09** | 76.35 |
|
61 |
+
| GQA | **64.17** | 63.31 |
|
62 |
+
| InfoVQA_val | **43.48** | 38.88 |
|
63 |
+
| MMBench_cn_dev | **74.83** | 72.51 |
|
64 |
+
| MMBench_en_dev | **76.37** | 74.57 |
|
65 |
+
| MME(cognition) | **432** | 384 |
|
66 |
+
| MME(perception) | **1598** | 1512 |
|
67 |
+
| SeedBench | **68.20** | 66.80 |
|
68 |
+
| SeedBench_img | **73.75** | 72.72 |
|
69 |
+
| MMStar | **50.98** | 48.98 |
|
70 |
+
| MMMU | **44.30** | 44.20 |
|
71 |
+
| OCRBench | **531.00** | 525.00 |
|
72 |
+
| ChartQA | **67.84** | 66.52 |
|
73 |
+
| DocVQA_val | **76.46** | 75.21 |
|
74 |
+
| POPE | 88.69 | **88.83** |
|
75 |
+
| TextVQA_val | 61.69 | **62.47** |
|
76 |
+
|
77 |
+
### C. Limitations
|
78 |
+
Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.
|
79 |
+
|
80 |
+
|
81 |
+
## Acknowledgments
|
82 |
+
|
83 |
We would like to express our gratitude to [Yumeng Wang](https://huggingface.co/devymex) for his significant contributions to the experimental validation in MLLMs.
|