File size: 6,562 Bytes
fc4dd9f 5cb0946 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
license: apache-2.0
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
metrics:
- bleu
base_model:
- Qwen/Qwen2.5-7B-Instruct
---
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
| | | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
:-- | :-- | :-: | :-: | :-: | :-: |
| RoboVQA | BLEU1 | <span style="color:red">73.16</span> | 38.12 | - | 54.9 |
| | BLEU2 | <span style="color:red">66.39</span> | 33.56 | - | 44.2 |
| | BLEU3 | <span style="color:red">60.61</span> | 31.76 | - | 39.5 |
| | BLEU4 | <span style="color:red">56.56</span> | 30.97 | - | 36.3 |
| OpenEQA | Object State Recognition | <span style="color:red">71.83</span> | - | 63.2 | - |
| | Object Recognition | <span style="color:red">49.46</span> | - | 43.4 | - |
| | Functional Reasoning | 54.38 | - | <span style="color:red">57.4</span> | - |
| | Spatial Understanding | <span style="color:red">48.64</span> | - | 33.6 | - |
| | Attribute Recognition | <span style="color:red">67.08</span> | - | 57.2 | - |
| | World Knowledge | <span style="color:red">53.87</span> | - | 50.7 | - |
| | Object Localization | <span style="color:red">43.06</span> | - | 42.0 | - |
## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
| Dataset | Split | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v | GPT-4o |
| :-- | :-: | :-: | :-: | :-: | :-: |
| A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
| ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
| DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
| InfoVQA | val | 73.9 | 70.7 | - | - |
| InfoVQA | test | 70.0 | 68.8 | - | - |
| MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
| MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
| OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
| RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
| SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
| MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
| MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
| MME | test | 578/1603 | 418/1580 | 517/1409 | - |
## Usage
### A. Installation
```bash
git clone https://github.com/deepglint/unicom
cd unicom/mlcd_vl
docker build -t train_mlcd_llava .
docker run --gpus all \
-v /vlm:/vlm \
-v /mnt:/mnt \
-v $(pwd):/workspace \
--rm \
-w /workspace \
--shm-size=64g -it train_mlcd_llava bash
pip install flash-attn==2.3.3 --no-build-isolation
```
### B. Inference
```bash
CUDA_VISIBLE_DEVICES=0 python infer_mlcd_emboided.py --model_dir DeepGlint-AI/MLCD-Embodied-7B
# example:
# >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
# >> Enter image file paths (comma-separated): ../_static/images/logo.png
# >> User: <image>What kind of animal is it in this picture?
# >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
# >> User: What color is this cat?
# >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
```
### C. Evaluation for Embodied Ability
#### Step 1
Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)
#### Step 2
Converting raw data into the format required for model evaluation.
```bash
# convert OpenEQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_openeqa_bmk.py
# convert RoboVQA benchmark. Note: replace the paths with your own.
python llava/benchmark/make_robovqa_bmk.py
```
#### Step 3
Make sure that your top-level directory structure should look like this:
```
|--/path/to/your/benchmarks
| |--OpenEQA
| | |--openeqa_scannet.parquet
| | |--openeqa_hm3d.parquet
| |--RoboVQA
| |--robovqa.parquet
|--/path/to/your/images
|--openeqa_val
| |--scannet-v0
| | |--002-scannet-scene0709_00
| | |--xxx-scannet-scenexxxx_xx
| |--hm3d-v0
| |--000-hm3d-BFRyYbPCCPE
| |--xxx-hm3d-xxxxxxxxxxx
|--robovqa_val
|--robovqa_221911
|--robovqa_xxxxxx
```
#### Step 4
Run script for evaluation
```bash
# Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
bash scripts/eval/eval_robo.sh /path/to/your/model
```
### D. Evaluation for General Ability
Install the evaluation tool and execute the evaluation script:
```bash
pip install lmms-eval==0.2.0
PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
--main_process_port=12444 \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
--tasks mme \
--batch_size 1 \
--log_samples \
--log_samples_suffix mlcd \
--output_path ./eval_log/
```
We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs. |