xiangan commited on
Commit
1c49499
·
verified ·
1 Parent(s): 016d313

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -1,3 +1,54 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ ## MLCD-ViT-bigG Model Card
5
+
6
+
7
+ MLCD-ViT-bigG is a state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.
8
+
9
+ We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and the official training dataset [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for evaluating the foundational visual models.
10
+ The language model is Qwen2.5-7B.
11
+
12
+
13
+ | Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
14
+ | :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
15
+ | CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
16
+ | SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
17
+ | DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | **48.00** |
18
+ | **[MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
19
+ | **[MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** | √ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
20
+ | **[MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** | √ | **73.80** | **83.34** | **46.59** | **582.00** | 46.00 |
21
+
22
+ ## Installation
23
+
24
+ ```shell
25
+ pip install torch transformers
26
+ git clone https://github.com/deepglint/unicom
27
+ cd unicom/mlcd
28
+ ```
29
+
30
+ ## Usage
31
+
32
+ ```python
33
+ from vit_rope2d_hf import MLCDVisionModel
34
+ from transformers import CLIPImageProcessor
35
+ from PIL import Image
36
+ import requests
37
+ import torch
38
+
39
+ # Load model and processor
40
+ model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
41
+ processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
42
+
43
+ # Process single image
44
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
45
+ image = Image.open(requests.get(url, stream=True).raw)
46
+ inputs = processor(images=image, return_tensors="pt")
47
+
48
+ # Get visual features
49
+ with torch.no_grad():
50
+ outputs = model(**inputs)
51
+ features = outputs.last_hidden_state
52
+
53
+ print(f"Extracted features shape: {features.shape}")
54
+ ```