alanzhuly commited on
Commit
cb6484b
·
verified ·
1 Parent(s): 881a253

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -6,26 +6,26 @@ tags:
6
  - GGUF
7
  - Image-Text-to-Text
8
  ---
9
- # Omnivision
10
 
11
  ## 🔥 Latest Update
12
- - [Nov 27, 2024] **Model Improvements:** OmniVision v3 model's **GGUF file has been updated** in this Hugging Face Repo! ✨
13
  👉 Test these exciting changes in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo)
14
 
15
 
16
- - [Nov 22, 2024] **Model Improvements:** OmniVision v2 model's **GGUF file has been updated** in this Hugging Face Repo! ✨ Key Improvements Include:
17
  - Enhanced Art Descriptions
18
  - Better Complex Image Understanding
19
  - Improved Anime Recognition
20
  - More Accurate Color and Detail Detection
21
  - Expanded World Knowledge
22
 
23
- We are continuously improving Omnivision-968M based on your valuable feedback! **More exciting updates coming soon - Stay tuned!** ⭐
24
 
25
 
26
  ## Introduction
27
 
28
- Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
29
 
30
  - **9x Token Reduction**: Reduces image tokens from **729** to **81**, cutting latency and computational cost aggressively. Note that the computation of vision encoder and the projection part keep the same, but the computation of language model backbone is reduced due to 9X shorter image token span.
31
  - **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
@@ -38,7 +38,7 @@ Omnivision is a compact, sub-billion (968M) multimodal model for processing both
38
  **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
39
 
40
  ## Intended Use Cases
41
- Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
42
 
43
  **Example Demo:**
44
  Generating captions for a 1046×1568 image on M4 Pro Macbook takes **< 2s processing time** and requires only 988 MB RAM and 948 MB Storage.
@@ -49,13 +49,13 @@ Generating captions for a 1046×1568 image on M4 Pro Macbook takes **< 2s proces
49
 
50
  ## Benchmarks
51
 
52
- Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.
53
 
54
  <img src="benchmark.png" alt="Benchmark Radar Chart" style="width:500px;"/>
55
 
56
- We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.
57
 
58
- | Benchmark | Nexa AI Omnivision v2 | Nexa AI Omnivision v1 | nanoLLAVA |
59
  |-------------------|------------------------|------------------------|-----------|
60
  | ScienceQA (Eval) | 71.0 | 62.2 | 59.0 |
61
  | ScienceQA (Test) | 71.0 | 64.5 | 59.0 |
@@ -67,7 +67,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
67
 
68
 
69
  ## How to Use On Device
70
- In the following, we demonstrate how to run Omnivision locally on your device.
71
 
72
  **Step 1: Install Nexa-SDK (local on-device inference framework)**
73
 
@@ -78,11 +78,11 @@ In the following, we demonstrate how to run Omnivision locally on your device.
78
  **Step 2: Then run the following code in your terminal**
79
 
80
  ```bash
81
- nexa run omnivision
82
  ```
83
 
84
  ## Model Architecture ##
85
- Omnivision's architecture consists of three key components:
86
 
87
  - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
88
  - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
@@ -92,7 +92,7 @@ The vision encoder first transforms input images into embeddings, which are then
92
 
93
  ## Training
94
 
95
- We developed Omnivision through a three-stage training pipeline:
96
 
97
  **Pretraining:**
98
  The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
@@ -103,12 +103,12 @@ We enhance the model's contextual understanding using image-based question-answe
103
  **Direct Preference Optimization (DPO):**
104
  The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
105
 
106
- ## What's next for Omnivision?
107
- Omnivision is in early development and we are working to address current limitations:
108
  - Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
109
  - Improve document and text understanding
110
 
111
- In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications.
112
 
113
  ### Follow us
114
- [Blogs](https://nexa.ai/blogs/omni-vision) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)
 
6
  - GGUF
7
  - Image-Text-to-Text
8
  ---
9
+ # OmniVLM
10
 
11
  ## 🔥 Latest Update
12
+ - [Nov 27, 2024] **Model Improvements:** OmniVLM v3 model's **GGUF file has been updated** in this Hugging Face Repo! ✨
13
  👉 Test these exciting changes in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo)
14
 
15
 
16
+ - [Nov 22, 2024] **Model Improvements:** OmniVLM v2 model's **GGUF file has been updated** in this Hugging Face Repo! ✨ Key Improvements Include:
17
  - Enhanced Art Descriptions
18
  - Better Complex Image Understanding
19
  - Improved Anime Recognition
20
  - More Accurate Color and Detail Detection
21
  - Expanded World Knowledge
22
 
23
+ We are continuously improving OmniVLM-968M based on your valuable feedback! **More exciting updates coming soon - Stay tuned!** ⭐
24
 
25
 
26
  ## Introduction
27
 
28
+ OmniVLM is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:
29
 
30
  - **9x Token Reduction**: Reduces image tokens from **729** to **81**, cutting latency and computational cost aggressively. Note that the computation of vision encoder and the projection part keep the same, but the computation of language model backbone is reduced due to 9X shorter image token span.
31
  - **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
 
38
  **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
39
 
40
  ## Intended Use Cases
41
+ OmniVLM is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
42
 
43
  **Example Demo:**
44
  Generating captions for a 1046×1568 image on M4 Pro Macbook takes **< 2s processing time** and requires only 988 MB RAM and 948 MB Storage.
 
49
 
50
  ## Benchmarks
51
 
52
+ Below we demonstrate a figure to show how OmniVLM performs against nanollava. In all the tasks, OmniVLM outperforms the previous world's smallest vision-language model.
53
 
54
  <img src="benchmark.png" alt="Benchmark Radar Chart" style="width:500px;"/>
55
 
56
+ We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of OmniVLM.
57
 
58
+ | Benchmark | Nexa AI OmniVLM v2 | Nexa AI OmniVLM v1 | nanoLLAVA |
59
  |-------------------|------------------------|------------------------|-----------|
60
  | ScienceQA (Eval) | 71.0 | 62.2 | 59.0 |
61
  | ScienceQA (Test) | 71.0 | 64.5 | 59.0 |
 
67
 
68
 
69
  ## How to Use On Device
70
+ In the following, we demonstrate how to run OmniVLM locally on your device.
71
 
72
  **Step 1: Install Nexa-SDK (local on-device inference framework)**
73
 
 
78
  **Step 2: Then run the following code in your terminal**
79
 
80
  ```bash
81
+ nexa run OmniVLM
82
  ```
83
 
84
  ## Model Architecture ##
85
+ OmniVLM's architecture consists of three key components:
86
 
87
  - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
88
  - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
 
92
 
93
  ## Training
94
 
95
+ We developed OmniVLM through a three-stage training pipeline:
96
 
97
  **Pretraining:**
98
  The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
 
103
  **Direct Preference Optimization (DPO):**
104
  The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
105
 
106
+ ## What's next for OmniVLM?
107
+ OmniVLM is in early development and we are working to address current limitations:
108
  - Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
109
  - Improve document and text understanding
110
 
111
+ In the long term, we aim to develop OmniVLM as a fully optimized, production-ready solution for edge AI multimodal applications.
112
 
113
  ### Follow us
114
+ [Blogs](https://nexa.ai/blogs/OmniVLM) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)