Nealeon commited on
Commit
50260a0
·
verified ·
1 Parent(s): bfd4bc4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -19,7 +19,8 @@ library_name: transformers
19
  </a> &nbsp;|&nbsp;
20
  <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a>
21
  </div>
22
- ## Introduction
 
23
 
24
  We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
25
 
@@ -35,7 +36,7 @@ Building on this foundation, we introduce an advanced long-thinking variant: **K
35
 
36
  More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
37
 
38
- ## Architecture
39
 
40
  The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
41
 
@@ -43,7 +44,7 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
43
  <img width="90%" src="figures/arch.png">
44
  </div>
45
 
46
- ## Model Variants
47
 
48
  🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
49
 
@@ -63,7 +64,7 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
63
 
64
 
65
 
66
- ## Performance
67
 
68
  With effective long-thinking abilitites, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
69
 
@@ -127,7 +128,7 @@ print(response)
127
 
128
  We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
129
 
130
- ## 8. Citation
131
 
132
  ```
133
  @misc{kimiteam2025kimivltechnicalreport,
 
19
  </a> &nbsp;|&nbsp;
20
  <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a>
21
  </div>
22
+
23
+ ## 1. Introduction
24
 
25
  We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
26
 
 
36
 
37
  More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
38
 
39
+ ## 2. Architecture
40
 
41
  The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
42
 
 
44
  <img width="90%" src="figures/arch.png">
45
  </div>
46
 
47
+ ## 3. Model Variants
48
 
49
  🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
50
 
 
64
 
65
 
66
 
67
+ ## 4. Performance
68
 
69
  With effective long-thinking abilitites, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
70
 
 
128
 
129
  We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
130
 
131
+ ## 5. Citation
132
 
133
  ```
134
  @misc{kimiteam2025kimivltechnicalreport,