Update README.md
Browse files
README.md
CHANGED
@@ -19,7 +19,8 @@ library_name: transformers
|
|
19 |
</a> |
|
20 |
<a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a>
|
21 |
</div>
|
22 |
-
|
|
|
23 |
|
24 |
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
|
25 |
|
@@ -35,7 +36,7 @@ Building on this foundation, we introduce an advanced long-thinking variant: **K
|
|
35 |
|
36 |
More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
|
37 |
|
38 |
-
## Architecture
|
39 |
|
40 |
The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
|
41 |
|
@@ -43,7 +44,7 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
|
|
43 |
<img width="90%" src="figures/arch.png">
|
44 |
</div>
|
45 |
|
46 |
-
## Model Variants
|
47 |
|
48 |
🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
|
49 |
|
@@ -63,7 +64,7 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
|
|
63 |
|
64 |
|
65 |
|
66 |
-
## Performance
|
67 |
|
68 |
With effective long-thinking abilitites, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
|
69 |
|
@@ -127,7 +128,7 @@ print(response)
|
|
127 |
|
128 |
We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
|
129 |
|
130 |
-
##
|
131 |
|
132 |
```
|
133 |
@misc{kimiteam2025kimivltechnicalreport,
|
|
|
19 |
</a> |
|
20 |
<a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a>
|
21 |
</div>
|
22 |
+
|
23 |
+
## 1. Introduction
|
24 |
|
25 |
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
|
26 |
|
|
|
36 |
|
37 |
More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
|
38 |
|
39 |
+
## 2. Architecture
|
40 |
|
41 |
The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
|
42 |
|
|
|
44 |
<img width="90%" src="figures/arch.png">
|
45 |
</div>
|
46 |
|
47 |
+
## 3. Model Variants
|
48 |
|
49 |
🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
|
50 |
|
|
|
64 |
|
65 |
|
66 |
|
67 |
+
## 4. Performance
|
68 |
|
69 |
With effective long-thinking abilitites, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
|
70 |
|
|
|
128 |
|
129 |
We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
|
130 |
|
131 |
+
## 5. Citation
|
132 |
|
133 |
```
|
134 |
@misc{kimiteam2025kimivltechnicalreport,
|