moonshotai
/

Kimi-VL-A3B-Thinking

@@ -19,7 +19,8 @@ library_name: transformers
   </a> &nbsp;|&nbsp;
   <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a>
 </div>
-## Introduction
 We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
@@ -35,7 +36,7 @@ Building on this foundation, we introduce an advanced long-thinking variant: **K
 More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
-## Architecture
 The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
@@ -43,7 +44,7 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
   <img width="90%" src="figures/arch.png">
 </div>
-## Model Variants
 🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
@@ -63,7 +64,7 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
-## Performance
 With effective long-thinking abilitites, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
@@ -127,7 +128,7 @@ print(response)
 We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
-## 8. Citation
 ```
 @misc{kimiteam2025kimivltechnicalreport,

   </a> &nbsp;|&nbsp;
   <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a>
 </div>
+## 1. Introduction
 We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
 More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
+## 2. Architecture
 The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
   <img width="90%" src="figures/arch.png">
 </div>
+## 3. Model Variants
 🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
+## 4. Performance
 With effective long-thinking abilitites, Kimi-VL-A3B-Thinking can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
 We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
+## 5. Citation
 ```
 @misc{kimiteam2025kimivltechnicalreport,