moonshotai
/

Kimi-Audio-7B

@@ -1,73 +1,47 @@
 <p align="center">
-    <img src="assets/kimia_logo.png" width="400"/>
 <p>
 <p align="center">
-Kimi-Audio-7B  <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">🤗</a>&nbsp; ｜ Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">🤗</a>&nbsp; | 📑 <a href="assets/kimia_report.pdf">Paper</a> &nbsp;&nbsp;
 </p>
-We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio.
-## 🔥🔥🔥 News!!
-* April 25, 2025: 👋 We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).
-* April 25, 2025: 👋 We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the **our results and baselines** by this toolkit!
-* April 25, 2025: 👋 We release the technical report of [Kimi-Audio](assets/kimia_report.pdf).
-## Table of Contents
-- [Introduction](#introduction)
-- [Architecture Overview](#architecture-overview)
-- [License](#license)
-- [Acknowledgements](#acknowledgements)
-- [Citation](#citation)
-- [Contact Us](#contact-us)
 ## Introduction
 Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
-*   **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation.
-*   **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)).
-*   **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
 *   **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
 *   **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
-*   **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development.
-**This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).**
-## Architecture Overview
-<p align="center">
-    <img src="assets/kimia_framework.png" width="70%"/>
-<p>
-Kimi-Audio consists of three main components:
-1.  **Audio Tokenizer:** Converts input audio into:
-    *   Discrete semantic tokens (12.5Hz) using vector quantization.
-    *   Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
-2.  **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
-3.  **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.
-## License
-The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
-## Acknowledgements
-We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:
-* [Whisper](https://github.com/openai/whisper)
-* [Transformers](https://github.com/huggingface/transformers)
-* [BigVGAN](https://github.com/NVIDIA/BigVGAN)
-* [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
-Thank you to all the open-source projects for their contributions to this project!
 ## Citation
@@ -85,6 +59,6 @@ If you find Kimi-Audio useful in your research or applications, please cite our
 }
 ```
-## Contact Us
-For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.

+---
+license: mit
+language:
+- en
+- zh
+tags:
+- audio
+- audio-language-model
+- speech-recognition
+- audio-understanding
+- text-to-speech
+- audio-generation
+- chat
+- kimi-audio
+---
+# Kimi-Audio
 <p align="center">
+    <img src="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_logo.png" width="400"/>
 <p>
 <p align="center">
+Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">🤗</a>&nbsp; Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">🤗</a>&nbsp; | 📑 <a href="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf">Paper</a>
 </p>
 ## Introduction
+We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository hosts the model checkpoints for Kimi-Audio-7B.
 Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
+*   **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC) and end-to-end speech conversation.
+*   **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see our [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf)).
+*   **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
 *   **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
 *   **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
+For more details, please refer to our [GitHub Repository](https://github.com/MoonshotAI/Kimi-Audio) and [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf).
+## Note
+Kimi-Audio-7B is a base model without fine-tuning. So it cannot be used directly.
+The base model is quite flexible, you can fine-tune it on any possible downstream tasks.
 ## Citation
 }
 ```
+## License
+The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).