Kimi-Audio-7B 🤗 | Kimi-Audio-7B-Instruct 🤗 | 📑 Paper
We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio. ## 🔥🔥🔥 News!! * April 25, 2025: 👋 We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct). * April 25, 2025: 👋 We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the **our results and baselines** by this toolkit! * April 25, 2025: 👋 We release the technical report of [Kimi-Audio](assets/kimia_report.pdf). ## Table of Contents - [Introduction](#introduction) - [Architecture Overview](#architecture-overview) - [License](#license) - [Acknowledgements](#acknowledgements) - [Citation](#citation) - [Contact Us](#contact-us) ## Introduction Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include: * **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation. * **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)). * **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding. * **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation. * **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation. * **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development. **This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).** ## Architecture Overview
Kimi-Audio consists of three main components: 1. **Audio Tokenizer:** Converts input audio into: * Discrete semantic tokens (12.5Hz) using vector quantization. * Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz). 2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. 3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency. ## License The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT). ## Acknowledgements We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio: * [Whisper](https://github.com/openai/whisper) * [Transformers](https://github.com/huggingface/transformers) * [BigVGAN](https://github.com/NVIDIA/BigVGAN) * [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) Thank you to all the open-source projects for their contributions to this project! ## Citation If you find Kimi-Audio useful in your research or applications, please cite our technical report: ```bibtex @misc{kimi_audio_2024, title={Kimi-Audio Technical Report}, author={Kimi Team}, year={2024}, eprint={arXiv:placeholder}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Contact Us For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.