Kimi-Audio-7B π€ ο½ Kimi-Audio-7B-Instruct π€ | π Paper
We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio.
π₯π₯π₯ News!!
- April 25, 2025: π We release the inference code and model weights of Kimi-Audio-7B and Kimi-Audio-7B-Instruct.
- April 25, 2025: π We release the audio evaluation toolkit ALMEvalKit. We can easily reproduce the our results and baselines by this toolkit!
- April 25, 2025: π We release the technical report of Kimi-Audio.
Table of Contents
Introduction
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
- Universal Capabilities: Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation.
- State-of-the-Art Performance: Achieves SOTA results on numerous audio benchmarks (see Technical Report).
- Large-Scale Pre-training: Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
- Novel Architecture: Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
- Efficient Inference: Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
- Open-Source: We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development.
This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to Kimi-Audio-7B-Instruct.
Architecture Overview
Kimi-Audio consists of three main components:
- Audio Tokenizer: Converts input audio into:
- Discrete semantic tokens (12.5Hz) using vector quantization.
- Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
- Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
- Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.
License
The model is based and modified from Qwen 2.5-7B. Code derived from Qwen2.5-7B is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.
Acknowledgements
We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:
Thank you to all the open-source projects for their contributions to this project!
Citation
If you find Kimi-Audio useful in your research or applications, please cite our technical report:
@misc{kimi_audio_2024,
title={Kimi-Audio Technical Report},
author={Kimi Team},
year={2024},
eprint={arXiv:placeholder},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact Us
For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.