|
<p align="center"> |
|
<img src="assets/kimia_logo.png" width="400"/> |
|
<p> |
|
|
|
<p align="center"> |
|
Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">π€</a> ο½ Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">π€</a> | π <a href="assets/kimia_report.pdf">Paper</a> |
|
</p> |
|
|
|
|
|
We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio. |
|
|
|
## π₯π₯π₯ News!! |
|
* April 25, 2025: π We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct). |
|
* April 25, 2025: π We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the **our results and baselines** by this toolkit! |
|
* April 25, 2025: π We release the technical report of [Kimi-Audio](assets/kimia_report.pdf). |
|
|
|
## Table of Contents |
|
|
|
- [Introduction](#introduction) |
|
- [Architecture Overview](#architecture-overview) |
|
- [License](#license) |
|
- [Acknowledgements](#acknowledgements) |
|
- [Citation](#citation) |
|
- [Contact Us](#contact-us) |
|
|
|
## Introduction |
|
|
|
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include: |
|
|
|
* **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation. |
|
* **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)). |
|
* **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding. |
|
* **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation. |
|
* **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation. |
|
* **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development. |
|
|
|
**This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).** |
|
|
|
|
|
## Architecture Overview |
|
|
|
<p align="center"> |
|
<img src="assets/kimia_framework.png" width="70%"/> |
|
<p> |
|
|
|
Kimi-Audio consists of three main components: |
|
|
|
1. **Audio Tokenizer:** Converts input audio into: |
|
* Discrete semantic tokens (12.5Hz) using vector quantization. |
|
* Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz). |
|
2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. |
|
3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency. |
|
|
|
|
|
## License |
|
|
|
The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT). |
|
|
|
|
|
|
|
## Acknowledgements |
|
|
|
We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio: |
|
|
|
* [Whisper](https://github.com/openai/whisper) |
|
* [Transformers](https://github.com/huggingface/transformers) |
|
* [BigVGAN](https://github.com/NVIDIA/BigVGAN) |
|
* [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice) |
|
|
|
Thank you to all the open-source projects for their contributions to this project! |
|
|
|
|
|
## Citation |
|
|
|
If you find Kimi-Audio useful in your research or applications, please cite our technical report: |
|
|
|
```bibtex |
|
@misc{kimi_audio_2024, |
|
title={Kimi-Audio Technical Report}, |
|
author={Kimi Team}, |
|
year={2024}, |
|
eprint={arXiv:placeholder}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
## Contact Us |
|
|
|
For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub. |
|
|