File size: 4,936 Bytes
880ac96 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
<p align="center">
<img src="assets/kimia_logo.png" width="400"/>
<p>
<p align="center">
Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">π€</a> ο½ Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">π€</a> | π <a href="assets/kimia_report.pdf">Paper</a>
</p>
We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio.
## π₯π₯π₯ News!!
* April 25, 2025: π We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).
* April 25, 2025: π We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the **our results and baselines** by this toolkit!
* April 25, 2025: π We release the technical report of [Kimi-Audio](assets/kimia_report.pdf).
## Table of Contents
- [Introduction](#introduction)
- [Architecture Overview](#architecture-overview)
- [License](#license)
- [Acknowledgements](#acknowledgements)
- [Citation](#citation)
- [Contact Us](#contact-us)
## Introduction
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
* **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation.
* **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)).
* **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
* **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
* **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
* **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development.
**This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).**
## Architecture Overview
<p align="center">
<img src="assets/kimia_framework.png" width="70%"/>
<p>
Kimi-Audio consists of three main components:
1. **Audio Tokenizer:** Converts input audio into:
* Discrete semantic tokens (12.5Hz) using vector quantization.
* Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.
## License
The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
## Acknowledgements
We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:
* [Whisper](https://github.com/openai/whisper)
* [Transformers](https://github.com/huggingface/transformers)
* [BigVGAN](https://github.com/NVIDIA/BigVGAN)
* [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
Thank you to all the open-source projects for their contributions to this project!
## Citation
If you find Kimi-Audio useful in your research or applications, please cite our technical report:
```bibtex
@misc{kimi_audio_2024,
title={Kimi-Audio Technical Report},
author={Kimi Team},
year={2024},
eprint={arXiv:placeholder},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Contact Us
For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.
|