Update README.md
Browse files
README.md
CHANGED
@@ -1,73 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
<p align="center">
|
2 |
-
<img src="assets/kimia_logo.png" width="400"/>
|
3 |
<p>
|
4 |
|
5 |
<p align="center">
|
6 |
-
Kimi-Audio-7B
|
7 |
</p>
|
8 |
|
9 |
-
|
10 |
-
We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio.
|
11 |
-
|
12 |
-
## π₯π₯π₯ News!!
|
13 |
-
* April 25, 2025: π We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).
|
14 |
-
* April 25, 2025: π We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the **our results and baselines** by this toolkit!
|
15 |
-
* April 25, 2025: π We release the technical report of [Kimi-Audio](assets/kimia_report.pdf).
|
16 |
-
|
17 |
-
## Table of Contents
|
18 |
-
|
19 |
-
- [Introduction](#introduction)
|
20 |
-
- [Architecture Overview](#architecture-overview)
|
21 |
-
- [License](#license)
|
22 |
-
- [Acknowledgements](#acknowledgements)
|
23 |
-
- [Citation](#citation)
|
24 |
-
- [Contact Us](#contact-us)
|
25 |
-
|
26 |
## Introduction
|
27 |
|
|
|
|
|
28 |
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
|
29 |
|
30 |
-
* **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC)
|
31 |
-
* **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)).
|
32 |
-
* **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data
|
33 |
* **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
|
34 |
* **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
|
35 |
-
* **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development.
|
36 |
|
37 |
-
|
38 |
|
|
|
39 |
|
40 |
-
|
41 |
-
|
42 |
-
<p align="center">
|
43 |
-
<img src="assets/kimia_framework.png" width="70%"/>
|
44 |
-
<p>
|
45 |
-
|
46 |
-
Kimi-Audio consists of three main components:
|
47 |
-
|
48 |
-
1. **Audio Tokenizer:** Converts input audio into:
|
49 |
-
* Discrete semantic tokens (12.5Hz) using vector quantization.
|
50 |
-
* Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
|
51 |
-
2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
|
52 |
-
3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.
|
53 |
-
|
54 |
-
|
55 |
-
## License
|
56 |
-
|
57 |
-
The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
## Acknowledgements
|
62 |
-
|
63 |
-
We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:
|
64 |
-
|
65 |
-
* [Whisper](https://github.com/openai/whisper)
|
66 |
-
* [Transformers](https://github.com/huggingface/transformers)
|
67 |
-
* [BigVGAN](https://github.com/NVIDIA/BigVGAN)
|
68 |
-
* [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
|
69 |
-
|
70 |
-
Thank you to all the open-source projects for their contributions to this project!
|
71 |
|
72 |
|
73 |
## Citation
|
@@ -85,6 +59,6 @@ If you find Kimi-Audio useful in your research or applications, please cite our
|
|
85 |
}
|
86 |
```
|
87 |
|
88 |
-
##
|
89 |
|
90 |
-
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- zh
|
6 |
+
tags:
|
7 |
+
- audio
|
8 |
+
- audio-language-model
|
9 |
+
- speech-recognition
|
10 |
+
- audio-understanding
|
11 |
+
- text-to-speech
|
12 |
+
- audio-generation
|
13 |
+
- chat
|
14 |
+
- kimi-audio
|
15 |
+
---
|
16 |
+
|
17 |
+
# Kimi-Audio
|
18 |
+
|
19 |
<p align="center">
|
20 |
+
<img src="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_logo.png" width="400"/>
|
21 |
<p>
|
22 |
|
23 |
<p align="center">
|
24 |
+
Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">π€</a> Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">π€</a> | π <a href="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf">Paper</a>
|
25 |
</p>
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
## Introduction
|
28 |
|
29 |
+
We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository hosts the model checkpoints for Kimi-Audio-7B.
|
30 |
+
|
31 |
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
|
32 |
|
33 |
+
* **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC) and end-to-end speech conversation.
|
34 |
+
* **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see our [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf)).
|
35 |
+
* **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
|
36 |
* **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
|
37 |
* **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
|
|
|
38 |
|
39 |
+
For more details, please refer to our [GitHub Repository](https://github.com/MoonshotAI/Kimi-Audio) and [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf).
|
40 |
|
41 |
+
## Note
|
42 |
|
43 |
+
Kimi-Audio-7B is a base model without fine-tuning. So it cannot be used directly.
|
44 |
+
The base model is quite flexible, you can fine-tune it on any possible downstream tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
|
47 |
## Citation
|
|
|
59 |
}
|
60 |
```
|
61 |
|
62 |
+
## License
|
63 |
|
64 |
+
The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
|