bigeagle commited on
Commit
d7330f4
Β·
verified Β·
1 Parent(s): 491aabd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -57
README.md CHANGED
@@ -1,73 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  <p align="center">
2
- <img src="assets/kimia_logo.png" width="400"/>
3
  <p>
4
 
5
  <p align="center">
6
- Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">πŸ€—</a>&nbsp; | Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">πŸ€—</a>&nbsp; | πŸ“‘ <a href="assets/kimia_report.pdf">Paper</a> &nbsp;&nbsp;
7
  </p>
8
 
9
-
10
- We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio.
11
-
12
- ## πŸ”₯πŸ”₯πŸ”₯ News!!
13
- * April 25, 2025: πŸ‘‹ We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).
14
- * April 25, 2025: πŸ‘‹ We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the **our results and baselines** by this toolkit!
15
- * April 25, 2025: πŸ‘‹ We release the technical report of [Kimi-Audio](assets/kimia_report.pdf).
16
-
17
- ## Table of Contents
18
-
19
- - [Introduction](#introduction)
20
- - [Architecture Overview](#architecture-overview)
21
- - [License](#license)
22
- - [Acknowledgements](#acknowledgements)
23
- - [Citation](#citation)
24
- - [Contact Us](#contact-us)
25
-
26
  ## Introduction
27
 
 
 
28
  Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
29
 
30
- * **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation.
31
- * **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)).
32
- * **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
33
  * **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
34
  * **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
35
- * **Open-Source:** We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development.
36
 
37
- **This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).**
38
 
 
39
 
40
- ## Architecture Overview
41
-
42
- <p align="center">
43
- <img src="assets/kimia_framework.png" width="70%"/>
44
- <p>
45
-
46
- Kimi-Audio consists of three main components:
47
-
48
- 1. **Audio Tokenizer:** Converts input audio into:
49
- * Discrete semantic tokens (12.5Hz) using vector quantization.
50
- * Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
51
- 2. **Audio LLM:** A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
52
- 3. **Audio Detokenizer:** Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.
53
-
54
-
55
- ## License
56
-
57
- The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).
58
-
59
-
60
-
61
- ## Acknowledgements
62
-
63
- We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:
64
-
65
- * [Whisper](https://github.com/openai/whisper)
66
- * [Transformers](https://github.com/huggingface/transformers)
67
- * [BigVGAN](https://github.com/NVIDIA/BigVGAN)
68
- * [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
69
-
70
- Thank you to all the open-source projects for their contributions to this project!
71
 
72
 
73
  ## Citation
@@ -85,6 +59,6 @@ If you find Kimi-Audio useful in your research or applications, please cite our
85
  }
86
  ```
87
 
88
- ## Contact Us
89
 
90
- For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - audio
8
+ - audio-language-model
9
+ - speech-recognition
10
+ - audio-understanding
11
+ - text-to-speech
12
+ - audio-generation
13
+ - chat
14
+ - kimi-audio
15
+ ---
16
+
17
+ # Kimi-Audio
18
+
19
  <p align="center">
20
+ <img src="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_logo.png" width="400"/>
21
  <p>
22
 
23
  <p align="center">
24
+ Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">πŸ€—</a>&nbsp; Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">πŸ€—</a>&nbsp; | πŸ“‘ <a href="https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf">Paper</a>
25
  </p>
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ## Introduction
28
 
29
+ We present Kimi-Audio, an open-source audio foundation model excelling in **audio understanding, generation, and conversation**. This repository hosts the model checkpoints for Kimi-Audio-7B.
30
+
31
  Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
32
 
33
+ * **Universal Capabilities:** Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC) and end-to-end speech conversation.
34
+ * **State-of-the-Art Performance:** Achieves SOTA results on numerous audio benchmarks (see our [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf)).
35
+ * **Large-Scale Pre-training:** Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
36
  * **Novel Architecture:** Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
37
  * **Efficient Inference:** Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
 
38
 
39
+ For more details, please refer to our [GitHub Repository](https://github.com/MoonshotAI/Kimi-Audio) and [Technical Report](https://raw.githubusercontent.com/MoonshotAI/Kimi-Audio/master/assets/kimia_report.pdf).
40
 
41
+ ## Note
42
 
43
+ Kimi-Audio-7B is a base model without fine-tuning. So it cannot be used directly.
44
+ The base model is quite flexible, you can fine-tune it on any possible downstream tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
 
47
  ## Citation
 
59
  }
60
  ```
61
 
62
+ ## License
63
 
64
+ The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).