Kimi-Audio-7B / README.md

Upload folder using huggingface_hub

880ac96 verified 3 days ago

4.94 kB

	<p align="center">
	<img src="assets/kimia_logo.png" width="400"/>
	<p>

	<p align="center">
	Kimi-Audio-7B <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B">🤗</a>  ｜ Kimi-Audio-7B-Instruct <a href="https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct">🤗</a>  \| 📑 <a href="assets/kimia_report.pdf">Paper</a>
	</p>


	We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, pre-trained models, and evaluation toolkit for Kimi-Audio.

	## 🔥🔥🔥 News!!
	* April 25, 2025: 👋 We release the inference code and model weights of [Kimi-Audio-7B](https://huggingface.co/moonshotai/Kimi-Audio-7B) and [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).
	* April 25, 2025: 👋 We release the audio evaluation toolkit [ALMEvalKit](https://github.com/moonshotai/KimiA-Audio-EvaluationToolkit). We can easily reproduce the our results and baselines by this toolkit!
	* April 25, 2025: 👋 We release the technical report of [Kimi-Audio](assets/kimia_report.pdf).

	## Table of Contents

	- [Introduction](#introduction)
	- [Architecture Overview](#architecture-overview)
	- [License](#license)
	- [Acknowledgements](#acknowledgements)
	- [Citation](#citation)
	- [Contact Us](#contact-us)

	## Introduction

	Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

	* Universal Capabilities: Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), text-to-speech (TTS), voice conversion (VC), and end-to-end speech conversation.
	* State-of-the-Art Performance: Achieves SOTA results on numerous audio benchmarks (see [Technical Report](assets/kimia_report.pdf)).
	* Large-Scale Pre-training: Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
	* Novel Architecture: Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
	* Efficient Inference: Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
	* Open-Source: We release the code, model checkpoints, and a comprehensive evaluation toolkit to foster community research and development.

	This is the pre-trained model of Kimi-Audio. If you want to use Kimi-Audio in practice, please refer to [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct).


	## Architecture Overview

	<p align="center">
	<img src="assets/kimia_framework.png" width="70%"/>
	<p>

	Kimi-Audio consists of three main components:

	1. Audio Tokenizer: Converts input audio into:
	* Discrete semantic tokens (12.5Hz) using vector quantization.
	* Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
	2. Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
	3. Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.


	## License

	The model is based and modified from [Qwen 2.5-7B](https://github.com/QwenLM/Qwen2.5). Code derived from Qwen2.5-7B is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Other parts of the code are licensed under the [MIT License](https://opensource.org/licenses/MIT).



	## Acknowledgements

	We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:

	* [Whisper](https://github.com/openai/whisper)
	* [Transformers](https://github.com/huggingface/transformers)
	* [BigVGAN](https://github.com/NVIDIA/BigVGAN)
	* [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)

	Thank you to all the open-source projects for their contributions to this project!


	## Citation

	If you find Kimi-Audio useful in your research or applications, please cite our technical report:

	```bibtex
	@misc{kimi_audio_2024,
	title={Kimi-Audio Technical Report},
	author={Kimi Team},
	year={2024},
	eprint={arXiv:placeholder},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	## Contact Us

	For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.