Issue with long audio (~1 min) output, or prompt instruct following
messages = [
{
"role": "user",
"message_type": "audio",
"content": "500ms-silence.mp3", # an audio file input is required in your demo code
},
{
"role": "user",
"message_type": "text",
"content": "Text to speech\n\nDelivery: Exaggerated and theatrical, with dramatic pauses, sudden outbursts, and gleeful cackling.\n\nVoice: High-energy, eccentric, and slightly unhinged, with a manic enthusiasm that rises and falls unpredictably.\n\nTone: Excited, chaotic, and grandiose, as if reveling in the brilliance of a mad experiment.\n\nPronunciation: Sharp and expressive, with elongated vowels, sudden inflections, and an emphasis on big words to sound more diabolical.\n\nText:\nAh-ha-ha! The stars tremble before my genius! The rift is open, the energy surging—unstable? Perhaps. Dangerous? Most certainly! Captain Rylen's hands twitch over the controls. Fools! They hesitate, but I—I alone see the future! \"Engage the thrusters!\" I bellow, eyes wild with possibility. The ship lurches, metal groaning—oh, what delicious chaos! Light bends, time twists, and then—BOOM! Silence. Darkness. And then… oh-ho! A new universe! Bigger! Stranger! And mine for the taking! Ah-ha-ha-ha!",
}
]
wav, text = model.generate(messages, **sampling_params, output_type="both")
sf.write(
os.path.join(output_dir, "output.wav"),
wav.detach().cpu().view(-1).numpy(),
24000,
)
print(">>> output text: ", text)
The prompt, copied from openai.fm, works correctly with qwen-omni (although style control is ineffective with it) and also with a glm-voice model fine-tuned for long audio outputs.
However, your model fails to generate from the beginning of the text. Instead, it produces seemingly random sentences and disregards the specified style controls. This suggests the model may either be incapable of handling long audio (30s~1min) output or struggle with following prompt instructions.
Kimi:
Qwen-omni:
OpenAI TTS:
Model trained in the same arch as glm-voice:
Thanks for your experiments! Really appreciate!
The badcases touch the limitation in the SFT stage: we do not support the missing of audio in the user input (no such data during the whole SFT stage). So it is somewhat a hacking to leverage silence as user input. As a result, the badcase is not out of the expectation.
As we have release our base model, we think our model is able to handle the scenarios in your experiments after SFT. Just feel free to have a try.
PS: As stated in our paper, in the SFT stage of Kimi-Audio, we do not use TTS data or train on TTS task. So additional SFT is needed for Kimi-Audio to become a TTS model.