Issue with long audio (~1 min) output, or prompt instruct following

#2
by JosephusCheung - opened
messages = [

    {
        "role": "user",
        "message_type": "audio",
        "content": "500ms-silence.mp3", # an audio file input is required in your demo code
    },
        {
        "role": "user",
        "message_type": "text",
        "content": "Text to speech\n\nDelivery: Exaggerated and theatrical, with dramatic pauses, sudden outbursts, and gleeful cackling.\n\nVoice: High-energy, eccentric, and slightly unhinged, with a manic enthusiasm that rises and falls unpredictably.\n\nTone: Excited, chaotic, and grandiose, as if reveling in the brilliance of a mad experiment.\n\nPronunciation: Sharp and expressive, with elongated vowels, sudden inflections, and an emphasis on big words to sound more diabolical.\n\nText:\nAh-ha-ha! The stars tremble before my genius! The rift is open, the energy surging—unstable? Perhaps. Dangerous? Most certainly! Captain Rylen's hands twitch over the controls. Fools! They hesitate, but I—I alone see the future! \"Engage the thrusters!\" I bellow, eyes wild with possibility. The ship lurches, metal groaning—oh, what delicious chaos! Light bends, time twists, and then—BOOM! Silence. Darkness. And then… oh-ho! A new universe! Bigger! Stranger! And mine for the taking! Ah-ha-ha-ha!",
    }
]

wav, text = model.generate(messages, **sampling_params, output_type="both")
sf.write(
    os.path.join(output_dir, "output.wav"),
    wav.detach().cpu().view(-1).numpy(),
    24000,
)
print(">>> output text: ", text)

The prompt, copied from openai.fm, works correctly with qwen-omni (although style control is ineffective with it) and also with a glm-voice model fine-tuned for long audio outputs.

However, your model fails to generate from the beginning of the text. Instead, it produces seemingly random sentences and disregards the specified style controls. This suggests the model may either be incapable of handling long audio (30s~1min) output or struggle with following prompt instructions.

Kimi:

Qwen-omni:

OpenAI TTS:

Model trained in the same arch as glm-voice:

我在这里再提供一个中文的bad case,在类似的支持语音(Whisper)输入,但是性能不佳的模型中很多有这个问题。这似乎不单是LLM幻觉的问题。

image.png

image.png

而语音似乎也出现了错字错音的问题,我不确定这是否是模型推理存在的问题,还是模型本身的问题。我倾向于归因为 audio head 存在细节问题,遇到上游ASR任务中的生僻词时也会出现生成的问题。

Moonshot AI org

Thanks for your experiments! Really appreciate!

The badcases touch the limitation in the SFT stage: we do not support the missing of audio in the user input (no such data during the whole SFT stage). So it is somewhat a hacking to leverage silence as user input. As a result, the badcase is not out of the expectation.

As we have release our base model, we think our model is able to handle the scenarios in your experiments after SFT. Just feel free to have a try.

PS: As stated in our paper, in the SFT stage of Kimi-Audio, we do not use TTS data or train on TTS task. So additional SFT is needed for Kimi-Audio to become a TTS model.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment