moonshotai/Kimi-Audio-7B-Instruct · examples for AQA, AAC, SER, SEC/ASC

MrDragonFox

4 days ago

•

edited 4 days ago

congrats on the release !

we gotten some examples for vc and the tts,

could we get more examples in particular for the other capability and the prompts used during training for those ?

im after:

audio question answering (AQA),
audio captioning (AAC),
speech emotion recognition (SER),
sound event/scene classification (SEC/ASC)

YifeiXin

2 days ago

•

edited 1 day ago

Hi guys, thanks for your attention. You can refer to our benchmark evaluation files, which contain evaluation prompts for different tasks: . As for the training task prompts, we have designed many, and here are some examples: https://github.com/MoonshotAI/Kimi-Audio-Evalkit/blob/master/data/download_benchmark.py.

For the speech emotion task, the training prompts are:
1)Identify the predominant emotion in this speech.\nOptions:\n(A) neutral\n(B) joy\n(C) sadness\n(D) anger\n(E) surprise\n(F) fear\n(G) disgust\n.Answer with the option's letter from the given choices directly and only give the best option.
2)Based on the speech, what is the main emotion?\nOptions:\n(A) neutral\n(B) joy\n(C) sadness\n(D) anger\n(E) surprise\n(F) fear\n(G) disgust\n.Answer with the option's letter from the given choices directly and only give the best option.

For the acoustic scene classification task, the evaluation prompts follow this general format:
1)Identify the acoustic scene in the audio.\nOptions:\n(A) beach\n(B) bus\n(C) cafe or restaurant\n(D) car\n(E) city center\n(F) forest path\n(G) grocery store\n(H) home\n(I) library\n(J) metro station\n(K) office\n(L) park\n(M) residential area\n(N) train\n(O) tram\n.Answer with the option's letter from the given choices directly and only give the best option.
2)Classify the location heard in the sound.\nOptions:\n(A) beach\n(B) bus\n(C) cafe or restaurant\n(D) car\n(E) city center\n(F) forest path\n(G) grocery store\n(H) home\n(I) library\n(J) metro station\n(K) office\n(L) park\n(M) residential area\n(N) train\n(O) tram\n.Answer with the option's letter from the given choices directly and only give the best option.

For the acoustic event detection task, the evaluation prompts follow this general format:
1)Identify the sound event in the audio.
2)What sound event occurs in this audio?

For the audio captioning task, the evaluation prompts follow this general format:
1)Please describe the sound events in the audio.
2)Please generate the audio caption.

As for the Audio Question Answering (AQA) task, the evaluation prompt is completely random. You can ask any question (for example, you can refer to datasets like MMAU, ClothoAQA, Comp-R, AVQA, MusicAVQA, etc.)

It’s best to follow our format for asking, but the specific content details can vary freely.

jocoyo

1 day ago

@YifeiXin
If I want, for example, to perform an acoustic event detection task for an audio, but within several events occur, and I would like to receive a result in the style of:
(start_1, end_1, event_1), (start_2, end_2, event_2), etc.
Is this possible?