Disappointed with the results - Gibberish, Pauses, Inconsistent Voice, and Pitch is unstable

#14

by millionwords - opened Mar 15

Mar 15

It's not stable, it produces inconsistent voice, adds pauses, and gibberish voice in between. The pitch is also not consistent. Very disappointed. :( This is no where close to the demo showed on the blog.

staturecrane

Mar 15

Not the author but FWIW this is the base model, not the fine-tuned variant they said the demo uses. Like pre-training models for LLMs, it's likely to not be very usable until you tune it for a more specific purpose.

millionwords

Mar 15

Are there any instructions to finetune it?

MrDragonFox

Mar 15

no - and they dont plan to give us any either

tanfar

Mar 16

I don't understand the intention behind this release.

MrDragonFox

Mar 16

research .. its a demo - not for the end customer or dev centric - watch there interview ..

tanfar

Mar 16

Demo on their website is way better and of course, I understand that it is finetuned. Not sure which interview you are referring to, but I didn't see them talk about the CSM-1b in particular being a demo. I would say, even for a demo, the consistency pretty bad. This seems more like the proof-of-technology.

zenoran

Mar 16

And let's not even talk about how excruciatingly slow it is, making it unrealistic as a "conversational" speech model. This is a gimped down tease of a model they know can't be used for anything or it would take away from their product they will be selling.

Am I wrong?

MrDragonFox

Mar 17

oh community has it at 3.3 rtf already . thats not the issue - and with some optimisations its certainly even faster

Arcadianer

Mar 17

@MrDragonFox where did you get this from? Can you point to the source to make it rtf 3.3?

MrDragonFox

Mar 17

i have about 3.3 local with ampere .. with ada it be even faster
tensor magic :)

cheez2012

Mar 17

@MrDragonFox how? i think you had to have done something extra to get it workable… care to share or give hints the the people of how you got it there?

AnimaVR

Mar 17

This comment has been hidden (marked as Resolved)

MrDragonFox

Mar 17

@MrDragonFox how? i think you had to have done something extra to get it workable… care to share or give hints the the people of how you got it there?

quite a bit custom inference code yes - but accomplish-able for the people who know torch / cuda

MrDragonFox

Mar 17

if it relies on audio context + text and has a llama3 under the hood it really is no use to anyone.

So many messages asking me to implement this with NeuroSync to. Such a shame.

ll3 has just text input for the semantic audio embeddings -> does not gen text out - acts very much just like a regular tts

zenoran

Mar 17

oh community has it at 3.3 rtf already . thats not the issue - and with some optimisations its certainly even faster

Where is this community you're referring to? I haven't seen anyone post anything anywhere claiming such performance other than you. Would love to hear about your optimizations. Can you please share your code with the community?