Disappointed with the results - Gibberish, Pauses, Inconsistent Voice, and Pitch is unstable

#14
by millionwords - opened

It's not stable, it produces inconsistent voice, adds pauses, and gibberish voice in between. The pitch is also not consistent. Very disappointed. :( This is no where close to the demo showed on the blog.

Not the author but FWIW this is the base model, not the fine-tuned variant they said the demo uses. Like pre-training models for LLMs, it's likely to not be very usable until you tune it for a more specific purpose.

Are there any instructions to finetune it?

no - and they dont plan to give us any either

I don't understand the intention behind this release.

research .. its a demo - not for the end customer or dev centric - watch there interview ..

Demo on their website is way better and of course, I understand that it is finetuned. Not sure which interview you are referring to, but I didn't see them talk about the CSM-1b in particular being a demo. I would say, even for a demo, the consistency pretty bad. This seems more like the proof-of-technology.

And let's not even talk about how excruciatingly slow it is, making it unrealistic as a "conversational" speech model. This is a gimped down tease of a model they know can't be used for anything or it would take away from their product they will be selling.

Am I wrong?

oh community has it at 3.3 rtf already . thats not the issue - and with some optimisations its certainly even faster

@MrDragonFox where did you get this from? Can you point to the source to make it rtf 3.3?

i have about 3.3 local with ampere .. with ada it be even faster
tensor magic :)

Screenshot_20250317_092828.png

@MrDragonFox how? i think you had to have done something extra to get it workable… care to share or give hints the the people of how you got it there?

This comment has been hidden (marked as Resolved)

@MrDragonFox how? i think you had to have done something extra to get it workable… care to share or give hints the the people of how you got it there?

quite a bit custom inference code yes - but accomplish-able for the people who know torch / cuda

if it relies on audio context + text and has a llama3 under the hood it really is no use to anyone.

So many messages asking me to implement this with NeuroSync to. Such a shame.

ll3 has just text input for the semantic audio embeddings -> does not gen text out - acts very much just like a regular tts

oh community has it at 3.3 rtf already . thats not the issue - and with some optimisations its certainly even faster

Where is this community you're referring to? I haven't seen anyone post anything anywhere claiming such performance other than you. Would love to hear about your optimizations. Can you please share your code with the community?

my code wont be shared - plenty others shared optimisations .. there is a discord link in the issues in the github repo of sesame/csm

https://discord.gg/pWQjW4GC

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment