Disappointed with the results - Gibberish, Pauses, Inconsistent Voice, and Pitch is unstable
It's not stable, it produces inconsistent voice, adds pauses, and gibberish voice in between. The pitch is also not consistent. Very disappointed. :( This is no where close to the demo showed on the blog.
Not the author but FWIW this is the base model, not the fine-tuned variant they said the demo uses. Like pre-training models for LLMs, it's likely to not be very usable until you tune it for a more specific purpose.
Are there any instructions to finetune it?
no - and they dont plan to give us any either
I don't understand the intention behind this release.
research .. its a demo - not for the end customer or dev centric - watch there interview ..
Demo on their website is way better and of course, I understand that it is finetuned. Not sure which interview you are referring to, but I didn't see them talk about the CSM-1b in particular being a demo. I would say, even for a demo, the consistency pretty bad. This seems more like the proof-of-technology.
And let's not even talk about how excruciatingly slow it is, making it unrealistic as a "conversational" speech model. This is a gimped down tease of a model they know can't be used for anything or it would take away from their product they will be selling.
Am I wrong?
oh community has it at 3.3 rtf already . thats not the issue - and with some optimisations its certainly even faster
@MrDragonFox where did you get this from? Can you point to the source to make it rtf 3.3?
@MrDragonFox how? i think you had to have done something extra to get it workable… care to share or give hints the the people of how you got it there?
@MrDragonFox how? i think you had to have done something extra to get it workable… care to share or give hints the the people of how you got it there?
quite a bit custom inference code yes - but accomplish-able for the people who know torch / cuda
if it relies on audio context + text and has a llama3 under the hood it really is no use to anyone.
So many messages asking me to implement this with NeuroSync to. Such a shame.
ll3 has just text input for the semantic audio embeddings -> does not gen text out - acts very much just like a regular tts
oh community has it at 3.3 rtf already . thats not the issue - and with some optimisations its certainly even faster
Where is this community you're referring to? I haven't seen anyone post anything anywhere claiming such performance other than you. Would love to hear about your optimizations. Can you please share your code with the community?
my code wont be shared - plenty others shared optimisations .. there is a discord link in the issues in the github repo of sesame/csm