sesame/csm-1b · RealTime

Kar0nte

2 days ago

Hi guys, great job! That's cool! Any suggestions for an opensource RealTime mode? Thanks

Someshfengde

2 days ago

I'm trying to see if it can work with openai realtime console

https://github.com/openai/openai-realtime-console

for testing out on realtime what are you trying onto ? @Kar0nte

Kar0nte

2 days ago

@Someshfengde I was thinking about frameworks like LiveKit and FastRTC for real-time streaming. Do you think CSM-1B is fast enough for a WebRTC pipeline, or would we need additional optimization?

quadratrix

1 day ago

The demo gets around the limitations of the model by starting to process the input while the user is still talking, then it seems to stitch the responses together. You can basically force it to use the entire generation time by asking it a series of "repeat after me" in the same sentence, followed by something that triggers it's guardrails (expletives, etc)

To replicate the demo, you'd basically need a fast text-to-speech model, then fire it off to the LLM for a response and bring that response to the CSM for audio generation.

Kar0nte

1 day ago

Hi @quadratrix that's why I was thinking about frameworks like livekit, where you can use STT like Whisper, LLM, VOD and TTS, and I wanted to understand if it would be possible to use it. But from what I'm finding out, the 1B model is very immature and needs a lot of optimization. What do you think? To host it locally you would still need a lot of power for acceptable latency. And it seems to only handle English, and not very well either. I think it will take a long time before we can use it well.