Lyte commited on
Commit
04280da
·
verified ·
1 Parent(s): 78b0268

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-to-speech
7
+ base_model:
8
+ - sesame/csm-1b
9
+ pipeline_tag: text-to-speech
10
+ ---
11
+
12
+ ## CSM 1B
13
+
14
+ **2025/03/13** - We are releasing the 1B CSM variant. Code is available on GitHub: [SesameAILabs/csm](https://github.com/SesameAILabs/csm).
15
+
16
+ ---
17
+
18
+ CSM (Conversational Speech Model) is a speech generation model from [Sesame](sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.
19
+
20
+ A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).
21
+
22
+ A hosted [HuggingFace space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.
23
+
24
+ ## Usage
25
+
26
+ Setup the repo
27
+
28
+ ```bash
29
+ git clone [email protected]:SesameAILabs/csm.git
30
+ cd csm
31
+ python3.10 -m venv .venv
32
+ source .venv/bin/activate
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ Generate a sentence
37
+
38
+ ```python
39
+ from huggingface_hub import hf_hub_download
40
+ from generator import load_csm_1b
41
+ import torchaudio
42
+
43
+ model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")
44
+ generator = load_csm_1b(model_path, "cuda")
45
+ audio = generator.generate(
46
+ text="Hello from Sesame.",
47
+ speaker=0,
48
+ context=[],
49
+ max_audio_length_ms=10_000,
50
+ )
51
+
52
+ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
53
+ ```
54
+
55
+ CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance.
56
+
57
+ ```python
58
+ speakers = [0, 1, 0, 0]
59
+ transcripts = [
60
+ "Hey how are you doing.",
61
+ "Pretty good, pretty good.",
62
+ "I'm great.",
63
+ "So happy to be speaking to you.",
64
+ ]
65
+ audio_paths = [
66
+ "utterance_0.wav",
67
+ "utterance_1.wav",
68
+ "utterance_2.wav",
69
+ "utterance_3.wav",
70
+ ]
71
+
72
+ def load_audio(audio_path):
73
+ audio_tensor, sample_rate = torchaudio.load(audio_path)
74
+ audio_tensor = torchaudio.functional.resample(
75
+ audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
76
+ )
77
+ return audio_tensor
78
+
79
+ segments = [
80
+ Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
81
+ for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
82
+ ]
83
+ audio = generator.generate(
84
+ text="Me too, this is some cool stuff huh?",
85
+ speaker=1,
86
+ context=segments,
87
+ max_audio_length_ms=10_000,
88
+ )
89
+
90
+ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
91
+ ```
92
+
93
+ ## FAQ
94
+
95
+ **Does this model come with any voices?**
96
+
97
+ The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
98
+
99
+ **Can I converse with the model?**
100
+
101
+ CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
102
+
103
+ **Does it support other languages?**
104
+
105
+ The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
106
+
107
+ ## Misuse and abuse ⚠️
108
+
109
+ This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:
110
+
111
+ - **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.
112
+ - **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
113
+ - **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.
114
+
115
+ By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
116
+
117
+ **Authors**
118
+ Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.