thepushkarp commited on
Commit
6023dd6
·
verified ·
1 Parent(s): 04be612

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +158 -0
  2. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-to-speech
7
+ base_model:
8
+ - sesame/csm-1b
9
+ pipeline_tag: text-to-speech
10
+ ---
11
+
12
+ # CSM FP16 Safetensors
13
+
14
+ **2025/03/15** - This is the half-precision (FP16) Safetensors version of the 1B CSM variant which was [originally released in full-precision by Sesame](https://huggingface.co/sesame/csm_1b) on 2025/03/13.
15
+
16
+ ---
17
+
18
+ CSM (Conversational Speech Model) is a speech generation model from [Sesame](https://www.sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.
19
+
20
+ A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).
21
+
22
+ A hosted [Hugging Face space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.
23
+
24
+ ## Conversion Statistics
25
+
26
+ Some statistics for the conversion from full-precision to half-precision:
27
+ - Original size: 5931.49 MB
28
+ - Converted size: 2961.73 MB
29
+ - Size reduction: 50.067621%
30
+ - Max absolute difference: 0.000897
31
+ - Max relative difference: 0.229582
32
+ - Avg absolute difference: 0.000016
33
+
34
+ ## Requirements
35
+
36
+ * A CUDA-compatible GPU
37
+ * The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
38
+ * Simiarly, Python 3.10 is recommended, but newer versions may be fine
39
+ * For some audio operations, `ffmpeg` may be required
40
+
41
+ ### Setup
42
+
43
+ ```bash
44
+ git clone [email protected]:SesameAILabs/csm.git
45
+ cd csm
46
+ python3.10 -m venv .venv
47
+ source .venv/bin/activate
48
+ pip install -r requirements.txt
49
+ pip install safetensors
50
+ ```
51
+
52
+ ### Windows Setup
53
+
54
+ The `triton` package cannot be installed in Windows. Instead use `pip install triton-windows`.
55
+
56
+ ## Usage
57
+
58
+ Generate a sentence
59
+
60
+ ```python
61
+ from huggingface_hub import hf_hub_download
62
+ from generator import Generator
63
+ from models import Model, ModelArgs
64
+ from safetensors.torch import load_file
65
+ import torchaudio
66
+ import torch
67
+
68
+ device = "cpu"
69
+ model_path = hf_hub_download(repo_id="thepushkarp/csm-1b-safetensors-fp16", filename="model.safetensors")
70
+
71
+ model_args = ModelArgs(
72
+ backbone_flavor="llama-1B",
73
+ decoder_flavor="llama-100M",
74
+ text_vocab_size=128256,
75
+ audio_vocab_size=2051,
76
+ audio_num_codebooks=32,
77
+ )
78
+ model = Model(model_args).to(device=device, dtype=torch.bfloat16)
79
+ loaded = load_file(model_path)
80
+ model.load_state_dict(loaded)
81
+
82
+ generator = Generator(model)
83
+ audio = generator.generate(
84
+ text="Hello from Sesame.",
85
+ speaker=0,
86
+ context=[],
87
+ max_audio_length_ms=10_000,
88
+ )
89
+
90
+ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
91
+ ```
92
+
93
+ CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance.
94
+
95
+ ```python
96
+ speakers = [0, 1, 0, 0]
97
+ transcripts = [
98
+ "Hey how are you doing.",
99
+ "Pretty good, pretty good.",
100
+ "I'm great.",
101
+ "So happy to be speaking to you.",
102
+ ]
103
+ audio_paths = [
104
+ "utterance_0.wav",
105
+ "utterance_1.wav",
106
+ "utterance_2.wav",
107
+ "utterance_3.wav",
108
+ ]
109
+
110
+ def load_audio(audio_path):
111
+ audio_tensor, sample_rate = torchaudio.load(audio_path)
112
+ audio_tensor = torchaudio.functional.resample(
113
+ audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
114
+ )
115
+ return audio_tensor
116
+
117
+ segments = [
118
+ Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
119
+ for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
120
+ ]
121
+ audio = generator.generate(
122
+ text="Me too, this is some cool stuff huh?",
123
+ speaker=1,
124
+ context=segments,
125
+ max_audio_length_ms=10_000,
126
+ )
127
+
128
+ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
129
+ ```
130
+
131
+ ## FAQ
132
+
133
+ **Does this model come with any voices?**
134
+
135
+ The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
136
+
137
+ **Can I converse with the model?**
138
+
139
+ CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
140
+
141
+ **Does it support other languages?**
142
+
143
+ The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
144
+
145
+ ## Misuse and abuse ⚠️
146
+
147
+ This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:
148
+
149
+ - **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.
150
+ - **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
151
+ - **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.
152
+
153
+ By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
154
+
155
+ ---
156
+
157
+ ## Authors
158
+ Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e56dbf183a568ee1bdf5d59448c2947ef314beafe5ce551122c817ca179aedd
3
+ size 3105603608