Pushkar Patel

use float16 instead of bfloat16 as we are inferencing on cpu

53bc958 unverified 3 days ago

5.37 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- text-to-speech
	base_model:
	- sesame/csm-1b
	pipeline_tag: text-to-speech
	---

	# CSM FP16 Safetensors

	2025/03/15 - This is the half-precision (FP16) Safetensors version of the 1B CSM variant which was [originally released in full-precision by Sesame](https://huggingface.co/sesame/csm_1b) on 2025/03/13.

	---

	CSM (Conversational Speech Model) is a speech generation model from [Sesame](https://www.sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.

	A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).

	A hosted [Hugging Face space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.

	## Conversion Statistics

	Some statistics for the conversion from full-precision to half-precision:
	- Original size: 6.22 GB
	- Converted size: 3.11 GB
	- Size reduction: 49.93%
	- Max absolute difference: 0.000897
	- Max relative difference: 0.229582
	- Avg absolute difference: 0.000016

	## Requirements

	* A CUDA-compatible GPU
	* The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
	* Simiarly, Python 3.10 is recommended, but newer versions may be fine
	* For some audio operations, `ffmpeg` may be required

	### Setup

	```bash
	git clone [email protected]:SesameAILabs/csm.git
	cd csm
	python3.10 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	pip install safetensors
	```

	### Windows Setup

	The `triton` package cannot be installed in Windows. Instead use `pip install triton-windows`.

	## Usage

	Generate a sentence

	```python
	from huggingface_hub import hf_hub_download
	from generator import Generator
	from models import Model, ModelArgs
	from safetensors.torch import load_file
	import torchaudio
	import torch

	device = "cpu"
	model_path = hf_hub_download(repo_id="thepushkarp/csm-1b-safetensors-fp16", filename="model.safetensors")

	model_args = ModelArgs(
	backbone_flavor="llama-1B",
	decoder_flavor="llama-100M",
	text_vocab_size=128256,
	audio_vocab_size=2051,
	audio_num_codebooks=32,
	)
	model = Model(model_args).to(device=device, dtype=torch.float16)
	loaded = load_file(model_path)
	model.load_state_dict(loaded)

	generator = Generator(model)
	audio = generator.generate(
	text="Hello from Sesame.",
	speaker=0,
	context=[],
	max_audio_length_ms=10_000,
	)

	torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
	```

	CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance.

	```python
	speakers = [0, 1, 0, 0]
	transcripts = [
	"Hey how are you doing.",
	"Pretty good, pretty good.",
	"I'm great.",
	"So happy to be speaking to you.",
	]
	audio_paths = [
	"utterance_0.wav",
	"utterance_1.wav",
	"utterance_2.wav",
	"utterance_3.wav",
	]

	def load_audio(audio_path):
	audio_tensor, sample_rate = torchaudio.load(audio_path)
	audio_tensor = torchaudio.functional.resample(
	audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
	)
	return audio_tensor

	segments = [
	Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
	for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
	]
	audio = generator.generate(
	text="Me too, this is some cool stuff huh?",
	speaker=1,
	context=segments,
	max_audio_length_ms=10_000,
	)

	torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
	```

	## FAQ

	Does this model come with any voices?

	The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.

	Can I converse with the model?

	CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

	Does it support other languages?

	The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.

	## Misuse and abuse ⚠️

	This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

	- Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
	- Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
	- Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

	By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

	---

	## Authors
	Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.