homebrewltd
/

Speechless-llama3.2-v0.1

Audio-Text-to-Text

sound language model

Model card Files Files and versions Community

Speechless-llama3.2-v0.1 / README.md

jan-hq's picture

Update README.md

517916a verified 3 months ago

|

history blame contribute delete

3.73 kB

	---
	datasets:
	- homebrewltd/Ichigo-tokenized-v0.1
	language:
	- en
	- vi
	license: apache-2.0
	tags:
	- sound language model
	- audio-text-to-text
	- torchtune
	- whisperspeech
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/BjNGSPCF5z-tp9aAGsZN9.png)

	## Speechless

	Speechless is a compact, open-source text-to-semantics (1B parameters) model, designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text-to-speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low-resource languages.

	Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, Speechless is a core component of the Ichigo v0.5 family.

	For more details, check out our official [blog post]().

	### Model Summary

	Developed by: Homebrew Research.

	Model Architecture: Llama

	Model type: Text to Semantics

	Language(s): English and Vietnamese

	License: Apache 2.0

	### Resources

	Blog: [Blog post]()

	## Intended Use

	Intended Use Cases This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text-to-speech (TTS) model.

	Out-of-scope The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

	## How to Get Started

	You can use given example code to load the model.

	```python
	import torch
	from transformers import pipeline

	model_id = "homebrewltd/Speechless-llama3.2-v0.1"

	pipe = pipeline(
	"text-generation",
	model=model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	pipe("<\|reserved_special_token_69\|>I’m Speechless – A Model Developed by Homebrew Research")

	>>> [{'generated_text': '<\|reserved_special_token_69\|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<\|sound_1968\|><\|sound_0464\|><\|sound_0642\|><\|duration_02\|><\|sound_0634\|><\|sound_0105\|><\|duration_02\|><\|sound_1745\|><\|duration_02\|><\|sound_1345\|><\|sound_0210\|><\|sound_1312\|><\|sound_1312\|>'}]
	```


	## Training Specs

	\| Parameter \| Value \|
	\|----------------------------\|-------------------------\|
	\| Epochs \| 2 \|
	\| Global Batch Size \| 144 \|
	\| Learning Rate \| 3e-4 \|
	\| Learning Scheduler \| Cosine \|
	\| Optimizer \| AdamW \|
	\| Warmup Ratio \| 0.05 \|
	\| Weight Decay \| 0.01 \|
	\| Max Sequence Length \| 512 \|
	\| Clip Grad Norm \| 1.0 \|

	## Evaluation

	1. Vietnamese

	\| Model Name \| Dataset test \| Test samples \| WER \|
	\|------------\|--------------\|--------------\|-----\|
	\| Speechless v0.1 \| viet_bud500 \| 7500 \| 3.99 \|

	2. English

	\| Model Name \| Dataset test \| Test samples \| WER \|
	\|------------\|--------------\|--------------\|-----\|
	\| Speechless v0.1 \| librispeech_asr \| 2620 \| 3.27 \|

	## Citation Information

	BibTeX:

	```
	@article{Speechless 2024,
	title={Speechless},
	author={Homebrew Research},
	year=2024,
	month=December},
	url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}
	```

	## Acknowledgement

	- [WhisperSpeech](https://github.com/collabora/WhisperSpeech)

	- [Llama3.2](https://huggingface.co/meta-llama/Meta-Llama-3.2-1B-Base)