jan-hq's picture
Update README.md
517916a verified
---
datasets:
- homebrewltd/Ichigo-tokenized-v0.1
language:
- en
- vi
license: apache-2.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/BjNGSPCF5z-tp9aAGsZN9.png)
## Speechless
Speechless is a compact, open-source text-to-semantics (1B parameters) model, designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text-to-speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low-resource languages.
Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, Speechless is a core component of the Ichigo v0.5 family.
For more details, check out our official [blog post]().
### Model Summary
**Developed by:** Homebrew Research.
**Model Architecture:** Llama
**Model type:** Text to Semantics
**Language(s):** English and Vietnamese
**License:** Apache 2.0
### Resources
**Blog:** [Blog post]()
## Intended Use
**Intended Use Cases** This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text-to-speech (TTS) model.
**Out-of-scope** The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.
## How to Get Started
You can use given example code to load the model.
```python
import torch
from transformers import pipeline
model_id = "homebrewltd/Speechless-llama3.2-v0.1"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")
>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]
```
## Training Specs
| **Parameter** | **Value** |
|----------------------------|-------------------------|
| **Epochs** | 2 |
| **Global Batch Size** | 144 |
| **Learning Rate** | 3e-4 |
| **Learning Scheduler** | Cosine |
| **Optimizer** | AdamW |
| **Warmup Ratio** | 0.05 |
| **Weight Decay** | 0.01 |
| **Max Sequence Length** | 512 |
| **Clip Grad Norm** | 1.0 |
## Evaluation
1. Vietnamese
| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | viet_bud500 | 7500 | **3.99** |
2. English
| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | librispeech_asr | 2620 | **3.27** |
## Citation Information
**BibTeX:**
```
@article{Speechless 2024,
title={Speechless},
author={Homebrew Research},
year=2024,
month=December},
url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}
```
## Acknowledgement
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
- **[Llama3.2](https://huggingface.co/meta-llama/Meta-Llama-3.2-1B-Base)**