kirundi-tokenizer / README.md
eligapris's picture
Update README.md
8329bb7 verified
---
license: mit
language:
- rn
---
# Kirundi Tokenizer
This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.
## Model Details
- **Model type**: SentencePiece
- **Vocabulary size**: 32,000
- **Training corpus**: A clean corpus of Kirundi text.
## Training Data
The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.
## How to Use
```python
import sentencepiece as spm
# Load the tokenizer
sp = spm.SentencePieceProcessor(model_file='kirundi.model')
# Tokenize text
text = "Ndakunda igihugu canje."
tokens = sp.encode(text, out_type=str)
print(tokens)
# Detokenize text
decoded_text = sp.decode(tokens)
print(decoded_text)