|
--- |
|
license: mit |
|
language: |
|
- rn |
|
--- |
|
# Kirundi Tokenizer |
|
|
|
This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks. |
|
|
|
## Model Details |
|
|
|
- **Model type**: SentencePiece |
|
- **Vocabulary size**: 32,000 |
|
- **Training corpus**: A clean corpus of Kirundi text. |
|
|
|
## Training Data |
|
|
|
The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization. |
|
|
|
## How to Use |
|
|
|
```python |
|
import sentencepiece as spm |
|
|
|
# Load the tokenizer |
|
sp = spm.SentencePieceProcessor(model_file='kirundi.model') |
|
|
|
# Tokenize text |
|
text = "Ndakunda igihugu canje." |
|
tokens = sp.encode(text, out_type=str) |
|
print(tokens) |
|
|
|
# Detokenize text |
|
decoded_text = sp.decode(tokens) |
|
print(decoded_text) |