---
license: mit
language:
- rn
---
# Kirundi Tokenizer

This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.

## Model Details

- **Model type**: SentencePiece
- **Vocabulary size**: 32,000
- **Training corpus**: A clean corpus of Kirundi text.
  
## Training Data

The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.

## How to Use

```python
import sentencepiece as spm

# Load the tokenizer
sp = spm.SentencePieceProcessor(model_file='kirundi.model')

# Tokenize text
text = "Ndakunda igihugu canje."
tokens = sp.encode(text, out_type=str)
print(tokens)

# Detokenize text
decoded_text = sp.decode(tokens)
print(decoded_text)