--- license: mit language: - rn --- # Kirundi Tokenizer This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks. ## Model Details - **Model type**: SentencePiece - **Vocabulary size**: 32,000 - **Training corpus**: A clean corpus of Kirundi text. ## Training Data The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization. ## How to Use ```python import sentencepiece as spm # Load the tokenizer sp = spm.SentencePieceProcessor(model_file='kirundi.model') # Tokenize text text = "Ndakunda igihugu canje." tokens = sp.encode(text, out_type=str) print(tokens) # Detokenize text decoded_text = sp.decode(tokens) print(decoded_text)