eligapris
/

kirundi-tokenizer

eligapris commited on Dec 6, 2024

Commit

52fd74d

verified ·

1 Parent(s): b35c9ab

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md ADDED Viewed

+# Kirundi Tokenizer
+This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.
+## Model Details
+- **Model type**: SentencePiece
+- **Vocabulary size**: 32,000
+- **Training corpus**: A clean corpus of Kirundi text.
+## Training Data
+The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.
+## How to Use
+```python
+import sentencepiece as spm
+# Load the tokenizer
+sp = spm.SentencePieceProcessor(model_file='kirundi.model')
+# Tokenize text
+text = "Ndakunda igihugu canje."
+tokens = sp.encode(text, out_type=str)
+print(tokens)
+# Detokenize text
+decoded_text = sp.decode(tokens)
+print(decoded_text)