eligapris commited on
Commit
52fd74d
·
verified ·
1 Parent(s): b35c9ab

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +31 -0
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kirundi Tokenizer
2
+
3
+ This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.
4
+
5
+ ## Model Details
6
+
7
+ - **Model type**: SentencePiece
8
+ - **Vocabulary size**: 32,000
9
+ - **Training corpus**: A clean corpus of Kirundi text.
10
+
11
+ ## Training Data
12
+
13
+ The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.
14
+
15
+ ## How to Use
16
+
17
+ ```python
18
+ import sentencepiece as spm
19
+
20
+ # Load the tokenizer
21
+ sp = spm.SentencePieceProcessor(model_file='kirundi.model')
22
+
23
+ # Tokenize text
24
+ text = "Ndakunda igihugu canje."
25
+ tokens = sp.encode(text, out_type=str)
26
+ print(tokens)
27
+
28
+ # Detokenize text
29
+ decoded_text = sp.decode(tokens)
30
+ print(decoded_text)
31
+