eligapris
/

kirundi-tokenizer

Model card Files Files and versions Community

kirundi-tokenizer / README.md

eligapris's picture

Update README.md

8329bb7 verified 5 months ago

|

history blame contribute delete

849 Bytes

	---
	license: mit
	language:
	- rn
	---
	# Kirundi Tokenizer

	This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.

	## Model Details

	- Model type: SentencePiece
	- Vocabulary size: 32,000
	- Training corpus: A clean corpus of Kirundi text.

	## Training Data

	The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.

	## How to Use

	```python
	import sentencepiece as spm

	# Load the tokenizer
	sp = spm.SentencePieceProcessor(model_file='kirundi.model')

	# Tokenize text
	text = "Ndakunda igihugu canje."
	tokens = sp.encode(text, out_type=str)
	print(tokens)

	# Detokenize text
	decoded_text = sp.decode(tokens)
	print(decoded_text)