AIGym Custom Tokenizer (CL200K)
Overview
The AIGym CL200K Tokenizer is a custom tokenizer designed for pretraining large language models. It is based on Meta-Llama-3-8B and trained on the AIGym Pretraining Corpus.
Features
- Built on Meta-Llama-3-8B
- Supports a vocabulary size of 200K tokens
- Optimized for educational, programming, multilingual, and mathematical texts
- Includes custom PAD token
Usage
Loading the Tokenizer
To use the tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("AIGym/cl200k")
text = "Hello, world!"
tokens = tokenizer.encode(text)
print(tokens)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no pipeline_tag.