AIGym Custom Tokenizer (CL200K)

Overview

The AIGym CL200K Tokenizer is a custom tokenizer designed for pretraining large language models. It is based on Meta-Llama-3-8B and trained on the AIGym Pretraining Corpus.

Features

  • Built on Meta-Llama-3-8B
  • Supports a vocabulary size of 200K tokens
  • Optimized for educational, programming, multilingual, and mathematical texts
  • Includes custom PAD token

Usage

Loading the Tokenizer

To use the tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AIGym/cl200k")
text = "Hello, world!"
tokens = tokenizer.encode(text)
print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support