Fill-Mask
Transformers
Safetensors
Japanese
modernbert

llm-jp-modernbert-base

This model is based on the modernBERT-base architecture with llm-jp-tokenizer. It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

For detailed information on the training methods, evaluation, and analysis results, please visit at llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

Usage

Please install the transformers library.

pip install "transformers>=4.48.0"

If your GPU supports flash-attn 2, it is recommended to install flash-attn.

pip install flash-attn --no-build-isolation

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "llm-jp/llm-jp-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  東京

Training

This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

Training code can be found at https://github.com/llm-jp/llm-jp-modernbert

Model stage 1 stage 2
max_seq_len 1024 8192
max_steps 500,000 200,000
Total batch size 3328 384
Peak LR 5e-4 5e-5
warmup step 24,000
LR schedule Linear decay
Adam beta 1 0.9
Adam beta 2 0.98
Adam eps 1e-6
MLM prob 0.30
Gradient clipping 1.0
weight decay 1e-5
line_by_line True

The blank in stage 2 indicate the same value as in stage 1.

Evaluation

JSTS, JNLI, and JCoLA from JGLUE were used. Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert

Model JSTS (pearson) JNLI (accuracy) JCoLA (accuracy) Avg
tohoku-nlp/bert-base-japanese-v3 0.920 0.912 0.880 0.904
sbintuitions/modernbert-ja-130m 0.916 0.927 0.868 0.904
sbintuitions/modernbert-ja-310m 0.932 0.933 0.883 0.916
llm-jp/llm-jp-modernbert-base 0.918 0.913 0.844 0.892

LICENSE

Apache License, Version 2.0

Citation

@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
      title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length}, 
      author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
      year={2025},
      eprint={2504.15544},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15544}, 
}
Downloads last month
1,371
Safetensors
Model size
187M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support