---
library_name: transformers
license: apache-2.0
language:
- ja
---

# llm-jp-modernbert-base

This model is based on the [modernBERT-base](https://arxiv.org/abs/2412.13663) architecture with [llm-jp-tokenizer](https://github.com/llm-jp/llm-jp-tokenizer).
It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

For detailed information on the training methods, evaluation, and analysis results, please visit at [llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length](https://arxiv.org/abs/2504.15544)

## Usage

Please install the transformers library.
```bash
pip install "transformers>=4.48.0"
```

If your GPU supports flash-attn 2, it is recommended to install flash-attn.
```
pip install flash-attn --no-build-isolation
```

Using AutoModelForMaskedLM:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "llm-jp/llm-jp-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  東京
```


## Training

This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

Training code can be found at https://github.com/llm-jp/llm-jp-modernbert

| Model              |      stage 1    |    stage 2      |
|:------------------ |----------------:|----------------:|
| max_seq_len        | 1024            | 8192            |
| max_steps          | 500,000         | 200,000         |
| Total batch size   | 3328            | 384             |
| Peak LR            | 5e-4            | 5e-5            |
| warmup step        | 24,000          |                 |
| LR schedule        | Linear decay    |                 |
| Adam beta 1        | 0.9             |                 |
| Adam beta 2        | 0.98            |                 |
| Adam eps           | 1e-6            |                 |
| MLM prob           | 0.30            |                 |
| Gradient clipping  | 1.0             |                 |
| weight decay       | 1e-5            |                 |
| line_by_line       | True            |                 |

The blank in stage 2 indicate the same value as in stage 1.

## Evaluation

JSTS, JNLI, and JCoLA from [JGLUE](https://aclanthology.org/2022.lrec-1.317/) were used. 
Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert

| Model                                                 |   JSTS (pearson) |   JNLI (accuracy) |   JCoLA (accuracy) |   Avg |
|-------------------------------------------------------|--------|--------|---------|--------------|
| tohoku-nlp/bert-base-japanese-v3                      |  0.920 |  0.912 |   0.880 |        0.904 |
| sbintuitions/modernbert-ja-130m                       |  0.916 |  0.927 |   0.868 |        0.904 |
| sbintuitions/modernbert-ja-310m                       |  **0.932** |  **0.933** |   **0.883** |        **0.916** |
| **llm-jp/llm-jp-modernbert-base**        |  0.918 |  0.913 |   0.844 |        0.892 |

## LICENSE

[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Citation

```
@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
      title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length}, 
      author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
      year={2025},
      eprint={2504.15544},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15544}, 
}
```