File size: 3,995 Bytes

fef9ff7
 
1cf964c
 
 
fef9ff7
 
e8d0493
fef9ff7
1cf964c
 
fef9ff7
3a6d756
fef9ff7
d6a6844
 
 
 
 
 
 
 
 
 
 
 
 
8b76fd0
 
d6a6844
 
bd4e27a
d6a6844
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1cf964c
fef9ff7
1cf964c
fef9ff7
3a6d756
a9f9f2e
1cf964c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fef9ff7
1cf964c
fef9ff7
 
 
5f6b1b9
3a6d756
fef9ff7
87c768f
5f6b1b9
 
 
87c768f
3a6d756
fef9ff7
67f32a4
 
 
 
 
 
3a6d756

---
library_name: transformers
license: apache-2.0
language:
- ja
---

# llm-jp-modernbert-base

This model is based on the [modernBERT-base](https://arxiv.org/abs/2412.13663) architecture with [llm-jp-tokenizer](https://github.com/llm-jp/llm-jp-tokenizer).
It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

For detailed information on the training methods, evaluation, and analysis results, please visit at [llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length](https://arxiv.org/abs/2504.15544)

## Usage

Please install the transformers library.
```bash
pip install "transformers>=4.48.0"
```

If your GPU supports flash-attn 2, it is recommended to install flash-attn.
```
pip install flash-attn --no-build-isolation
```

Using AutoModelForMaskedLM:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "llm-jp/llm-jp-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  東京
```


## Training

This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

Training code can be found at https://github.com/llm-jp/llm-jp-modernbert

| Model              |      stage 1    |    stage 2      |
|:------------------ |----------------:|----------------:|
| max_seq_len        | 1024            | 8192            |
| max_steps          | 500,000         | 200,000         |
| Total batch size   | 3328            | 384             |
| Peak LR            | 5e-4            | 5e-5            |
| warmup step        | 24,000          |                 |
| LR schedule        | Linear decay    |                 |
| Adam beta 1        | 0.9             |                 |
| Adam beta 2        | 0.98            |                 |
| Adam eps           | 1e-6            |                 |
| MLM prob           | 0.30            |                 |
| Gradient clipping  | 1.0             |                 |
| weight decay       | 1e-5            |                 |
| line_by_line       | True            |                 |

The blank in stage 2 indicate the same value as in stage 1.

## Evaluation

JSTS, JNLI, and JCoLA from [JGLUE](https://aclanthology.org/2022.lrec-1.317/) were used. 
Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert

| Model                                                 |   JSTS (pearson) |   JNLI (accuracy) |   JCoLA (accuracy) |   Avg |
|-------------------------------------------------------|--------|--------|---------|--------------|
| tohoku-nlp/bert-base-japanese-v3                      |  0.920 |  0.912 |   0.880 |        0.904 |
| sbintuitions/modernbert-ja-130m                       |  0.916 |  0.927 |   0.868 |        0.904 |
| sbintuitions/modernbert-ja-310m                       |  **0.932** |  **0.933** |   **0.883** |        **0.916** |
| **llm-jp/llm-jp-modernbert-base**        |  0.918 |  0.913 |   0.844 |        0.892 |

## LICENSE

[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Citation

```
@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
      title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length}, 
      author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
      year={2025},
      eprint={2504.15544},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15544}, 
}
```