File size: 3,995 Bytes
fef9ff7 1cf964c fef9ff7 e8d0493 fef9ff7 1cf964c fef9ff7 3a6d756 fef9ff7 d6a6844 8b76fd0 d6a6844 bd4e27a d6a6844 1cf964c fef9ff7 1cf964c fef9ff7 3a6d756 a9f9f2e 1cf964c fef9ff7 1cf964c fef9ff7 5f6b1b9 3a6d756 fef9ff7 87c768f 5f6b1b9 87c768f 3a6d756 fef9ff7 67f32a4 3a6d756 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
library_name: transformers
license: apache-2.0
language:
- ja
---
# llm-jp-modernbert-base
This model is based on the [modernBERT-base](https://arxiv.org/abs/2412.13663) architecture with [llm-jp-tokenizer](https://github.com/llm-jp/llm-jp-tokenizer).
It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.
For detailed information on the training methods, evaluation, and analysis results, please visit at [llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length](https://arxiv.org/abs/2504.15544)
## Usage
Please install the transformers library.
```bash
pip install "transformers>=4.48.0"
```
If your GPU supports flash-attn 2, it is recommended to install flash-attn.
```
pip install flash-attn --no-build-isolation
```
Using AutoModelForMaskedLM:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "llm-jp/llm-jp-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: 東京
```
## Training
This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.
Training code can be found at https://github.com/llm-jp/llm-jp-modernbert
| Model | stage 1 | stage 2 |
|:------------------ |----------------:|----------------:|
| max_seq_len | 1024 | 8192 |
| max_steps | 500,000 | 200,000 |
| Total batch size | 3328 | 384 |
| Peak LR | 5e-4 | 5e-5 |
| warmup step | 24,000 | |
| LR schedule | Linear decay | |
| Adam beta 1 | 0.9 | |
| Adam beta 2 | 0.98 | |
| Adam eps | 1e-6 | |
| MLM prob | 0.30 | |
| Gradient clipping | 1.0 | |
| weight decay | 1e-5 | |
| line_by_line | True | |
The blank in stage 2 indicate the same value as in stage 1.
## Evaluation
JSTS, JNLI, and JCoLA from [JGLUE](https://aclanthology.org/2022.lrec-1.317/) were used.
Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert
| Model | JSTS (pearson) | JNLI (accuracy) | JCoLA (accuracy) | Avg |
|-------------------------------------------------------|--------|--------|---------|--------------|
| tohoku-nlp/bert-base-japanese-v3 | 0.920 | 0.912 | 0.880 | 0.904 |
| sbintuitions/modernbert-ja-130m | 0.916 | 0.927 | 0.868 | 0.904 |
| sbintuitions/modernbert-ja-310m | **0.932** | **0.933** | **0.883** | **0.916** |
| **llm-jp/llm-jp-modernbert-base** | 0.918 | 0.913 | 0.844 | 0.892 |
## LICENSE
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## Citation
```
@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length},
author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
year={2025},
eprint={2504.15544},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.15544},
}
``` |