|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- ja |
|
--- |
|
|
|
# llm-jp-modernbert-base |
|
|
|
This model is based on the [modernBERT-base](https://arxiv.org/abs/2412.13663) architecture with [llm-jp-tokenizer](https://github.com/llm-jp/llm-jp-tokenizer). |
|
It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192. |
|
|
|
For detailed information on the training methods, evaluation, and analysis results, please visit at [llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length](https://arxiv.org/abs/2504.15544) |
|
|
|
## Usage |
|
|
|
Please install the transformers library. |
|
```bash |
|
pip install "transformers>=4.48.0" |
|
``` |
|
|
|
If your GPU supports flash-attn 2, it is recommended to install flash-attn. |
|
``` |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
Using AutoModelForMaskedLM: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
model_id = "llm-jp/llm-jp-modernbert-base" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForMaskedLM.from_pretrained(model_id) |
|
|
|
text = "日本の首都は<MASK|LLM-jp>です。" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
# To get predictions for the mask: |
|
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) |
|
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) |
|
predicted_token = tokenizer.decode(predicted_token_id) |
|
print("Predicted token:", predicted_token) |
|
# Predicted token: 東京 |
|
``` |
|
|
|
|
|
## Training |
|
|
|
This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2. |
|
|
|
Training code can be found at https://github.com/llm-jp/llm-jp-modernbert |
|
|
|
| Model | stage 1 | stage 2 | |
|
|:------------------ |----------------:|----------------:| |
|
| max_seq_len | 1024 | 8192 | |
|
| max_steps | 500,000 | 200,000 | |
|
| Total batch size | 3328 | 384 | |
|
| Peak LR | 5e-4 | 5e-5 | |
|
| warmup step | 24,000 | | |
|
| LR schedule | Linear decay | | |
|
| Adam beta 1 | 0.9 | | |
|
| Adam beta 2 | 0.98 | | |
|
| Adam eps | 1e-6 | | |
|
| MLM prob | 0.30 | | |
|
| Gradient clipping | 1.0 | | |
|
| weight decay | 1e-5 | | |
|
| line_by_line | True | | |
|
|
|
The blank in stage 2 indicate the same value as in stage 1. |
|
|
|
## Evaluation |
|
|
|
JSTS, JNLI, and JCoLA from [JGLUE](https://aclanthology.org/2022.lrec-1.317/) were used. |
|
Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert |
|
|
|
| Model | JSTS (pearson) | JNLI (accuracy) | JCoLA (accuracy) | Avg | |
|
|-------------------------------------------------------|--------|--------|---------|--------------| |
|
| tohoku-nlp/bert-base-japanese-v3 | 0.920 | 0.912 | 0.880 | 0.904 | |
|
| sbintuitions/modernbert-ja-130m | 0.916 | 0.927 | 0.868 | 0.904 | |
|
| sbintuitions/modernbert-ja-310m | **0.932** | **0.933** | **0.883** | **0.916** | |
|
| **llm-jp/llm-jp-modernbert-base** | 0.918 | 0.913 | 0.844 | 0.892 | |
|
|
|
## LICENSE |
|
|
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained, |
|
title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length}, |
|
author={Issa Sugiura and Kouta Nakayama and Yusuke Oda}, |
|
year={2025}, |
|
eprint={2504.15544}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2504.15544}, |
|
} |
|
``` |