llm-jp
/

llm-jp-modernbert-base

Model card Files Files and versions Community

llm-jp-modernbert-base / README.md

speed's picture

Update README.md

84e67a0 verified 6 days ago

|

history blame contribute delete

4 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- ja
	---

	# llm-jp-modernbert-base

	This model is based on the [modernBERT-base](https://arxiv.org/abs/2412.13663) architecture with [llm-jp-tokenizer](https://github.com/llm-jp/llm-jp-tokenizer).
	It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

	For detailed information on the training methods, evaluation, and analysis results, please visit at [llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length](https://arxiv.org/abs/2504.15544)

	## Usage

	Please install the transformers library.
	```bash
	pip install "transformers>=4.48.0"
	```

	If your GPU supports flash-attn 2, it is recommended to install flash-attn.
	```
	pip install flash-attn --no-build-isolation
	```

	Using AutoModelForMaskedLM:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	model_id = "llm-jp/llm-jp-modernbert-base"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(model_id)

	text = "日本の首都は<MASK\|LLM-jp>です。"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# To get predictions for the mask:
	masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
	predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
	predicted_token = tokenizer.decode(predicted_token_id)
	print("Predicted token:", predicted_token)
	# Predicted token: 東京
	```


	## Training

	This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

	Training code can be found at https://github.com/llm-jp/llm-jp-modernbert

	\| Model \| stage 1 \| stage 2 \|
	\|:------------------ \|----------------:\|----------------:\|
	\| max_seq_len \| 1024 \| 8192 \|
	\| max_steps \| 500,000 \| 200,000 \|
	\| Total batch size \| 3328 \| 384 \|
	\| Peak LR \| 5e-4 \| 5e-5 \|
	\| warmup step \| 24,000 \| \|
	\| LR schedule \| Linear decay \| \|
	\| Adam beta 1 \| 0.9 \| \|
	\| Adam beta 2 \| 0.98 \| \|
	\| Adam eps \| 1e-6 \| \|
	\| MLM prob \| 0.30 \| \|
	\| Gradient clipping \| 1.0 \| \|
	\| weight decay \| 1e-5 \| \|
	\| line_by_line \| True \| \|

	The blank in stage 2 indicate the same value as in stage 1.

	## Evaluation

	JSTS, JNLI, and JCoLA from [JGLUE](https://aclanthology.org/2022.lrec-1.317/) were used.
	Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert

	\| Model \| JSTS (pearson) \| JNLI (accuracy) \| JCoLA (accuracy) \| Avg \|
	\|-------------------------------------------------------\|--------\|--------\|---------\|--------------\|
	\| tohoku-nlp/bert-base-japanese-v3 \| 0.920 \| 0.912 \| 0.880 \| 0.904 \|
	\| sbintuitions/modernbert-ja-130m \| 0.916 \| 0.927 \| 0.868 \| 0.904 \|
	\| sbintuitions/modernbert-ja-310m \| 0.932 \| 0.933 \| 0.883 \| 0.916 \|
	\| llm-jp/llm-jp-modernbert-base \| 0.918 \| 0.913 \| 0.844 \| 0.892 \|

	## LICENSE

	[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

	## Citation

	```
	@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
	title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length},
	author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
	year={2025},
	eprint={2504.15544},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2504.15544},
	}
	```