Ruri: Japanese General Text Embeddings
Collection
26 items
โข
Updated
โข
13
First install the Sentence Transformers library:
pip install -U sentence-transformers fugashi sentencepiece unidic-lite
Then you can load this model and run inference.
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("cl-nagoya/ruri-large")
# Don't forget to add the prefix "ใฏใจใช: " for query-side or "ๆ็ซ : " for passage-side texts.
sentences = [
"ใฏใจใช: ็ ็่ฒใฏใฉใใช่ฒ๏ผ",
"ๆ็ซ : ็ ็่ฒ๏ผใใใใ๏ผใฏใ็ดซใฟใๅธฏใณใๆฟใ้ใๅใฏใๅ่ฒด็ณใฎ็ ็๏ผใฉใในใฉใบใชใ่ฑ: lapis lazuli๏ผใซใใใJISๆ
ฃ็จ่ฒๅใงใฏใใใ็ดซใฟใฎ้ใ๏ผ็ฅๅท dp-pB๏ผใจๅฎ็พฉใใฆใใ[1][2]ใ",
"ใฏใจใช: ใฏใทใใฟใซใฎใใใซใ้ญใใใกใฐใใจ็ชใๆใฃใๅคงๅใฎ้ณฅ้กใ็ท็งฐใใฆใไฝ้กใใจใใใงใใใ?",
"ๆ็ซ : ใฏใทใใฟใซใใใฒใฏใทใใใคใใตใใณใณใใซใใใฏใญใฆใไปฃ่กจ็ใงใใใใใใใฎ็็ฆฝ้กใฏใชใณใๅๅพใฎๆไปฃ(17~18ไธ็ด)ใซใฏ้ทฒ้กใป้ทน้กใป้ผ้กๅใณๆข้กใซๅ้กใใใใใกใชใฟใซใชใณใใฏ็ฉใใใใ้ณฅใๅไธใฎ็ฎ(ใใ)ใซใพใจใใvultur(ใณใณใใซใใใฒใฏใท)ใfalco(ใฏใทใใฟใซใใใคใใตใชใฉ)ใstrix(ใใฏใญใฆ)ใlanius(ใขใบ)ใฎ4ๅฑใๅซใใฆใใใ",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 1024]
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9429, 0.6565, 0.6997],
# [0.9429, 1.0000, 0.6579, 0.6768],
# [0.6565, 0.6579, 1.0000, 0.8933],
# [0.6997, 0.6768, 0.8933, 1.0000]]
Evaluated with JMTEB.
Model | #Param. | Avg. | Retrieval | STS | Classfification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|---|
cl-nagoya/sup-simcse-ja-base | 111M | 68.56 | 49.64 | 82.05 | 73.47 | 91.83 | 51.79 | 62.57 |
cl-nagoya/sup-simcse-ja-large | 337M | 66.51 | 37.62 | 83.18 | 73.73 | 91.48 | 50.56 | 62.51 |
cl-nagoya/unsup-simcse-ja-base | 111M | 65.07 | 40.23 | 78.72 | 73.07 | 91.16 | 44.77 | 62.44 |
cl-nagoya/unsup-simcse-ja-large | 337M | 66.27 | 40.53 | 80.56 | 74.66 | 90.95 | 48.41 | 62.49 |
pkshatech/GLuCoSE-base-ja | 133M | 70.44 | 59.02 | 78.71 | 76.82 | 91.90 | 49.78 | 66.39 |
sentence-transformers/LaBSE | 472M | 64.70 | 40.12 | 76.56 | 72.66 | 91.63 | 44.88 | 62.33 |
intfloat/multilingual-e5-small | 118M | 69.52 | 67.27 | 80.07 | 67.62 | 93.03 | 46.91 | 62.19 |
intfloat/multilingual-e5-base | 278M | 70.12 | 68.21 | 79.84 | 69.30 | 92.85 | 48.26 | 62.26 |
intfloat/multilingual-e5-large | 560M | 71.65 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
OpenAI/text-embedding-ada-002 | - | 69.48 | 64.38 | 79.02 | 69.75 | 93.04 | 48.30 | 62.40 |
OpenAI/text-embedding-3-small | - | 70.86 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
OpenAI/text-embedding-3-large | - | 73.97 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
Ruri-Small | 68M | 71.53 | 69.41 | 82.79 | 76.22 | 93.00 | 51.19 | 62.11 |
Ruri-Base | 111M | 71.91 | 69.82 | 82.87 | 75.58 | 92.91 | 54.16 | 62.38 |
Ruri-Large (this model) | 337M | 73.31 | 73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 |
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
@misc{
Ruri,
title={{Ruri: Japanese General Text Embeddings}},
author={Hayato Tsukagoshi and Ryohei Sasano},
year={2024},
eprint={2409.07737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.07737},
}
This model is published under the Apache License, Version 2.0.
Base model
tohoku-nlp/bert-base-japanese-v3