Sentence Similarity
Safetensors
Japanese
bert
feature-extraction

Ruri: Japanese General Text Embeddings

Usage

First install the Sentence Transformers library:

pip install -U sentence-transformers fugashi sentencepiece unidic-lite

Then you can load this model and run inference.

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the ๐Ÿค— Hub
model = SentenceTransformer("cl-nagoya/ruri-large")

# Don't forget to add the prefix "ใ‚ฏใ‚จใƒช: " for query-side or "ๆ–‡็ซ : " for passage-side texts.
sentences = [
    "ใ‚ฏใ‚จใƒช: ็‘ ็’ƒ่‰ฒใฏใฉใ‚“ใช่‰ฒ๏ผŸ",
    "ๆ–‡็ซ : ็‘ ็’ƒ่‰ฒ๏ผˆใ‚‹ใ‚Šใ„ใ‚๏ผ‰ใฏใ€็ดซใฟใ‚’ๅธฏใณใŸๆฟƒใ„้’ใ€‚ๅใฏใ€ๅŠ่ฒด็Ÿณใฎ็‘ ็’ƒ๏ผˆใƒฉใƒ”ใ‚นใƒฉใ‚บใƒชใ€่‹ฑ: lapis lazuli๏ผ‰ใซใ‚ˆใ‚‹ใ€‚JISๆ…ฃ็”จ่‰ฒๅใงใฏใ€Œใ“ใ„็ดซใฟใฎ้’ใ€๏ผˆ็•ฅๅท dp-pB๏ผ‰ใจๅฎš็พฉใ—ใฆใ„ใ‚‹[1][2]ใ€‚",
    "ใ‚ฏใ‚จใƒช: ใƒฏใ‚ทใ‚„ใ‚ฟใ‚ซใฎใ‚ˆใ†ใซใ€้‹ญใ„ใใกใฐใ—ใจ็ˆชใ‚’ๆŒใฃใŸๅคงๅž‹ใฎ้ณฅ้กžใ‚’็ท็งฐใ—ใฆใ€Œไฝ•้กžใ€ใจใ„ใ†ใงใ—ใ‚‡ใ†?",
    "ๆ–‡็ซ : ใƒฏใ‚ทใ€ใ‚ฟใ‚ซใ€ใƒใ‚ฒใƒฏใ‚ทใ€ใƒใƒคใƒ–ใ‚ตใ€ใ‚ณใƒณใƒ‰ใƒซใ€ใƒ•ใ‚ฏใƒญใ‚ฆใŒไปฃ่กจ็š„ใงใ‚ใ‚‹ใ€‚ใ“ใ‚Œใ‚‰ใฎ็Œ›็ฆฝ้กžใฏใƒชใƒณใƒๅ‰ๅพŒใฎๆ™‚ไปฃ(17~18ไธ–็ด€)ใซใฏ้ทฒ้กžใƒป้ทน้กžใƒป้šผ้กžๅŠใณๆขŸ้กžใซๅˆ†้กžใ•ใ‚ŒใŸใ€‚ใกใชใฟใซใƒชใƒณใƒใฏ็‹ฉใ‚Šใ‚’ใ™ใ‚‹้ณฅใ‚’ๅ˜ไธ€ใฎ็›ฎ(ใ‚‚ใ)ใซใพใจใ‚ใ€vultur(ใ‚ณใƒณใƒ‰ใƒซใ€ใƒใ‚ฒใƒฏใ‚ท)ใ€falco(ใƒฏใ‚ทใ€ใ‚ฟใ‚ซใ€ใƒใƒคใƒ–ใ‚ตใชใฉ)ใ€strix(ใƒ•ใ‚ฏใƒญใ‚ฆ)ใ€lanius(ใƒขใ‚บ)ใฎ4ๅฑžใ‚’ๅซใ‚ใฆใ„ใ‚‹ใ€‚",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 1024]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9429, 0.6565, 0.6997],
#  [0.9429, 1.0000, 0.6579, 0.6768],
#  [0.6565, 0.6579, 1.0000, 0.8933],
#  [0.6997, 0.6768, 0.8933, 1.0000]]

Benchmarks

JMTEB

Evaluated with JMTEB.

Model #Param. Avg. Retrieval STS Classfification Reranking Clustering PairClassification
cl-nagoya/sup-simcse-ja-base 111M 68.56 49.64 82.05 73.47 91.83 51.79 62.57
cl-nagoya/sup-simcse-ja-large 337M 66.51 37.62 83.18 73.73 91.48 50.56 62.51
cl-nagoya/unsup-simcse-ja-base 111M 65.07 40.23 78.72 73.07 91.16 44.77 62.44
cl-nagoya/unsup-simcse-ja-large 337M 66.27 40.53 80.56 74.66 90.95 48.41 62.49
pkshatech/GLuCoSE-base-ja 133M 70.44 59.02 78.71 76.82 91.90 49.78 66.39
sentence-transformers/LaBSE 472M 64.70 40.12 76.56 72.66 91.63 44.88 62.33
intfloat/multilingual-e5-small 118M 69.52 67.27 80.07 67.62 93.03 46.91 62.19
intfloat/multilingual-e5-base 278M 70.12 68.21 79.84 69.30 92.85 48.26 62.26
intfloat/multilingual-e5-large 560M 71.65 70.98 79.70 72.89 92.96 51.24 62.15
OpenAI/text-embedding-ada-002 - 69.48 64.38 79.02 69.75 93.04 48.30 62.40
OpenAI/text-embedding-3-small - 70.86 66.39 79.46 73.06 92.92 51.06 62.27
OpenAI/text-embedding-3-large - 73.97 74.48 82.52 77.58 93.58 53.32 62.35
Ruri-Small 68M 71.53 69.41 82.79 76.22 93.00 51.19 62.11
Ruri-Base 111M 71.91 69.82 82.87 75.58 92.91 54.16 62.38
Ruri-Large (this model) 337M 73.31 73.02 83.13 77.43 92.99 51.82 62.29

Model Details

Model Description

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.0
  • Transformers: 4.41.2
  • PyTorch: 2.3.1+cu118
  • Accelerate: 0.30.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

License

This model is published under the Apache License, Version 2.0.

Downloads last month
7,206
Safetensors
Model size
337M params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for cl-nagoya/ruri-large

Finetuned
(2)
this model

Dataset used to train cl-nagoya/ruri-large

Spaces using cl-nagoya/ruri-large 6

Collection including cl-nagoya/ruri-large