SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the statictable-triplets-all dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/paraphrase-multilingual-miniLM-L12-v2-mnrl-beir-2")
# Run inference
sentences = [
    'Laporan singkat arus kas Q2 2005, dalam miliar',
    'Ringkasan Neraca Arus Dana, Triwulan Kedua, 2005, (Miliar Rupiah)',
    'Rata-rata Upah/Gaji Bersih sebulan Buruh/Karyawan Pegawai Menurut Pendidikan Tertinggi dan Jumlah Jam Kerja Utama, 2020',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.899
cosine_accuracy@5 0.9837
cosine_accuracy@10 1.0
cosine_precision@1 0.899
cosine_precision@5 0.2189
cosine_precision@10 0.1261
cosine_recall@1 0.703
cosine_recall@5 0.789
cosine_recall@10 0.8116
cosine_ndcg@1 0.899
cosine_ndcg@5 0.8179
cosine_ndcg@10 0.8156
cosine_mrr@1 0.899
cosine_mrr@5 0.9348
cosine_mrr@10 0.9369
cosine_map@1 0.899
cosine_map@5 0.7721
cosine_map@10 0.7636

Training Details

Training Dataset

statictable-triplets-all

  • Dataset: statictable-triplets-all at 24979b4
  • Size: 967,831 training samples
  • Columns: query, pos, and neg
  • Approximate statistics based on the first 1000 samples:
    query pos neg
    type string string string
    details
    • min: 4 tokens
    • mean: 18.55 tokens
    • max: 37 tokens
    • min: 4 tokens
    • mean: 25.6 tokens
    • max: 58 tokens
    • min: 4 tokens
    • mean: 25.7 tokens
    • max: 58 tokens
  • Samples:
    query pos neg
    Indeks harga petani (diterima & dibayar) dan NTP per provinsi, 2012 Indeks Harga yang Diterima Petani (It), Indeks Harga yang Dibayar Petani (Ib), dan Nilai Tukar Petani (NTP) Menurut Provinsi, 2008-2016 Persentase Rumah Tangga Menurut Provinsi dan KebiasaanMemanfaatkan Air Bekas untuk Keperluan Lain, 2013, 2014, 2017, 2021
    Data rumah tangga perikanan budidaya Indonesia, detail per provinsi dan jenis budidaya, di tahun 2008 Jumlah Rumah Tangga Perikanan Budidaya Menurut Provinsi dan Jenis Budidaya, 2000-2016 Ringkasan Neraca Arus Dana, 2005, (Miliar Rupiah)
    Lapangan pekerjaan vs pendidikan pekerja (15 tahun ke atas), 1986 hingga 1996 Penduduk Berumur 15 Tahun Ke Atas yang Bekerja Selama Seminggu yang Lalu Menurut Lapangan Pekerjaan Utama dan Pendidikan Tertinggi yang Ditamatkan, 1986 -1996 Tabel Input-Output Indonesia Transaksi Domestik Atas Dasar Harga Produsen (17 Lapangan Usaha), 2016 (Juta Rupiah)
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

statictable-triplets-all

  • Dataset: statictable-triplets-all at 24979b4
  • Size: 967,831 evaluation samples
  • Columns: query, pos, and neg
  • Approximate statistics based on the first 1000 samples:
    query pos neg
    type string string string
    details
    • min: 5 tokens
    • mean: 18.38 tokens
    • max: 37 tokens
    • min: 4 tokens
    • mean: 25.28 tokens
    • max: 58 tokens
    • min: 5 tokens
    • mean: 25.65 tokens
    • max: 58 tokens
  • Samples:
    query pos neg
    Bagaimana hubungan IHK dan rata-rata upah buruh industri (bukan supervisor) bulanan tahun 2010, acuan 1996? IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor (Supervisor), 1996-2014 (1996=100) Rata-rata Harga Valuta Asing Terpilih menurut Provinsi, 2014
    Berapa rata-rata gaji bulanan pekerja Indonesia berdasarkan ijazah terakhir dan sektor pekerjaannya (2017)? Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut Pendidikan Tertinggi yang Ditamatkan dan Lapangan Pekerjaan Utama di 9 Sektor (rupiah), 2017 Rata-Rata Pengeluaran per Kapita Sebulan Menurut Kelompok Barang (rupiah), 2013-2021
    Data luas lahan (hektar) yang dipakai untuk jenis budidaya perikanan di tiap provinsi tahun 2009 Luas Area Usaha Budidaya Perikanan Menurut Provinsi dan Jenis Budidaya (ha), 2005-2016 Ringkasan Neraca Arus Dana, Triwulan I, 2008, (Miliar Rupiah)
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • weight_decay: 0.01
  • num_train_epochs: 2
  • lr_scheduler_type: reduce_lr_on_plateau
  • lr_scheduler_kwargs: {'factor': 0.5, 'patience': 2}
  • warmup_steps: 10000
  • save_on_each_node: True
  • fp16: True
  • dataloader_num_workers: 2
  • load_best_model_at_end: True
  • eval_on_start: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: reduce_lr_on_plateau
  • lr_scheduler_kwargs: {'factor': 0.5, 'patience': 2}
  • warmup_ratio: 0.0
  • warmup_steps: 10000
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: True
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 2
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss bps-statictable-ir_cosine_ndcg@10
0 0 - 1.0819 0.4643
0.0334 200 0.1373 - -
0.0668 400 0.0354 - -
0.0835 500 - 0.0132 0.8252
0.1002 600 0.028 - -
0.1336 800 0.018 - -
0.1670 1000 0.0145 0.0096 0.8286
0.2004 1200 0.0089 - -
0.2338 1400 0.0103 - -
0.2505 1500 - 0.0067 0.8312
0.2672 1600 0.0098 - -
0.3006 1800 0.0086 - -
0.3339 2000 0.0086 0.0044 0.8246
0.3673 2200 0.0088 - -
0.4007 2400 0.0075 - -
0.4174 2500 - 0.0051 0.8295
0.4341 2600 0.0066 - -
0.4675 2800 0.0054 - -
0.5009 3000 0.0051 0.0059 0.8294
0.5343 3200 0.0052 - -
0.5677 3400 0.0037 - -
0.5844 3500 - 0.0041 0.8126
0.6011 3600 0.0078 - -
0.6345 3800 0.005 - -
0.6679 4000 0.0045 0.0050 0.8308
0.7013 4200 0.0047 - -
0.7347 4400 0.0066 - -
0.7514 4500 - 0.0033 0.8233
0.7681 4600 0.0043 - -
0.8015 4800 0.003 - -
0.8349 5000 0.0029 0.0036 0.8224
0.8683 5200 0.0014 - -
0.9017 5400 0.0058 - -
0.9184 5500 - 0.0020 0.8169
0.9350 5600 0.0045 - -
0.9684 5800 0.0036 - -
1.0018 6000 0.0053 0.0018 0.8152
1.0352 6200 0.0035 - -
1.0686 6400 0.0017 - -
1.0853 6500 - 0.0024 0.8231
1.1020 6600 0.0037 - -
1.1354 6800 0.0044 - -
1.1688 7000 0.0011 0.0113 0.8153
1.2022 7200 0.0042 - -
1.2356 7400 0.0028 - -
1.2523 7500 - 0.0046 0.8253
1.2690 7600 0.0005 - -
1.3024 7800 0.001 - -
1.3358 8000 0.0011 0.0017 0.8216
1.3692 8200 0.0007 - -
1.4026 8400 0.0014 - -
1.4193 8500 - 0.0014 0.8253
1.4360 8600 0.0003 - -
1.4694 8800 0.0005 - -
1.5028 9000 0.002 0.0012 0.8250
1.5361 9200 0.0013 - -
1.5695 9400 0.0009 - -
1.5862 9500 - 0.0003 0.8162
1.6029 9600 0.0021 - -
1.6363 9800 0.0013 - -
1.6697 10000 0.0005 0.0003 0.8234
1.7031 10200 0.0004 - -
1.7365 10400 0.0004 - -
1.7532 10500 - 0.0001 0.8225
1.7699 10600 0.0011 - -
1.8033 10800 0.0004 - -
1.8367 11000 0.0009 0.0008 0.8259
1.8701 11200 0.0024 - -
1.9035 11400 0.0002 - -
1.9202 11500 - 0.0008 0.8156
1.9369 11600 0.0007 - -
1.9703 11800 0.0007 - -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
2
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for yahyaabd/paraphrase-multilingual-miniLM-L12-v2-mnrl-beir-2

Dataset used to train yahyaabd/paraphrase-multilingual-miniLM-L12-v2-mnrl-beir-2

Evaluation results