yahyaabd's picture
Add new SentenceTransformer model
ccc8b62 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:967831
  - loss:MultipleNegativesRankingLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
  - source_sentence: >-
      Penghasilan rata-rata pelaku usaha mandiri: Analisis berdasarkan lokasi
      dan jenjang pendidikan, 2023
    sentences:
      - >-
        Rata-rata Pendapatan bersih Berusaha Sendiri menurut Provinsi dan
        Pendidikan yang Ditamatkan, 2023
      - >-
        Rata-Rata Pengeluaran per Kapita Sebulan Menurut Kelompok Barang
        (rupiah), 2013-2021
      - Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)
  - source_sentence: Bagaimana traffic penerbangan internasional di Indonesia pada 2008?
    sentences:
      - Tingkat Inflasi Harga Konsumen Nasional Bulanan (M-to-M) 1 (2022=100)
      - Balita (0-59 Bulan) Menurut Status Gizi, Tahun 1998-2005
      - Lalu Lintas Penerbangan Luar Negeri Indonesia Tahun 2003-2022
  - source_sentence: >-
      Data indeks daya penyebaran dan derajat kepekaan sektor ekonomi, ambil
      contoh tahun 2005
    sentences:
      - >-
        Indeks Daya Penyebaran dan Indeks Derajat Kepekaan Menurut Sektor
        Ekonomi, 1995, 2000, 2005, dan 2010
      - Ekspor Kopi Menurut Negara Tujuan Utama, 2000-2023
      - >-
        Anggaran Kesehatan dari Direktorat Penyusunan APBN - Direktorat Jenderal
        Anggaran, Kementerian Keuangan
  - source_sentence: >-
      Data aktivitas penduduk 15 tahun ke atas berdasarkan kelompok umur, satu
      minggu ke belakang (periode 2002)
    sentences:
      - Ekspor Lada Putih menurut Negara Tujuan Utama, 2012-2023
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi
        Sulawesi Selatan, 2018-2023
      - >-
        Penduduk Berumur 15 Tahun Ke Atas Menurut Golongan Umur dan Jenis
        Kegiatan Selama Seminggu yang Lalu, 1997 - 2007
  - source_sentence: Laporan singkat arus kas Q2 2005, dalam miliar
    sentences:
      - Ringkasan Neraca Arus Dana, Triwulan Kedua, 2005, (Miliar Rupiah)
      - Indikator Pendidikan, 1994-2023
      - >-
        Rata-rata Upah/Gaji Bersih sebulan Buruh/Karyawan Pegawai Menurut
        Pendidikan Tertinggi dan Jumlah Jam Kerja Utama, 2020
datasets:
  - yahyaabd/statictable-triplets-all
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@1
  - cosine_ndcg@5
  - cosine_ndcg@10
  - cosine_mrr@1
  - cosine_mrr@5
  - cosine_mrr@10
  - cosine_map@1
  - cosine_map@5
  - cosine_map@10
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: bps statictable ir
          type: bps-statictable-ir
        metrics:
          - type: cosine_accuracy@1
            value: 0.8990228013029316
            name: Cosine Accuracy@1
          - type: cosine_accuracy@5
            value: 0.9837133550488599
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.8990228013029316
            name: Cosine Precision@1
          - type: cosine_precision@5
            value: 0.21889250814332245
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.12605863192182412
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.7029638149674847
            name: Cosine Recall@1
          - type: cosine_recall@5
            value: 0.789022126091837
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.8116078533769628
            name: Cosine Recall@10
          - type: cosine_ndcg@1
            value: 0.8990228013029316
            name: Cosine Ndcg@1
          - type: cosine_ndcg@5
            value: 0.8178579787978988
            name: Cosine Ndcg@5
          - type: cosine_ndcg@10
            value: 0.8156444177517035
            name: Cosine Ndcg@10
          - type: cosine_mrr@1
            value: 0.8990228013029316
            name: Cosine Mrr@1
          - type: cosine_mrr@5
            value: 0.9347991313789358
            name: Cosine Mrr@5
          - type: cosine_mrr@10
            value: 0.9368827878599865
            name: Cosine Mrr@10
          - type: cosine_map@1
            value: 0.8990228013029316
            name: Cosine Map@1
          - type: cosine_map@5
            value: 0.772128121606949
            name: Cosine Map@5
          - type: cosine_map@10
            value: 0.7635855701310564
            name: Cosine Map@10

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the statictable-triplets-all dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/paraphrase-multilingual-miniLM-L12-v2-mnrl-beir-2")
# Run inference
sentences = [
    'Laporan singkat arus kas Q2 2005, dalam miliar',
    'Ringkasan Neraca Arus Dana, Triwulan Kedua, 2005, (Miliar Rupiah)',
    'Rata-rata Upah/Gaji Bersih sebulan Buruh/Karyawan Pegawai Menurut Pendidikan Tertinggi dan Jumlah Jam Kerja Utama, 2020',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.899
cosine_accuracy@5 0.9837
cosine_accuracy@10 1.0
cosine_precision@1 0.899
cosine_precision@5 0.2189
cosine_precision@10 0.1261
cosine_recall@1 0.703
cosine_recall@5 0.789
cosine_recall@10 0.8116
cosine_ndcg@1 0.899
cosine_ndcg@5 0.8179
cosine_ndcg@10 0.8156
cosine_mrr@1 0.899
cosine_mrr@5 0.9348
cosine_mrr@10 0.9369
cosine_map@1 0.899
cosine_map@5 0.7721
cosine_map@10 0.7636

Training Details

Training Dataset

statictable-triplets-all

  • Dataset: statictable-triplets-all at 24979b4
  • Size: 967,831 training samples
  • Columns: query, pos, and neg
  • Approximate statistics based on the first 1000 samples:
    query pos neg
    type string string string
    details
    • min: 4 tokens
    • mean: 18.55 tokens
    • max: 37 tokens
    • min: 4 tokens
    • mean: 25.6 tokens
    • max: 58 tokens
    • min: 4 tokens
    • mean: 25.7 tokens
    • max: 58 tokens
  • Samples:
    query pos neg
    Indeks harga petani (diterima & dibayar) dan NTP per provinsi, 2012 Indeks Harga yang Diterima Petani (It), Indeks Harga yang Dibayar Petani (Ib), dan Nilai Tukar Petani (NTP) Menurut Provinsi, 2008-2016 Persentase Rumah Tangga Menurut Provinsi dan KebiasaanMemanfaatkan Air Bekas untuk Keperluan Lain, 2013, 2014, 2017, 2021
    Data rumah tangga perikanan budidaya Indonesia, detail per provinsi dan jenis budidaya, di tahun 2008 Jumlah Rumah Tangga Perikanan Budidaya Menurut Provinsi dan Jenis Budidaya, 2000-2016 Ringkasan Neraca Arus Dana, 2005, (Miliar Rupiah)
    Lapangan pekerjaan vs pendidikan pekerja (15 tahun ke atas), 1986 hingga 1996 Penduduk Berumur 15 Tahun Ke Atas yang Bekerja Selama Seminggu yang Lalu Menurut Lapangan Pekerjaan Utama dan Pendidikan Tertinggi yang Ditamatkan, 1986 -1996 Tabel Input-Output Indonesia Transaksi Domestik Atas Dasar Harga Produsen (17 Lapangan Usaha), 2016 (Juta Rupiah)
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

statictable-triplets-all

  • Dataset: statictable-triplets-all at 24979b4
  • Size: 967,831 evaluation samples
  • Columns: query, pos, and neg
  • Approximate statistics based on the first 1000 samples:
    query pos neg
    type string string string
    details
    • min: 5 tokens
    • mean: 18.38 tokens
    • max: 37 tokens
    • min: 4 tokens
    • mean: 25.28 tokens
    • max: 58 tokens
    • min: 5 tokens
    • mean: 25.65 tokens
    • max: 58 tokens
  • Samples:
    query pos neg
    Bagaimana hubungan IHK dan rata-rata upah buruh industri (bukan supervisor) bulanan tahun 2010, acuan 1996? IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor (Supervisor), 1996-2014 (1996=100) Rata-rata Harga Valuta Asing Terpilih menurut Provinsi, 2014
    Berapa rata-rata gaji bulanan pekerja Indonesia berdasarkan ijazah terakhir dan sektor pekerjaannya (2017)? Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut Pendidikan Tertinggi yang Ditamatkan dan Lapangan Pekerjaan Utama di 9 Sektor (rupiah), 2017 Rata-Rata Pengeluaran per Kapita Sebulan Menurut Kelompok Barang (rupiah), 2013-2021
    Data luas lahan (hektar) yang dipakai untuk jenis budidaya perikanan di tiap provinsi tahun 2009 Luas Area Usaha Budidaya Perikanan Menurut Provinsi dan Jenis Budidaya (ha), 2005-2016 Ringkasan Neraca Arus Dana, Triwulan I, 2008, (Miliar Rupiah)
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • weight_decay: 0.01
  • num_train_epochs: 2
  • lr_scheduler_type: reduce_lr_on_plateau
  • lr_scheduler_kwargs: {'factor': 0.5, 'patience': 2}
  • warmup_steps: 10000
  • save_on_each_node: True
  • fp16: True
  • dataloader_num_workers: 2
  • load_best_model_at_end: True
  • eval_on_start: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: reduce_lr_on_plateau
  • lr_scheduler_kwargs: {'factor': 0.5, 'patience': 2}
  • warmup_ratio: 0.0
  • warmup_steps: 10000
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: True
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 2
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss bps-statictable-ir_cosine_ndcg@10
0 0 - 1.0819 0.4643
0.0334 200 0.1373 - -
0.0668 400 0.0354 - -
0.0835 500 - 0.0132 0.8252
0.1002 600 0.028 - -
0.1336 800 0.018 - -
0.1670 1000 0.0145 0.0096 0.8286
0.2004 1200 0.0089 - -
0.2338 1400 0.0103 - -
0.2505 1500 - 0.0067 0.8312
0.2672 1600 0.0098 - -
0.3006 1800 0.0086 - -
0.3339 2000 0.0086 0.0044 0.8246
0.3673 2200 0.0088 - -
0.4007 2400 0.0075 - -
0.4174 2500 - 0.0051 0.8295
0.4341 2600 0.0066 - -
0.4675 2800 0.0054 - -
0.5009 3000 0.0051 0.0059 0.8294
0.5343 3200 0.0052 - -
0.5677 3400 0.0037 - -
0.5844 3500 - 0.0041 0.8126
0.6011 3600 0.0078 - -
0.6345 3800 0.005 - -
0.6679 4000 0.0045 0.0050 0.8308
0.7013 4200 0.0047 - -
0.7347 4400 0.0066 - -
0.7514 4500 - 0.0033 0.8233
0.7681 4600 0.0043 - -
0.8015 4800 0.003 - -
0.8349 5000 0.0029 0.0036 0.8224
0.8683 5200 0.0014 - -
0.9017 5400 0.0058 - -
0.9184 5500 - 0.0020 0.8169
0.9350 5600 0.0045 - -
0.9684 5800 0.0036 - -
1.0018 6000 0.0053 0.0018 0.8152
1.0352 6200 0.0035 - -
1.0686 6400 0.0017 - -
1.0853 6500 - 0.0024 0.8231
1.1020 6600 0.0037 - -
1.1354 6800 0.0044 - -
1.1688 7000 0.0011 0.0113 0.8153
1.2022 7200 0.0042 - -
1.2356 7400 0.0028 - -
1.2523 7500 - 0.0046 0.8253
1.2690 7600 0.0005 - -
1.3024 7800 0.001 - -
1.3358 8000 0.0011 0.0017 0.8216
1.3692 8200 0.0007 - -
1.4026 8400 0.0014 - -
1.4193 8500 - 0.0014 0.8253
1.4360 8600 0.0003 - -
1.4694 8800 0.0005 - -
1.5028 9000 0.002 0.0012 0.8250
1.5361 9200 0.0013 - -
1.5695 9400 0.0009 - -
1.5862 9500 - 0.0003 0.8162
1.6029 9600 0.0021 - -
1.6363 9800 0.0013 - -
1.6697 10000 0.0005 0.0003 0.8234
1.7031 10200 0.0004 - -
1.7365 10400 0.0004 - -
1.7532 10500 - 0.0001 0.8225
1.7699 10600 0.0011 - -
1.8033 10800 0.0004 - -
1.8367 11000 0.0009 0.0008 0.8259
1.8701 11200 0.0024 - -
1.9035 11400 0.0002 - -
1.9202 11500 - 0.0008 0.8156
1.9369 11600 0.0007 - -
1.9703 11800 0.0007 - -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}