albertus-sussex's picture
Add new SentenceTransformer model
35c8fa6 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:10079
  - loss:AttributeTripletLoss
base_model: Alibaba-NLP/gte-base-en-v1.5
widget:
  - source_sentence: 'The Beginnings: Word & Spirit in Conversion'
    sentences:
      - '9780312942328'
      - isbn_13
      - title
      - 'I Dare You: Embrace Life with Passion'
  - source_sentence: Continuum International
    sentences:
      - '1999'
      - publisher
      - publication_date
      - 'Publisher: Charlesbridge Pub Inc'
  - source_sentence: 'Pub. Date: December 2009'
    sentences:
      - 20/09/2007
      - '9781616790905'
      - publication_date
      - isbn_13
  - source_sentence: '9781600785283'
    sentences:
      - isbn_13
      - 'Pub. Date: February 2010'
      - '9781400068579'
      - publication_date
  - source_sentence: 'Publisher: Penguin Group (USA)'
    sentences:
      - publisher
      - E. B. White
      - author
      - Harvest House Publishers
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - silhouette_cosine
  - silhouette_euclidean
model-index:
  - name: SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy
            value: 0.9758928418159485
            name: Cosine Accuracy
          - type: cosine_accuracy
            value: 0.972690761089325
            name: Cosine Accuracy
      - task:
          type: silhouette
          name: Silhouette
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: silhouette_cosine
            value: 0.726933479309082
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.5876889228820801
            name: Silhouette Euclidean
          - type: silhouette_cosine
            value: 0.7395883202552795
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.5981691479682922
            name: Silhouette Euclidean

SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-base-en-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("albertus-sussex/veriscrape-sbert-book-reference_4_to_verify_6-fold-5")
# Run inference
sentences = [
    'Publisher: Penguin Group (USA)',
    'Harvest House Publishers',
    'E. B. White',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9759

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.7269
silhouette_euclidean 0.5877

Triplet

Metric Value
cosine_accuracy 0.9727

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.7396
silhouette_euclidean 0.5982

Training Details

Training Dataset

Unnamed Dataset

  • Size: 10,079 training samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 8.23 tokens
    • max: 41 tokens
    • min: 3 tokens
    • mean: 8.32 tokens
    • max: 26 tokens
    • min: 3 tokens
    • mean: 8.68 tokens
    • max: 25 tokens
    • min: 3 tokens
    • mean: 3.75 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.88 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    978-0453008754 9781416579021 Peter Pan (Barnes & Noble Classics Series) isbn_13 title
    2008 Pub. Date: May 2009 978-0880800174 publication_date isbn_13
    Summersdale Publishers Publisher: Random House Publishing Group 9781607060765 publisher isbn_13
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 1,120 evaluation samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 8.47 tokens
    • max: 41 tokens
    • min: 3 tokens
    • mean: 8.38 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 8.76 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 3.79 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.82 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    Novel to Film: An Introduction to the Theory of Adaptation Roughing It (Signet Classics) Houghton Mifflin; 1 edition (November 2002) title publication_date
    The Making of America: The Substance and Meaning of the Constitution I Am...: Biblical Women Tell Their Own Stories Benjamin Franklin title author
    978-0399523304 9781592578368 Dean Koontz isbn_13 author
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 5
  • warmup_ratio: 0.1

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss cosine_accuracy silhouette_cosine
-1 -1 - - 0.4098 0.1189
1.0 79 1.3683 0.3569 0.9714 0.6619
2.0 158 0.1651 0.3353 0.9759 0.7334
3.0 237 0.1183 0.2818 0.9786 0.7583
4.0 316 0.0969 0.2999 0.9777 0.7498
5.0 395 0.0764 0.3014 0.9759 0.7269
-1 -1 - - 0.9727 0.7396

Framework Versions

  • Python: 3.10.16
  • Sentence Transformers: 3.4.1
  • Transformers: 4.45.2
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

AttributeTripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}