albertus-sussex's picture
Add new SentenceTransformer model
9b728a2 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:7600
  - loss:AttributeTripletLoss
base_model: Alibaba-NLP/gte-base-en-v1.5
widget:
  - source_sentence: Harperaudio
    sentences:
      - 'Pub. Date: May 2007'
      - ': Atheneum Books'
      - publication_date
      - publisher
  - source_sentence: Josh McDowell
    sentences:
      - author
      - Peter B. Rosenberger
      - isbn_13
      - '9781616877163'
  - source_sentence: ': 9780316067348'
    sentences:
      - 'Publisher: Barnes & Noble'
      - publisher
      - '9780785134350'
      - isbn_13
  - source_sentence: '9780425235676'
    sentences:
      - '9780312939434'
      - isbn_13
      - author
      - Jim Butcher
  - source_sentence: ': Vintage Books USA'
    sentences:
      - 'Publisher: Knopf Doubleday Publishing Group'
      - publication_date
      - 'Pub. Date: October 26, 2010'
      - publisher
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - silhouette_cosine
  - silhouette_euclidean
model-index:
  - name: SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy
            value: 1
            name: Cosine Accuracy
          - type: cosine_accuracy
            value: 1
            name: Cosine Accuracy
      - task:
          type: silhouette
          name: Silhouette
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: silhouette_cosine
            value: 0.9663600325584412
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.843291163444519
            name: Silhouette Euclidean
          - type: silhouette_cosine
            value: 0.9647400975227356
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.8390628099441528
            name: Silhouette Euclidean

SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-base-en-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("albertus-sussex/veriscrape-sbert-book-reference_3_to_verify_7-fold-10")
# Run inference
sentences = [
    ': Vintage Books USA',
    'Publisher: Knopf Doubleday Publishing Group',
    'Pub. Date: October 26, 2010',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 1.0

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.9664
silhouette_euclidean 0.8433

Triplet

Metric Value
cosine_accuracy 1.0

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.9647
silhouette_euclidean 0.8391

Training Details

Training Dataset

Unnamed Dataset

  • Size: 7,600 training samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 7.21 tokens
    • max: 37 tokens
    • min: 3 tokens
    • mean: 7.11 tokens
    • max: 17 tokens
    • min: 3 tokens
    • mean: 7.72 tokens
    • max: 22 tokens
    • min: 3 tokens
    • mean: 3.81 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.78 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    New Moon The Best Kind of Different July 21, 2005 title publication_date
    : January 2009 Pub. Date: September 2002 : 9780316049009 publication_date isbn_13
    Martha Nesbit Kate DiCamillo Pub. Date: February 1999 author publication_date
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 845 evaluation samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 845 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 7.15 tokens
    • max: 37 tokens
    • min: 3 tokens
    • mean: 7.06 tokens
    • max: 33 tokens
    • min: 3 tokens
    • mean: 7.83 tokens
    • max: 17 tokens
    • min: 3 tokens
    • mean: 3.71 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.85 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    Nancy Schoenberger Catherine Anderson : Berkley author publisher
    Publisher: Bethany House Publishers Knopf Doubleday Publishing Group DK Publishing publisher author
    Lise Friedman Lisa Gardner : Benbella Books author publisher
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 5
  • warmup_ratio: 0.1

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss cosine_accuracy silhouette_cosine
-1 -1 - - 0.5077 0.1629
1.0 60 0.87 0.0209 0.9988 0.9412
2.0 120 0.0078 0.0 1.0 0.9648
3.0 180 0.0001 0.0 1.0 0.9663
4.0 240 0.0002 0.0 1.0 0.9661
5.0 300 0.0 0.0 1.0 0.9664
-1 -1 - - 1.0 0.9647

Framework Versions

  • Python: 3.10.16
  • Sentence Transformers: 3.4.1
  • Transformers: 4.45.2
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

AttributeTripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}