albertus-sussex's picture
Add new SentenceTransformer model
bbe4d20 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:10040
  - loss:AttributeTripletLoss
base_model: Alibaba-NLP/gte-base-en-v1.5
widget:
  - source_sentence: Ted Hughes
    sentences:
      - John Simpson
      - isbn_13
      - '9782070391653'
      - author
  - source_sentence: 01 December 2008
    sentences:
      - publication_date
      - 01/04/2010
      - '9780758232007'
      - isbn_13
  - source_sentence: Scribner (December 1, 1995)
    sentences:
      - Solida Inc; Pap/MP3 edition (March 2005)
      - 978-0446300155
      - publisher
      - isbn_13
  - source_sentence: AMACOM (April 7, 2010)
    sentences:
      - publisher
      - 11/10/2010
      - Waverley Books Ltd
      - publication_date
  - source_sentence: Zest for Life
    sentences:
      - title
      - Jimmy Buffett
      - author
      - Selected Poems - Faber poetry
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - silhouette_cosine
  - silhouette_euclidean
model-index:
  - name: SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy
            value: 0.9740143418312073
            name: Cosine Accuracy
          - type: cosine_accuracy
            value: 0.9677419066429138
            name: Cosine Accuracy
      - task:
          type: silhouette
          name: Silhouette
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: silhouette_cosine
            value: 0.7366474270820618
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.6139115691184998
            name: Silhouette Euclidean
          - type: silhouette_cosine
            value: 0.7273561954498291
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.6114970445632935
            name: Silhouette Euclidean

SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-base-en-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("albertus-sussex/veriscrape-sbert-book-reference_4_to_verify_6-fold-8")
# Run inference
sentences = [
    'Zest for Life',
    'Selected Poems - Faber poetry',
    'Jimmy Buffett',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.974

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.7366
silhouette_euclidean 0.6139

Triplet

Metric Value
cosine_accuracy 0.9677

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.7274
silhouette_euclidean 0.6115

Training Details

Training Dataset

Unnamed Dataset

  • Size: 10,040 training samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 7.93 tokens
    • max: 27 tokens
    • min: 3 tokens
    • mean: 7.84 tokens
    • max: 29 tokens
    • min: 3 tokens
    • mean: 8.2 tokens
    • max: 29 tokens
    • min: 3 tokens
    • mean: 3.77 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.79 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    9780671558680 9781439177785 Random House; 1 edition (September 9, 2003) isbn_13 publication_date
    Griffin Publishing Carlton Books Ltd 9780007307456 publisher isbn_13
    The Cannibal Queen: An Aerial Odyssey Across America A Member of the Family: The Ultimate Guide to Living with a Happy, Healthy Dog HarperCollins Entertainment title publisher
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 1,116 evaluation samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 7.98 tokens
    • max: 29 tokens
    • min: 3 tokens
    • mean: 8.0 tokens
    • max: 29 tokens
    • min: 3 tokens
    • mean: 8.4 tokens
    • max: 24 tokens
    • min: 3 tokens
    • mean: 3.76 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.83 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    Simon & Schuster Audio; Unabridged edition (April 28, 2009) 01/04/2003 St. Martin's Press; 1 edition (April 14, 2009) publication_date publisher
    9781845335267 978-0684196381 The Kitchen House: A Novel isbn_13 title
    Bantam Press (2008) 2001 Bantam publication_date publisher
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 5
  • warmup_ratio: 0.1

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss cosine_accuracy silhouette_cosine
-1 -1 - - 0.3961 0.1214
1.0 79 1.4026 0.3251 0.9722 0.6770
2.0 158 0.1547 0.3334 0.9713 0.6860
3.0 237 0.1063 0.3605 0.9704 0.6971
4.0 316 0.0851 0.3044 0.9731 0.7332
5.0 395 0.0707 0.3056 0.9740 0.7366
-1 -1 - - 0.9677 0.7274

Framework Versions

  • Python: 3.10.16
  • Sentence Transformers: 3.4.1
  • Transformers: 4.45.2
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

AttributeTripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}