jo-mengr's picture
Add new SentenceTransformer model
5596a00 verified
metadata
language:
  - code
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:94500
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      Primary CD8+ T cells from a subject identified as CL-MCRL, exposed to the
      GPR epitope with a dpi (days post-infection) of 87.5.
    sentences:
      - Cancer cell line (CCL23) derived from a carcinoma patient.
      - >-
        Primary CD34+ human cells in three-phase in vitro culture, isolated on
        day 13, with GG1dd zf vector transduction.
      - 23-year-old primary nonETP leukemic blasts from bone marrow.
  - source_sentence: >-
      Hematopoietic cells with PI-AnnexinV-GFP+CD33+ phenotype from a xenograft
      strain NRG-3GS.
    sentences:
      - >-
        H9 embryonic stem cells treated with recombinant Wnt3a for 8 hours in
        culture.
      - >-
        iCell Hepatocytes that have been treated with 075\_OLBO\_10 in a study
        involving BO class and dose 10.
      - >-
        48 hour treatment of colorectal carcinoma cell line HCT116 (colorectal
        cancer) with control treatment.
  - source_sentence: >-
      Memory B cells derived from a female thoracic lymph node, obtained from a
      donor in their seventh decade.
    sentences:
      - >-
        Neuron cell type from the Pulvinar of thalamus, derived from a
        42-year-old human individual.
      - >-
        Germinal center B cell derived from the tonsil tissue of a 3-year-old
        male with recurrent tonsillitis.
      - >-
        B cell sample from a 55-year old female Asian individual with managed
        systemic lupus erythematosus (SLE). The cell was derived from peripheral
        blood mononuclear cells (PBMCs).
  - source_sentence: >-
      Pericyte cells, part of the smooth muscle lineage, extracted from the
      transition zone of a 74-year-old human prostate.
    sentences:
      - >-
        A CD8-positive, alpha-beta memory T cell, CD45RO-positive, specifically
        identified as Tem/Effector cytotoxic T cells, as determined by
        CellTypist prediction. The cell was obtained from the lung tissue of a
        female individual in her eighth decade.
      - >-
        CD4-positive, alpha-beta T cell sample taken from a 53-year old female
        Asian individual with managed systemic lupus erythematosus (SLE).
      - >-
        Natural killer cell from a 32-year old female of European descent with
        managed systemic lupus erythematosus (SLE).
  - source_sentence: >-
      Sample is a basal cell of prostate epithelium, taken from the transition
      zone of the prostate gland in a 72-year old male. It belongs to the
      Epithelia lineage and Population BE.
    sentences:
      - >-
        Neuron cell type from a 42-year old male cerebral cortex tissue,
        specifically from the rostral gyrus dorsal division of MFC A32,
        classified as Deep-layer corticothalamic and 6b.
      - >-
        Dendritic cell from the transition zone of prostate of a 29-year-old
        male, specifically from the EREG+ population.
      - >-
        Neuron from the mediodorsal nucleus of thalamus, which is part of the
        medial nuclear complex of thalamus (MNC) in the thalamic complex, taken
        from a 42-year-old male human donor with European ethnicity. The neuron
        belongs to the Thalamic excitatory supercluster.
datasets:
  - jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
  - jo-mengr/geo_70k_multiplets_natural_language_annotation
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
model-index:
  - name: SentenceTransformer
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy
            value: 0.9402857422828674
            name: Cosine Accuracy
          - type: cosine_accuracy
            value: 0.9371428489685059
            name: Cosine Accuracy

SentenceTransformer

This is a sentence-transformers model trained on the cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation and geo_70k_multiplets_natural_language_annotation datasets. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): MMContextEncoder(
    (text_encoder): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
              (intermediate_act_fn): GELUActivation()
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (pooler): BertPooler(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (activation): Tanh()
      )
    )
    (text_adapter): AdapterModule(
      (net): Sequential(
        (0): Linear(in_features=768, out_features=512, bias=True)
        (1): ReLU(inplace=True)
        (2): Linear(in_features=512, out_features=2048, bias=True)
        (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (omics_adapter): AdapterModule(
      (net): Sequential(
        (0): Linear(in_features=64, out_features=512, bias=True)
        (1): ReLU(inplace=True)
        (2): Linear(in_features=512, out_features=2048, bias=True)
        (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-100k-natural_language_annotation-pca-1024")
# Run inference
sentences = [
    'Sample is a basal cell of prostate epithelium, taken from the transition zone of the prostate gland in a 72-year old male. It belongs to the Epithelia lineage and Population BE.',
    'Neuron cell type from a 42-year old male cerebral cortex tissue, specifically from the rostral gyrus dorsal division of MFC A32, classified as Deep-layer corticothalamic and 6b.',
    'Neuron from the mediodorsal nucleus of thalamus, which is part of the medial nuclear complex of thalamus (MNC) in the thalamic complex, taken from a 42-year-old male human donor with European ethnicity. The neuron belongs to the Thalamic excitatory supercluster.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9403

Triplet

Metric Value
cosine_accuracy 0.9371

Training Details

Training Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

geo_70k_multiplets_natural_language_annotation

Evaluation Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

geo_70k_multiplets_natural_language_annotation

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 2e-05
  • num_train_epochs: 8
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_num_workers: 1

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 8
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 1
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss cellxgene pseudo bulk 35k multiplets natural language annotation loss geo 70k multiplets natural language annotation loss cosine_accuracy
0.1351 100 - 19.5545 19.6050 0.5656
0.2703 200 17.2819 19.4888 17.2415 0.7261
0.4054 300 - 17.2527 14.3099 0.7684
0.5405 400 13.4122 13.1462 13.4371 0.7976
0.6757 500 - 12.6305 9.3601 0.8474
0.8108 600 8.3246 11.1233 7.6021 0.8787
0.9459 700 - 8.5871 7.6461 0.8980
1.0811 800 6.1203 7.0774 7.1605 0.9046
1.2162 900 - 6.0461 6.7694 0.9076
1.3514 1000 5.1622 6.1759 6.0741 0.9166
1.4865 1100 - 6.6497 5.3305 0.9269
1.6216 1200 4.7346 7.6330 4.9083 0.9324
1.7568 1300 - 6.5700 4.8609 0.9349
1.8919 1400 4.4577 6.9249 4.6155 0.9401
2.0270 1500 - 5.4120 5.0721 0.9367
2.1622 1600 4.2281 6.3842 4.6481 0.9407
2.2973 1700 - 5.6970 4.9588 0.9370
2.4324 1800 4.2392 6.3306 4.6888 0.9407
2.5676 1900 - 5.3909 5.0415 0.9364
2.7027 2000 4.2237 6.0779 4.7476 0.9394
2.8378 2100 - 5.3631 5.0280 0.9379
2.9730 2200 4.2215 5.5800 4.9418 0.9373
3.1081 2300 - 6.3898 4.6718 0.9400
3.2432 2400 4.1984 4.7118 5.4301 0.9313
3.3784 2500 - 7.2266 4.5063 0.9419
3.5135 2600 4.2538 8.1464 4.4121 0.9426
3.6486 2700 - 6.5866 4.6253 0.9409
3.7838 2800 4.2186 5.8797 4.8671 0.9380
3.9189 2900 - 5.5591 4.9559 0.9377
4.0541 3000 4.2064 6.3420 4.7167 0.9413
4.1892 3100 - 5.9561 4.8190 0.9387
4.3243 3200 4.2248 6.3844 4.6827 0.9410
4.4595 3300 - 7.1522 4.5193 0.9421
4.5946 3400 4.2263 4.8333 5.3410 0.9331
4.7297 3500 - 4.5820 5.5334 0.9306
4.8649 3600 4.2472 6.8254 4.5512 0.9413
5.0 3700 - 6.4904 4.6993 0.9399
5.1351 3800 4.1936 4.8578 5.3678 0.9344
5.2703 3900 - 6.4530 4.6426 0.9413
5.4054 4000 4.2345 6.6050 4.6684 0.9409
5.5405 4100 - 4.8690 5.3172 0.9334
5.6757 4200 4.2406 6.2903 4.7100 0.9404
5.8108 4300 - 6.6273 4.6269 0.9419
5.9459 4400 4.2227 5.4572 5.0365 0.9370
6.0811 4500 - 5.0242 5.2568 0.9341
6.2162 4600 4.1997 4.7279 5.5242 0.9316
6.3514 4700 - 5.1846 5.2246 0.9339
6.4865 4800 4.2361 5.8601 4.8249 0.9381
6.6216 4900 - 6.9398 4.5848 0.9423
6.7568 5000 4.2273 6.2977 4.6921 0.9406
6.8919 5100 - 6.9737 4.5439 0.9421
7.0270 5200 4.2052 5.3900 5.0873 0.9370
7.1622 5300 - 6.3929 4.6474 0.9406
7.2973 5400 4.2416 5.6994 4.9590 0.9371
7.4324 5500 - 6.3184 4.6922 0.9407
7.5676 5600 4.2311 5.3932 5.0403 0.9363
7.7027 5700 - 6.0781 4.7480 0.9394
7.8378 5800 4.229 5.3664 5.0291 0.9380
7.9730 5900 - 5.5803 4.9391 0.9371

Framework Versions

  • Python: 3.10.10
  • Sentence Transformers: 3.5.0.dev0
  • Transformers: 4.43.4
  • PyTorch: 2.6.0+cu124
  • Accelerate: 0.33.0
  • Datasets: 2.14.4
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}