metadata

language:
  - code
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:94500
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      Primary CD8+ T cells from a subject identified as CL-MCRL, exposed to the
      GPR epitope with a dpi (days post-infection) of 87.5.
    sentences:
      - Cancer cell line (CCL23) derived from a carcinoma patient.
      - >-
        Primary CD34+ human cells in three-phase in vitro culture, isolated on
        day 13, with GG1dd zf vector transduction.
      - 23-year-old primary nonETP leukemic blasts from bone marrow.
  - source_sentence: >-
      Hematopoietic cells with PI-AnnexinV-GFP+CD33+ phenotype from a xenograft
      strain NRG-3GS.
    sentences:
      - >-
        H9 embryonic stem cells treated with recombinant Wnt3a for 8 hours in
        culture.
      - >-
        iCell Hepatocytes that have been treated with 075\_OLBO\_10 in a study
        involving BO class and dose 10.
      - >-
        48 hour treatment of colorectal carcinoma cell line HCT116 (colorectal
        cancer) with control treatment.
  - source_sentence: >-
      Memory B cells derived from a female thoracic lymph node, obtained from a
      donor in their seventh decade.
    sentences:
      - >-
        Neuron cell type from the Pulvinar of thalamus, derived from a
        42-year-old human individual.
      - >-
        Germinal center B cell derived from the tonsil tissue of a 3-year-old
        male with recurrent tonsillitis.
      - >-
        B cell sample from a 55-year old female Asian individual with managed
        systemic lupus erythematosus (SLE). The cell was derived from peripheral
        blood mononuclear cells (PBMCs).
  - source_sentence: >-
      Pericyte cells, part of the smooth muscle lineage, extracted from the
      transition zone of a 74-year-old human prostate.
    sentences:
      - >-
        A CD8-positive, alpha-beta memory T cell, CD45RO-positive, specifically
        identified as Tem/Effector cytotoxic T cells, as determined by
        CellTypist prediction. The cell was obtained from the lung tissue of a
        female individual in her eighth decade.
      - >-
        CD4-positive, alpha-beta T cell sample taken from a 53-year old female
        Asian individual with managed systemic lupus erythematosus (SLE).
      - >-
        Natural killer cell from a 32-year old female of European descent with
        managed systemic lupus erythematosus (SLE).
  - source_sentence: >-
      Sample is a basal cell of prostate epithelium, taken from the transition
      zone of the prostate gland in a 72-year old male. It belongs to the
      Epithelia lineage and Population BE.
    sentences:
      - >-
        Neuron cell type from a 42-year old male cerebral cortex tissue,
        specifically from the rostral gyrus dorsal division of MFC A32,
        classified as Deep-layer corticothalamic and 6b.
      - >-
        Dendritic cell from the transition zone of prostate of a 29-year-old
        male, specifically from the EREG+ population.
      - >-
        Neuron from the mediodorsal nucleus of thalamus, which is part of the
        medial nuclear complex of thalamus (MNC) in the thalamic complex, taken
        from a 42-year-old male human donor with European ethnicity. The neuron
        belongs to the Thalamic excitatory supercluster.
datasets:
  - jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
  - jo-mengr/geo_70k_multiplets_natural_language_annotation
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
model-index:
  - name: SentenceTransformer
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy
            value: 0.9402857422828674
            name: Cosine Accuracy
          - type: cosine_accuracy
            value: 0.9371428489685059
            name: Cosine Accuracy

SentenceTransformer

This is a sentence-transformers model trained on the cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation and geo_70k_multiplets_natural_language_annotation datasets. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: None tokens
Output Dimensionality: None dimensions
Similarity Function: Cosine Similarity
Training Datasets:
- cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
- geo_70k_multiplets_natural_language_annotation
Language: code

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): MMContextEncoder(
    (text_encoder): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
              (intermediate_act_fn): GELUActivation()
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (pooler): BertPooler(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (activation): Tanh()
      )
    )
    (text_adapter): AdapterModule(
      (net): Sequential(
        (0): Linear(in_features=768, out_features=512, bias=True)
        (1): ReLU(inplace=True)
        (2): Linear(in_features=512, out_features=2048, bias=True)
        (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (omics_adapter): AdapterModule(
      (net): Sequential(
        (0): Linear(in_features=64, out_features=512, bias=True)
        (1): ReLU(inplace=True)
        (2): Linear(in_features=512, out_features=2048, bias=True)
        (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-100k-natural_language_annotation-pca-1024")
# Run inference
sentences = [
    'Sample is a basal cell of prostate epithelium, taken from the transition zone of the prostate gland in a 72-year old male. It belongs to the Epithelia lineage and Population BE.',
    'Neuron cell type from a 42-year old male cerebral cortex tissue, specifically from the rostral gyrus dorsal division of MFC A32, classified as Deep-layer corticothalamic and 6b.',
    'Neuron from the mediodorsal nucleus of thalamus, which is part of the medial nuclear complex of thalamus (MNC) in the thalamic complex, taken from a 42-year-old male human donor with European ethnicity. The neuron belongs to the Thalamic excitatory supercluster.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9403

Triplet

Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9371

Training Details

Training Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

Dataset: cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation at a6241c4
Size: 31,500 training samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 53 characters mean: 163.04 characters max: 743 characters	min: 43 characters mean: 163.42 characters max: 609 characters

Samples:

anndata_ref	positive	negative_1	negative_2
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_1f1c5c14-5949-4c81-b28e-b272e271b672_570'}	`Stromal cell of ovary, specifically Stroma-2, from a human adult female individual, in S phase of the cell cycle.`	`Neuron cell type from a 50-year-old male human thalamic complex, specifically from the ventral anterior nucleus of thalamus within the lateral nuclear complex.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_1b9d8702-5af8-4142-85ed-020eb06ec4f6_19663'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_34872'}	`CD8-positive, alpha-beta T cell sample from a 52-year old Asian female with managed systemic lupus erythematosus (SLE).`	`Mucosal invariant T cell derived from the spleen of a female in her seventies.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_4145'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_7321'}	`Hofbauer cell derived from the decidua basalis tissue of a female individual at 8 post conception week (8_PCW). The sample is a nucleus.`	`Regulatory T cell derived from a lymph node of a male individual with advanced non-small cell lung cancer (NSCLC), stage IV, who has a history of smoking.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_5a73f63f-18a2-49b5-b431-2c469c41a41b_163'}

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

geo_70k_multiplets_natural_language_annotation

Dataset: geo_70k_multiplets_natural_language_annotation at 449eb79
Size: 63,000 training samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 21 characters mean: 139.4 characters max: 696 characters	min: 23 characters mean: 142.09 characters max: 705 characters

Samples:

anndata_ref	positive	negative_1	negative_2
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3111576'}`	`198Z_MSCB-067 sample contains primary cells that are neuronal progenitors from patient type WB_1.`	`31-year-old female Caucasian with ntm disease provided a whole blood sample on July 11, 2016. The baseline FEVPP was 89.74 and FVCpp was 129.41.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX6591734'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7834244'}`	`CD8+ T cells from a healthy skin sample, labeled C4, from plate rep1, well E6, sequencing batch b7, which passed QC, and clustered as 2_Resid.`	`6-week-old (PCW6) neuronal epithelium tissue from donor HSB325, cultured using C1-72 chip.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX2440281'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3112138'}`	`201Z_MSCB-083 is a sample of primary neuronal progenitor cells from patient MD1 with no reported treatment.`	`48-hour sample from HPV-negative UPCI:SCC131 cell line, a head and neck squamous cell carcinoma (HNSCC) cell line, that has not been irradiated.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7448263'}`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

Dataset: cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation at a6241c4
Size: 3,500 evaluation samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 51 characters mean: 168.27 characters max: 829 characters	min: 57 characters mean: 174.27 characters max: 804 characters

Samples:

anndata_ref	positive	negative_1	negative_2
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_842c6f5d-4a94-4eef-8510-8c792d1124bc_6822'}	`Non-classical monocyte cell type, derived from a fresh breast tissue sample of an African American female donor with low breast density, obese BMI, and premenopausal status. The cell was obtained through resection procedure and analyzed using single-cell transcriptomics as part of the Human Breast Cell Atlas (HBCA) study.`	`Plasma cells derived from gingival tissue of a 39-year-old female.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_23461'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_b46237d1-19c6-4af2-9335-9854634bad16_9825'}	`Enteric neuron cells derived from the ileum tissue at Carnegie stage 22.`	`Ciliated cell from the trachea of a 6-12 year-old European male with no SARS-CoV-2 infection, who is a non-smoker and healthy.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_2872f4b0-b171-46e2-abc6-befcf6de6306_2871'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_d7d7e89c-c93a-422d-8958-9b4a90b69558_4209'}	`Activated CD16-positive, CD56-dim natural killer cell taken from a 26-year-old male, activated with CD3, and found to be in G1 phase.`	`CD8-positive, alpha-beta thymocyte cell type derived from a 74-year-old male human with European self-reported ethnicity, located in the transition zone of the prostate.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_535e9336-2d8d-43c3-944d-bcbebe20df8a_18'}

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

geo_70k_multiplets_natural_language_annotation

Dataset: geo_70k_multiplets_natural_language_annotation at 449eb79
Size: 7,000 evaluation samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 22 characters mean: 138.7 characters max: 702 characters	min: 22 characters mean: 131.79 characters max: 702 characters

Samples:

anndata_ref	positive	negative_1	negative_2
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX16033546'}`	`A549 lung adenocarcinoma cell line with ectopic expression of TPK1 p.G48C mutation.`	`3 days after the 4th immunization, blood sample from donor 1033 with low antibody-dependent cellular phagocytosis (ADCP) category.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX10356703'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX8241199'}`	`Human fibroblasts at the D7 time point during reprogramming into induced pluripotent stem cells (iPSCs) or hiPSCs.`	`CD14+ monocytes from a healthy control participant (ID 2015).`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX14140416'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX17834359'}`	`Whole blood sample from subject HRV15-017, collected at day 1 in the afternoon.`	`59 year old male bronchial epithelial cells with 39 pack years of smoking history and imaging cluster 1.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX5429074'}`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
learning_rate: 2e-05
num_train_epochs: 8
warmup_ratio: 0.1
fp16: True
dataloader_num_workers: 1

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 8
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 1
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	cellxgene pseudo bulk 35k multiplets natural language annotation loss	geo 70k multiplets natural language annotation loss	cosine_accuracy
0.1351	100	-	19.5545	19.6050	0.5656
0.2703	200	17.2819	19.4888	17.2415	0.7261
0.4054	300	-	17.2527	14.3099	0.7684
0.5405	400	13.4122	13.1462	13.4371	0.7976
0.6757	500	-	12.6305	9.3601	0.8474
0.8108	600	8.3246	11.1233	7.6021	0.8787
0.9459	700	-	8.5871	7.6461	0.8980
1.0811	800	6.1203	7.0774	7.1605	0.9046
1.2162	900	-	6.0461	6.7694	0.9076
1.3514	1000	5.1622	6.1759	6.0741	0.9166
1.4865	1100	-	6.6497	5.3305	0.9269
1.6216	1200	4.7346	7.6330	4.9083	0.9324
1.7568	1300	-	6.5700	4.8609	0.9349
1.8919	1400	4.4577	6.9249	4.6155	0.9401
2.0270	1500	-	5.4120	5.0721	0.9367
2.1622	1600	4.2281	6.3842	4.6481	0.9407
2.2973	1700	-	5.6970	4.9588	0.9370
2.4324	1800	4.2392	6.3306	4.6888	0.9407
2.5676	1900	-	5.3909	5.0415	0.9364
2.7027	2000	4.2237	6.0779	4.7476	0.9394
2.8378	2100	-	5.3631	5.0280	0.9379
2.9730	2200	4.2215	5.5800	4.9418	0.9373
3.1081	2300	-	6.3898	4.6718	0.9400
3.2432	2400	4.1984	4.7118	5.4301	0.9313
3.3784	2500	-	7.2266	4.5063	0.9419
3.5135	2600	4.2538	8.1464	4.4121	0.9426
3.6486	2700	-	6.5866	4.6253	0.9409
3.7838	2800	4.2186	5.8797	4.8671	0.9380
3.9189	2900	-	5.5591	4.9559	0.9377
4.0541	3000	4.2064	6.3420	4.7167	0.9413
4.1892	3100	-	5.9561	4.8190	0.9387
4.3243	3200	4.2248	6.3844	4.6827	0.9410
4.4595	3300	-	7.1522	4.5193	0.9421
4.5946	3400	4.2263	4.8333	5.3410	0.9331
4.7297	3500	-	4.5820	5.5334	0.9306
4.8649	3600	4.2472	6.8254	4.5512	0.9413
5.0	3700	-	6.4904	4.6993	0.9399
5.1351	3800	4.1936	4.8578	5.3678	0.9344
5.2703	3900	-	6.4530	4.6426	0.9413
5.4054	4000	4.2345	6.6050	4.6684	0.9409
5.5405	4100	-	4.8690	5.3172	0.9334
5.6757	4200	4.2406	6.2903	4.7100	0.9404
5.8108	4300	-	6.6273	4.6269	0.9419
5.9459	4400	4.2227	5.4572	5.0365	0.9370
6.0811	4500	-	5.0242	5.2568	0.9341
6.2162	4600	4.1997	4.7279	5.5242	0.9316
6.3514	4700	-	5.1846	5.2246	0.9339
6.4865	4800	4.2361	5.8601	4.8249	0.9381
6.6216	4900	-	6.9398	4.5848	0.9423
6.7568	5000	4.2273	6.2977	4.6921	0.9406
6.8919	5100	-	6.9737	4.5439	0.9421
7.0270	5200	4.2052	5.3900	5.0873	0.9370
7.1622	5300	-	6.3929	4.6474	0.9406
7.2973	5400	4.2416	5.6994	4.9590	0.9371
7.4324	5500	-	6.3184	4.6922	0.9407
7.5676	5600	4.2311	5.3932	5.0403	0.9363
7.7027	5700	-	6.0781	4.7480	0.9394
7.8378	5800	4.229	5.3664	5.0291	0.9380
7.9730	5900	-	5.5803	4.9391	0.9371

Framework Versions

Python: 3.10.10
Sentence Transformers: 3.5.0.dev0
Transformers: 4.43.4
PyTorch: 2.6.0+cu124
Accelerate: 0.33.0
Datasets: 2.14.4
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}