SentenceTransformer

This is a sentence-transformers model trained on the cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation and geo_70k_multiplets_natural_language_annotation datasets. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: None tokens
Output Dimensionality: None dimensions
Similarity Function: Cosine Similarity
Training Datasets:
- cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
- geo_70k_multiplets_natural_language_annotation
Language: code

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): MMContextEncoder(
    (text_encoder): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
              (intermediate_act_fn): GELUActivation()
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (pooler): BertPooler(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (activation): Tanh()
      )
    )
    (text_adapter): AdapterModule(
      (net): Sequential(
        (0): Linear(in_features=768, out_features=512, bias=True)
        (1): ReLU(inplace=True)
        (2): Linear(in_features=512, out_features=2048, bias=True)
        (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (omics_adapter): AdapterModule(
      (net): Sequential(
        (0): Linear(in_features=512, out_features=512, bias=True)
        (1): ReLU(inplace=True)
        (2): Linear(in_features=512, out_features=2048, bias=True)
        (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-100k-natural_language_annotation-geneformer-2024-text-unfrozen")
# Run inference
sentences = [
    'Endothelial cell of lymphatic vessel derived from fresh fimbria tissue sample of a 65-year old female.',
    'Neuron cell type from a 29-year-old human, specifically from the thalamic complex, specifically the thalamus (THM) - posterior nuclear complex of thalamus (PoN) - medial geniculate nuclei (MG).',
    'Plasma cells derived from lung parenchyma tissue of a female individual in her eighth decade, with a 24-hour delay between sample collection and processing.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9543

Triplet

Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.949

Training Details

Training Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

Dataset: cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation at 3c6f498
Size: 31,500 training samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 53 characters mean: 163.04 characters max: 743 characters	min: 43 characters mean: 169.26 characters max: 829 characters

Samples:

anndata_ref	positive	negative_1	negative_2
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/og3XeEiMqqRjNK7/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/57QG8QGk2pPinLn/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KkbBxngNCtATdiB/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/iwSSpYaeHZNLPbp/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oQNsJsdzqnebrGD/download'}}, 'sample_id': 'census_1f1c5c14-5949-4c81-b28e-b272e271b672_570'}	`Stromal cell of ovary, specifically Stroma-2, from a human adult female individual, in S phase of the cell cycle.`	`Memory B cell derived from the tonsil tissue of a 3-year-old male human with obstructive sleep apnea and recurrent tonsillitis.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/og3XeEiMqqRjNK7/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/57QG8QGk2pPinLn/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KkbBxngNCtATdiB/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/iwSSpYaeHZNLPbp/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oQNsJsdzqnebrGD/download'}}, 'sample_id': 'census_9372df2d-13d6-4fac-980b-919a5b7eb483_46'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/og3XeEiMqqRjNK7/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/57QG8QGk2pPinLn/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KkbBxngNCtATdiB/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/iwSSpYaeHZNLPbp/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oQNsJsdzqnebrGD/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_34872'}	`CD8-positive, alpha-beta T cell sample from a 52-year old Asian female with managed systemic lupus erythematosus (SLE).`	`CD1c-positive myeloid dendritic cell from the lung of a 63-year-old male, derived from normal tissue.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/og3XeEiMqqRjNK7/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/57QG8QGk2pPinLn/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KkbBxngNCtATdiB/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/iwSSpYaeHZNLPbp/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oQNsJsdzqnebrGD/download'}}, 'sample_id': 'census_182f6a56-7360-4924-a74e-1772e07b3031_42'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/og3XeEiMqqRjNK7/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/57QG8QGk2pPinLn/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KkbBxngNCtATdiB/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/iwSSpYaeHZNLPbp/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oQNsJsdzqnebrGD/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_7321'}	`Hofbauer cell derived from the decidua basalis tissue of a female individual at 8 post conception week (8_PCW). The sample is a nucleus.`	`B cell sample from a 29-year-old European female with blood tissue, exhibiting elevated expression of type 1 interferon-stimulated genes (ISGs) in monocytes, reduction of naïve CD4+ T cells correlating with monocyte ISG expression, and expansion of repertoire-restricted cytotoxic GZMH+ CD8+ T cells.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/og3XeEiMqqRjNK7/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/57QG8QGk2pPinLn/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KkbBxngNCtATdiB/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/iwSSpYaeHZNLPbp/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oQNsJsdzqnebrGD/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_25017'}

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

geo_70k_multiplets_natural_language_annotation

Dataset: geo_70k_multiplets_natural_language_annotation at 449eb79
Size: 63,000 training samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 21 characters mean: 139.4 characters max: 696 characters	min: 23 characters mean: 142.09 characters max: 705 characters

Samples:

anndata_ref	positive	negative_1	negative_2
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3111576'}`	`198Z_MSCB-067 sample contains primary cells that are neuronal progenitors from patient type WB_1.`	`31-year-old female Caucasian with ntm disease provided a whole blood sample on July 11, 2016. The baseline FEVPP was 89.74 and FVCpp was 129.41.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX6591734'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7834244'}`	`CD8+ T cells from a healthy skin sample, labeled C4, from plate rep1, well E6, sequencing batch b7, which passed QC, and clustered as 2_Resid.`	`6-week-old (PCW6) neuronal epithelium tissue from donor HSB325, cultured using C1-72 chip.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX2440281'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3112138'}`	`201Z_MSCB-083 is a sample of primary neuronal progenitor cells from patient MD1 with no reported treatment.`	`48-hour sample from HPV-negative UPCI:SCC131 cell line, a head and neck squamous cell carcinoma (HNSCC) cell line, that has not been irradiated.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7448263'}`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

Dataset: cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation at 3c6f498
Size: 3,500 evaluation samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 51 characters mean: 168.27 characters max: 829 characters	min: 51 characters mean: 167.36 characters max: 963 characters

Samples:

anndata_ref	positive	negative_1	negative_2
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5TPJabJ69oYLqLE/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/2tzG5oC8bWBQnGA/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KfdWYXxsw47TYx6/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oFLREC3S5yEitZT/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/ydkXdKKqWEzSBj8/download'}}, 'sample_id': 'census_842c6f5d-4a94-4eef-8510-8c792d1124bc_6822'}	`Non-classical monocyte cell type, derived from a fresh breast tissue sample of an African American female donor with low breast density, obese BMI, and premenopausal status. The cell was obtained through resection procedure and analyzed using single-cell transcriptomics as part of the Human Breast Cell Atlas (HBCA) study.`	`Memory T cell derived from a 65-79 year-old female, specifically with a Tcm phenotype, located in the mesenteric lymph node.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5TPJabJ69oYLqLE/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/2tzG5oC8bWBQnGA/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KfdWYXxsw47TYx6/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oFLREC3S5yEitZT/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/ydkXdKKqWEzSBj8/download'}}, 'sample_id': 'census_7970bd6b-f752-47a9-8643-2af16855ec49_12558'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5TPJabJ69oYLqLE/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/2tzG5oC8bWBQnGA/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KfdWYXxsw47TYx6/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oFLREC3S5yEitZT/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/ydkXdKKqWEzSBj8/download'}}, 'sample_id': 'census_b46237d1-19c6-4af2-9335-9854634bad16_9825'}	`Enteric neuron cells derived from the ileum tissue at Carnegie stage 22.`	`Sample is a CD16-negative, CD56-bright natural killer cell derived from a female in her eighth decade.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5TPJabJ69oYLqLE/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/2tzG5oC8bWBQnGA/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KfdWYXxsw47TYx6/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oFLREC3S5yEitZT/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/ydkXdKKqWEzSBj8/download'}}, 'sample_id': 'census_1a38e762-2465-418f-b81c-6a4bce261c34_211'}
{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5TPJabJ69oYLqLE/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/2tzG5oC8bWBQnGA/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KfdWYXxsw47TYx6/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oFLREC3S5yEitZT/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/ydkXdKKqWEzSBj8/download'}}, 'sample_id': 'census_d7d7e89c-c93a-422d-8958-9b4a90b69558_4209'}	`Activated CD16-positive, CD56-dim natural killer cell taken from a 26-year-old male, activated with CD3, and found to be in G1 phase.`	`A CD4-positive, alpha-beta T cell derived from proximal lung tissue of a male human donor at the 22nd week post-fertilization stage.`	{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5TPJabJ69oYLqLE/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/2tzG5oC8bWBQnGA/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/KfdWYXxsw47TYx6/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/oFLREC3S5yEitZT/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/ydkXdKKqWEzSBj8/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_22282'}

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

geo_70k_multiplets_natural_language_annotation

Dataset: geo_70k_multiplets_natural_language_annotation at 449eb79
Size: 7,000 evaluation samples
Columns: anndata_ref, positive, negative_1, and negative_2

Approximate statistics based on the first 1000 samples:

	anndata_ref	positive	negative_1	negative_2
type	dict	string	string	dict
details		min: 22 characters mean: 138.7 characters max: 702 characters	min: 22 characters mean: 131.79 characters max: 702 characters

Samples:

anndata_ref	positive	negative_1	negative_2
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX16033546'}`	`A549 lung adenocarcinoma cell line with ectopic expression of TPK1 p.G48C mutation.`	`3 days after the 4th immunization, blood sample from donor 1033 with low antibody-dependent cellular phagocytosis (ADCP) category.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX10356703'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX8241199'}`	`Human fibroblasts at the D7 time point during reprogramming into induced pluripotent stem cells (iPSCs) or hiPSCs.`	`CD14+ monocytes from a healthy control participant (ID 2015).`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX14140416'}`
`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX17834359'}`	`Whole blood sample from subject HRV15-017, collected at day 1 in the afternoon.`	`59 year old male bronchial epithelial cells with 39 pack years of smoking history and imaging cluster 1.`	`{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX5429074'}`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
learning_rate: 2e-05
num_train_epochs: 16
warmup_ratio: 0.1
fp16: True
dataloader_num_workers: 1

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 16
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 1
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	cellxgene pseudo bulk 35k multiplets natural language annotation loss	geo 70k multiplets natural language annotation loss	cosine_accuracy
0.1351	100	-	16.5681	15.3425	0.5510
0.2703	200	15.2121	16.3962	14.5975	0.6669
0.4054	300	-	15.1565	13.5315	0.7754
0.5405	400	13.4551	12.2976	11.6012	0.8340
0.6757	500	-	10.1066	8.5850	0.8704
0.8108	600	8.9059	7.8946	6.7269	0.8931
0.9459	700	-	6.1265	5.8313	0.9036
1.0811	800	5.8557	5.3230	5.3629	0.9107
1.2162	900	-	4.7961	5.0623	0.9209
1.3514	1000	4.8756	4.6028	4.7280	0.9279
1.4865	1100	-	4.6467	4.4183	0.9373
1.6216	1200	4.3719	4.7835	4.1918	0.9440
1.7568	1300	-	4.4550	4.0311	0.9476
1.8919	1400	4.0077	4.5942	3.8520	0.9497
2.0270	1500	-	4.0982	3.8556	0.9517
2.1622	1600	3.7523	4.3389	3.7847	0.9554
2.2973	1700	-	4.1296	3.8354	0.9521
2.4324	1800	3.7573	4.3382	3.7801	0.9553
2.5676	1900	-	4.1184	3.8465	0.9521
2.7027	2000	3.7301	4.2711	3.7977	0.9540
2.8378	2100	-	4.0863	3.8529	0.9516
2.9730	2200	3.7111	4.1145	3.8415	0.9517
3.1081	2300	-	4.2684	3.8076	0.9536
3.2432	2400	3.7155	3.8739	3.9858	0.9476
3.3784	2500	-	4.5718	3.7554	0.9556
3.5135	2600	3.7532	4.7481	3.7515	0.9573
3.6486	2700	-	4.3598	3.7741	0.9544
3.7838	2800	3.7255	4.2423	3.8044	0.9544
3.9189	2900	-	4.1150	3.8462	0.9517
4.0541	3000	3.7	4.2966	3.7923	0.9553
4.1892	3100	-	4.1954	3.8200	0.9524
4.3243	3200	3.7556	4.3824	3.7742	0.9556
4.4595	3300	-	4.5560	3.7541	0.9560
4.5946	3400	3.7283	3.9065	3.9552	0.9487
4.7297	3500	-	3.8415	4.0087	0.9481
4.8649	3600	3.741	4.4399	3.7655	0.9557
5.0	3700	-	4.5457	3.7542	0.9561
5.1351	3800	3.6978	3.9224	3.9533	0.9487
5.2703	3900	-	4.3493	3.7846	0.9554
5.4054	4000	3.7399	4.3480	3.7832	0.9549
5.5405	4100	-	3.9356	3.9337	0.9500
5.6757	4200	3.7406	4.3089	3.7905	0.9546
5.8108	4300	-	4.4414	3.7711	0.9550
5.9459	4400	3.7161	4.0804	3.8547	0.9521
6.0811	4500	-	3.9827	3.9103	0.9509
6.2162	4600	3.7038	3.8720	3.9825	0.9486
6.3514	4700	-	3.9803	3.9070	0.9503
6.4865	4800	3.7522	4.2410	3.8043	0.9551
6.6216	4900	-	4.5504	3.7628	0.9557
6.7568	5000	3.7252	4.3341	3.7837	0.9550
6.8919	5100	-	4.5281	3.7531	0.9560
7.0270	5200	3.6791	4.0975	3.8550	0.9517
7.1622	5300	-	4.3336	3.7814	0.9553
7.2973	5400	3.7546	4.1190	3.8355	0.9523
7.4324	5500	-	4.3390	3.7763	0.9554
7.5676	5600	3.725	4.1069	3.8476	0.9516
7.7027	5700	-	4.2602	3.7962	0.9546
7.8378	5800	3.7309	4.0831	3.8483	0.9517
7.9730	5900	-	4.1081	3.8386	0.9519
8.1081	6000	3.7056	4.2598	3.8045	0.9534
8.2432	6100	-	3.8669	3.9848	0.9479
8.3784	6200	3.7322	4.5549	3.7529	0.9559
8.5135	6300	-	4.7403	3.7472	0.9576
8.6486	6400	3.7317	4.3473	3.7718	0.9547
8.7838	6500	-	4.2320	3.7998	0.9546
8.9189	6600	3.7208	4.1063	3.8423	0.9519
9.0541	6700	-	4.2851	3.7893	0.9547
9.1892	6800	3.6945	4.1825	3.8167	0.9526
9.3243	6900	-	4.3738	3.7702	0.9560
9.4595	7000	3.7437	4.5468	3.7502	0.9560
9.5946	7100	-	3.8960	3.9519	0.9489
9.7297	7200	3.7285	3.8328	4.0028	0.9474
9.8649	7300	-	4.4250	3.7606	0.9557
10.0	7400	3.6724	4.5225	3.7482	0.9563
10.1351	7500	-	3.9094	3.9493	0.9486
10.2703	7600	3.7461	4.3360	3.7803	0.9550
10.4054	7700	-	4.3358	3.7772	0.9553
10.5405	7800	3.7407	3.9274	3.9251	0.9499
10.6757	7900	-	4.2977	3.7844	0.9543
10.8108	8000	3.728	4.4351	3.7666	0.9551
10.9459	8100	-	4.0689	3.8480	0.9521
11.0811	8200	3.6982	3.9707	3.9039	0.9509
11.2162	8300	-	3.8588	3.9769	0.9481
11.3514	8400	3.7318	3.9676	3.9023	0.9503
11.4865	8500	-	4.2258	3.7993	0.9549
11.6216	8600	3.7316	4.5318	3.7566	0.9559
11.7568	8700	-	4.3155	3.7782	0.9544
11.8919	8800	3.7158	4.5147	3.7473	0.9559
12.0270	8900	-	4.0836	3.8483	0.9517
12.1622	9000	3.6941	4.3180	3.7766	0.9546
12.2973	9100	-	4.1086	3.8267	0.9530
12.4324	9200	3.7351	4.3192	3.7696	0.9550
12.5676	9300	-	4.0972	3.8375	0.9516
12.7027	9400	3.7224	4.2462	3.7891	0.9543
12.8378	9500	-	4.0651	3.8419	0.9514
12.9730	9600	3.7019	4.0886	3.8325	0.9514
13.1081	9700	-	4.2453	3.7956	0.9533
13.2432	9800	3.6979	3.8549	3.9746	0.9480
13.3784	9900	-	4.5402	3.7440	0.9556
13.5135	10000	3.7436	4.7189	3.7372	0.9571
13.6486	10100	-	4.3368	3.7617	0.9546
13.7838	10200	3.7129	4.2180	3.7909	0.9540
13.9189	10300	-	4.0913	3.8344	0.9509
14.0541	10400	3.6821	4.2673	3.7803	0.9543
14.1892	10500	-	4.1662	3.8081	0.9524
14.3243	10600	3.7336	4.3547	3.7615	0.9554
14.4595	10700	-	4.5219	3.7425	0.9560
14.5946	10800	3.7057	3.8819	3.9436	0.9484
14.7297	10900	-	3.8188	3.9952	0.9479
14.8649	11000	3.7205	4.4094	3.7525	0.9547
15.0	11100	-	4.5114	3.7421	0.9556
15.1351	11200	3.6753	3.8929	3.9439	0.9483
15.2703	11300	-	4.3207	3.7717	0.9543
15.4054	11400	3.7216	4.3187	3.7698	0.9551
15.5405	11500	-	3.9106	3.9202	0.9490

Framework Versions

Python: 3.10.10
Sentence Transformers: 3.5.0.dev0
Transformers: 4.43.4
PyTorch: 2.6.0+cu124
Accelerate: 0.33.0
Datasets: 2.14.4
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

jo-mengr
/

mmcontext-100k-natural_language_annotation-geneformer-2048-text-unfrozen

SentenceTransformer

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Triplet

Triplet

Training Details

Training Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

geo_70k_multiplets_natural_language_annotation

Evaluation Datasets

cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation

geo_70k_multiplets_natural_language_annotation

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

MultipleNegativesRankingLoss

Evaluation results