--- language: - code tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:94500 - loss:MultipleNegativesRankingLoss widget: - source_sentence: Primary CD8+ T cells from a subject identified as CL-MCRL, exposed to the GPR epitope with a dpi (days post-infection) of 87.5. sentences: - Cancer cell line (CCL23) derived from a carcinoma patient. - Primary CD34+ human cells in three-phase in vitro culture, isolated on day 13, with GG1dd zf vector transduction. - 23-year-old primary nonETP leukemic blasts from bone marrow. - source_sentence: Hematopoietic cells with PI-AnnexinV-GFP+CD33+ phenotype from a xenograft strain NRG-3GS. sentences: - H9 embryonic stem cells treated with recombinant Wnt3a for 8 hours in culture. - iCell Hepatocytes that have been treated with 075\_OLBO\_10 in a study involving BO class and dose 10. - 48 hour treatment of colorectal carcinoma cell line HCT116 (colorectal cancer) with control treatment. - source_sentence: Memory B cells derived from a female thoracic lymph node, obtained from a donor in their seventh decade. sentences: - Neuron cell type from the Pulvinar of thalamus, derived from a 42-year-old human individual. - Germinal center B cell derived from the tonsil tissue of a 3-year-old male with recurrent tonsillitis. - B cell sample from a 55-year old female Asian individual with managed systemic lupus erythematosus (SLE). The cell was derived from peripheral blood mononuclear cells (PBMCs). - source_sentence: Pericyte cells, part of the smooth muscle lineage, extracted from the transition zone of a 74-year-old human prostate. sentences: - A CD8-positive, alpha-beta memory T cell, CD45RO-positive, specifically identified as Tem/Effector cytotoxic T cells, as determined by CellTypist prediction. The cell was obtained from the lung tissue of a female individual in her eighth decade. - CD4-positive, alpha-beta T cell sample taken from a 53-year old female Asian individual with managed systemic lupus erythematosus (SLE). - Natural killer cell from a 32-year old female of European descent with managed systemic lupus erythematosus (SLE). - source_sentence: Sample is a basal cell of prostate epithelium, taken from the transition zone of the prostate gland in a 72-year old male. It belongs to the Epithelia lineage and Population BE. sentences: - Neuron cell type from a 42-year old male cerebral cortex tissue, specifically from the rostral gyrus dorsal division of MFC A32, classified as Deep-layer corticothalamic and 6b. - Dendritic cell from the transition zone of prostate of a 29-year-old male, specifically from the EREG+ population. - Neuron from the mediodorsal nucleus of thalamus, which is part of the medial nuclear complex of thalamus (MNC) in the thalamic complex, taken from a 42-year-old male human donor with European ethnicity. The neuron belongs to the Thalamic excitatory supercluster. datasets: - jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation - jo-mengr/geo_70k_multiplets_natural_language_annotation pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - cosine_accuracy model-index: - name: SentenceTransformer results: - task: type: triplet name: Triplet dataset: name: Unknown type: unknown metrics: - type: cosine_accuracy value: 0.9402857422828674 name: Cosine Accuracy - type: cosine_accuracy value: 0.9371428489685059 name: Cosine Accuracy --- # SentenceTransformer This is a [sentence-transformers](https://www.SBERT.net) model trained on the [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) and [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) datasets. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** None tokens - **Output Dimensionality:** None dimensions - **Similarity Function:** Cosine Similarity - **Training Datasets:** - [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) - [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) - **Language:** code ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): MMContextEncoder( (text_encoder): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(28996, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-11): 12 x BertLayer( (attention): BertAttention( (self): BertSdpaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (text_adapter): AdapterModule( (net): Sequential( (0): Linear(in_features=768, out_features=512, bias=True) (1): ReLU(inplace=True) (2): Linear(in_features=512, out_features=2048, bias=True) (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (omics_adapter): AdapterModule( (net): Sequential( (0): Linear(in_features=64, out_features=512, bias=True) (1): ReLU(inplace=True) (2): Linear(in_features=512, out_features=2048, bias=True) (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("jo-mengr/mmcontext-100k-natural_language_annotation-pca-1024") # Run inference sentences = [ 'Sample is a basal cell of prostate epithelium, taken from the transition zone of the prostate gland in a 72-year old male. It belongs to the Epithelia lineage and Population BE.', 'Neuron cell type from a 42-year old male cerebral cortex tissue, specifically from the rostral gyrus dorsal division of MFC A32, classified as Deep-layer corticothalamic and 6b.', 'Neuron from the mediodorsal nucleus of thalamus, which is part of the medial nuclear complex of thalamus (MNC) in the thalamic complex, taken from a 42-year-old male human donor with European ethnicity. The neuron belongs to the Thalamic excitatory supercluster.', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 1024] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Evaluation ### Metrics #### Triplet * Evaluated with [TripletEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator) | Metric | Value | |:--------------------|:-----------| | **cosine_accuracy** | **0.9403** | #### Triplet * Evaluated with [TripletEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator) | Metric | Value | |:--------------------|:-----------| | **cosine_accuracy** | **0.9371** | ## Training Details ### Training Datasets #### cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation * Dataset: [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) at [a6241c4](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation/tree/a6241c46b7e108ff9106fd7a1838117096e2c3c6) * Size: 31,500 training samples * Columns: anndata_ref, positive, negative_1, and negative_2 * Approximate statistics based on the first 1000 samples: | | anndata_ref | positive | negative_1 | negative_2 | |:--------|:-------------------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------| | type | dict | string | string | dict | | details | | | | | * Samples: | anndata_ref | positive | negative_1 | negative_2 | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_1f1c5c14-5949-4c81-b28e-b272e271b672_570'} | Stromal cell of ovary, specifically Stroma-2, from a human adult female individual, in S phase of the cell cycle. | Neuron cell type from a 50-year-old male human thalamic complex, specifically from the ventral anterior nucleus of thalamus within the lateral nuclear complex. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_1b9d8702-5af8-4142-85ed-020eb06ec4f6_19663'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_34872'} | CD8-positive, alpha-beta T cell sample from a 52-year old Asian female with managed systemic lupus erythematosus (SLE). | Mucosal invariant T cell derived from the spleen of a female in her seventies. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_4145'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_7321'} | Hofbauer cell derived from the decidua basalis tissue of a female individual at 8 post conception week (8_PCW). The sample is a nucleus. | Regulatory T cell derived from a lymph node of a male individual with advanced non-small cell lung cancer (NSCLC), stage IV, who has a history of smoking. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_5a73f63f-18a2-49b5-b431-2c469c41a41b_163'} | * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` #### geo_70k_multiplets_natural_language_annotation * Dataset: [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) at [449eb79](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation/tree/449eb79e41b05af4d3e32900144411963f626f8c) * Size: 63,000 training samples * Columns: anndata_ref, positive, negative_1, and negative_2 * Approximate statistics based on the first 1000 samples: | | anndata_ref | positive | negative_1 | negative_2 | |:--------|:-------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------| | type | dict | string | string | dict | | details | | | | | * Samples: | anndata_ref | positive | negative_1 | negative_2 | |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3111576'} | 198Z\_MSCB-067 sample contains primary cells that are neuronal progenitors from patient type WB\_1. | 31-year-old female Caucasian with ntm disease provided a whole blood sample on July 11, 2016. The baseline FEVPP was 89.74 and FVCpp was 129.41. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX6591734'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7834244'} | CD8+ T cells from a healthy skin sample, labeled C4, from plate rep1, well E6, sequencing batch b7, which passed QC, and clustered as 2\_Resid. | 6-week-old (PCW6) neuronal epithelium tissue from donor HSB325, cultured using C1-72 chip. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX2440281'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3112138'} | 201Z\_MSCB-083 is a sample of primary neuronal progenitor cells from patient MD1 with no reported treatment. | 48-hour sample from HPV-negative UPCI:SCC131 cell line, a head and neck squamous cell carcinoma (HNSCC) cell line, that has not been irradiated. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7448263'} | * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` ### Evaluation Datasets #### cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation * Dataset: [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) at [a6241c4](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation/tree/a6241c46b7e108ff9106fd7a1838117096e2c3c6) * Size: 3,500 evaluation samples * Columns: anndata_ref, positive, negative_1, and negative_2 * Approximate statistics based on the first 1000 samples: | | anndata_ref | positive | negative_1 | negative_2 | |:--------|:-------------------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------| | type | dict | string | string | dict | | details | | | | | * Samples: | anndata_ref | positive | negative_1 | negative_2 | |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_842c6f5d-4a94-4eef-8510-8c792d1124bc_6822'} | Non-classical monocyte cell type, derived from a fresh breast tissue sample of an African American female donor with low breast density, obese BMI, and premenopausal status. The cell was obtained through resection procedure and analyzed using single-cell transcriptomics as part of the Human Breast Cell Atlas (HBCA) study. | Plasma cells derived from gingival tissue of a 39-year-old female. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_23461'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_b46237d1-19c6-4af2-9335-9854634bad16_9825'} | Enteric neuron cells derived from the ileum tissue at Carnegie stage 22. | Ciliated cell from the trachea of a 6-12 year-old European male with no SARS-CoV-2 infection, who is a non-smoker and healthy. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_2872f4b0-b171-46e2-abc6-befcf6de6306_2871'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_d7d7e89c-c93a-422d-8958-9b4a90b69558_4209'} | Activated CD16-positive, CD56-dim natural killer cell taken from a 26-year-old male, activated with CD3, and found to be in G1 phase. | CD8-positive, alpha-beta thymocyte cell type derived from a 74-year-old male human with European self-reported ethnicity, located in the transition zone of the prostate. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_535e9336-2d8d-43c3-944d-bcbebe20df8a_18'} | * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` #### geo_70k_multiplets_natural_language_annotation * Dataset: [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) at [449eb79](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation/tree/449eb79e41b05af4d3e32900144411963f626f8c) * Size: 7,000 evaluation samples * Columns: anndata_ref, positive, negative_1, and negative_2 * Approximate statistics based on the first 1000 samples: | | anndata_ref | positive | negative_1 | negative_2 | |:--------|:-------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------| | type | dict | string | string | dict | | details | | | | | * Samples: | anndata_ref | positive | negative_1 | negative_2 | |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX16033546'} | A549 lung adenocarcinoma cell line with ectopic expression of TPK1 p.G48C mutation. | 3 days after the 4th immunization, blood sample from donor 1033 with low antibody-dependent cellular phagocytosis (ADCP) category. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX10356703'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX8241199'} | Human fibroblasts at the D7 time point during reprogramming into induced pluripotent stem cells (iPSCs) or hiPSCs. | CD14+ monocytes from a healthy control participant (ID 2015). | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX14140416'} | | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX17834359'} | Whole blood sample from subject HRV15-017, collected at day 1 in the afternoon. | 59 year old male bronchial epithelial cells with 39 pack years of smoking history and imaging cluster 1. | {'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX5429074'} | * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: steps - `per_device_train_batch_size`: 128 - `per_device_eval_batch_size`: 128 - `learning_rate`: 2e-05 - `num_train_epochs`: 8 - `warmup_ratio`: 0.1 - `fp16`: True - `dataloader_num_workers`: 1 #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: steps - `prediction_loss_only`: True - `per_device_train_batch_size`: 128 - `per_device_eval_batch_size`: 128 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 2e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1.0 - `num_train_epochs`: 8 - `max_steps`: -1 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.1 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: False - `fp16`: True - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 1 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: False - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `dispatch_batches`: None - `split_batches`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `eval_use_gather_object`: False - `prompts`: None - `batch_sampler`: batch_sampler - `multi_dataset_batch_sampler`: proportional
### Training Logs | Epoch | Step | Training Loss | cellxgene pseudo bulk 35k multiplets natural language annotation loss | geo 70k multiplets natural language annotation loss | cosine_accuracy | |:------:|:----:|:-------------:|:---------------------------------------------------------------------:|:---------------------------------------------------:|:---------------:| | 0.1351 | 100 | - | 19.5545 | 19.6050 | 0.5656 | | 0.2703 | 200 | 17.2819 | 19.4888 | 17.2415 | 0.7261 | | 0.4054 | 300 | - | 17.2527 | 14.3099 | 0.7684 | | 0.5405 | 400 | 13.4122 | 13.1462 | 13.4371 | 0.7976 | | 0.6757 | 500 | - | 12.6305 | 9.3601 | 0.8474 | | 0.8108 | 600 | 8.3246 | 11.1233 | 7.6021 | 0.8787 | | 0.9459 | 700 | - | 8.5871 | 7.6461 | 0.8980 | | 1.0811 | 800 | 6.1203 | 7.0774 | 7.1605 | 0.9046 | | 1.2162 | 900 | - | 6.0461 | 6.7694 | 0.9076 | | 1.3514 | 1000 | 5.1622 | 6.1759 | 6.0741 | 0.9166 | | 1.4865 | 1100 | - | 6.6497 | 5.3305 | 0.9269 | | 1.6216 | 1200 | 4.7346 | 7.6330 | 4.9083 | 0.9324 | | 1.7568 | 1300 | - | 6.5700 | 4.8609 | 0.9349 | | 1.8919 | 1400 | 4.4577 | 6.9249 | 4.6155 | 0.9401 | | 2.0270 | 1500 | - | 5.4120 | 5.0721 | 0.9367 | | 2.1622 | 1600 | 4.2281 | 6.3842 | 4.6481 | 0.9407 | | 2.2973 | 1700 | - | 5.6970 | 4.9588 | 0.9370 | | 2.4324 | 1800 | 4.2392 | 6.3306 | 4.6888 | 0.9407 | | 2.5676 | 1900 | - | 5.3909 | 5.0415 | 0.9364 | | 2.7027 | 2000 | 4.2237 | 6.0779 | 4.7476 | 0.9394 | | 2.8378 | 2100 | - | 5.3631 | 5.0280 | 0.9379 | | 2.9730 | 2200 | 4.2215 | 5.5800 | 4.9418 | 0.9373 | | 3.1081 | 2300 | - | 6.3898 | 4.6718 | 0.9400 | | 3.2432 | 2400 | 4.1984 | 4.7118 | 5.4301 | 0.9313 | | 3.3784 | 2500 | - | 7.2266 | 4.5063 | 0.9419 | | 3.5135 | 2600 | 4.2538 | 8.1464 | 4.4121 | 0.9426 | | 3.6486 | 2700 | - | 6.5866 | 4.6253 | 0.9409 | | 3.7838 | 2800 | 4.2186 | 5.8797 | 4.8671 | 0.9380 | | 3.9189 | 2900 | - | 5.5591 | 4.9559 | 0.9377 | | 4.0541 | 3000 | 4.2064 | 6.3420 | 4.7167 | 0.9413 | | 4.1892 | 3100 | - | 5.9561 | 4.8190 | 0.9387 | | 4.3243 | 3200 | 4.2248 | 6.3844 | 4.6827 | 0.9410 | | 4.4595 | 3300 | - | 7.1522 | 4.5193 | 0.9421 | | 4.5946 | 3400 | 4.2263 | 4.8333 | 5.3410 | 0.9331 | | 4.7297 | 3500 | - | 4.5820 | 5.5334 | 0.9306 | | 4.8649 | 3600 | 4.2472 | 6.8254 | 4.5512 | 0.9413 | | 5.0 | 3700 | - | 6.4904 | 4.6993 | 0.9399 | | 5.1351 | 3800 | 4.1936 | 4.8578 | 5.3678 | 0.9344 | | 5.2703 | 3900 | - | 6.4530 | 4.6426 | 0.9413 | | 5.4054 | 4000 | 4.2345 | 6.6050 | 4.6684 | 0.9409 | | 5.5405 | 4100 | - | 4.8690 | 5.3172 | 0.9334 | | 5.6757 | 4200 | 4.2406 | 6.2903 | 4.7100 | 0.9404 | | 5.8108 | 4300 | - | 6.6273 | 4.6269 | 0.9419 | | 5.9459 | 4400 | 4.2227 | 5.4572 | 5.0365 | 0.9370 | | 6.0811 | 4500 | - | 5.0242 | 5.2568 | 0.9341 | | 6.2162 | 4600 | 4.1997 | 4.7279 | 5.5242 | 0.9316 | | 6.3514 | 4700 | - | 5.1846 | 5.2246 | 0.9339 | | 6.4865 | 4800 | 4.2361 | 5.8601 | 4.8249 | 0.9381 | | 6.6216 | 4900 | - | 6.9398 | 4.5848 | 0.9423 | | 6.7568 | 5000 | 4.2273 | 6.2977 | 4.6921 | 0.9406 | | 6.8919 | 5100 | - | 6.9737 | 4.5439 | 0.9421 | | 7.0270 | 5200 | 4.2052 | 5.3900 | 5.0873 | 0.9370 | | 7.1622 | 5300 | - | 6.3929 | 4.6474 | 0.9406 | | 7.2973 | 5400 | 4.2416 | 5.6994 | 4.9590 | 0.9371 | | 7.4324 | 5500 | - | 6.3184 | 4.6922 | 0.9407 | | 7.5676 | 5600 | 4.2311 | 5.3932 | 5.0403 | 0.9363 | | 7.7027 | 5700 | - | 6.0781 | 4.7480 | 0.9394 | | 7.8378 | 5800 | 4.229 | 5.3664 | 5.0291 | 0.9380 | | 7.9730 | 5900 | - | 5.5803 | 4.9391 | 0.9371 | ### Framework Versions - Python: 3.10.10 - Sentence Transformers: 3.5.0.dev0 - Transformers: 4.43.4 - PyTorch: 2.6.0+cu124 - Accelerate: 0.33.0 - Datasets: 2.14.4 - Tokenizers: 0.19.1 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```