jo-mengr commited on
Commit
5596a00
·
verified ·
1 Parent(s): cfc7ab6

Add new SentenceTransformer model

Browse files
0_MMContextEncoder/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "text_encoder_name": "pritamdeka/S-BioBert-snli-multinli-stsb",
3
+ "omics_input_dim": 64,
4
+ "processor_obsm_key": "X_pca",
5
+ "freeze_text_encoder": true,
6
+ "unfreeze_last_n_layers": 1,
7
+ "adapter_hidden_dim": 512,
8
+ "adapter_output_dim": 2048
9
+ }
0_MMContextEncoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:081f0609dda1fd7be3d60a0516c748a50c2f77e7832537f797955f305d73af81
3
+ size 443446304
README.md ADDED
@@ -0,0 +1,613 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ tags:
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - generated_from_trainer
9
+ - dataset_size:94500
10
+ - loss:MultipleNegativesRankingLoss
11
+ widget:
12
+ - source_sentence: Primary CD8+ T cells from a subject identified as CL-MCRL, exposed
13
+ to the GPR epitope with a dpi (days post-infection) of 87.5.
14
+ sentences:
15
+ - Cancer cell line (CCL23) derived from a carcinoma patient.
16
+ - Primary CD34+ human cells in three-phase in vitro culture, isolated on day 13,
17
+ with GG1dd zf vector transduction.
18
+ - 23-year-old primary nonETP leukemic blasts from bone marrow.
19
+ - source_sentence: Hematopoietic cells with PI-AnnexinV-GFP+CD33+ phenotype from a
20
+ xenograft strain NRG-3GS.
21
+ sentences:
22
+ - H9 embryonic stem cells treated with recombinant Wnt3a for 8 hours in culture.
23
+ - iCell Hepatocytes that have been treated with 075\_OLBO\_10 in a study involving
24
+ BO class and dose 10.
25
+ - 48 hour treatment of colorectal carcinoma cell line HCT116 (colorectal cancer)
26
+ with control treatment.
27
+ - source_sentence: Memory B cells derived from a female thoracic lymph node, obtained
28
+ from a donor in their seventh decade.
29
+ sentences:
30
+ - Neuron cell type from the Pulvinar of thalamus, derived from a 42-year-old human
31
+ individual.
32
+ - Germinal center B cell derived from the tonsil tissue of a 3-year-old male with
33
+ recurrent tonsillitis.
34
+ - B cell sample from a 55-year old female Asian individual with managed systemic
35
+ lupus erythematosus (SLE). The cell was derived from peripheral blood mononuclear
36
+ cells (PBMCs).
37
+ - source_sentence: Pericyte cells, part of the smooth muscle lineage, extracted from
38
+ the transition zone of a 74-year-old human prostate.
39
+ sentences:
40
+ - A CD8-positive, alpha-beta memory T cell, CD45RO-positive, specifically identified
41
+ as Tem/Effector cytotoxic T cells, as determined by CellTypist prediction. The
42
+ cell was obtained from the lung tissue of a female individual in her eighth decade.
43
+ - CD4-positive, alpha-beta T cell sample taken from a 53-year old female Asian individual
44
+ with managed systemic lupus erythematosus (SLE).
45
+ - Natural killer cell from a 32-year old female of European descent with managed
46
+ systemic lupus erythematosus (SLE).
47
+ - source_sentence: Sample is a basal cell of prostate epithelium, taken from the transition
48
+ zone of the prostate gland in a 72-year old male. It belongs to the Epithelia
49
+ lineage and Population BE.
50
+ sentences:
51
+ - Neuron cell type from a 42-year old male cerebral cortex tissue, specifically
52
+ from the rostral gyrus dorsal division of MFC A32, classified as Deep-layer corticothalamic
53
+ and 6b.
54
+ - Dendritic cell from the transition zone of prostate of a 29-year-old male, specifically
55
+ from the EREG+ population.
56
+ - Neuron from the mediodorsal nucleus of thalamus, which is part of the medial nuclear
57
+ complex of thalamus (MNC) in the thalamic complex, taken from a 42-year-old male
58
+ human donor with European ethnicity. The neuron belongs to the Thalamic excitatory
59
+ supercluster.
60
+ datasets:
61
+ - jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
62
+ - jo-mengr/geo_70k_multiplets_natural_language_annotation
63
+ pipeline_tag: sentence-similarity
64
+ library_name: sentence-transformers
65
+ metrics:
66
+ - cosine_accuracy
67
+ model-index:
68
+ - name: SentenceTransformer
69
+ results:
70
+ - task:
71
+ type: triplet
72
+ name: Triplet
73
+ dataset:
74
+ name: Unknown
75
+ type: unknown
76
+ metrics:
77
+ - type: cosine_accuracy
78
+ value: 0.9402857422828674
79
+ name: Cosine Accuracy
80
+ - type: cosine_accuracy
81
+ value: 0.9371428489685059
82
+ name: Cosine Accuracy
83
+ ---
84
+
85
+ # SentenceTransformer
86
+
87
+ This is a [sentence-transformers](https://www.SBERT.net) model trained on the [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) and [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) datasets. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
88
+
89
+ ## Model Details
90
+
91
+ ### Model Description
92
+ - **Model Type:** Sentence Transformer
93
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
94
+ - **Maximum Sequence Length:** None tokens
95
+ - **Output Dimensionality:** None dimensions
96
+ - **Similarity Function:** Cosine Similarity
97
+ - **Training Datasets:**
98
+ - [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation)
99
+ - [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation)
100
+ - **Language:** code
101
+ <!-- - **License:** Unknown -->
102
+
103
+ ### Model Sources
104
+
105
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
106
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
107
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
108
+
109
+ ### Full Model Architecture
110
+
111
+ ```
112
+ SentenceTransformer(
113
+ (0): MMContextEncoder(
114
+ (text_encoder): BertModel(
115
+ (embeddings): BertEmbeddings(
116
+ (word_embeddings): Embedding(28996, 768, padding_idx=0)
117
+ (position_embeddings): Embedding(512, 768)
118
+ (token_type_embeddings): Embedding(2, 768)
119
+ (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
120
+ (dropout): Dropout(p=0.1, inplace=False)
121
+ )
122
+ (encoder): BertEncoder(
123
+ (layer): ModuleList(
124
+ (0-11): 12 x BertLayer(
125
+ (attention): BertAttention(
126
+ (self): BertSdpaSelfAttention(
127
+ (query): Linear(in_features=768, out_features=768, bias=True)
128
+ (key): Linear(in_features=768, out_features=768, bias=True)
129
+ (value): Linear(in_features=768, out_features=768, bias=True)
130
+ (dropout): Dropout(p=0.1, inplace=False)
131
+ )
132
+ (output): BertSelfOutput(
133
+ (dense): Linear(in_features=768, out_features=768, bias=True)
134
+ (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
135
+ (dropout): Dropout(p=0.1, inplace=False)
136
+ )
137
+ )
138
+ (intermediate): BertIntermediate(
139
+ (dense): Linear(in_features=768, out_features=3072, bias=True)
140
+ (intermediate_act_fn): GELUActivation()
141
+ )
142
+ (output): BertOutput(
143
+ (dense): Linear(in_features=3072, out_features=768, bias=True)
144
+ (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
145
+ (dropout): Dropout(p=0.1, inplace=False)
146
+ )
147
+ )
148
+ )
149
+ )
150
+ (pooler): BertPooler(
151
+ (dense): Linear(in_features=768, out_features=768, bias=True)
152
+ (activation): Tanh()
153
+ )
154
+ )
155
+ (text_adapter): AdapterModule(
156
+ (net): Sequential(
157
+ (0): Linear(in_features=768, out_features=512, bias=True)
158
+ (1): ReLU(inplace=True)
159
+ (2): Linear(in_features=512, out_features=2048, bias=True)
160
+ (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
161
+ )
162
+ )
163
+ (omics_adapter): AdapterModule(
164
+ (net): Sequential(
165
+ (0): Linear(in_features=64, out_features=512, bias=True)
166
+ (1): ReLU(inplace=True)
167
+ (2): Linear(in_features=512, out_features=2048, bias=True)
168
+ (3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
169
+ )
170
+ )
171
+ )
172
+ )
173
+ ```
174
+
175
+ ## Usage
176
+
177
+ ### Direct Usage (Sentence Transformers)
178
+
179
+ First install the Sentence Transformers library:
180
+
181
+ ```bash
182
+ pip install -U sentence-transformers
183
+ ```
184
+
185
+ Then you can load this model and run inference.
186
+ ```python
187
+ from sentence_transformers import SentenceTransformer
188
+
189
+ # Download from the 🤗 Hub
190
+ model = SentenceTransformer("jo-mengr/mmcontext-100k-natural_language_annotation-pca-1024")
191
+ # Run inference
192
+ sentences = [
193
+ 'Sample is a basal cell of prostate epithelium, taken from the transition zone of the prostate gland in a 72-year old male. It belongs to the Epithelia lineage and Population BE.',
194
+ 'Neuron cell type from a 42-year old male cerebral cortex tissue, specifically from the rostral gyrus dorsal division of MFC A32, classified as Deep-layer corticothalamic and 6b.',
195
+ 'Neuron from the mediodorsal nucleus of thalamus, which is part of the medial nuclear complex of thalamus (MNC) in the thalamic complex, taken from a 42-year-old male human donor with European ethnicity. The neuron belongs to the Thalamic excitatory supercluster.',
196
+ ]
197
+ embeddings = model.encode(sentences)
198
+ print(embeddings.shape)
199
+ # [3, 1024]
200
+
201
+ # Get the similarity scores for the embeddings
202
+ similarities = model.similarity(embeddings, embeddings)
203
+ print(similarities.shape)
204
+ # [3, 3]
205
+ ```
206
+
207
+ <!--
208
+ ### Direct Usage (Transformers)
209
+
210
+ <details><summary>Click to see the direct usage in Transformers</summary>
211
+
212
+ </details>
213
+ -->
214
+
215
+ <!--
216
+ ### Downstream Usage (Sentence Transformers)
217
+
218
+ You can finetune this model on your own dataset.
219
+
220
+ <details><summary>Click to expand</summary>
221
+
222
+ </details>
223
+ -->
224
+
225
+ <!--
226
+ ### Out-of-Scope Use
227
+
228
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
229
+ -->
230
+
231
+ ## Evaluation
232
+
233
+ ### Metrics
234
+
235
+ #### Triplet
236
+
237
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
238
+
239
+ | Metric | Value |
240
+ |:--------------------|:-----------|
241
+ | **cosine_accuracy** | **0.9403** |
242
+
243
+ #### Triplet
244
+
245
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
246
+
247
+ | Metric | Value |
248
+ |:--------------------|:-----------|
249
+ | **cosine_accuracy** | **0.9371** |
250
+
251
+ <!--
252
+ ## Bias, Risks and Limitations
253
+
254
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
255
+ -->
256
+
257
+ <!--
258
+ ### Recommendations
259
+
260
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
261
+ -->
262
+
263
+ ## Training Details
264
+
265
+ ### Training Datasets
266
+
267
+ #### cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
268
+
269
+ * Dataset: [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) at [a6241c4](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation/tree/a6241c46b7e108ff9106fd7a1838117096e2c3c6)
270
+ * Size: 31,500 training samples
271
+ * Columns: <code>anndata_ref</code>, <code>positive</code>, <code>negative_1</code>, and <code>negative_2</code>
272
+ * Approximate statistics based on the first 1000 samples:
273
+ | | anndata_ref | positive | negative_1 | negative_2 |
274
+ |:--------|:-------------------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------|
275
+ | type | dict | string | string | dict |
276
+ | details | <ul><li></li></ul> | <ul><li>min: 53 characters</li><li>mean: 163.04 characters</li><li>max: 743 characters</li></ul> | <ul><li>min: 43 characters</li><li>mean: 163.42 characters</li><li>max: 609 characters</li></ul> | <ul><li></li></ul> |
277
+ * Samples:
278
+ | anndata_ref | positive | negative_1 | negative_2 |
279
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
280
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_1f1c5c14-5949-4c81-b28e-b272e271b672_570'}</code> | <code>Stromal cell of ovary, specifically Stroma-2, from a human adult female individual, in S phase of the cell cycle.</code> | <code>Neuron cell type from a 50-year-old male human thalamic complex, specifically from the ventral anterior nucleus of thalamus within the lateral nuclear complex.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_1b9d8702-5af8-4142-85ed-020eb06ec4f6_19663'}</code> |
281
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_34872'}</code> | <code>CD8-positive, alpha-beta T cell sample from a 52-year old Asian female with managed systemic lupus erythematosus (SLE).</code> | <code>Mucosal invariant T cell derived from the spleen of a female in her seventies.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_4145'}</code> |
282
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_74cff64f-9da9-4b2a-9b3b-8a04a1598040_7321'}</code> | <code>Hofbauer cell derived from the decidua basalis tissue of a female individual at 8 post conception week (8_PCW). The sample is a nucleus.</code> | <code>Regulatory T cell derived from a lymph node of a male individual with advanced non-small cell lung cancer (NSCLC), stage IV, who has a history of smoking.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cZdKEMQFMKGHc6E/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GDgf9MfckNmk2Bf/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/GWrtoRASdZAWdPa/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/FAiRMKztdjLYG23/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TDTo6seSi6qrGTq/download'}}, 'sample_id': 'census_5a73f63f-18a2-49b5-b431-2c469c41a41b_163'}</code> |
283
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
284
+ ```json
285
+ {
286
+ "scale": 20.0,
287
+ "similarity_fct": "cos_sim"
288
+ }
289
+ ```
290
+
291
+ #### geo_70k_multiplets_natural_language_annotation
292
+
293
+ * Dataset: [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) at [449eb79](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation/tree/449eb79e41b05af4d3e32900144411963f626f8c)
294
+ * Size: 63,000 training samples
295
+ * Columns: <code>anndata_ref</code>, <code>positive</code>, <code>negative_1</code>, and <code>negative_2</code>
296
+ * Approximate statistics based on the first 1000 samples:
297
+ | | anndata_ref | positive | negative_1 | negative_2 |
298
+ |:--------|:-------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------|
299
+ | type | dict | string | string | dict |
300
+ | details | <ul><li></li></ul> | <ul><li>min: 21 characters</li><li>mean: 139.4 characters</li><li>max: 696 characters</li></ul> | <ul><li>min: 23 characters</li><li>mean: 142.09 characters</li><li>max: 705 characters</li></ul> | <ul><li></li></ul> |
301
+ * Samples:
302
+ | anndata_ref | positive | negative_1 | negative_2 |
303
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
304
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3111576'}</code> | <code>198Z\_MSCB-067 sample contains primary cells that are neuronal progenitors from patient type WB\_1.</code> | <code>31-year-old female Caucasian with ntm disease provided a whole blood sample on July 11, 2016. The baseline FEVPP was 89.74 and FVCpp was 129.41.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX6591734'}</code> |
305
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7834244'}</code> | <code>CD8+ T cells from a healthy skin sample, labeled C4, from plate rep1, well E6, sequencing batch b7, which passed QC, and clustered as 2\_Resid.</code> | <code>6-week-old (PCW6) neuronal epithelium tissue from donor HSB325, cultured using C1-72 chip.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX2440281'}</code> |
306
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX3112138'}</code> | <code>201Z\_MSCB-083 is a sample of primary neuronal progenitor cells from patient MD1 with no reported treatment.</code> | <code>48-hour sample from HPV-negative UPCI:SCC131 cell line, a head and neck squamous cell carcinoma (HNSCC) cell line, that has not been irradiated.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/mwyWK7cTL3j5ydA/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Tg4TMSg8gDtxJ5x/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/QjSE4s5ZHamjwfi/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/rYEATQXRJsx42Qr/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/cWgZaKPJLsgb5Zo/download'}}, 'sample_id': 'SRX7448263'}</code> |
307
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
308
+ ```json
309
+ {
310
+ "scale": 20.0,
311
+ "similarity_fct": "cos_sim"
312
+ }
313
+ ```
314
+
315
+ ### Evaluation Datasets
316
+
317
+ #### cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation
318
+
319
+ * Dataset: [cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation) at [a6241c4](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_35k_multiplets_natural_language_annotation/tree/a6241c46b7e108ff9106fd7a1838117096e2c3c6)
320
+ * Size: 3,500 evaluation samples
321
+ * Columns: <code>anndata_ref</code>, <code>positive</code>, <code>negative_1</code>, and <code>negative_2</code>
322
+ * Approximate statistics based on the first 1000 samples:
323
+ | | anndata_ref | positive | negative_1 | negative_2 |
324
+ |:--------|:-------------------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------|
325
+ | type | dict | string | string | dict |
326
+ | details | <ul><li></li></ul> | <ul><li>min: 51 characters</li><li>mean: 168.27 characters</li><li>max: 829 characters</li></ul> | <ul><li>min: 57 characters</li><li>mean: 174.27 characters</li><li>max: 804 characters</li></ul> | <ul><li></li></ul> |
327
+ * Samples:
328
+ | anndata_ref | positive | negative_1 | negative_2 |
329
+ |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
330
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_842c6f5d-4a94-4eef-8510-8c792d1124bc_6822'}</code> | <code>Non-classical monocyte cell type, derived from a fresh breast tissue sample of an African American female donor with low breast density, obese BMI, and premenopausal status. The cell was obtained through resection procedure and analyzed using single-cell transcriptomics as part of the Human Breast Cell Atlas (HBCA) study.</code> | <code>Plasma cells derived from gingival tissue of a 39-year-old female.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_23461'}</code> |
331
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_b46237d1-19c6-4af2-9335-9854634bad16_9825'}</code> | <code>Enteric neuron cells derived from the ileum tissue at Carnegie stage 22.</code> | <code>Ciliated cell from the trachea of a 6-12 year-old European male with no SARS-CoV-2 infection, who is a non-smoker and healthy.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_2872f4b0-b171-46e2-abc6-befcf6de6306_2871'}</code> |
332
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_d7d7e89c-c93a-422d-8958-9b4a90b69558_4209'}</code> | <code>Activated CD16-positive, CD56-dim natural killer cell taken from a 26-year-old male, activated with CD3, and found to be in G1 phase.</code> | <code>CD8-positive, alpha-beta thymocyte cell type derived from a 74-year-old male human with European self-reported ethnicity, located in the transition zone of the prostate.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/Zk4EtWao9WKAQKc/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/LET7EG7xi56RqMd/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/5qjxiEJwwdNHTBX/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/z4TQkdxcP3ynBMn/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/6NZ94ZLkLKYyPcY/download'}}, 'sample_id': 'census_535e9336-2d8d-43c3-944d-bcbebe20df8a_18'}</code> |
333
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
334
+ ```json
335
+ {
336
+ "scale": 20.0,
337
+ "similarity_fct": "cos_sim"
338
+ }
339
+ ```
340
+
341
+ #### geo_70k_multiplets_natural_language_annotation
342
+
343
+ * Dataset: [geo_70k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation) at [449eb79](https://huggingface.co/datasets/jo-mengr/geo_70k_multiplets_natural_language_annotation/tree/449eb79e41b05af4d3e32900144411963f626f8c)
344
+ * Size: 7,000 evaluation samples
345
+ * Columns: <code>anndata_ref</code>, <code>positive</code>, <code>negative_1</code>, and <code>negative_2</code>
346
+ * Approximate statistics based on the first 1000 samples:
347
+ | | anndata_ref | positive | negative_1 | negative_2 |
348
+ |:--------|:-------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------|
349
+ | type | dict | string | string | dict |
350
+ | details | <ul><li></li></ul> | <ul><li>min: 22 characters</li><li>mean: 138.7 characters</li><li>max: 702 characters</li></ul> | <ul><li>min: 22 characters</li><li>mean: 131.79 characters</li><li>max: 702 characters</li></ul> | <ul><li></li></ul> |
351
+ * Samples:
352
+ | anndata_ref | positive | negative_1 | negative_2 |
353
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
354
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX16033546'}</code> | <code>A549 lung adenocarcinoma cell line with ectopic expression of TPK1 p.G48C mutation.</code> | <code>3 days after the 4th immunization, blood sample from donor 1033 with low antibody-dependent cellular phagocytosis (ADCP) category.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX10356703'}</code> |
355
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX8241199'}</code> | <code>Human fibroblasts at the D7 time point during reprogramming into induced pluripotent stem cells (iPSCs) or hiPSCs.</code> | <code>CD14+ monocytes from a healthy control participant (ID 2015).</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX14140416'}</code> |
356
+ | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX17834359'}</code> | <code>Whole blood sample from subject HRV15-017, collected at day 1 in the afternoon.</code> | <code>59 year old male bronchial epithelial cells with 39 pack years of smoking history and imaging cluster 1.</code> | <code>{'file_record': {'dataset_path': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kfjX6LkLewqssdN/download', 'embeddings': {'X_geneformer': 'https://nxc-fredato.imbi.uni-freiburg.de/s/kxd2NqJjnMSArf6/download', 'X_hvg': 'https://nxc-fredato.imbi.uni-freiburg.de/s/zqPbdqn5nCgo7rb/download', 'X_pca': 'https://nxc-fredato.imbi.uni-freiburg.de/s/b7sANypKxGyYQ2J/download', 'X_scvi': 'https://nxc-fredato.imbi.uni-freiburg.de/s/TwFF6TWRp9sMxgc/download'}}, 'sample_id': 'SRX5429074'}</code> |
357
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
358
+ ```json
359
+ {
360
+ "scale": 20.0,
361
+ "similarity_fct": "cos_sim"
362
+ }
363
+ ```
364
+
365
+ ### Training Hyperparameters
366
+ #### Non-Default Hyperparameters
367
+
368
+ - `eval_strategy`: steps
369
+ - `per_device_train_batch_size`: 128
370
+ - `per_device_eval_batch_size`: 128
371
+ - `learning_rate`: 2e-05
372
+ - `num_train_epochs`: 8
373
+ - `warmup_ratio`: 0.1
374
+ - `fp16`: True
375
+ - `dataloader_num_workers`: 1
376
+
377
+ #### All Hyperparameters
378
+ <details><summary>Click to expand</summary>
379
+
380
+ - `overwrite_output_dir`: False
381
+ - `do_predict`: False
382
+ - `eval_strategy`: steps
383
+ - `prediction_loss_only`: True
384
+ - `per_device_train_batch_size`: 128
385
+ - `per_device_eval_batch_size`: 128
386
+ - `per_gpu_train_batch_size`: None
387
+ - `per_gpu_eval_batch_size`: None
388
+ - `gradient_accumulation_steps`: 1
389
+ - `eval_accumulation_steps`: None
390
+ - `torch_empty_cache_steps`: None
391
+ - `learning_rate`: 2e-05
392
+ - `weight_decay`: 0.0
393
+ - `adam_beta1`: 0.9
394
+ - `adam_beta2`: 0.999
395
+ - `adam_epsilon`: 1e-08
396
+ - `max_grad_norm`: 1.0
397
+ - `num_train_epochs`: 8
398
+ - `max_steps`: -1
399
+ - `lr_scheduler_type`: linear
400
+ - `lr_scheduler_kwargs`: {}
401
+ - `warmup_ratio`: 0.1
402
+ - `warmup_steps`: 0
403
+ - `log_level`: passive
404
+ - `log_level_replica`: warning
405
+ - `log_on_each_node`: True
406
+ - `logging_nan_inf_filter`: True
407
+ - `save_safetensors`: True
408
+ - `save_on_each_node`: False
409
+ - `save_only_model`: False
410
+ - `restore_callback_states_from_checkpoint`: False
411
+ - `no_cuda`: False
412
+ - `use_cpu`: False
413
+ - `use_mps_device`: False
414
+ - `seed`: 42
415
+ - `data_seed`: None
416
+ - `jit_mode_eval`: False
417
+ - `use_ipex`: False
418
+ - `bf16`: False
419
+ - `fp16`: True
420
+ - `fp16_opt_level`: O1
421
+ - `half_precision_backend`: auto
422
+ - `bf16_full_eval`: False
423
+ - `fp16_full_eval`: False
424
+ - `tf32`: None
425
+ - `local_rank`: 0
426
+ - `ddp_backend`: None
427
+ - `tpu_num_cores`: None
428
+ - `tpu_metrics_debug`: False
429
+ - `debug`: []
430
+ - `dataloader_drop_last`: False
431
+ - `dataloader_num_workers`: 1
432
+ - `dataloader_prefetch_factor`: None
433
+ - `past_index`: -1
434
+ - `disable_tqdm`: False
435
+ - `remove_unused_columns`: True
436
+ - `label_names`: None
437
+ - `load_best_model_at_end`: False
438
+ - `ignore_data_skip`: False
439
+ - `fsdp`: []
440
+ - `fsdp_min_num_params`: 0
441
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
442
+ - `fsdp_transformer_layer_cls_to_wrap`: None
443
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
444
+ - `deepspeed`: None
445
+ - `label_smoothing_factor`: 0.0
446
+ - `optim`: adamw_torch
447
+ - `optim_args`: None
448
+ - `adafactor`: False
449
+ - `group_by_length`: False
450
+ - `length_column_name`: length
451
+ - `ddp_find_unused_parameters`: None
452
+ - `ddp_bucket_cap_mb`: None
453
+ - `ddp_broadcast_buffers`: False
454
+ - `dataloader_pin_memory`: True
455
+ - `dataloader_persistent_workers`: False
456
+ - `skip_memory_metrics`: True
457
+ - `use_legacy_prediction_loop`: False
458
+ - `push_to_hub`: False
459
+ - `resume_from_checkpoint`: None
460
+ - `hub_model_id`: None
461
+ - `hub_strategy`: every_save
462
+ - `hub_private_repo`: False
463
+ - `hub_always_push`: False
464
+ - `gradient_checkpointing`: False
465
+ - `gradient_checkpointing_kwargs`: None
466
+ - `include_inputs_for_metrics`: False
467
+ - `eval_do_concat_batches`: True
468
+ - `fp16_backend`: auto
469
+ - `push_to_hub_model_id`: None
470
+ - `push_to_hub_organization`: None
471
+ - `mp_parameters`:
472
+ - `auto_find_batch_size`: False
473
+ - `full_determinism`: False
474
+ - `torchdynamo`: None
475
+ - `ray_scope`: last
476
+ - `ddp_timeout`: 1800
477
+ - `torch_compile`: False
478
+ - `torch_compile_backend`: None
479
+ - `torch_compile_mode`: None
480
+ - `dispatch_batches`: None
481
+ - `split_batches`: None
482
+ - `include_tokens_per_second`: False
483
+ - `include_num_input_tokens_seen`: False
484
+ - `neftune_noise_alpha`: None
485
+ - `optim_target_modules`: None
486
+ - `batch_eval_metrics`: False
487
+ - `eval_on_start`: False
488
+ - `eval_use_gather_object`: False
489
+ - `prompts`: None
490
+ - `batch_sampler`: batch_sampler
491
+ - `multi_dataset_batch_sampler`: proportional
492
+
493
+ </details>
494
+
495
+ ### Training Logs
496
+ | Epoch | Step | Training Loss | cellxgene pseudo bulk 35k multiplets natural language annotation loss | geo 70k multiplets natural language annotation loss | cosine_accuracy |
497
+ |:------:|:----:|:-------------:|:---------------------------------------------------------------------:|:---------------------------------------------------:|:---------------:|
498
+ | 0.1351 | 100 | - | 19.5545 | 19.6050 | 0.5656 |
499
+ | 0.2703 | 200 | 17.2819 | 19.4888 | 17.2415 | 0.7261 |
500
+ | 0.4054 | 300 | - | 17.2527 | 14.3099 | 0.7684 |
501
+ | 0.5405 | 400 | 13.4122 | 13.1462 | 13.4371 | 0.7976 |
502
+ | 0.6757 | 500 | - | 12.6305 | 9.3601 | 0.8474 |
503
+ | 0.8108 | 600 | 8.3246 | 11.1233 | 7.6021 | 0.8787 |
504
+ | 0.9459 | 700 | - | 8.5871 | 7.6461 | 0.8980 |
505
+ | 1.0811 | 800 | 6.1203 | 7.0774 | 7.1605 | 0.9046 |
506
+ | 1.2162 | 900 | - | 6.0461 | 6.7694 | 0.9076 |
507
+ | 1.3514 | 1000 | 5.1622 | 6.1759 | 6.0741 | 0.9166 |
508
+ | 1.4865 | 1100 | - | 6.6497 | 5.3305 | 0.9269 |
509
+ | 1.6216 | 1200 | 4.7346 | 7.6330 | 4.9083 | 0.9324 |
510
+ | 1.7568 | 1300 | - | 6.5700 | 4.8609 | 0.9349 |
511
+ | 1.8919 | 1400 | 4.4577 | 6.9249 | 4.6155 | 0.9401 |
512
+ | 2.0270 | 1500 | - | 5.4120 | 5.0721 | 0.9367 |
513
+ | 2.1622 | 1600 | 4.2281 | 6.3842 | 4.6481 | 0.9407 |
514
+ | 2.2973 | 1700 | - | 5.6970 | 4.9588 | 0.9370 |
515
+ | 2.4324 | 1800 | 4.2392 | 6.3306 | 4.6888 | 0.9407 |
516
+ | 2.5676 | 1900 | - | 5.3909 | 5.0415 | 0.9364 |
517
+ | 2.7027 | 2000 | 4.2237 | 6.0779 | 4.7476 | 0.9394 |
518
+ | 2.8378 | 2100 | - | 5.3631 | 5.0280 | 0.9379 |
519
+ | 2.9730 | 2200 | 4.2215 | 5.5800 | 4.9418 | 0.9373 |
520
+ | 3.1081 | 2300 | - | 6.3898 | 4.6718 | 0.9400 |
521
+ | 3.2432 | 2400 | 4.1984 | 4.7118 | 5.4301 | 0.9313 |
522
+ | 3.3784 | 2500 | - | 7.2266 | 4.5063 | 0.9419 |
523
+ | 3.5135 | 2600 | 4.2538 | 8.1464 | 4.4121 | 0.9426 |
524
+ | 3.6486 | 2700 | - | 6.5866 | 4.6253 | 0.9409 |
525
+ | 3.7838 | 2800 | 4.2186 | 5.8797 | 4.8671 | 0.9380 |
526
+ | 3.9189 | 2900 | - | 5.5591 | 4.9559 | 0.9377 |
527
+ | 4.0541 | 3000 | 4.2064 | 6.3420 | 4.7167 | 0.9413 |
528
+ | 4.1892 | 3100 | - | 5.9561 | 4.8190 | 0.9387 |
529
+ | 4.3243 | 3200 | 4.2248 | 6.3844 | 4.6827 | 0.9410 |
530
+ | 4.4595 | 3300 | - | 7.1522 | 4.5193 | 0.9421 |
531
+ | 4.5946 | 3400 | 4.2263 | 4.8333 | 5.3410 | 0.9331 |
532
+ | 4.7297 | 3500 | - | 4.5820 | 5.5334 | 0.9306 |
533
+ | 4.8649 | 3600 | 4.2472 | 6.8254 | 4.5512 | 0.9413 |
534
+ | 5.0 | 3700 | - | 6.4904 | 4.6993 | 0.9399 |
535
+ | 5.1351 | 3800 | 4.1936 | 4.8578 | 5.3678 | 0.9344 |
536
+ | 5.2703 | 3900 | - | 6.4530 | 4.6426 | 0.9413 |
537
+ | 5.4054 | 4000 | 4.2345 | 6.6050 | 4.6684 | 0.9409 |
538
+ | 5.5405 | 4100 | - | 4.8690 | 5.3172 | 0.9334 |
539
+ | 5.6757 | 4200 | 4.2406 | 6.2903 | 4.7100 | 0.9404 |
540
+ | 5.8108 | 4300 | - | 6.6273 | 4.6269 | 0.9419 |
541
+ | 5.9459 | 4400 | 4.2227 | 5.4572 | 5.0365 | 0.9370 |
542
+ | 6.0811 | 4500 | - | 5.0242 | 5.2568 | 0.9341 |
543
+ | 6.2162 | 4600 | 4.1997 | 4.7279 | 5.5242 | 0.9316 |
544
+ | 6.3514 | 4700 | - | 5.1846 | 5.2246 | 0.9339 |
545
+ | 6.4865 | 4800 | 4.2361 | 5.8601 | 4.8249 | 0.9381 |
546
+ | 6.6216 | 4900 | - | 6.9398 | 4.5848 | 0.9423 |
547
+ | 6.7568 | 5000 | 4.2273 | 6.2977 | 4.6921 | 0.9406 |
548
+ | 6.8919 | 5100 | - | 6.9737 | 4.5439 | 0.9421 |
549
+ | 7.0270 | 5200 | 4.2052 | 5.3900 | 5.0873 | 0.9370 |
550
+ | 7.1622 | 5300 | - | 6.3929 | 4.6474 | 0.9406 |
551
+ | 7.2973 | 5400 | 4.2416 | 5.6994 | 4.9590 | 0.9371 |
552
+ | 7.4324 | 5500 | - | 6.3184 | 4.6922 | 0.9407 |
553
+ | 7.5676 | 5600 | 4.2311 | 5.3932 | 5.0403 | 0.9363 |
554
+ | 7.7027 | 5700 | - | 6.0781 | 4.7480 | 0.9394 |
555
+ | 7.8378 | 5800 | 4.229 | 5.3664 | 5.0291 | 0.9380 |
556
+ | 7.9730 | 5900 | - | 5.5803 | 4.9391 | 0.9371 |
557
+
558
+
559
+ ### Framework Versions
560
+ - Python: 3.10.10
561
+ - Sentence Transformers: 3.5.0.dev0
562
+ - Transformers: 4.43.4
563
+ - PyTorch: 2.6.0+cu124
564
+ - Accelerate: 0.33.0
565
+ - Datasets: 2.14.4
566
+ - Tokenizers: 0.19.1
567
+
568
+ ## Citation
569
+
570
+ ### BibTeX
571
+
572
+ #### Sentence Transformers
573
+ ```bibtex
574
+ @inproceedings{reimers-2019-sentence-bert,
575
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
576
+ author = "Reimers, Nils and Gurevych, Iryna",
577
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
578
+ month = "11",
579
+ year = "2019",
580
+ publisher = "Association for Computational Linguistics",
581
+ url = "https://arxiv.org/abs/1908.10084",
582
+ }
583
+ ```
584
+
585
+ #### MultipleNegativesRankingLoss
586
+ ```bibtex
587
+ @misc{henderson2017efficient,
588
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
589
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
590
+ year={2017},
591
+ eprint={1705.00652},
592
+ archivePrefix={arXiv},
593
+ primaryClass={cs.CL}
594
+ }
595
+ ```
596
+
597
+ <!--
598
+ ## Glossary
599
+
600
+ *Clearly define terms in order to be accessible across audiences.*
601
+ -->
602
+
603
+ <!--
604
+ ## Model Card Authors
605
+
606
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
607
+ -->
608
+
609
+ <!--
610
+ ## Model Card Contact
611
+
612
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
613
+ -->
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.5.0.dev0",
4
+ "transformers": "4.43.4",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "0_MMContextEncoder",
6
+ "type": "mmcontext.models.MMContextEncoder.MMContextEncoder"
7
+ }
8
+ ]