「tokenize_cell」 function in tokenizer.py

#133
by janelynn - opened

Hello, thanks for this great work!

I wonder why the 「gene_tokens 」variable is 2 dimensional, and isn't 「gene_tokens 」obtained from the pickle file of 「token_dictionary_file 」(in the form of Ensembl IDs:tokens)?

Why does the last line of the 「tokenize_cell 」function show: gene_tokens[nonzero_mask][sorted_indices]?

Geneformer/geneformer/tokenizer.py

def tokenize_cell(gene_vector, gene_tokens):
"""
Convert normalized gene expression vector to tokenized rank value encoding.
"""
# create array of gene vector with token indices
# mask undetected genes
nonzero_mask = np.nonzero(gene_vector)[0]
# sort by median-scaled gene values
sorted_indices = np.argsort(-gene_vector[nonzero_mask])
# tokenize
sentence_tokens = gene_tokens[nonzero_mask][sorted_indices]
return sentence_tokens

Thank you for your interest in Geneformer! The rank value encodings are a 1 dimensional vector that is the genes ranked by their expression in the given single cell after normalization by their expression across the ~30M cells in Genecorpus-30M to prioritize genes that uniquely distinguish cell state. Please refer to the manuscript Methods for additional discussion. The nonzero mask serves to encode only genes that are detected in the specific given cell, thereby creating an efficient dense representation. The model learns from the absence of particular genes to understand the cell’s state as well, similarly to how the absence of certain words in a sentence are informative (e.g. the absence of negative words in a movie review would lead to an NLP model interpreting the review as having a more positive sentiment). When adding genes to the encoding, for example with in silico overexpression, this changes the model’s interpretation of the cell state and is the basis of our in silico reprogramming and differentiation analyses (e.g. Extended Data Fig. 2 c,e in the manuscript).

Including the zero genes would be also problematic because they inherently have the same rank and their inclusion in the encoding in an ordered fashion would be meaningless, not to mention it would waste a ton of computation on zeroes essentially because gene expression matrices are very sparse in their raw form. It would be like including all the words in the English language that are not used in a sentence at the end of every sentence, which would unnecessarily vastly increase the amount of computation for each sentence (analogous to each cell in this case).

Regarding your question about the number of dimensions, one thing that can be helpful when trying to understand the code is modifying it with print statements at various parts and running it on a small dataset so that you can visualize the dimensions.

ctheodoris changed discussion status to closed

Thank you for your quick reply!
Your proposal is very valid, would it be convenient for you to provide a loom source file for me to figure out the data dimension issue?

You can use any .loom file as long as you have the attributes as described in the example for the transcriptome tokenization. Human Cell Atlas has .loom files you can use as a test case, for example:
https://data.humancellatlas.org/explore/projects/c4077b3c-5c98-4d26-a614-246d12c2e5d7/project-matrices

Please check loompy API documentation for more information about how to manipulate .loom files to add the n_counts column attribute if you are unfamiliar with this file format.

Just to be clear though, each cell is a 1 dimensional vector, not two dimensions as you suggested.

Thanks for your patience, I used the data set you recommended :tissue-stability-human-spleen-10XV2.loom. Here's my code:

图片1.png

图片2.png

The generated arrow file is only 464bytes, what is the reason?
After I use git pull to sync to the latest version, I removed the {"cell_names": "cell_names", "n_fragments": "n_fragments"} parameter and found the following error:

图片3.png

What went wrong, please?

Thank you for your question! I do not encounter that error when running without a custom_attr_name_dict argument. Please check a diff between the script you have locally and the current one in this repository to ensure you have the current version. Please also ensure that you are pointing to the right version and not an outdated version that may be elsewhere in your directories.

Thanks for guiding me through the update. It now can run without the custom_attr_name_dict parameter, but a problem has arisen. Here is the error message:
Snipaste_2023-07-24_23-00-01.png

The first argument for tokenize_data should be the directory rather than the file.

Sign up or log in to comment