I would like to confirm some information in the paper about genome annotation in the embeddings
Hello. I am going to explore GROVER in my graduation thesis and I just wanted to make sure I understand the part where the genome annotation was added to the embeddings. I would like to explain this part in the document.
From what I gathered, after pretraining, the tokens are annotated with features such as GC content, strand info, repeated elements and gene coordinates.
I'm not sure I understood exactly how the annotations are added/appended to the embeddings. Are they also transformed into numerical vectors? And then appended after each word's embedding or after the whole sequence?
Is that about it? I found this most interesting, thank you so much.
Hello,
the annotations are not included at the input of the model. The model is agnostic of any genomic element during training and evaluation.
We extract the embeddings of the whole genome and the annotations are used only for descriptive purposes. For example in Fig. 5d. we have the UMAP of all the embeddings and in each square we color only the tokens with the respective annotation.
I hope the explanation is clear, if not, do not hesitate to let us know.
Hello: Hope you're doing well. I'm including this in my graduate thesis as well, and was wondering if there is any way to generate the UMAP of all embedding in each sequence tokens with the respective annotation? Also is there there an ideal way to present the generated sequence in FASTA format with percentages of each nucleotide? Thanks.
Hello, thank you for your interest in our work.
I am not sure if I understood your question. In zenodo (https://zenodo.org/records/13374192/files/GROVER_R_revised.html), you can find the data for the UMAP and the annotations. About the FASTA question, which percentage do you mean?