Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ Our findings show that our DNA model clusters species effectively in the embeddi
|
|
23 |
Enter a DNA sequence and the coordinates where you sampled it. (We can easily extend this tool to handle multiple DNA sequences with CSV upload.) You can choose between two methods to predict the most probable genus. 'Cosine' will calculate the cosine similarity between the embeddings of your unidentified eDNA sequence and existing labelled sequences to determine the most probable genuses; this method is not aware of environmental data. 'fine_tuned_model' will output the predictions of a model trained on DNA embeddings and ecological layer data to predict the most probable genuses. A plot of the most probable genuses is shown.
|
24 |
|
25 |
### DNA Embedding Space Visualization
|
26 |
-
Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species.
|
27 |
|
28 |
## BarcodeBERT DNA Embeddings.
|
29 |
The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401).
|
|
|
23 |
Enter a DNA sequence and the coordinates where you sampled it. (We can easily extend this tool to handle multiple DNA sequences with CSV upload.) You can choose between two methods to predict the most probable genus. 'Cosine' will calculate the cosine similarity between the embeddings of your unidentified eDNA sequence and existing labelled sequences to determine the most probable genuses; this method is not aware of environmental data. 'fine_tuned_model' will output the predictions of a model trained on DNA embeddings and ecological layer data to predict the most probable genuses. A plot of the most probable genuses is shown.
|
24 |
|
25 |
### DNA Embedding Space Visualization
|
26 |
+
Prehaps we have a DNA sequence for which the highest genus probability is very low (this could be because scientists have not managed to directly sample any specimens of the genus, so our training dataset, BOLD, doesn't contain any examples), we can still examine the DNA embedding of the sequence in relation to known samples. The t-SNE plot on the left shows the embedding space of the top k most common species in the dataset. Use the slider to choose k. We can see clear group distinctions between species. The t-sne plot on the right shows the DNA embedding spaces of the k most likely genera for the DNA sequence you provided compared to your DNA sequence's embedding.
|
27 |
|
28 |
## BarcodeBERT DNA Embeddings.
|
29 |
The model we used to train the DNA embeddings is [BarcodeBERT](https://github.com/Kari-Genomics-Lab/BarcodeBERT). We trained the model on the 'nucraw' column of DNA sequences from the latest release of the [BOLD Database](http://www.boldsystems.org/index.php/datapackages/Latest). We followed the preprocessing steps outlined by the [BarcodeBERT paper](https://arxiv.org/pdf/2311.02401).
|