Training dataset from article

#3
by jskp - opened

https://www.bioptimus.com/news/bioptimus-launches-h-optimus-1

Hi, this article references a "TCGA-CRC" dataset retrieved from https://portal.gdc.cancer.gov that was used to train H-optimus-1 for the slide-level tasks KRAS-CRC and BRAF-CRC, but I cannot really find mention of such a dataset on the portal, I can only find TCGA-COAD and TCGA-READ. Exactly what dataset and split was used?

Thanks!
Peter

Bioptimus org

Hi Peter,

TCGA-CRC refers to the combination of TCGA-COAD and TCGA-READ cohorts. We wil clarify this in the article, thank you for your feedback.
We used all slides of these cohorts present in https://portal.gdc.cancer.gov.
Labels were found in CbioPortal for BRAF and KRAS mutations.
For MSI, labels were obtained from Table S1 of "Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas", Liu et al., 2018.

Let me know if anything is unclear,
Thanks,
Charlie

Thank you for your response. Speaking of the article, do you have a preferred way to cite it?

And sorry if this isn't the best place to ask, but the article states you've only used the CLS token embedding for all FMs. Is this true also for your usage of Virchow2? Because they suggest concatenating CLS token with avg pool of patch tokens.

Bioptimus org

You can cite the model as Bioptimus, H-optimus-1. https://huggingface.co/bioptimus/H-optimus-1 (2025), thanks !
Indeed we solely used the [CLS] token for all embedders, including Virchow2. I would say this is still as of today the standard way of comparing pathology foundation models embeddings. Early on, we did try this approach and did not notice substantial difference of performance for all models, except on HEST, where all models benefited from concatenating the average of patch tokens. This however did not change the performance trends. Further analysis should be performed to confirm this initial observation.

Sign up or log in to comment