--- tags: - chemistry - molecule - drug --- # Model Card for Roberta Zinc 480m ### Model Description `roberta_zinc_480m` is a ~102m parameter Roberta-style masked language model ~480m SMILES strings from the [ZINC database](https://zinc.docking.org/). This model is useful for generating embeddings from SMILES strings. - **Developed by:** Karl Heyer - **License:** MIT ### Direct Use Usage examples. Note that input SMILES strings should be canonicalized. With the Transformers library: ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m") roberta_zinc = AutoModel.from_pretrained("entropy/roberta_zinc_480m", add_pooling_layer=False) # model was not trained with a pooler # smiles should be canonicalized smiles = [ "Brc1cc2c(NCc3ccccc3)ncnc2s1", "Brc1cc2c(NCc3ccccn3)ncnc2s1", "Brc1cc2c(NCc3cccs3)ncnc2s1", "Brc1cc2c(NCc3ccncc3)ncnc2s1", "Brc1cc2c(Nc3ccccc3)ncnc2s1" ] batch = tokenizer(smiles, return_tensors='pt', padding=True, pad_to_multiple_of=8) # mean pooling outputs = roberta_zinc(**batch, output_hidden_states=True) full_embeddings = outputs[1][-1] mask = batch['attention_mask'] embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1)) ``` With Sentence Transformers: ```python from sentence_transformers import models, SentenceTransformer transformer = models.Transformer("entropy/roberta_zinc_480m", max_seq_length=256, model_args={"add_pooling_layer": False}) pooling = models.Pooling(transformer.get_word_embedding_dimension(), pooling_mode="mean") model = SentenceTransformer(modules=[transformer, pooling]) # smiles should be canonicalized smiles = [ "Brc1cc2c(NCc3ccccc3)ncnc2s1", "Brc1cc2c(NCc3ccccn3)ncnc2s1", "Brc1cc2c(NCc3cccs3)ncnc2s1", "Brc1cc2c(NCc3ccncc3)ncnc2s1", "Brc1cc2c(Nc3ccccc3)ncnc2s1" ] embeddings = model.encode(smiles, convert_to_tensor=True) ``` ### Training Procedure #### Preprocessing ~480m SMILES strings were randomly sampled from the [ZINC database](https://zinc.docking.org/), weighted by tranche size (ie more SMILES were sampled from larger tranches). SMILES were canonicalized, then used to train the tokenizer. #### Training Hyperparameters The model was trained with cross entropy loss for 150,000 iterations with a batch size of 4096. The model achieved a validation loss of ~0.122. ### Downstream Models #### Decoder There is a [decoder model](https://huggingface.co/entropy/roberta_zinc_decoder) trained to reconstruct inputs from embeddings generated with this model #### Compression Encoder There is a [compression encoder model](https://huggingface.co/entropy/roberta_zinc_compression_encoder) trained to compress embeddings generated by this model from the native size of 768 to smaller sizes (512, 256, 128, 64, 32) while preserving cosine similarity between embeddings. #### Decomposer There is a [embedding decomposer model](https://huggingface.co/entropy/roberta_zinc_enamine_decomposer) trained to "decompose" a roberta-zinc embedding into two building block embeddings from the Enamine library. **BibTeX:** @misc{heyer2023roberta, title={Roberta-zinc-480m}, author={Heyer, Karl}, year={2023} } **APA:** Heyer, K. (2023). Roberta-zinc-480m. ## Model Card Authors Karl Heyer ## Model Card Contact karl@darmatterai.xyz --- license: mit ---