Model Card for Roberta Zinc 480m
Model Description
roberta_zinc_480m
is a ~102m parameter Roberta-style masked language model ~480m SMILES
strings from the ZINC database. This model is useful for
generating embeddings from SMILES strings.
- Developed by: Karl Heyer
- License: MIT
Direct Use
Usage examples. Note that input SMILES strings should be canonicalized.
With the Transformers library:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("entropy/roberta_zinc_480m")
roberta_zinc = AutoModel.from_pretrained("entropy/roberta_zinc_480m",
add_pooling_layer=False) # model was not trained with a pooler
# smiles should be canonicalized
smiles = [
"Brc1cc2c(NCc3ccccc3)ncnc2s1",
"Brc1cc2c(NCc3ccccn3)ncnc2s1",
"Brc1cc2c(NCc3cccs3)ncnc2s1",
"Brc1cc2c(NCc3ccncc3)ncnc2s1",
"Brc1cc2c(Nc3ccccc3)ncnc2s1"
]
batch = tokenizer(smiles, return_tensors='pt', padding=True, pad_to_multiple_of=8)
# mean pooling
outputs = roberta_zinc(**batch, output_hidden_states=True)
full_embeddings = outputs[1][-1]
mask = batch['attention_mask']
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))
With Sentence Transformers:
from sentence_transformers import models, SentenceTransformer
transformer = models.Transformer("entropy/roberta_zinc_480m",
max_seq_length=256,
model_args={"add_pooling_layer": False})
pooling = models.Pooling(transformer.get_word_embedding_dimension(),
pooling_mode="mean")
model = SentenceTransformer(modules=[transformer, pooling])
# smiles should be canonicalized
smiles = [
"Brc1cc2c(NCc3ccccc3)ncnc2s1",
"Brc1cc2c(NCc3ccccn3)ncnc2s1",
"Brc1cc2c(NCc3cccs3)ncnc2s1",
"Brc1cc2c(NCc3ccncc3)ncnc2s1",
"Brc1cc2c(Nc3ccccc3)ncnc2s1"
]
embeddings = model.encode(smiles, convert_to_tensor=True)
Training Procedure
Preprocessing
~480m SMILES strings were randomly sampled from the ZINC database, weighted by tranche size (ie more SMILES were sampled from larger tranches). SMILES were canonicalized, then used to train the tokenizer.
Training Hyperparameters
The model was trained with cross entropy loss for 150,000 iterations with a batch size of 4096. The model achieved a validation loss of ~0.122.
Downstream Models
Decoder
There is a decoder model trained to reconstruct inputs from embeddings generated with this model
Compression Encoder
There is a compression encoder model trained to compress embeddings generated by this model from the native size of 768 to smaller sizes (512, 256, 128, 64, 32) while preserving cosine similarity between embeddings.
Decomposer
There is a embedding decomposer model trained to "decompose" a roberta-zinc embedding into two building block embeddings from the Enamine library.
BibTeX:
@misc{heyer2023roberta, title={Roberta-zinc-480m}, author={Heyer, Karl}, year={2023} }
APA:
Heyer, K. (2023). Roberta-zinc-480m.
Model Card Authors
Karl Heyer
Model Card Contact
license: mit
- Downloads last month
- 1,301