Model Card for Roberta Zinc Enamine Decomposer
Model Description
roberta_zinc_enamine_decomposer
is trained to "decompose" a molecule SMILES embedding into
two "building block embeddings" representing Enamine building blocks expected to assemble
into the input molecule.
The model is trained to convert embeddings from the
roberta_zinc_480m model, or compressed
embeddings from the
roberta_zinc_compression_encoder
model. The decomposer can map from any input size (32, 64, 128, 256, 512, 768) to any
output size (same values). For an input of shape (batch_size, d_in)
, the output will be
of shape (batch_size, 2, d_out)
(two building block embeddings per input)
- Developed by: Karl Heyer
- License: MIT
Direct Use
Usage examples. Note that input SMILES strings should be canonicalized.
from sentence_transformers import models, SentenceTransformer
from transformers import AutoModel
import torch
transformer = models.Transformer("entropy/roberta_zinc_480m",
max_seq_length=256,
model_args={"add_pooling_layer": False})
pooling = models.Pooling(transformer.get_word_embedding_dimension(),
pooling_mode="mean")
roberta_zinc = SentenceTransformer(modules=[transformer, pooling])
decomposer = AutoModel.from_pretrained("entropy/roberta_zinc_enamine_decomposer",
trust_remote_code=True)
# smiles should be canonicalized
smiles = [
"COc1ccc(F)cc1S(=O)(=O)OC(=O)C(C)c1ccon1",
"C#Cc1cc(C(F)(F)F)ccc1Nc1ccc(OC)c(S(=O)(=O)Cl)c1",
"COc1ccc(NC(=O)c2ccccc2Nc2ccc(OC)c(S(=O)(=O)Cl)c2)c(OC)c1",
"COc1ccc(OC(=O)c2noc3c2COCC3)cc1S(=O)(=O)Cl",
"COc1ccc(N2CCC(C(=O)N3CCCc4ccccc43)CC2)cc1S(=O)(=O)Cl",
"COc1ccc(F)cc1S(=O)(=O)OC(=O)C1(C)CCCNS1(=O)=O",
"COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cc(-n2c(C)ccc2C)ccc1Cl",
"COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cnc2c(c1)OCC(=O)N2"
]
# embed smiles
embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True)
print(embeddings.shape)
# torch.Size([8, 768])
# decompose from 768 to 512
# {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]}
output_sizes = [512]
decomposed_embeddings = decomposer.decompose({embeddings.shape[1]: embeddings},
output_sizes)
for k,v in decomposed_embeddings.items():
print(k,v.shape)
# 512 torch.Size([1, 8, 2, 512])
# compress inputs to all sizes
# [B, input_size] -> {compressed_size: [B, compressed_size]}
sizes = [32, 64, 128, 256, 512, 768]
embedding_dict = decomposer.compress(embeddings, sizes)
for k,v in embedding_dict.items():
print(k, v.shape)
# 32 torch.Size([8, 32])
# 64 torch.Size([8, 64])
# 128 torch.Size([8, 128])
# 256 torch.Size([8, 256])
# 512 torch.Size([8, 512])
# 768 torch.Size([8, 768])
# decompose all compressed inputs to all output sizes
# {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]}
decomposed_embeddings = decomposer.decompose(embedding_dict, sizes)
for k,v in decomposed_embeddings.items():
print(k,v.shape)
# 32 torch.Size([6, 8, 2, 32])
# 64 torch.Size([6, 8, 2, 64])
# 128 torch.Size([6, 8, 2, 128])
# 256 torch.Size([6, 8, 2, 256])
# 512 torch.Size([6, 8, 2, 512])
# 768 torch.Size([6, 8, 2, 768])
# for routing multiple inputs to multiple outputs,
# output tensors are stacked in order of `config.comp_sizes` used
input_size = 128
input_index = decomposer.config.comp_sizes.index(input_size)
output_size = 512
# outputs at `output_size` that came specificaly from the `input_size` input
out1 = decomposed_embeddings[output_size][input_index]
# compute only `input_size` to `output_size`, no stacking/routing
out2 = decomposer.decompose({input_size: embedding_dict[input_size]},
[output_size])[output_size]
torch.allclose(out1, out2, atol=5e-6)
Training Procedure
Preprocessing
A dataset of 50M molecules was created by assembing a set of 80k Enamine building blocks using in-silico forward synthesis. Product molecules and building blocks were canonicalized and embedded with the roberta_zinc_480m model.
Training Hyperparameters
The model was trained for 6 epochs with a batch size of 2048, learning rate of 1e-3, cosine scheduling, weight decay of 0.01 and 10% warmup.
Training Loss
During training, the model is loaded with frozen, pre-trained embedding compression heads from the roberta_zinc_compression_encoder model and frozen, pre-computed Enamine building block embeddings at all compression sizes.
The training input is a batch of full size (768) embeddings from the roberta_zinc_480m model. The embeddings are first compressed to all compression sizes. The compressed embeddings are then used to predict decomposed embeddings at all compression sizes.
For the loss, the predicted decomposed embeddings are compared to the ground truth via cosine similarity. We then sample 3072 reference embeddings from the pre-computed Enamine building block embeddings (reference embeddings). For all sizes of outputs and ground truth, we compute the pair-wise cosine similarity between the predicted/targets and the reference embeddings. We then compute the row-wise pearson correlation between the similarity matrices for the predicted and targets.
Model Card Authors
Karl Heyer
Model Card Contact
license: mit
- Downloads last month
- 220