Model Card for Roberta Zinc Enamine Decomposer

Model Description

roberta_zinc_enamine_decomposer is trained to "decompose" a molecule SMILES embedding into two "building block embeddings" representing Enamine building blocks expected to assemble into the input molecule.

The model is trained to convert embeddings from the roberta_zinc_480m model, or compressed embeddings from the roberta_zinc_compression_encoder model. The decomposer can map from any input size (32, 64, 128, 256, 512, 768) to any output size (same values). For an input of shape (batch_size, d_in), the output will be of shape (batch_size, 2, d_out) (two building block embeddings per input)

Developed by: Karl Heyer
License: MIT

Direct Use

Usage examples. Note that input SMILES strings should be canonicalized.

from sentence_transformers import models, SentenceTransformer
from transformers import AutoModel
import torch

transformer = models.Transformer("entropy/roberta_zinc_480m", 
                                 max_seq_length=256, 
                                 model_args={"add_pooling_layer": False})

pooling = models.Pooling(transformer.get_word_embedding_dimension(), 
                         pooling_mode="mean")

roberta_zinc = SentenceTransformer(modules=[transformer, pooling])

decomposer = AutoModel.from_pretrained("entropy/roberta_zinc_enamine_decomposer", 
                                       trust_remote_code=True)

# smiles should be canonicalized
smiles = [
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)C(C)c1ccon1",
    "C#Cc1cc(C(F)(F)F)ccc1Nc1ccc(OC)c(S(=O)(=O)Cl)c1",
    "COc1ccc(NC(=O)c2ccccc2Nc2ccc(OC)c(S(=O)(=O)Cl)c2)c(OC)c1",
    "COc1ccc(OC(=O)c2noc3c2COCC3)cc1S(=O)(=O)Cl",
    "COc1ccc(N2CCC(C(=O)N3CCCc4ccccc43)CC2)cc1S(=O)(=O)Cl",
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)C1(C)CCCNS1(=O)=O",
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cc(-n2c(C)ccc2C)ccc1Cl",
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cnc2c(c1)OCC(=O)N2"
]

# embed smiles
embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True)
print(embeddings.shape)
# torch.Size([8, 768])

# decompose from 768 to 512
# {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]}
output_sizes = [512]
decomposed_embeddings = decomposer.decompose({embeddings.shape[1]: embeddings}, 
                                             output_sizes)

for k,v in decomposed_embeddings.items():
    print(k,v.shape)
    
# 512 torch.Size([1, 8, 2, 512])


# compress inputs to all sizes
# [B, input_size] -> {compressed_size: [B, compressed_size]}
sizes = [32, 64, 128, 256, 512, 768]
embedding_dict = decomposer.compress(embeddings, sizes)
for k,v in embedding_dict.items():
    print(k, v.shape)
# 32 torch.Size([8, 32])
# 64 torch.Size([8, 64])
# 128 torch.Size([8, 128])
# 256 torch.Size([8, 256])
# 512 torch.Size([8, 512])
# 768 torch.Size([8, 768])

# decompose all compressed inputs to all output sizes
# {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]}
decomposed_embeddings = decomposer.decompose(embedding_dict, sizes)
for k,v in decomposed_embeddings.items():
    print(k,v.shape)
# 32 torch.Size([6, 8, 2, 32])
# 64 torch.Size([6, 8, 2, 64])
# 128 torch.Size([6, 8, 2, 128])
# 256 torch.Size([6, 8, 2, 256])
# 512 torch.Size([6, 8, 2, 512])
# 768 torch.Size([6, 8, 2, 768])
    

# for routing multiple inputs to multiple outputs, 
# output tensors are stacked in order of `config.comp_sizes` used

input_size = 128
input_index = decomposer.config.comp_sizes.index(input_size)
output_size = 512

# outputs at `output_size` that came specificaly from the `input_size` input
out1 = decomposed_embeddings[output_size][input_index]

# compute only `input_size` to `output_size`, no stacking/routing
out2 = decomposer.decompose({input_size: embedding_dict[input_size]}, 
                            [output_size])[output_size]

torch.allclose(out1, out2, atol=5e-6)

Training Procedure

Preprocessing

A dataset of 50M molecules was created by assembing a set of 80k Enamine building blocks using in-silico forward synthesis. Product molecules and building blocks were canonicalized and embedded with the roberta_zinc_480m model.

Training Hyperparameters

The model was trained for 6 epochs with a batch size of 2048, learning rate of 1e-3, cosine scheduling, weight decay of 0.01 and 10% warmup.

Training Loss

During training, the model is loaded with frozen, pre-trained embedding compression heads from the roberta_zinc_compression_encoder model and frozen, pre-computed Enamine building block embeddings at all compression sizes.

The training input is a batch of full size (768) embeddings from the roberta_zinc_480m model. The embeddings are first compressed to all compression sizes. The compressed embeddings are then used to predict decomposed embeddings at all compression sizes.

For the loss, the predicted decomposed embeddings are compared to the ground truth via cosine similarity. We then sample 3072 reference embeddings from the pre-computed Enamine building block embeddings (reference embeddings). For all sizes of outputs and ground truth, we compute the pair-wise cosine similarity between the predicted/targets and the reference embeddings. We then compute the row-wise pearson correlation between the similarity matrices for the predicted and targets.

entropy
/

roberta_zinc_enamine_decomposer