---
library_name: transformers
tags:
- chemistry
- molecule
license: mit
---

# Model Card for Roberta Zinc Enamine Decomposer

### Model Description

`roberta_zinc_enamine_decomposer` is trained to "decompose" a molecule SMILES embedding into 
two "building block embeddings" representing Enamine building blocks expected to assemble 
into the input molecule.

The model is trained to convert embeddings from the 
[roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model, or compressed 
embeddings from the 
[roberta_zinc_compression_encoder](https://huggingface.co/entropy/roberta_zinc_compression_encoder) 
model. The decomposer can map from any input size (32, 64, 128, 256, 512, 768) to any 
output size (same values). For an input of shape `(batch_size, d_in)`, the output will be 
of shape `(batch_size, 2, d_out)` (two building block embeddings per input)

- **Developed by:** Karl Heyer
- **License:** MIT


### Direct Use

Usage examples. Note that input SMILES strings should be canonicalized.

```python
from sentence_transformers import models, SentenceTransformer
from transformers import AutoModel
import torch

transformer = models.Transformer("entropy/roberta_zinc_480m", 
                                 max_seq_length=256, 
                                 model_args={"add_pooling_layer": False})

pooling = models.Pooling(transformer.get_word_embedding_dimension(), 
                         pooling_mode="mean")

roberta_zinc = SentenceTransformer(modules=[transformer, pooling])

decomposer = AutoModel.from_pretrained("entropy/roberta_zinc_enamine_decomposer", 
                                       trust_remote_code=True)

# smiles should be canonicalized
smiles = [
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)C(C)c1ccon1",
    "C#Cc1cc(C(F)(F)F)ccc1Nc1ccc(OC)c(S(=O)(=O)Cl)c1",
    "COc1ccc(NC(=O)c2ccccc2Nc2ccc(OC)c(S(=O)(=O)Cl)c2)c(OC)c1",
    "COc1ccc(OC(=O)c2noc3c2COCC3)cc1S(=O)(=O)Cl",
    "COc1ccc(N2CCC(C(=O)N3CCCc4ccccc43)CC2)cc1S(=O)(=O)Cl",
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)C1(C)CCCNS1(=O)=O",
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cc(-n2c(C)ccc2C)ccc1Cl",
    "COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cnc2c(c1)OCC(=O)N2"
]

# embed smiles
embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True)
print(embeddings.shape)
# torch.Size([8, 768])

# decompose from 768 to 512
# {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]}
output_sizes = [512]
decomposed_embeddings = decomposer.decompose({embeddings.shape[1]: embeddings}, 
                                             output_sizes)

for k,v in decomposed_embeddings.items():
    print(k,v.shape)
    
# 512 torch.Size([1, 8, 2, 512])


# compress inputs to all sizes
# [B, input_size] -> {compressed_size: [B, compressed_size]}
sizes = [32, 64, 128, 256, 512, 768]
embedding_dict = decomposer.compress(embeddings, sizes)
for k,v in embedding_dict.items():
    print(k, v.shape)
# 32 torch.Size([8, 32])
# 64 torch.Size([8, 64])
# 128 torch.Size([8, 128])
# 256 torch.Size([8, 256])
# 512 torch.Size([8, 512])
# 768 torch.Size([8, 768])

# decompose all compressed inputs to all output sizes
# {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]}
decomposed_embeddings = decomposer.decompose(embedding_dict, sizes)
for k,v in decomposed_embeddings.items():
    print(k,v.shape)
# 32 torch.Size([6, 8, 2, 32])
# 64 torch.Size([6, 8, 2, 64])
# 128 torch.Size([6, 8, 2, 128])
# 256 torch.Size([6, 8, 2, 256])
# 512 torch.Size([6, 8, 2, 512])
# 768 torch.Size([6, 8, 2, 768])
    

# for routing multiple inputs to multiple outputs, 
# output tensors are stacked in order of `config.comp_sizes` used

input_size = 128
input_index = decomposer.config.comp_sizes.index(input_size)
output_size = 512

# outputs at `output_size` that came specificaly from the `input_size` input
out1 = decomposed_embeddings[output_size][input_index]

# compute only `input_size` to `output_size`, no stacking/routing
out2 = decomposer.decompose({input_size: embedding_dict[input_size]}, 
                            [output_size])[output_size]

torch.allclose(out1, out2, atol=5e-6)
```

### Training Procedure

#### Preprocessing

A dataset of 50M molecules was created by assembing a set of 80k [Enamine](https://enamine.net/)
building blocks using in-silico forward synthesis. Product molecules and building blocks were 
canonicalized and embedded with the 
[roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model.

#### Training Hyperparameters

The model was trained for 6 epochs with a batch size of 2048, learning rate of 1e-3, 
cosine scheduling, weight decay of 0.01 and 10% warmup.

#### Training Loss

During training, the model is loaded with frozen, pre-trained embedding compression heads from 
the [roberta_zinc_compression_encoder](https://huggingface.co/entropy/roberta_zinc_compression_encoder) 
model and frozen, pre-computed Enamine building block embeddings at all compression sizes.

The training input is a batch of full size (768) embeddings from the 
[roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model. The embeddings 
are first compressed to all compression sizes. The compressed embeddings are then used to 
predict decomposed embeddings at all compression sizes.

For the loss, the predicted decomposed embeddings are compared to the ground truth via cosine 
similarity. We then sample 3072 reference embeddings from the pre-computed Enamine building 
block embeddings (reference embeddings). For all sizes of outputs and ground truth, we compute 
the pair-wise cosine similarity between the predicted/targets and the reference embeddings. 
We then compute the row-wise pearson correlation between the similarity matrices for the 
predicted and targets.

## Model Card Authors

Karl Heyer

## Model Card Contact

karl@darmatterai.xyz

---
license: mit
---