--- library_name: transformers tags: - chemistry - molecule license: mit --- # Model Card for Roberta Zinc Enamine Decomposer ### Model Description `roberta_zinc_enamine_decomposer` is trained to "decompose" a molecule SMILES embedding into two "building block embeddings" representing Enamine building blocks expected to assemble into the input molecule. The model is trained to convert embeddings from the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model, or compressed embeddings from the [roberta_zinc_compression_encoder](https://huggingface.co/entropy/roberta_zinc_compression_encoder) model. The decomposer can map from any input size (32, 64, 128, 256, 512, 768) to any output size (same values). For an input of shape `(batch_size, d_in)`, the output will be of shape `(batch_size, 2, d_out)` (two building block embeddings per input) - **Developed by:** Karl Heyer - **License:** MIT ### Direct Use Usage examples. Note that input SMILES strings should be canonicalized. ```python from sentence_transformers import models, SentenceTransformer from transformers import AutoModel import torch transformer = models.Transformer("entropy/roberta_zinc_480m", max_seq_length=256, model_args={"add_pooling_layer": False}) pooling = models.Pooling(transformer.get_word_embedding_dimension(), pooling_mode="mean") roberta_zinc = SentenceTransformer(modules=[transformer, pooling]) decomposer = AutoModel.from_pretrained("entropy/roberta_zinc_enamine_decomposer", trust_remote_code=True) # smiles should be canonicalized smiles = [ "COc1ccc(F)cc1S(=O)(=O)OC(=O)C(C)c1ccon1", "C#Cc1cc(C(F)(F)F)ccc1Nc1ccc(OC)c(S(=O)(=O)Cl)c1", "COc1ccc(NC(=O)c2ccccc2Nc2ccc(OC)c(S(=O)(=O)Cl)c2)c(OC)c1", "COc1ccc(OC(=O)c2noc3c2COCC3)cc1S(=O)(=O)Cl", "COc1ccc(N2CCC(C(=O)N3CCCc4ccccc43)CC2)cc1S(=O)(=O)Cl", "COc1ccc(F)cc1S(=O)(=O)OC(=O)C1(C)CCCNS1(=O)=O", "COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cc(-n2c(C)ccc2C)ccc1Cl", "COc1ccc(F)cc1S(=O)(=O)OC(=O)c1cnc2c(c1)OCC(=O)N2" ] # embed smiles embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True) print(embeddings.shape) # torch.Size([8, 768]) # decompose from 768 to 512 # {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]} output_sizes = [512] decomposed_embeddings = decomposer.decompose({embeddings.shape[1]: embeddings}, output_sizes) for k,v in decomposed_embeddings.items(): print(k,v.shape) # 512 torch.Size([1, 8, 2, 512]) # compress inputs to all sizes # [B, input_size] -> {compressed_size: [B, compressed_size]} sizes = [32, 64, 128, 256, 512, 768] embedding_dict = decomposer.compress(embeddings, sizes) for k,v in embedding_dict.items(): print(k, v.shape) # 32 torch.Size([8, 32]) # 64 torch.Size([8, 64]) # 128 torch.Size([8, 128]) # 256 torch.Size([8, 256]) # 512 torch.Size([8, 512]) # 768 torch.Size([8, 768]) # decompose all compressed inputs to all output sizes # {input_size: [B, input_size]} -> {output_size: [len(output_sizes), B, 2, output_size]} decomposed_embeddings = decomposer.decompose(embedding_dict, sizes) for k,v in decomposed_embeddings.items(): print(k,v.shape) # 32 torch.Size([6, 8, 2, 32]) # 64 torch.Size([6, 8, 2, 64]) # 128 torch.Size([6, 8, 2, 128]) # 256 torch.Size([6, 8, 2, 256]) # 512 torch.Size([6, 8, 2, 512]) # 768 torch.Size([6, 8, 2, 768]) # for routing multiple inputs to multiple outputs, # output tensors are stacked in order of `config.comp_sizes` used input_size = 128 input_index = decomposer.config.comp_sizes.index(input_size) output_size = 512 # outputs at `output_size` that came specificaly from the `input_size` input out1 = decomposed_embeddings[output_size][input_index] # compute only `input_size` to `output_size`, no stacking/routing out2 = decomposer.decompose({input_size: embedding_dict[input_size]}, [output_size])[output_size] torch.allclose(out1, out2, atol=5e-6) ``` ### Training Procedure #### Preprocessing A dataset of 50M molecules was created by assembing a set of 80k [Enamine](https://enamine.net/) building blocks using in-silico forward synthesis. Product molecules and building blocks were canonicalized and embedded with the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model. #### Training Hyperparameters The model was trained for 6 epochs with a batch size of 2048, learning rate of 1e-3, cosine scheduling, weight decay of 0.01 and 10% warmup. #### Training Loss During training, the model is loaded with frozen, pre-trained embedding compression heads from the [roberta_zinc_compression_encoder](https://huggingface.co/entropy/roberta_zinc_compression_encoder) model and frozen, pre-computed Enamine building block embeddings at all compression sizes. The training input is a batch of full size (768) embeddings from the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model. The embeddings are first compressed to all compression sizes. The compressed embeddings are then used to predict decomposed embeddings at all compression sizes. For the loss, the predicted decomposed embeddings are compared to the ground truth via cosine similarity. We then sample 3072 reference embeddings from the pre-computed Enamine building block embeddings (reference embeddings). For all sizes of outputs and ground truth, we compute the pair-wise cosine similarity between the predicted/targets and the reference embeddings. We then compute the row-wise pearson correlation between the similarity matrices for the predicted and targets. ## Model Card Authors Karl Heyer ## Model Card Contact karl@darmatterai.xyz --- license: mit ---