Model Card for Roberta Zinc Compression Encoder
Model Description
roberta_zinc_compression_encoder
contains several MLP-style compression heads trained to compress
molecule embeddings from the roberta_zinc_480m
from the native dimension of 768 to smaller dimensions - 512, 256, 128, 64, 32
- Developed by: Karl Heyer
- License: MIT
Direct Use
Usage examples. Note that input SMILES strings should be canonicalized.
from sentence_transformers import models, SentenceTransformer
from transformers import AutoModel
transformer = models.Transformer("entropy/roberta_zinc_480m",
max_seq_length=256,
model_args={"add_pooling_layer": False})
pooling = models.Pooling(transformer.get_word_embedding_dimension(),
pooling_mode="mean")
roberta_zinc = SentenceTransformer(modules=[transformer, pooling])
compression_encoder = AutoModel.from_pretrained("entropy/roberta_zinc_compression_encoder",
trust_remote_code=True)
# smiles should be canonicalized
smiles = [
"Brc1cc2c(NCc3ccccc3)ncnc2s1",
"Brc1cc2c(NCc3ccccn3)ncnc2s1",
"Brc1cc2c(NCc3cccs3)ncnc2s1",
"Brc1cc2c(NCc3ccncc3)ncnc2s1",
"Brc1cc2c(Nc3ccccc3)ncnc2s1",
]
embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True)
print(embeddings.shape)
# torch.Size([6, 768])
compressed_embeddings = compression_encoder.compress(embeddings.cpu(),
compression_sizes=[32, 64, 128, 256, 512])
for k,v in compressed_embeddings.items():
print(k, v.shape)
# 32 torch.Size([6, 32])
# 64 torch.Size([6, 64])
# 128 torch.Size([6, 128])
# 256 torch.Size([6, 256])
# 512 torch.Size([6, 512])
Training Procedure
Preprocessing
A dataset of 30m SMILES strings were assembled from the ZINC Database and the Enamine real space. SMILES were canonicalized and embedded with the roberta_zinc_480m model.
Training Hyperparameters
The model was trained for 1 epoch with a learning rate of 1e-3, cosine scheduling, weight decay of 0.01 and 10% warmup.
Training Loss
For training, the input batch of embeddings is compressed with all compression sizes via the encoder layers, the reconstructed via the decoder layers.
For the encoder, we compute the pairwise similarities of the compressed embeddings and compare to the pairwise similarities of the input embeddings using row-wise pearson correlation.
For the decoder, we compute the cosine similarity of the reconstructed embeddings to the inputs.
Model Card Authors
Karl Heyer
Model Card Contact
license: mit
- Downloads last month
- 26