|
--- |
|
library_name: transformers |
|
tags: |
|
- chemistry |
|
- molecule |
|
license: mit |
|
--- |
|
|
|
# Model Card for Roberta Zinc Compression Encoder |
|
|
|
### Model Description |
|
|
|
`roberta_zinc_compression_encoder` contains several MLP-style compression heads trained to compress |
|
molecule embeddings from the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) |
|
from the native dimension of 768 to smaller dimensions - 512, 256, 128, 64, 32 |
|
|
|
- **Developed by:** Karl Heyer |
|
- **License:** MIT |
|
|
|
|
|
### Direct Use |
|
|
|
Usage examples. Note that input SMILES strings should be canonicalized. |
|
|
|
```python |
|
from sentence_transformers import models, SentenceTransformer |
|
from transformers import AutoModel |
|
|
|
transformer = models.Transformer("entropy/roberta_zinc_480m", |
|
max_seq_length=256, |
|
model_args={"add_pooling_layer": False}) |
|
|
|
pooling = models.Pooling(transformer.get_word_embedding_dimension(), |
|
pooling_mode="mean") |
|
|
|
roberta_zinc = SentenceTransformer(modules=[transformer, pooling]) |
|
|
|
compression_encoder = AutoModel.from_pretrained("entropy/roberta_zinc_compression_encoder", |
|
trust_remote_code=True) |
|
# smiles should be canonicalized |
|
smiles = [ |
|
"Brc1cc2c(NCc3ccccc3)ncnc2s1", |
|
"Brc1cc2c(NCc3ccccn3)ncnc2s1", |
|
"Brc1cc2c(NCc3cccs3)ncnc2s1", |
|
"Brc1cc2c(NCc3ccncc3)ncnc2s1", |
|
"Brc1cc2c(Nc3ccccc3)ncnc2s1", |
|
] |
|
|
|
embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True) |
|
print(embeddings.shape) |
|
# torch.Size([6, 768]) |
|
|
|
compressed_embeddings = compression_encoder.compress(embeddings.cpu(), |
|
compression_sizes=[32, 64, 128, 256, 512]) |
|
|
|
for k,v in compressed_embeddings.items(): |
|
print(k, v.shape) |
|
|
|
# 32 torch.Size([6, 32]) |
|
# 64 torch.Size([6, 64]) |
|
# 128 torch.Size([6, 128]) |
|
# 256 torch.Size([6, 256]) |
|
# 512 torch.Size([6, 512]) |
|
``` |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
|
|
A dataset of 30m SMILES strings were assembled from the [ZINC Database](https://zinc.docking.org/) |
|
and the [Enamine](https://enamine.net/) real space. SMILES were canonicalized and embedded with the |
|
[roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model. |
|
|
|
#### Training Hyperparameters |
|
|
|
The model was trained for 1 epoch with a learning rate of 1e-3, cosine scheduling, weight decay of 0.01 |
|
and 10% warmup. |
|
|
|
#### Training Loss |
|
|
|
For training, the input batch of embeddings is compressed with all compression sizes via |
|
the encoder layers, the reconstructed via the decoder layers. |
|
|
|
For the encoder, we compute the pairwise similarities of the compressed embeddings and |
|
compare to the pairwise similarities of the input embeddings using row-wise pearson correlation. |
|
|
|
For the decoder, we compute the cosine similarity of the reconstructed embeddings to the inputs. |
|
|
|
## Model Card Authors |
|
|
|
Karl Heyer |
|
|
|
## Model Card Contact |
|
|
|
[email protected] |
|
|
|
--- |
|
license: mit |
|
--- |