File size: 2,958 Bytes
bbc7ddf
 
09b7700
 
 
 
bbc7ddf
 
09b7700
bbc7ddf
 
 
09b7700
 
 
bbc7ddf
09b7700
 
bbc7ddf
 
 
 
09b7700
bbc7ddf
09b7700
 
 
bbc7ddf
09b7700
 
 
bbc7ddf
09b7700
 
bbc7ddf
09b7700
bbc7ddf
09b7700
 
 
 
 
 
 
 
 
 
bbc7ddf
09b7700
 
 
bbc7ddf
09b7700
 
bbc7ddf
09b7700
 
bbc7ddf
09b7700
 
 
 
 
 
bbc7ddf
 
 
09b7700
bbc7ddf
09b7700
 
 
bbc7ddf
 
 
09b7700
 
bbc7ddf
09b7700
bbc7ddf
09b7700
 
bbc7ddf
09b7700
 
bbc7ddf
09b7700
bbc7ddf
09b7700
bbc7ddf
09b7700
bbc7ddf
 
 
09b7700
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
library_name: transformers
tags:
- chemistry
- molecule
license: mit
---

# Model Card for Roberta Zinc Compression Encoder

### Model Description

`roberta_zinc_compression_encoder` contains several MLP-style compression heads trained to compress 
molecule embeddings from the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) 
from the native dimension of 768 to smaller dimensions - 512, 256, 128, 64, 32

- **Developed by:** Karl Heyer
- **License:** MIT


### Direct Use

Usage examples. Note that input SMILES strings should be canonicalized.

```python
from sentence_transformers import models, SentenceTransformer
from transformers import AutoModel

transformer = models.Transformer("entropy/roberta_zinc_480m", 
                                 max_seq_length=256, 
                                 model_args={"add_pooling_layer": False})

pooling = models.Pooling(transformer.get_word_embedding_dimension(), 
                         pooling_mode="mean")

roberta_zinc = SentenceTransformer(modules=[transformer, pooling])

compression_encoder = AutoModel.from_pretrained("entropy/roberta_zinc_compression_encoder", 
                                                trust_remote_code=True)
# smiles should be canonicalized
smiles = [
    "Brc1cc2c(NCc3ccccc3)ncnc2s1",
    "Brc1cc2c(NCc3ccccn3)ncnc2s1",
    "Brc1cc2c(NCc3cccs3)ncnc2s1",
    "Brc1cc2c(NCc3ccncc3)ncnc2s1",
    "Brc1cc2c(Nc3ccccc3)ncnc2s1",
]

embeddings = roberta_zinc.encode(smiles, convert_to_tensor=True)
print(embeddings.shape)
# torch.Size([6, 768])

compressed_embeddings = compression_encoder.compress(embeddings.cpu(),
                                                    compression_sizes=[32, 64, 128, 256, 512])

for k,v in compressed_embeddings.items():
    print(k, v.shape)

# 32 torch.Size([6, 32])
# 64 torch.Size([6, 64])
# 128 torch.Size([6, 128])
# 256 torch.Size([6, 256])
# 512 torch.Size([6, 512])
```

### Training Procedure

#### Preprocessing

A dataset of 30m SMILES strings were assembled from the [ZINC Database](https://zinc.docking.org/) 
and the [Enamine](https://enamine.net/) real space. SMILES were canonicalized and embedded with the 
[roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model.

#### Training Hyperparameters

The model was trained for 1 epoch with a learning rate of 1e-3, cosine scheduling, weight decay of 0.01 
and 10% warmup.

#### Training Loss

For training, the input batch of embeddings is compressed with all compression sizes via 
the encoder layers, the reconstructed via the decoder layers.

For the encoder, we compute the pairwise similarities of the compressed embeddings and 
compare to the pairwise similarities of the input embeddings using row-wise pearson correlation.

For the decoder, we compute the cosine similarity of the reconstructed embeddings to the inputs.

## Model Card Authors

Karl Heyer

## Model Card Contact

[email protected]

---
license: mit
---