Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,84 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: text-generation
|
5 |
+
tags:
|
6 |
+
- chemistry
|
7 |
+
---
|
8 |
+
|
9 |
+
# GP-MoLFormer-Uniq
|
10 |
+
|
11 |
+
GP-MoLFormer is a class of models pretrained on SMILES string representations of 0.65-1.1B molecules from ZINC and PubChem.
|
12 |
+
This repository is for the model pretrained on all the _unique_ molecules from both datasets.
|
13 |
+
|
14 |
+
It was introduced in the paper [GP-MoLFormer: A Foundation Model For Molecular Generation](https://arxiv.org/abs/2405.04912) by Ross et al. and released in [this repository](https://github.com/IBM/gp-molformer).
|
15 |
+
|
16 |
+
## Model Details
|
17 |
+
|
18 |
+
### Model Description
|
19 |
+
|
20 |
+
GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.
|
21 |
+
|
22 |
+
GP-MoLFormer was evaluated on _de novo_ generation (*at scale*), scaffold-constrained decoration, and molecular property optimization tasks.
|
23 |
+
|
24 |
+
## Intended use and limitations
|
25 |
+
|
26 |
+
The pretrained model may be used out-of-the-box for unconditional, _de novo_ molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. We also demonstrate it can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using **pair-tuning**. For details, see the paper and GitHub repository.
|
27 |
+
|
28 |
+
This model is not tested for classification performance. It is also not tested for molecules larger than ~200 atoms (i.e., macromolecules). Furthermore, using invalid or noncanonical SMILES may result in worse performance.
|
29 |
+
|
30 |
+
## Example code
|
31 |
+
|
32 |
+
Use the code below to get started with the model.
|
33 |
+
|
34 |
+
```py
|
35 |
+
import torch
|
36 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
37 |
+
|
38 |
+
model = AutoModelForCausalLM.from_pretrained("ibm-research/GP-MoLFormer-Uniq", trust_remote_code=True)
|
39 |
+
tokenizer = AutoTokenizer.from_pretrained("ibm-research/MoLFormer-XL-both-10pct", trust_remote_code=True)
|
40 |
+
|
41 |
+
outputs = model.generate(do_sample=True, top_k=None, max_length=202, num_return_sequences=3)
|
42 |
+
tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
43 |
+
```
|
44 |
+
|
45 |
+
## Training Details
|
46 |
+
|
47 |
+
### Data
|
48 |
+
|
49 |
+
We trained GP-MoLFormer on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on all _unique_ molecules from both datasets.
|
50 |
+
|
51 |
+
Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.
|
52 |
+
|
53 |
+
### Hardware
|
54 |
+
|
55 |
+
- 16 x NVIDIA A100 80GB GPUs
|
56 |
+
|
57 |
+
## Evaluation
|
58 |
+
|
59 |
+
We evaluated GP-MoLFormer on various generation metrics. The tables below show the performance of GP-MoLFormer-Uniq compared to baseline models:
|
60 |
+
|
61 |
+
| | Val↑ | Uniq@10k↑ | Nov↑ | Frag↑ | Scaf↑ | SNN↑ | IntDiv↑ | FCD↓ |
|
62 |
+
|-------------------|------------|-----------------|------------|-------------|-------------|------------|---------------|------------|
|
63 |
+
| CharRNN | 0.975 | 0.999 | 0.842 | **0.9998** | 0.9242 | 0.6015 | 0.8562 | 0.0732 |
|
64 |
+
| VAE | 0.977 | 0.998 | 0.695 | 0.9984 | **0.9386** | **0.6257** | 0.8558 | 0.0990 |
|
65 |
+
| JT-VAE | **1.000** | **1.000** | 0.914 | 0.9965 | 0.8964 | 0.5477 | 0.8551 | 0.3954 |
|
66 |
+
| LIMO | **1.000** | 0.976 | **1.000** | 0.6989 | 0.0079 | 0.2464 | **0.9039** | 26.78 |
|
67 |
+
| MolGen-7B | **1.000** | **1.000** | 0.934 | **0.9999** | 0.6538 | 0.5138 | 0.8617 | **0.0435** |
|
68 |
+
| GP-MoLFormer-Uniq | **1.000** | 0.977 | 0.390 | **0.9998** | 0.7383 | 0.5045 | 0.8655 | 0.0591 |
|
69 |
+
|
70 |
+
We report all metrics using the typical MOSES definitions on each model's respective test set. Note: novelty is with respect to each model's respective training set.
|
71 |
+
|
72 |
+
## Citation
|
73 |
+
|
74 |
+
```
|
75 |
+
@misc{ross2025gpmolformerfoundationmodelmolecular,
|
76 |
+
title={GP-MoLFormer: A Foundation Model For Molecular Generation},
|
77 |
+
author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Jiri Navratil and Youssef Mroueh and Payel Das},
|
78 |
+
year={2025},
|
79 |
+
eprint={2405.04912},
|
80 |
+
archivePrefix={arXiv},
|
81 |
+
primaryClass={q-bio.BM},
|
82 |
+
url={https://arxiv.org/abs/2405.04912},
|
83 |
+
}
|
84 |
+
```
|