shoffman commited on
Commit
6eca879
·
verified ·
1 Parent(s): 2054227

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -3
README.md CHANGED
@@ -1,3 +1,84 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - chemistry
7
+ ---
8
+
9
+ # GP-MoLFormer-Uniq
10
+
11
+ GP-MoLFormer is a class of models pretrained on SMILES string representations of 0.65-1.1B molecules from ZINC and PubChem.
12
+ This repository is for the model pretrained on all the _unique_ molecules from both datasets.
13
+
14
+ It was introduced in the paper [GP-MoLFormer: A Foundation Model For Molecular Generation](https://arxiv.org/abs/2405.04912) by Ross et al. and released in [this repository](https://github.com/IBM/gp-molformer).
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.
21
+
22
+ GP-MoLFormer was evaluated on _de novo_ generation (*at scale*), scaffold-constrained decoration, and molecular property optimization tasks.
23
+
24
+ ## Intended use and limitations
25
+
26
+ The pretrained model may be used out-of-the-box for unconditional, _de novo_ molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. We also demonstrate it can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using **pair-tuning**. For details, see the paper and GitHub repository.
27
+
28
+ This model is not tested for classification performance. It is also not tested for molecules larger than ~200 atoms (i.e., macromolecules). Furthermore, using invalid or noncanonical SMILES may result in worse performance.
29
+
30
+ ## Example code
31
+
32
+ Use the code below to get started with the model.
33
+
34
+ ```py
35
+ import torch
36
+ from transformers import AutoModelForCausalLM, AutoTokenizer
37
+
38
+ model = AutoModelForCausalLM.from_pretrained("ibm-research/GP-MoLFormer-Uniq", trust_remote_code=True)
39
+ tokenizer = AutoTokenizer.from_pretrained("ibm-research/MoLFormer-XL-both-10pct", trust_remote_code=True)
40
+
41
+ outputs = model.generate(do_sample=True, top_k=None, max_length=202, num_return_sequences=3)
42
+ tokenizer.batch_decode(outputs, skip_special_tokens=True)
43
+ ```
44
+
45
+ ## Training Details
46
+
47
+ ### Data
48
+
49
+ We trained GP-MoLFormer on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on all _unique_ molecules from both datasets.
50
+
51
+ Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.
52
+
53
+ ### Hardware
54
+
55
+ - 16 x NVIDIA A100 80GB GPUs
56
+
57
+ ## Evaluation
58
+
59
+ We evaluated GP-MoLFormer on various generation metrics. The tables below show the performance of GP-MoLFormer-Uniq compared to baseline models:
60
+
61
+ | | Val↑ | Uniq@10k↑ | Nov↑ | Frag↑ | Scaf↑ | SNN↑ | IntDiv↑ | FCD↓ |
62
+ |-------------------|------------|-----------------|------------|-------------|-------------|------------|---------------|------------|
63
+ | CharRNN | 0.975 | 0.999 | 0.842 | **0.9998** | 0.9242 | 0.6015 | 0.8562 | 0.0732 |
64
+ | VAE | 0.977 | 0.998 | 0.695 | 0.9984 | **0.9386** | **0.6257** | 0.8558 | 0.0990 |
65
+ | JT-VAE | **1.000** | **1.000** | 0.914 | 0.9965 | 0.8964 | 0.5477 | 0.8551 | 0.3954 |
66
+ | LIMO | **1.000** | 0.976 | **1.000** | 0.6989 | 0.0079 | 0.2464 | **0.9039** | 26.78 |
67
+ | MolGen-7B | **1.000** | **1.000** | 0.934 | **0.9999** | 0.6538 | 0.5138 | 0.8617 | **0.0435** |
68
+ | GP-MoLFormer-Uniq | **1.000** | 0.977 | 0.390 | **0.9998** | 0.7383 | 0.5045 | 0.8655 | 0.0591 |
69
+
70
+ We report all metrics using the typical MOSES definitions on each model's respective test set. Note: novelty is with respect to each model's respective training set.
71
+
72
+ ## Citation
73
+
74
+ ```
75
+ @misc{ross2025gpmolformerfoundationmodelmolecular,
76
+ title={GP-MoLFormer: A Foundation Model For Molecular Generation},
77
+ author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Jiri Navratil and Youssef Mroueh and Payel Das},
78
+ year={2025},
79
+ eprint={2405.04912},
80
+ archivePrefix={arXiv},
81
+ primaryClass={q-bio.BM},
82
+ url={https://arxiv.org/abs/2405.04912},
83
+ }
84
+ ```