GP-MoLFormer-Uniq
GP-MoLFormer is a class of models pretrained on SMILES string representations of 0.65-1.1B molecules from ZINC and PubChem. This repository is for the model pretrained on all the unique molecules from both datasets.
It was introduced in the paper GP-MoLFormer: A Foundation Model For Molecular Generation by Ross et al. and released in this repository.
Model Details
Model Description
GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.
GP-MoLFormer was evaluated on de novo generation (at scale), scaffold-constrained decoration, and molecular property optimization tasks.
Intended use and limitations
The pretrained model may be used out-of-the-box for unconditional, de novo molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. We also demonstrate it can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using pair-tuning. For details, see the paper and GitHub repository.
This model is not tested for classification performance. It is also not tested for molecules larger than ~200 atoms (i.e., macromolecules). Furthermore, using invalid or noncanonical SMILES may result in worse performance.
Example code
Use the code below to get started with the model.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ibm-research/GP-MoLFormer-Uniq", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm-research/MoLFormer-XL-both-10pct", trust_remote_code=True)
outputs = model.generate(do_sample=True, top_k=None, max_length=202, num_return_sequences=3)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
Training Details
Data
We trained GP-MoLFormer on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on all unique molecules from both datasets.
Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.
Hardware
- 16 x NVIDIA A100 80GB GPUs
Evaluation
We evaluated GP-MoLFormer on various generation metrics. The tables below show the performance of GP-MoLFormer-Uniq compared to baseline models:
Val↑ | Uniq@10k↑ | Nov↑ | Frag↑ | Scaf↑ | SNN↑ | IntDiv↑ | FCD↓ | |
---|---|---|---|---|---|---|---|---|
CharRNN | 0.975 | 0.999 | 0.842 | 0.9998 | 0.9242 | 0.6015 | 0.8562 | 0.0732 |
VAE | 0.977 | 0.998 | 0.695 | 0.9984 | 0.9386 | 0.6257 | 0.8558 | 0.0990 |
JT-VAE | 1.000 | 1.000 | 0.914 | 0.9965 | 0.8964 | 0.5477 | 0.8551 | 0.3954 |
LIMO | 1.000 | 0.976 | 1.000 | 0.6989 | 0.0079 | 0.2464 | 0.9039 | 26.78 |
MolGen-7B | 1.000 | 1.000 | 0.934 | 0.9999 | 0.6538 | 0.5138 | 0.8617 | 0.0435 |
GP-MoLFormer-Uniq | 1.000 | 0.977 | 0.390 | 0.9998 | 0.7383 | 0.5045 | 0.8655 | 0.0591 |
We report all metrics using the typical MOSES definitions on each model's respective test set. Note: novelty is with respect to each model's respective training set.
Citation
@misc{ross2025gpmolformerfoundationmodelmolecular,
title={GP-MoLFormer: A Foundation Model For Molecular Generation},
author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Jiri Navratil and Youssef Mroueh and Payel Das},
year={2025},
eprint={2405.04912},
archivePrefix={arXiv},
primaryClass={q-bio.BM},
url={https://arxiv.org/abs/2405.04912},
}
- Downloads last month
- 9