Improving Chemical Understanding of LLMs via SMILES Parsing
Abstract
CLEANMOL, a novel framework, enhances structural comprehension in large language models for molecular science by formulating SMILES parsing into structured tasks, improving performance on Mol-Instructions.
Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization (2025)
- ChemMLLM: Chemical Multimodal Large Language Model (2025)
- Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning (2025)
- MolGround: A Benchmark for Molecular Grounding (2025)
- Leveraging Large Language Models for enzymatic reaction prediction and characterization (2025)
- Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework (2025)
- Benchmarking Retrieval-Augmented Generation for Chemistry (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper