new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Mar 18

MeLM, a generative pretrained language modeling framework that solves forward and inverse mechanics problems

We report a flexible multi-modal mechanics language model, MeLM, applied to solve various nonlinear forward and inverse problems, that can deal with a set of instructions, numbers and microstructure data. The framework is applied to various examples including bio-inspired hierarchical honeycomb design, carbon nanotube mechanics, and protein unfolding. In spite of the flexible nature of the model-which allows us to easily incorporate diverse materials, scales, and mechanical features-it performs well across disparate forward and inverse tasks. Based on an autoregressive attention-model, MeLM effectively represents a large multi-particle system consisting of hundreds of millions of neurons, where the interaction potentials are discovered through graph-forming self-attention mechanisms that are then used to identify relationships from emergent structures, while taking advantage of synergies discovered in the training data. We show that the model can solve complex degenerate mechanics design problems and determine novel material architectures across a range of hierarchical levels, providing an avenue for materials discovery and analysis. Looking beyond the demonstrations reported in this paper, we discuss other opportunities in applied mechanics and general considerations about the use of large language models in modeling, design, and analysis that can span a broad spectrum of material properties from mechanical, thermal, optical, to electronic.

SELFormer: Molecular Representation Learning via SELFIES Language Models

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning

Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data -- natural vibrational frequencies -- via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive further advancements in AI-assisted materials science.

SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning

A key challenge in artificial intelligence is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, we present SciAgents, an approach that leverages three core concepts: (1) the use of large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses traditional human-driven research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the intelligent system yields material discoveries, critique and improve existing hypotheses, retrieve up-to-date data about existing research, and highlights their strengths and limitations. Our case studies demonstrate scalable capabilities to combine generative AI, ontological representations, and multi-agent modeling, harnessing a `swarm of intelligence' similar to biological systems. This provides new avenues for materials discovery and accelerates the development of advanced materials by unlocking Nature's design principles.

Matbench Discovery -- An evaluation framework for machine learning crystal stability prediction

Matbench Discovery simulates the deployment of machine learning (ML) energy models in a high-throughput search for stable inorganic crystals. We address the disconnect between (i) thermodynamic stability and formation energy and (ii) in-domain vs out-of-distribution performance. Alongside this paper, we publish a Python package to aid with future model submissions and a growing online leaderboard with further insights into trade-offs between various performance metrics. To answer the question which ML methodology performs best at materials discovery, our initial release explores a variety of models including random forests, graph neural networks (GNN), one-shot predictors, iterative Bayesian optimizers and universal interatomic potentials (UIP). Ranked best-to-worst by their test set F1 score on thermodynamic stability prediction, we find CHGNet > M3GNet > MACE > ALIGNN > MEGNet > CGCNN > CGCNN+P > Wrenformer > BOWSR > Voronoi tessellation fingerprints with random forest. The top 3 models are UIPs, the winning methodology for ML-guided materials discovery, achieving F1 scores of ~0.6 for crystal stability classification and discovery acceleration factors (DAF) of up to 5x on the first 10k most stable predictions compared to dummy selection from our test set. We also highlight a sharp disconnect between commonly used global regression metrics and more task-relevant classification metrics. Accurate regressors are susceptible to unexpectedly high false-positive rates if those accurate predictions lie close to the decision boundary at 0 eV/atom above the convex hull where most materials are. Our results highlight the need to focus on classification metrics that actually correlate with improved stability hit rate.

Synergistic Fusion of Multi-Source Knowledge via Evidence Theory for High-Entropy Alloy Discovery

Discovering novel high-entropy alloys (HEAs) with desirable properties is challenging due to the vast compositional space and complex phase formation mechanisms. Efficient exploration of this space requires a strategic approach that integrates heterogeneous knowledge sources. Here, we propose a framework that systematically combines knowledge extracted from computational material datasets with domain knowledge distilled from scientific literature using large language models (LLMs). A central feature of this approach is the explicit consideration of element substitutability, identifying chemically similar elements that can be interchanged to potentially stabilize desired HEAs. Dempster-Shafer theory, a mathematical framework for reasoning under uncertainty, is employed to model and combine substitutabilities based on aggregated evidence from multiple sources. The framework predicts the phase stability of candidate HEA compositions and is systematically evaluated on both quaternary alloy systems, demonstrating superior performance compared to baseline machine learning models and methods reliant on single-source evidence in cross-validation experiments. By leveraging multi-source knowledge, the framework retains robust predictive power even when key elements are absent from the training data, underscoring its potential for knowledge transfer and extrapolation. Furthermore, the enhanced interpretability of the methodology offers insights into the fundamental factors governing HEA formation. Overall, this work provides a promising strategy for accelerating HEA discovery by integrating computational and textual knowledge sources, enabling efficient exploration of vast compositional spaces with improved generalization and interpretability.

An inorganic ABX3 perovskite materials dataset for target property prediction and classification using machine learning

The reliability with Machine Learning (ML) techniques in novel materials discovery often depend on the quality of the dataset, in addition to the relevant features used in describing the material. In this regard, the current study presents and validates a newly processed materials dataset that can be utilized for benchmark ML analysis, as it relates to the prediction and classification of deterministic target properties. Originally, the dataset was extracted from the Open Quantum Materials Database (OQMD) and contains a robust 16,323 samples of ABX3 inorganic perovskite structures. The dataset is tabular in form and is preprocessed to include sixty-one generalized input features that broadly describes the physicochemical, stability/geometrical, and Density Functional Theory (DFT) target properties associated with the elemental ionic sites in a three-dimensional ABX3 polyhedral. For validation, four different ML models are employed to predict three distinctive target properties, namely: formation energy, energy band gap, and crystal system. On experimentation, the best accuracy measurements are reported at 0.013 eV/atom MAE, 0.216 eV MAE, and 85% F1, corresponding to the formation energy prediction, band gap prediction and crystal system multi-classification, respectively. Moreover, the realized results are compared with previous literature and as such, affirms the resourcefulness of the current dataset for future benchmark materials analysis via ML techniques. The preprocessed dataset and source codes are openly available to download from github.com/chenebuah/ML_abx3_dataset.

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

This study is dedicated to assessing the capabilities of large language models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting structured information from scientific documents in materials science. To this end, we primarily focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. Due to the evident lack of datasets within Materials Informatics (MI), we evaluated using SuperMat, based on superconductor research, and MeasEval, a generic measurement evaluation corpus. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline). We introduce a novel methodology for the comparative analysis of intricate material expressions, emphasising the standardisation of chemical formulas to tackle the complexities inherent in materials science information assessment. For NER, LLMs fail to outperform the baseline with zero-shot prompting and exhibit only limited improvement with few-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate strategy for RE outperforms all models, including the baseline. Without any fine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and relationship extraction capabilities after being provided with merely a couple of examples, surpassing the baseline. Overall, the results suggest that although LLMs demonstrate relevant reasoning skills in connecting concepts, specialised models are currently a better choice for tasks requiring extracting complex domain-specific entities like materials. These insights provide initial guidance applicable to other materials science sub-domains in future work.

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

Effectively representing materials as text has the potential to leverage the vast advancements of large language models (LLMs) for discovering new materials. While LLMs have shown remarkable success in various domains, their application to materials science remains underexplored. A fundamental challenge is the lack of understanding of how to best utilize text-based representations for materials modeling. This challenge is further compounded by the absence of a comprehensive benchmark to rigorously evaluate the capabilities and limitations of these text representations in capturing the complexity of material systems. To address this gap, we propose MatText, a suite of benchmarking tools and datasets designed to systematically evaluate the performance of language models in modeling materials. MatText encompasses nine distinct text-based representations for material systems, including several novel representations. Each representation incorporates unique inductive biases that capture relevant information and integrate prior physical knowledge about materials. Additionally, MatText provides essential tools for training and benchmarking the performance of language models in the context of materials science. These tools include standardized dataset splits for each representation, probes for evaluating sensitivity to geometric factors, and tools for seamlessly converting crystal structures into text. Using MatText, we conduct an extensive analysis of the capabilities of language models in modeling materials. Our findings reveal that current language models consistently struggle to capture the geometric information crucial for materials modeling across all representations. Instead, these models tend to leverage local information, which is emphasized in some of our novel representations. Our analysis underscores MatText's ability to reveal shortcomings of text-based methods for materials design.

The Open Catalyst 2020 (OC20) Dataset and Community Challenges

Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.

Crystal Structure Generation with Autoregressive Large Language Modeling

The generation of plausible crystal structures is often the first step in predicting the structure and properties of a material from its chemical composition. Quickly generating and predicting inorganic crystal structures is important for the discovery of new materials, which can target applications such as energy or electronic devices. However, most current methods for crystal structure prediction are computationally expensive, slowing the pace of innovation. Seeding structure prediction algorithms with quality generated candidates can overcome a major bottleneck. Here, we introduce CrystaLLM, a methodology for the versatile generation of crystal structures, based on the autoregressive large language modeling (LLM) of the Crystallographic Information File (CIF) format. Trained on millions of CIF files, CrystaLLM focuses on modeling crystal structures through text. CrystaLLM can produce plausible crystal structures for a wide range of inorganic compounds unseen in training, as demonstrated by ab initio simulations. The integration with predictors of formation energy permits the use of a Monte Carlo Tree Search algorithm to improve the generation of meaningful structures. Our approach challenges conventional representations of crystals, and demonstrates the potential of LLMs for learning effective 'world models' of crystal chemistry, which will lead to accelerated discovery and innovation in materials science.

MatterGen: a generative model for inorganic materials design

The design of functional materials with desired properties is essential in driving technological advances in areas like energy storage, catalysis, and carbon capture. Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints. Despite recent progress, current generative models have low success rate in proposing stable crystals, or can only satisfy a very limited set of property constraints. Here, we present MatterGen, a model that generates stable, diverse inorganic materials across the periodic table and can further be fine-tuned to steer the generation towards a broad range of property constraints. To enable this, we introduce a new diffusion-based generative process that produces crystalline structures by gradually refining atom types, coordinates, and the periodic lattice. We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset. Compared to prior generative models, structures produced by MatterGen are more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum. After fine-tuning, MatterGen successfully generates stable, novel materials with desired chemistry, symmetry, as well as mechanical, electronic and magnetic properties. Finally, we demonstrate multi-property materials design capabilities by proposing structures that have both high magnetic density and a chemical composition with low supply-chain risk. We believe that the quality of generated materials and the breadth of MatterGen's capabilities represent a major advancement towards creating a universal generative model for materials design.

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.

Scalable Diffusion for Materials Generation

Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials.

BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials

The study of biological materials and bio-inspired materials science is well established; however, surprisingly little knowledge has been systematically translated to engineering solutions. To accelerate discovery and guide insights, an open-source autoregressive transformer large language model (LLM), BioinspiredLLM, is reported. The model was finetuned with a corpus of over a thousand peer-reviewed articles in the field of structural biological and bio-inspired materials and can be prompted to recall information, assist with research tasks, and function as an engine for creativity. The model has proven that it is able to accurately recall information about biological materials and is further enhanced with enhanced reasoning ability, as well as with retrieval-augmented generation to incorporate new data during generation that can also help to traceback sources, update the knowledge base, and connect knowledge domains. BioinspiredLLM also has been shown to develop sound hypotheses regarding biological materials design and remarkably so for materials that have never been explicitly studied before. Lastly, the model showed impressive promise in collaborating with other generative artificial intelligence models in a workflow that can reshape the traditional materials design process. This collaborative generative artificial intelligence method can stimulate and enhance bio-inspired materials design workflows. Biological materials are at a critical intersection of multiple scientific fields and models like BioinspiredLLM help to connect knowledge domains.

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Scientific discovery contributes largely to human society's prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.

Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.

MatterGPT: A Generative Transformer for Multi-Property Inverse Design of Solid-State Materials

Inverse design of solid-state materials with desired properties represents a formidable challenge in materials science. Although recent generative models have demonstrated potential, their adoption has been hindered by limitations such as inefficiency, architectural constraints and restricted open-source availability. The representation of crystal structures using the SLICES (Simplified Line-Input Crystal-Encoding System) notation as a string of characters enables the use of state-of-the-art natural language processing models, such as Transformers, for crystal design. Drawing inspiration from the success of GPT models in generating coherent text, we trained a generative Transformer on the next-token prediction task to generate solid-state materials with targeted properties. We demonstrate MatterGPT's capability to generate de novo crystal structures with targeted single properties, including both lattice-insensitive (formation energy) and lattice-sensitive (band gap) properties. Furthermore, we extend MatterGPT to simultaneously target multiple properties, addressing the complex challenge of multi-objective inverse design of crystals. Our approach showcases high validity, uniqueness, and novelty in generated structures, as well as the ability to generate materials with properties beyond the training data distribution. This work represents a significant step forward in computational materials discovery, offering a powerful and open tool for designing materials with tailored properties for various applications in energy, electronics, and beyond.

MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling

We present MatSci-NLP, a natural language benchmark for evaluating the performance of natural language processing (NLP) models on materials science text. We construct the benchmark from publicly available materials science text data to encompass seven different NLP tasks, including conventional NLP tasks like named entity recognition and relation classification, as well as NLP tasks specific to materials science, such as synthesis action retrieval which relates to creating synthesis procedures for materials. We study various BERT-based models pretrained on different scientific text corpora on MatSci-NLP to understand the impact of pretraining strategies on understanding materials science text. Given the scarcity of high-quality annotated data in the materials science domain, we perform our fine-tuning experiments with limited training data to encourage the generalize across MatSci-NLP tasks. Our experiments in this low-resource training setting show that language models pretrained on scientific text outperform BERT trained on general text. MatBERT, a model pretrained specifically on materials science journals, generally performs best for most tasks. Moreover, we propose a unified text-to-schema for multitask learning on \benchmark and compare its performance with traditional fine-tuning methods. In our analysis of different training methods, we find that our proposed text-to-schema methods inspired by question-answering consistently outperform single and multitask NLP fine-tuning methods. The code and datasets are publicly available at https://github.com/BangLab-UdeM-Mila/NLP4MatSci-ACL23.

Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Leveraging generative Artificial Intelligence (AI), we have transformed a dataset comprising 1,000 scientific papers into an ontological knowledge graph. Through an in-depth structural analysis, we have calculated node degrees, identified communities and connectivities, and evaluated clustering coefficients and betweenness centrality of pivotal nodes, uncovering fascinating knowledge architectures. The graph has an inherently scale-free nature, is highly connected, and can be used for graph reasoning by taking advantage of transitive and isomorphic properties that reveal unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, propose never-before-seen material designs, and predict material behaviors. We compute deep node embeddings for combinatorial node similarity ranking for use in a path sampling strategy links dissimilar concepts that have previously not been related. One comparison revealed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. In another example, the algorithm proposed a hierarchical mycelium-based composite based on integrating path sampling with principles extracted from Kandinsky's 'Composition VII' painting. The resulting material integrates an innovative set of concepts that include a balance of chaos/order, adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across science, technology and art, revealing a nuanced ontology of immanence that reveal a context-dependent heterarchical interplay of constituents. Graph-based generative AI achieves a far higher degree of novelty, explorative capacity, and technical detail, than conventional approaches and establishes a widely useful framework for innovation by revealing hidden connections.

S2SNet: A Pretrained Neural Network for Superconductivity Discovery

Superconductivity allows electrical current to flow without any energy loss, and thus making solids superconducting is a grand goal of physics, material science, and electrical engineering. More than 16 Nobel Laureates have been awarded for their contribution to superconductivity research. Superconductors are valuable for sustainable development goals (SDGs), such as climate change mitigation, affordable and clean energy, industry, innovation and infrastructure, and so on. However, a unified physics theory explaining all superconductivity mechanism is still unknown. It is believed that superconductivity is microscopically due to not only molecular compositions but also the geometric crystal structure. Hence a new dataset, S2S, containing both crystal structures and superconducting critical temperature, is built upon SuperCon and Material Project. Based on this new dataset, we propose a novel model, S2SNet, which utilizes the attention mechanism for superconductivity prediction. To overcome the shortage of data, S2SNet is pre-trained on the whole Material Project dataset with Masked-Language Modeling (MLM). S2SNet makes a new state-of-the-art, with out-of-sample accuracy of 92% and Area Under Curve (AUC) of 0.92. To the best of our knowledge, S2SNet is the first work to predict superconductivity with only information of crystal structures. This work is beneficial to superconductivity discovery and further SDGs. Code and datasets are available in https://github.com/zjuKeLiu/S2SNet

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

Reducing hallucination of Large Language Models (LLMs) is imperative for use in the sciences where reproducibility is crucial. However, LLMs inherently lack long-term memory, making it a nontrivial, ad hoc, and inevitably biased task to fine-tune them on domain-specific literature and data. Here we introduce LLaMP, a multimodal retrieval-augmented generation (RAG) framework of multiple data-aware reasoning-and-acting (ReAct) agents that dynamically interact with computational and experimental data on Materials Project (MP). Without fine-tuning, LLaMP demonstrates an ability to comprehend and integrate various modalities of materials science concepts, fetch relevant data stores on the fly, process higher-order data (such as crystal structures and elastic tensors), and summarize multi-step procedures for solid-state synthesis. We show that LLaMP effectively corrects errors in GPT-3.5's intrinsic knowledge, reducing a 5.21% MAPE on frequently-documented bandgaps and a significant 1103.54% MAPE on formation energies -- errors that GPT-3.5 seems to derive from mixed data sources. Additionally, LLaMP substantially reduces the hallucinated volumetric strain in a diamond cubic silicon structure from 66.3% to 0. The proposed framework offers an intuitive and nearly hallucination-free approach to exploring materials informatics and establishes a pathway for knowledge distillation and fine-tuning other language models. We envision the framework as a valuable component for scientific hypotheses and a foundation for future autonomous laboratories where multiple LLM agents communicate and cooperate with robotics to drive material synthesis and chemical reactions without hard-coded human logic and intervention.

Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation

The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.

Solar System Elemental Abundances from the Solar Photosphere and CI-Chondrites

Solar photospheric abundances and CI-chondrite compositions are reviewed and updated to obtain representative solar system abundances of the elements and their isotopes. The new photospheric abundances obtained here lead to higher solar metallicity. Full 3D NLTE photospheric analyses are only available for 11 elements. A quality index for analyses is introduced. For several elements, uncertainties remain large. Protosolar mass fractions are H (X = 0.7060), He (Y = 0.2753), and for metals Li to U (Z = 0.0187). The protosolar (C+N)/H agrees within 13% with the ratio for the solar core from the Borexino experiment. Elemental abundances in CI-chondrites were screened by analytical methods, sample sizes, and evaluated using concentration frequency distributions. Aqueously mobile elements (e.g., alkalis, alkaline earths, etc.) often deviate from normal distributions indicating mobilization and/or sequestration into carbonates, phosphates, and sulfates. Revised CI-chondrite abundances of non-volatile elements are similar to earlier estimates. The moderately volatile elements F and Sb are higher than before, as are C, Br and I, whereas the CI-abundances of Hg and N are now significantly lower. The solar system nuclide distribution curves of s-process elements agree within 4% with s-process predictions of Galactic chemical evolution models. P-process nuclide distributions are assessed. No obvious correlation of CI-chondritic to solar elemental abundance ratios with condensation temperatures is observed, nor is there one for ratios of CI-chondrites/solar wind abundances.

Toward quantitative fractography using convolutional neural networks

The science of fractography revolves around the correlation between topographic characteristics of the fracture surface and the mechanisms and external conditions leading to their creation. While being a topic of investigation for centuries, it has remained mostly qualitative to date. A quantitative analysis of fracture surfaces is of prime interest for both the scientific community and the industrial sector, bearing the potential for improved understanding on the mechanisms controlling the fracture process and at the same time assessing the reliability of computational models currently being used for material design. With new advances in the field of image analysis, and specifically with machine learning tools becoming more accessible and reliable, it is now feasible to automate the process of extracting meaningful information from fracture surface images. Here, we propose a method of identifying and quantifying the relative appearance of intergranular and transgranular fracture events from scanning electron microscope images. The newly proposed method is based on a convolutional neural network algorithm for semantic segmentation. The proposed method is extensively tested and evaluated against two ceramic material systems (Al_2O_3,MgAl_2O_4) and shows high prediction accuracy, despite being trained on only one material system (MgAl_2O_4). While here attention is focused on brittle fracture characteristics, the method can be easily extended to account for other fracture morphologies, such as dimples, fatigue striations, etc.

Biomolecular Analysis of Soil Samples and Rock Imagery for Tracing Evidence of Life Using a Mobile Robot

The search for evidence of past life on Mars presents a tremendous challenge that requires the usage of very advanced robotic technologies to overcome it. Current digital microscopic imagers and spectrometers used for astrobiological examination suffer from limitations such as insufficient resolution, narrow detection range, and lack of portability. To overcome these challenges, this research study presents modifications to the Phoenix rover to expand its capability for detecting biosignatures on Mars. This paper examines the modifications implemented on the Phoenix rover to enhance its capability to detect a broader spectrum of biosignatures. One of the notable improvements comprises the integration of advanced digital microscopic imagers and spectrometers, enabling high-resolution examination of soil samples. Additionally, the mechanical components of the device have been reinforced to enhance maneuverability and optimize subsurface sampling capabilities. Empirical investigations have demonstrated that Phoenix has the capability to navigate diverse geological environments and procure samples for the purpose of biomolecular analysis. The biomolecular instrumentation and hybrid analytical methods showcased in this study demonstrate considerable potential for future astrobiology missions on Mars. The potential for enhancing the system lies in the possibility of broadening the range of detectable biomarkers and biosignatures.

Leveraging Large Language Models as Knowledge-Driven Agents for Reliable Retrosynthesis Planning

Identifying reliable synthesis pathways in materials chemistry is a complex task, particularly in polymer science, due to the intricate and often non-unique nomenclature of macromolecules. To address this challenge, we propose an agent system that integrates large language models (LLMs) and knowledge graphs (KGs). By leveraging LLMs' powerful capabilities for extracting and recognizing chemical substance names, and storing the extracted data in a structured knowledge graph, our system fully automates the retrieval of relevant literatures, extraction of reaction data, database querying, construction of retrosynthetic pathway trees, further expansion through the retrieval of additional literature and recommendation of optimal reaction pathways. A novel Multi-branched Reaction Pathway Search (MBRPS) algorithm enables the exploration of all pathways, with a particular focus on multi-branched ones, helping LLMs overcome weak reasoning in multi-branched paths. This work represents the first attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs. Applied to polyimide synthesis, our new approach constructs a retrosynthetic pathway tree with hundreds of pathways and recommends optimized routes, including both known and novel pathways, demonstrating its effectiveness and potential for broader applications.

Generative Hierarchical Materials Search

Generative models trained at scale can now produce text, video, and more recently, scientific data such as crystal structures. In applications of generative approaches to materials science, and in particular to crystal structures, the guidance from the domain expert in the form of high-level instructions can be essential for an automated system to output candidate crystals that are viable for downstream research. In this work, we formulate end-to-end language-to-structure generation as a multi-objective optimization problem, and propose Generative Hierarchical Materials Search (GenMS) for controllable generation of crystal structures. GenMS consists of (1) a language model that takes high-level natural language as input and generates intermediate textual information about a crystal (e.g., chemical formulae), and (2) a diffusion model that takes intermediate information as input and generates low-level continuous value crystal structures. GenMS additionally uses a graph neural network to predict properties (e.g., formation energy) from the generated crystal structures. During inference, GenMS leverages all three components to conduct a forward tree search over the space of possible structures. Experiments show that GenMS outperforms other alternatives of directly using language models to generate structures both in satisfying user request and in generating low-energy structures. We confirm that GenMS is able to generate common crystal structures such as double perovskites, or spinels, solely from natural language input, and hence can form the foundation for more complex structure generation in near future.

Procedural Generation of Grain Orientations using the Wave Function Collapse Algorithm

Statistics of grain sizes and orientations in metals correlate to the material's mechanical properties. Reproducing representative volume elements for further analysis of deformation and failure in metals, like 316L stainless steel, is particularly important due to their wide use in manufacturing goods today. Two approaches, initially created for video games, were considered for the procedural generation of representative grain microstructures. The first is the Wave Function Collapse (WFC) algorithm, and the second is constraint propagation and probabilistic inference through Markov Junior, a free and open-source software. This study aimed to investigate these two algorithms' effectiveness in using reference electron backscatter diffraction (EBSD) maps and recreating a statistically similar one that could be used in further research. It utilized two stainless steel EBSD maps as references to test both algorithms. First, the WFC algorithm was too constricting and, thus, incapable of producing images that resembled EBSDs. The second, MarkovJunior, was much more effective in creating a Voronoi tessellation that could be used to create an EBSD map in Python. When comparing the results between the reference and the generated EBSD, we discovered that the orientation and volume fractions were extremely similar. With the study, it was concluded that MarkovJunior is an effective machine learning tool that can reproduce representative grain microstructures.

CHGNet: Pretrained universal neural network potential for charge-informed atomistic modeling

The simulation of large-scale systems with complex electron interactions remains one of the greatest challenges for the atomistic modeling of materials. Although classical force fields often fail to describe the coupling between electronic states and ionic rearrangements, the more accurate ab-initio molecular dynamics suffers from computational complexity that prevents long-time and large-scale simulations, which are essential to study many technologically relevant phenomena, such as reactions, ion migrations, phase transformations, and degradation. In this work, we present the Crystal Hamiltonian Graph neural Network (CHGNet) as a novel machine-learning interatomic potential (MLIP), using a graph-neural-network-based force field to model a universal potential energy surface. CHGNet is pretrained on the energies, forces, stresses, and magnetic moments from the Materials Project Trajectory Dataset, which consists of over 10 years of density functional theory static and relaxation trajectories of sim 1.5 million inorganic structures. The explicit inclusion of magnetic moments enables CHGNet to learn and accurately represent the orbital occupancy of electrons, enhancing its capability to describe both atomic and electronic degrees of freedom. We demonstrate several applications of CHGNet in solid-state materials, including charge-informed molecular dynamics in Li_xMnO_2, the finite temperature phase diagram for Li_xFePO_4 and Li diffusion in garnet conductors. We critically analyze the significance of including charge information for capturing appropriate chemistry, and we provide new insights into ionic systems with additional electronic degrees of freedom that can not be observed by previous MLIPs.

Unifying Molecular and Textual Representations via Multi-task Language Modelling

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose a multi-domain, multi-task language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi-task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery

This paper revisits datasets and evaluation criteria for Symbolic Regression, a task of expressing given data using mathematical equations, specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling range of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. As an evaluation metric, we also propose to use normalized edit distances between a predicted equation and the ground-truth equation trees. While existing metrics are either binary or errors between the target values and an SR model's predicted values for a given input, normalized edit distances evaluate a sort of similarity between the ground-truth and predicted equation trees. We have conducted experiments on our new SRSD datasets using five state-of-the-art SR methods in SRBench and a simple baseline based on a recent Transformer architecture. The results show that we provide a more realistic performance evaluation and open up a new machine learning-based approach for scientific discovery. Our datasets and code repository are publicly available.

Knowledge Graph in Astronomical Research with Large Language Models: Quantifying Driving Forces in Interdisciplinary Scientific Discovery

Identifying and predicting the factors that contribute to the success of interdisciplinary research is crucial for advancing scientific discovery. However, there is a lack of methods to quantify the integration of new ideas and technological advancements in astronomical research and how these new technologies drive further scientific breakthroughs. Large language models, with their ability to extract key concepts from vast literature beyond keyword searches, provide a new tool to quantify such processes. In this study, we extracted concepts in astronomical research from 297,807 publications between 1993 and 2024 using large language models, resulting in a set of 24,939 concepts. These concepts were then used to form a knowledge graph, where the link strength between any two concepts was determined by their relevance through the citation-reference relationships. By calculating this relevance across different time periods, we quantified the impact of numerical simulations and machine learning on astronomical research. The knowledge graph demonstrates two phases of development: a phase where the technology was integrated and another where the technology was explored in scientific discovery. The knowledge graph reveals that despite machine learning has made much inroad in astronomy, there is currently a lack of new concept development at the intersection of AI and Astronomy, which may be the current bottleneck preventing machine learning from further transforming the field of astronomy.

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent

HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science

We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter language model specialized to materials science. In MatSci-Instruct we improve the trustworthiness of generated data by prompting multiple commercially available large language models for generation with an Instructor module (e.g. Chat-GPT) and verification from an independent Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of multiple tasks and measure the quality of our dataset along multiple dimensions, including accuracy against known facts, relevance to materials science, as well as completeness and reasonableness of the data. Moreover, we iteratively generate more targeted instructions and instruction-data in a finetuning-evaluation-feedback loop leading to progressively better performance for our finetuned HoneyBee models. Our evaluation on the MatSci-NLP benchmark shows HoneyBee's outperformance of existing language models on materials science tasks and iterative improvement in successive stages of instruction-data refinement. We study the quality of HoneyBee's language modeling through automatic evaluation and analyze case studies to further understand the model's capabilities and limitations. Our code and relevant datasets are publicly available at https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee.

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly available.

What it takes to solve the Origin(s) of Life: An integrated review of techniques

Understanding the origin(s) of life (OoL) is a fundamental challenge for science in the 21st century. Research on OoL spans many disciplines, including chemistry, physics, biology, planetary sciences, computer science, mathematics and philosophy. The sheer number of different scientific perspectives relevant to the problem has resulted in the coexistence of diverse tools, techniques, data, and software in OoL studies. This has made communication between the disciplines relevant to the OoL extremely difficult because the interpretation of data, analyses, or standards of evidence can vary dramatically. Here, we hope to bridge this wide field of study by providing common ground via the consolidation of tools and techniques rather than positing a unifying view on how life emerges. We review the common tools and techniques that have been used significantly in OoL studies in recent years. In particular, we aim to identify which information is most relevant for comparing and integrating the results of experimental analyses into mathematical and computational models. This review aims to provide a baseline expectation and understanding of technical aspects of origins research, rather than being a primer on any particular topic. As such, it spans broadly -- from analytical chemistry to mathematical models -- and highlights areas of future work that will benefit from a multidisciplinary approach to tackling the mystery of life's origin. Ultimately, we hope to empower a new generation of OoL scientists by reviewing how they can investigate life's origin, rather than dictating how to think about the problem.

Predicting Thermoelectric Power Factor of Bismuth Telluride During Laser Powder Bed Fusion Additive Manufacturing

An additive manufacturing (AM) process, like laser powder bed fusion, allows for the fabrication of objects by spreading and melting powder in layers until a freeform part shape is created. In order to improve the properties of the material involved in the AM process, it is important to predict the material characterization property as a function of the processing conditions. In thermoelectric materials, the power factor is a measure of how efficiently the material can convert heat to electricity. While earlier works have predicted the material characterization properties of different thermoelectric materials using various techniques, implementation of machine learning models to predict the power factor of bismuth telluride (Bi2Te3) during the AM process has not been explored. This is important as Bi2Te3 is a standard material for low temperature applications. Thus, we used data about manufacturing processing parameters involved and in-situ sensor monitoring data collected during AM of Bi2Te3, to train different machine learning models in order to predict its thermoelectric power factor. We implemented supervised machine learning techniques using 80% training and 20% test data and further used the permutation feature importance method to identify important processing parameters and in-situ sensor features which were best at predicting power factor of the material. Ensemble-based methods like random forest, AdaBoost classifier, and bagging classifier performed the best in predicting power factor with the highest accuracy of 90% achieved by the bagging classifier model. Additionally, we found the top 15 processing parameters and in-situ sensor features to characterize the material manufacturing property like power factor. These features could further be optimized to maximize power factor of the thermoelectric material and improve the quality of the products built using this material.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling.

MassSpecGym: A benchmark for the discovery and identification of molecules

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

Development of different methods and their efficiencies for the estimation of diffusion coefficients following the diffusion couple technique

The interdiffusion coefficients are estimated either following the Wagner's method expressed with respect to the composition (mol or atomic fraction) normalized variable after considering the molar volume variation or the den Broeder's method expressed with respect to the concentration (composition divided by the molar volume) normalized variable. On the other hand, the relations for estimation of the intrinsic diffusion coefficients of components as established by van Loo and integrated diffusion coefficients in a phase with narrow homogeneity range as established by Wagner are currently available with respect to the composition normalized variable only. In this study, we have first derived the relation proposed by den Broeder following the line of treatment proposed by Wagner. Further, the relations for estimation of the intrinsic diffusion coefficients of the components and integrated interdiffusion coefficient are established with respect to the concentration normalized variable, which were not available earlier. The veracity of these methods is examined based on the estimation of data in Ni-Pd, Ni-Al and Cu-Sn systems. Our analysis indicates that both the approaches are logically correct and there is small difference in the estimated data in these systems although a higher difference could be found in other systems. The integrated interdiffusion coefficients with respect to the concentration (or concentration normalized variable) can only be estimated considering the ideal molar volume variation. This might be drawback in certain practical systems.

Analysis and Applications of Deep Learning with Finite Samples in Full Life-Cycle Intelligence of Nuclear Power Generation

The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.

Generative modeling, design and analysis of spider silk protein sequences for enhanced mechanical properties

Spider silks are remarkable materials characterized by superb mechanical properties such as strength, extensibility and lightweightedness. Yet, to date, limited models are available to fully explore sequence-property relationships for analysis and design. Here we propose a custom generative large-language model to enable design of novel spider silk protein sequences to meet complex combinations of target mechanical properties. The model, pretrained on a large set of protein sequences, is fine-tuned on ~1,000 major ampullate spidroin (MaSp) sequences for which associated fiber-level mechanical properties exist, to yield an end-to-end forward and inverse generative strategy. Performance is assessed through: (1), a novelty analysis and protein type classification for generated spidroin sequences through BLAST searches, (2) property evaluation and comparison with similar sequences, (3) comparison of molecular structures, as well as, and (4) a detailed sequence motif analyses. We generate silk sequences with property combinations that do not exist in nature, and develop a deep understanding the mechanistic roles of sequence patterns in achieving overarching key mechanical properties (elastic modulus, strength, toughness, failure strain). The model provides an efficient approach to expand the silkome dataset, facilitating further sequence-structure analyses of silks, and establishes a foundation for synthetic silk design and optimization.

Cybloids - Creation and Control of Cybernetic Colloids

Colloids play an important role in fundamental science as well as in nature and technology. They have had a strong impact on the fundamental understanding of statistical physics. For example, colloids have helped to obtain a better understanding of collective phenomena, ranging from phase transitions and glass formation to the swarming of active Brownian particles. Yet the success of colloidal systems hinges crucially on the specific physical and chemical properties of the colloidal particles, i.e. particles with the appropriate characteristics must be available. Here we present an idea to create particles with freely selectable properties. The properties might depend, for example, on the presence of other particles (hence mimicking specific pair or many-body interactions), previous configurations (hence introducing some memory or feedback), or a directional bias (hence changing the dynamics). Without directly interfering with the sample, each particle is fully controlled and can receive external commands through a predefined algorithm that can take into account any input parameters. This is realized with computer-controlled colloids, which we term cybloids - short for cybernetic colloids. The potential of cybloids is illustrated by programming a time-delayed external potential acting on a single colloid and interaction potentials for many colloids. Both an attractive harmonic potential and an annular potential are implemented. For a single particle, this programming can cause subdiffusive behavior or lend activity. For many colloids, the programmed interaction potential allows to select a crystal structure at wish. Beyond these examples, we discuss further opportunities which cybloids offer.

Agent Laboratory: Using LLM Agents as Research Assistants

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

Rise and Fall of Anderson Localization by Lattice Vibrations: A Time-Dependent Machine Learning Approach

The intricate relationship between electrons and the crystal lattice is a linchpin in condensed matter, traditionally described by the Fr\"ohlich model encompassing the lowest-order lattice-electron coupling. Recently developed quantum acoustics, emphasizing the wave nature of lattice vibrations, has enabled the exploration of previously uncharted territories of electron-lattice interaction not accessible with conventional tools such as perturbation theory. In this context, our agenda here is two-fold. First, we showcase the application of machine learning methods to categorize various interaction regimes within the subtle interplay of electrons and the dynamical lattice landscape. Second, we shed light on a nebulous region of electron dynamics identified by the machine learning approach and then attribute it to transient localization, where strong lattice vibrations result in a momentary Anderson prison for electronic wavepackets, which are later released by the evolution of the lattice. Overall, our research illuminates the spectrum of dynamics within the Fr\"ohlich model, such as transient localization, which has been suggested as a pivotal factor contributing to the mysteries surrounding strange metals. Furthermore, this paves the way for utilizing time-dependent perspectives in machine learning techniques for designing materials with tailored electron-lattice properties.

Are large language models superhuman chemists?

Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. This is relevant for the chemical sciences, which face the problem of small and diverse datasets that are frequently in the form of text. LLMs have shown promise in addressing these issues and are increasingly being harnessed to predict chemical properties, optimize reactions, and even design and conduct experiments autonomously. However, we still have only a very limited systematic understanding of the chemical reasoning capabilities of LLMs, which would be required to improve models and mitigate potential harms. Here, we introduce "ChemBench," an automated framework designed to rigorously evaluate the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of human chemists. We curated more than 7,000 question-answer pairs for a wide array of subfields of the chemical sciences, evaluated leading open and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. The models, however, struggle with some chemical reasoning tasks that are easy for human experts and provide overconfident, misleading predictions, such as about chemicals' safety profiles. These findings underscore the dual reality that, although LLMs demonstrate remarkable proficiency in chemical tasks, further research is critical to enhancing their safety and utility in chemical sciences. Our findings also indicate a need for adaptations to chemistry curricula and highlight the importance of continuing to develop evaluation frameworks to improve safe and useful LLMs.

NatureLM: Deciphering the Language of Nature for Scientific Discovery

Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.

pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

The exponential growth of astronomical literature poses significant challenges for researchers navigating and synthesizing general insights or even domain-specific knowledge. We present Pathfinder, a machine learning framework designed to enable literature review and knowledge discovery in astronomy, focusing on semantic searching with natural language instead of syntactic searches with keywords. Utilizing state-of-the-art large language models (LLMs) and a corpus of 350,000 peer-reviewed papers from the Astrophysics Data System (ADS), Pathfinder offers an innovative approach to scientific inquiry and literature exploration. Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context as a complement to currently existing methods that use keywords or citation graphs. It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes. We demonstrate the tool's versatility through case studies, showcasing its application in various research scenarios. The system's performance is evaluated using custom benchmarks, including single-paper and multi-paper tasks. Beyond literature review, Pathfinder offers unique capabilities for reformatting answers in ways that are accessible to various audiences (e.g. in a different language or as simplified text), visualizing research landscapes, and tracking the impact of observatories and methodologies. This tool represents a significant advancement in applying AI to astronomical research, aiding researchers at all career stages in navigating modern astronomy literature.

ChemCrow: Augmenting large-language models with chemistry tools

Over the last decades, excellent computational chemistry tools have been developed. Their full potential has not yet been reached as most are challenging to learn and exist in isolation. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 17 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned the syntheses of an insect repellent, three organocatalysts, as well as other relevant molecules. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow's performance. There is a significant risk of misuse of tools like ChemCrow, and we discuss their potential harms. Employed responsibly, our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry. A subset of the code is publicly available at https://github.com/ur-whitelab/chemcrow-public.

C5T5: Controllable Generation of Organic Molecules with Transformers

Methods for designing organic materials with desired properties have high potential impact across fields such as medicine, renewable energy, petrochemical engineering, and agriculture. However, using generative modeling to design substances with desired properties is difficult because candidate compounds must satisfy multiple constraints, including synthetic accessibility and other metrics that are intuitive to domain experts but challenging to quantify. We propose C5T5, a novel self-supervised pretraining method that enables transformers to make zero-shot select-and-replace edits, altering organic substances towards desired property values. C5T5 operates on IUPAC names -- a standardized molecular representation that intuitively encodes rich structural information for organic chemists but that has been largely ignored by the ML community. Our technique requires no edited molecule pairs to train and only a rough estimate of molecular properties, and it has the potential to model long-range dependencies and symmetric molecular structures more easily than graph-based methods. C5T5 also provides a powerful interface to domain experts: it grants users fine-grained control over the generative process by selecting and replacing IUPAC name fragments, which enables experts to leverage their intuitions about structure-activity relationships. We demonstrate C5T5's effectiveness on four physical properties relevant for drug discovery, showing that it learns successful and chemically intuitive strategies for altering molecules towards desired property values.

The Apache Point Observatory Galactic Evolution Experiment (APOGEE)

The Apache Point Observatory Galactic Evolution Experiment (APOGEE), one of the programs in the Sloan Digital Sky Survey III (SDSS-III), has now completed its systematic, homogeneous spectroscopic survey sampling all major populations of the Milky Way. After a three year observing campaign on the Sloan 2.5-m Telescope, APOGEE has collected a half million high resolution (R~22,500), high S/N (>100), infrared (1.51-1.70 microns) spectra for 146,000 stars, with time series information via repeat visits to most of these stars. This paper describes the motivations for the survey and its overall design---hardware, field placement, target selection, operations---and gives an overview of these aspects as well as the data reduction, analysis and products. An index is also given to the complement of technical papers that describe various critical survey components in detail. Finally, we discuss the achieved survey performance and illustrate the variety of potential uses of the data products by way of a number of science demonstrations, which span from time series analysis of stellar spectral variations and radial velocity variations from stellar companions, to spatial maps of kinematics, metallicity and abundance patterns across the Galaxy and as a function of age, to new views of the interstellar medium, the chemistry of star clusters, and the discovery of rare stellar species. As part of SDSS-III Data Release 12, all of the APOGEE data products are now publicly available.

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

70 years of machine learning in geoscience in review

This review gives an overview of the development of machine learning in geoscience. A thorough analysis of the co-developments of machine learning applications throughout the last 70 years relates the recent enthusiasm for machine learning to developments in geoscience. I explore the shift of kriging towards a mainstream machine learning method and the historic application of neural networks in geoscience, following the general trend of machine learning enthusiasm through the decades. Furthermore, this chapter explores the shift from mathematical fundamentals and knowledge in software development towards skills in model validation, applied statistics, and integrated subject matter expertise. The review is interspersed with code examples to complement the theoretical foundations and illustrate model validation and machine learning explainability for science. The scope of this review includes various shallow machine learning methods, e.g. Decision Trees, Random Forests, Support-Vector Machines, and Gaussian Processes, as well as, deep neural networks, including feed-forward neural networks, convolutional neural networks, recurrent neural networks and generative adversarial networks. Regarding geoscience, the review has a bias towards geophysics but aims to strike a balance with geochemistry, geostatistics, and geology, however excludes remote sensing, as this would exceed the scope. In general, I aim to provide context for the recent enthusiasm surrounding deep learning with respect to research, hardware, and software developments that enable successful application of shallow and deep machine learning in all disciplines of Earth science.

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to address this challenge. Known for their strong abilities in summarizing texts, LLMs are seen as a potential tool to improve the analysis of scientific literature. However, existing LLMs have their own limits. Scientific literature often includes a wide range of multimodal elements, such as molecular structure, tables, and charts, which are hard for text-focused LLMs to understand and analyze. This issue points to the urgent need for new solutions that can fully understand and analyze multimodal content in scientific literature. To answer this demand, we present Uni-SMART (Universal Science Multimodal Analysis and Research Transformer), an innovative model designed for in-depth understanding of multimodal scientific literature. Through rigorous quantitative evaluation across several domains, Uni-SMART demonstrates superior performance over leading text-focused LLMs. Furthermore, our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts. These applications not only highlight Uni-SMART's adaptability but also its potential to revolutionize how we interact with scientific literature.

Evidence for Widespread Hydrogen Sequestration within the Moon's South Polar Cold Traps

The measured neutron flux from the Moons south polar region shows evidence of locally enhanced hydrogen concentrations, likely in the form of water ice, within most permanently shadowed regions (PSR), poleward of 77 deg S latitude. Results are consistent with the original findings of Watson et al, 1961, which found that the PSRs cryogenic surfaces create exclusive conditions for the sequestration of water ice, due to their extremely low sublimation rates. Widespread PSR hydrogenation is demonstrated in several studies by showing that the contrasting PSR area distribution is being instrumentally blurred. The PSRs expected hydrogen observations are correlated by their area fraction of the fixed 30 km diameter footprint area of the Collimated Sensor for Epithermal Neutrons (CSETN), which is part of the Lunar Exploration Neutron Detector (LEND) onboard the Lunar Reconnaissance Orbiter (LRO). The correlation indicates that the PSRs are similarly hydrogenated, with an expected concentration = 0.27 wt%, relative to that of the anhydrous reference terrain (lower bounds). Hydrogen concentrations are demonstrated to be correlated to maximum temperature distributions within the basins of Haworth, Shoemaker and Faustini PSRs. Cabeus-1 PSR shows an anomalously enhanced hydrogen concentration indicating a second process contributes to its hydrogen budget. Results are consistent with ongoing processes that introduce volatiles to the surface including outgassing, solar wind production with regolith silicates, and mixing from small scale meteor impacts and diurnal temperature variation. We validate the bandpass filter used to subtract CSETNs detection of uncollimated neutrons with profiles of several PSRs neutron suppression before and after processing. Keywords: Moon, Epithermal Neutron, Hydrogen, Water, Ice, Volatiles, LRO, LEND, Diviner, LOLA

Benchmarking Graph Neural Networks

In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of December 2022, the GitHub repository has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.

The Application of Artificial Neural Network Model to Predicting the Acid Mine Drainage from Long-Term Lab Scale Kinetic Test

Acid mine drainage (AMD) is one of the common environmental problems in the coal mining industry that was formed by the oxidation of sulfide minerals in the overburden or waste rock. The prediction of acid generation through AMD is important to do in overburden management and planning the post-mining land use. One of the methods used to predict AMD is a lab-scale kinetic test to determine the rate of acid formation over time using representative samples in the field. However, this test requires a long-time procedure and large amount of chemical reagents lead to inefficient cost. On the other hand, there is potential for machine learning to learn the pattern behind the lab-scale kinetic test data. This study describes an approach to use artificial neural network (ANN) modeling to predict the result from lab-scale kinetic tests. Various ANN model is used based on 83 weeks experiments of lab-scale kinetic tests with 100\% potential acid-forming rock. The model approaches the monitoring of pH, ORP, conductivity, TDS, sulfate, and heavy metals (Fe and Mn). The overall Nash-Sutcliffe Efficiency (NSE) obtained in this study was 0.99 on training and validation data, indicating a strong correlation and accurate prediction compared to the actual lab-scale kinetic tests data. This show the ANN ability to learn patterns, trends, and seasonality from past data for accurate forecasting, thereby highlighting its significant contribution to solving AMD problems. This research is also expected to establish the foundation for a new approach to predict AMD, with time efficient, accurate, and cost-effectiveness in future applications.

ProtSolM: Protein Solubility Prediction with Multi-modal Features

Understanding protein solubility is essential for their functional applications. Computational methods for predicting protein solubility are crucial for reducing experimental costs and enhancing the efficiency and success rates of protein engineering. Existing methods either construct a supervised learning scheme on small-scale datasets with manually processed physicochemical properties, or blindly apply pre-trained protein language models to extract amino acid interaction information. The scale and quality of available training datasets leave significant room for improvement in terms of accuracy and generalization. To address these research gaps, we propose \sol, a novel deep learning method that combines pre-training and fine-tuning schemes for protein solubility prediction. ProtSolM integrates information from multiple dimensions, including physicochemical properties, amino acid sequences, and protein backbone structures. Our model is trained using \data, the largest solubility dataset that we have constructed. PDBSol includes over 60,000 protein sequences and structures. We provide a comprehensive leaderboard of existing statistical learning and deep learning methods on independent datasets with computational and experimental labels. ProtSolM achieved state-of-the-art performance across various evaluation metrics, demonstrating its potential to significantly advance the accuracy of protein solubility prediction.

CycleResearcher: Improving Automated Research via Automated Review

The automation of scientific discovery has been a long-standing goal within the research community, driven by the potential to accelerate knowledge creation. While significant progress has been made using commercial large language models (LLMs) as research assistants or idea generators, the possibility of automating the entire research process with open-source LLMs remains largely unexplored. This paper explores the feasibility of using open-source post-trained LLMs as autonomous agents capable of performing the full cycle of automated research and review, from literature review and manuscript preparation to peer review and paper revision. Our iterative preference training framework consists of CycleResearcher, which conducts research tasks, and CycleReviewer, which simulates the peer review process, providing iterative feedback via reinforcement learning. To train these models, we develop two new datasets, Review-5k and Research-14k, reflecting real-world machine learning research and peer review dynamics. Our results demonstrate that CycleReviewer achieves a 26.89\% improvement in mean absolute error (MAE) over individual human reviewers in predicting paper scores, indicating that LLMs can surpass expert-level performance in research evaluation. In research, the papers generated by the CycleResearcher model achieved a score of 5.36 in simulated peer reviews, surpassing the preprint level of 5.24 from human experts and approaching the accepted paper level of 5.69. This work represents a significant step toward fully automated scientific inquiry, providing ethical safeguards and advancing AI-driven research capabilities. The code, dataset and model weight are released at http://github/minjun-zhu/Researcher.

MSA-3D: Metallicity Gradients in Galaxies at zsim1 with JWST/NIRSpec Slit-stepping Spectroscopy

The radial gradient of gas-phase metallicity is a powerful probe of the chemical and structural evolution of star-forming galaxies, closely tied to disk formation and gas kinematics in the early universe. We present spatially resolved chemical and dynamical properties for a sample of 25 galaxies at 0.5 lesssim z lesssim 1.7 from the \msasd survey. These innovative observations provide 3D spectroscopy of galaxies at a spatial resolution approaching JWST's diffraction limit and a high spectral resolution of Rsimeq2700. The metallicity gradients measured in our galaxy sample range from -0.03 to 0.02 dex~kpc^{-1}. Most galaxies exhibit negative or flat radial gradients, indicating lower metallicity in the outskirts or uniform metallicity throughout the entire galaxy. We confirm a tight relationship between stellar mass and metallicity gradient at zsim1 with small intrinsic scatter of 0.02 dex~kpc^{-1}. Our results indicate that metallicity gradients become increasingly negative as stellar mass increases, likely because the more massive galaxies tend to be more ``disky". This relationship is consistent with the predictions from cosmological hydrodynamic zoom-in simulations with strong stellar feedback. This work presents the effort to harness the multiplexing capability of JWST NIRSpec/MSA in slit-stepping mode to map the chemical and kinematic profiles of high-redshift galaxies in large samples and at high spatial and spectral resolution.

MechGPT, a language-based strategy for mechanics and materials modeling that connects knowledge across scales, disciplines and modalities

For centuries, researchers have sought out ways to connect disparate areas of knowledge. While early scholars (Galileo, da Vinci, etc.) were experts across fields, specialization has taken hold later. With the advent of Artificial Intelligence, we can now explore relationships across areas (e.g., mechanics-biology) or disparate domains (e.g., failure mechanics-art). To achieve this, we use a fine-tuned Large Language Model (LLM), here for a subset of knowledge in multiscale materials failure. The approach includes the use of a general-purpose LLM to distill question-answer pairs from raw sources followed by LLM fine-tuning. The resulting MechGPT LLM foundation model is used in a series of computational experiments to explore its capacity for knowledge retrieval, various language tasks, hypothesis generation, and connecting knowledge across disparate areas. While the model has some ability to recall knowledge from training, we find that LLMs are particularly useful to extract structural insights through Ontological Knowledge Graphs. These interpretable graph structures provide explanatory insights, frameworks for new research questions, and visual representations of knowledge that also can be used in retrieval-augmented generation. Three versions of MechGPT are discussed, featuring different sizes from 13 billion to 70 billion parameters, and reaching context lengths of more than 10,000 tokens. This provides ample capacity for sophisticated retrieval augmented strategies, as well as agent-based modeling where multiple LLMs interact collaboratively and/or adversarially, the incorporation of new data from the literature or web searches, as well as multimodality.

The chemical inventory of the planet-hosting disk PDS 70

As host to two accreting planets, PDS 70 provides a unique opportunity to probe the chemical complexity of atmosphere-forming material. We present ALMA Band 6 observations of the PDS~70 disk and report the first chemical inventory of the system. With a spatial resolution of 0.4''-0.5'' (sim50 au), 12 species are detected, including CO isotopologues and formaldehyde, small hydrocarbons, HCN and HCO+ isotopologues, and S-bearing molecules. SO and CH3OH are not detected. All lines show a large cavity at the center of the disk, indicative of the deep gap carved by the massive planets. The radial profiles of the line emission are compared to the (sub-)mm continuum and infrared scattered light intensity profiles. Different molecular transitions peak at different radii, revealing the complex interplay between density, temperature and chemistry in setting molecular abundances. Column densities and optical depth profiles are derived for all detected molecules, and upper limits obtained for the non detections. Excitation temperature is obtained for H2CO. Deuteration and nitrogen fractionation profiles from the hydro-cyanide lines show radially increasing fractionation levels. Comparison of the disk chemical inventory to grids of chemical models from the literature strongly suggests a disk molecular layer hosting a carbon to oxygen ratio C/O>1, thus providing for the first time compelling evidence of planets actively accreting high C/O ratio gas at present time.

Total Nitrogen Estimation in Agricultural Soils via Aerial Multispectral Imaging and LIBS

Measuring soil health indicators is an important and challenging task that affects farmers' decisions on timing, placement, and quantity of fertilizers applied in the farms. Most existing methods to measure soil health indicators (SHIs) are in-lab wet chemistry or spectroscopy-based methods, which require significant human input and effort, time-consuming, costly, and are low-throughput in nature. To address this challenge, we develop an artificial intelligence (AI)-driven near real-time unmanned aerial vehicle (UAV)-based multispectral sensing (UMS) solution to estimate total nitrogen (TN) of the soil, an important macro-nutrient or SHI that directly affects the crop health. Accurate prediction of soil TN can significantly increase crop yield through informed decision making on the timing of seed planting, and fertilizer quantity and timing. We train two machine learning models including multi-layer perceptron and support vector machine to predict the soil nitrogen using a suite of data classes including multispectral characteristics of the soil and crops in red, near-infrared, and green spectral bands, computed vegetation indices, and environmental variables including air temperature and relative humidity. To generate the ground-truth data or the training data for the machine learning models, we measure the total nitrogen of the soil samples (collected from a farm) using laser-induced breakdown spectroscopy (LIBS).

M^{3}-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery

This paper introduces M^{3}-20M, a large-scale Multi-Modal Molecular dataset that contains over 20 million molecules. Designed to support AI-driven drug design and discovery, M^{3}-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit training or fine-tuning large (language) models with superior performance for drug design and discovery. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated by using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M^{3}-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, and GPT-4. Our experimental results show that M^{3}-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than the existing single-modal datasets, which validates the value and potential of M^{3}-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.

Estimation of Classical Cepheid's Physical Parameters from NIR Light Curves

Recent space-borne and ground-based observations provide photometric measurements as time series. The effect of interstellar dust extinction in the near-infrared range is only 10% of that measured in the V band. However, the sensitivity of the light curve shape to the physical parameters in the near-infrared is much lower. So, interpreting these types of data sets requires new approaches like the different large-scale surveys, which create similar problems with big data. Using a selected data set, we provide a method for applying routines implemented in R to extract most information of measurements to determine physical parameters, which can also be used in automatic classification schemes and pipeline processing. We made a multivariate classification of 131 Cepheid light curves (LC) in J, H, and K colors, where all the LCs were represented in 20D parameter space in these colors separately. Performing a Principal Component Analysis (PCA), we got an orthogonal coordinate system and squared Euclidean distances between LCs, with 6 significant eigenvalues, reducing the 20-dimension to 6. We also estimated the optimal number of partitions of similar objects and found it to be equal to 7 in each color; their dependence on the period, absolute magnitude, amplitude, and metallicity are also discussed. We computed the Spearman rank correlations, showing that periods and absolute magnitudes correlate with the first three PCs significantly. The first two PC are also found to have a relationship with the amplitude, but the metallicity effects are only marginal. The method shown can be generalized and implemented in unsupervised classification schemes and analysis of mixed and biased samples. The analysis of our Classical Cepheid near-infrared LC sample showed that the J, H, K curves are insufficient for determination of stellar metallicity, with mass being the key factor shaping them.

SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain

Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at https://github.com/AQA6666/SCP-116K-open.