BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Abstract
BiomedSQL evaluates scientific reasoning in text-to-SQL tasks using a large biomedical knowledge base, highlighting performance gaps in existing models.
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
Community
Benchmark Dataset: https://huggingface.co/datasets/NIH-CARD/BiomedSQL
Code: https://github.com/NIH-CARD/biomedsql
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research (2025)
- MedBrowseComp: Benchmarking Medical Deep Research and Computer Use (2025)
- Disentangling Reasoning and Knowledge in Medical Large Language Models (2025)
- SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models (2025)
- Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study (2025)
- TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification (2025)
- Scaling Reasoning can Improve Factuality in Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper