Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
Abstract
We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks, where test cases serve as both prompt and verification for code generation. Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases, reflecting real-world software development practices. Comprising 1000 diverse challenges across 20 application domains, the benchmark evaluates LLMs on their ability to generate compact, functional code under the constraints of context length and multi-feature complexity. Our findings highlight instruction following and in-context learning as critical capabilities for TDD success, surpassing the importance of general coding proficiency or pretraining knowledge. Through comprehensive evaluation of 19 frontier models, we reveal performance bottlenecks, such as instruction loss in long prompts, and provide a detailed error analysis spanning multiple root causes. This work underscores the practical value of TDD-specific benchmarks and lays the foundation for advancing LLM capabilities in rigorous, application-driven coding scenarios.
Community
A Test-Driven-Development Benchmark for LLM Code Generation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation (2025)
- CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts (2025)
- Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach (2025)
- A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs (2025)
- OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs (2025)
- ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving (2025)
- FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper