Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Abstract
A novel pairwise-comparison framework using CreataSet dataset trains CrEval, an LLM-based evaluator that significantly improves the assessment of textual creativity aligned with human judgments.
Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach (2025)
- Creative Preference Optimization (2025)
- Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications (2025)
- Flex-Judge: Think Once, Judge Anywhere (2025)
- T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation (2025)
- Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification (2025)
- COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper