Papers
arxiv:2505.00662

DeepCritic: Deliberate Critique with Large Language Models

Published on May 1
· Submitted by Keven16 on May 2
#2 Paper of the day
Authors:
,
,

Abstract

As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

Community

Paper author Paper submitter

We propose DeepCritic framework to enable LLM critics to provide judgments after thoughtful and deliberate evaluation. We carefully curate 4.5K long-form critique data through iterative synthesis for SFT to teach the model how to perform deliberate critique, and subsequently perform RL to fully stimulate the model's critique capabilities. Our developed critique model built on Qwen2.5-7B-Instruct not only outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback. The data and models are available at https://github.com/RUCBM/DeepCritic.

This comment has been hidden
This comment has been hidden (marked as Resolved)

Kai is my God in AI.

Big congrats!

·

Thanks, Xu~

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.00662 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.00662 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.00662 in a Space README.md to link it from this page.

Collections including this paper 5