arxiv:2505.16483

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Published on May 22

· Submitted by

ssz1111 on May 26

Upvote

Authors:

Shuzheng Si ,

Haozhe Zhao ,

Abstract

CANOE improves LLM faithfulness in generation tasks using synthetic QA data and Dual-GRPO reinforcement learning without human annotations.

AI-generated summary

Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

ssz1111

Paper author Paper submitter 9 days ago

The code, data, and models are available at: https://github.com/S1s-Z/CANOE.

ssz1111

Paper author Paper submitter 9 days ago

•

edited 9 days ago

With only 7B parameters, CANOE already exceeds state-of-the-art LLMs like GPT-4o and OpenAI o1.

ssz1111

Paper author Paper submitter 9 days ago

CANOE first synthesizes easily verifiable short-form QA data and then proposes the Dual-GRPO with designed rule-based rewards to improve the faithfulness of LLMs.