RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Abstract
A novel iterative framework, RICO, improves image caption accuracy by using visual reconstruction and a text-to-image model to refine discrepancies, while RICO-Flash enhances efficiency using DPO.
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.
Community
We propose RICO, a novel framework that refines captions through visual reconstruction. Conventional recaptioning methods typically map images directly to text without explicitly aligning the semantic spaces of the two modalities, often leading to information loss in the generated captions. In contrast, our approach incorporates visual reconstruction to make this loss more observable. By identifying discrepancies between the original and reconstructed images through the reviser, we refine the caption to produce a more semantically aligned and comprehensive description.
Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception (2025)
- Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models (2025)
- Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training (2025)
- Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation (2025)
- FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs (2025)
- Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation (2025)
- OmniCaptioner: One Captioner to Rule Them All (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper