Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Abstract
Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models (2025)
- TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment (2025)
- Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs (2025)
- Aligning Multimodal LLM with Human Preference: A Survey (2025)
- BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding (2025)
- SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner (2025)
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper