Where do Large Vision-Language Models Look at when Answering Questions?
Abstract
Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection (2025)
- Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison (2025)
- Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding (2025)
- Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs (2025)
- BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries (2025)
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference (2025)
- LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper