HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models
Abstract
HoPE, a Hybrid of Position Embedding, enhances VLMs' long-context performance in videos through improved frequency allocation and dynamic temporal scaling.
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.
Community
đ§ Extending Rotary Position Embedding (RoPE) to multimodal scenarios typically involves allocating different frequencies to encode different positional components (i.e., t, x, y).
đ¤ In this paper, we first investigate how different frequency allocation strategies impact the semantic modeling capabilities of VLMs. Our analysis reveals that current multimodal RoPEs are unreliable in long-term semantic modeling. Moreover, we point out that existing temporal index scaling of visual tokens lacks flexibility and robustness during inference, where videos proceed with varying speeds and demonstrate significant differences in information densities.
⨠Guided by our analysis, we propose HoPE. HoPE combines multimodal RoPE and NoPE to facilitate reliable semantic modeling over extended contexts. Additionally, HoPE introduces dynamic and bidirectional temporal index scaling to facilitate VLMs' robustness to videos with varying speeds, which are common in real-world scenarios.
Code is available at: https://github.com/hrlics/HoPE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models (2025)
- Mavors: Multi-granularity Video Representation for Multimodal Large Language Model (2025)
- STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference (2025)
- ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models (2025)
- Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation (2025)
- M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models (2025)
- The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper