Papers
arxiv:2505.20444

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Published on May 26
¡ Submitted by brian13 on May 29
Authors:
,
,
,

Abstract

HoPE, a Hybrid of Position Embedding, enhances VLMs' long-context performance in videos through improved frequency allocation and dynamic temporal scaling.

AI-generated summary

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

Community

Paper author Paper submitter

🔧 Extending Rotary Position Embedding (RoPE) to multimodal scenarios typically involves allocating different frequencies to encode different positional components (i.e., t, x, y).

🤔 In this paper, we first investigate how different frequency allocation strategies impact the semantic modeling capabilities of VLMs. Our analysis reveals that current multimodal RoPEs are unreliable in long-term semantic modeling. Moreover, we point out that existing temporal index scaling of visual tokens lacks flexibility and robustness during inference, where videos proceed with varying speeds and demonstrate significant differences in information densities.

✨ Guided by our analysis, we propose HoPE. HoPE combines multimodal RoPE and NoPE to facilitate reliable semantic modeling over extended contexts. Additionally, HoPE introduces dynamic and bidirectional temporal index scaling to facilitate VLMs' robustness to videos with varying speeds, which are common in real-world scenarios.

Code is available at: https://github.com/hrlics/HoPE

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.20444 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.20444 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.20444 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.