arxiv:2505.20444

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Published on May 26

· Submitted by

brian13 on May 29

Upvote

Authors:

Haoran Li ,

Abstract

HoPE, a Hybrid of Position Embedding, enhances VLMs' long-context performance in videos through improved frequency allocation and dynamic temporal scaling.

AI-generated summary

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

View arXiv page View PDF GitHub repository Add to collection

Community

brian13

Paper author Paper submitter 3 days ago

🔧 Extending Rotary Position Embedding (RoPE) to multimodal scenarios typically involves allocating different frequencies to encode different positional components (i.e., t, x, y).

🤔 In this paper, we first investigate how different frequency allocation strategies impact the semantic modeling capabilities of VLMs. Our analysis reveals that current multimodal RoPEs are unreliable in long-term semantic modeling. Moreover, we point out that existing temporal index scaling of visual tokens lacks flexibility and robustness during inference, where videos proceed with varying speeds and demonstrate significant differences in information densities.

✨ Guided by our analysis, we propose HoPE. HoPE combines multimodal RoPE and NoPE to facilitate reliable semantic modeling over extended contexts. Additionally, HoPE introduces dynamic and bidirectional temporal index scaling to facilitate VLMs' robustness to videos with varying speeds, which are common in real-world scenarios.

Code is available at: https://github.com/hrlics/HoPE

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.20444 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.20444 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.20444 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.