Abstract
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.
Community
TL;DR:
- We propose PS3, which scales up vision pre-training (e.g., CLIP, SigLIP) to 4K resolution with a near-constant cost. PS3 is able to efficiently process high-res images via prompt-aware patch selection.
- We introduce VILA-HD, a state-of-the-art high-res MLLM with PS3 as the vision encoder. VILA-HD w/ PS3 shows great scaling properties and surpasses SOTA MLLMs on high-res benchmarks in both performance and efficiency.
- We find current benchmarks don't require 4K-res perception although they contain 4K-res images. We propose 4KPro, a QA benchmark that strictly requires 4K-resolution perception.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression (2025)
- Image Embedding Sampling Method for Diverse Captioning (2025)
- Should VLMs be Pre-trained with Image Data? (2025)
- MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing (2025)
- BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries (2025)
- Breaking the Encoder Barrier for Seamless Video-Language Understanding (2025)
- Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper