AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
Abstract
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.
Community
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
Khiem Vuong, Anurag Ghosh, Deva Ramanan*, Srinivasa Narasimhan*, Shubham Tulsiani*
CVPR 2025
TL;DR: A scalable data generation framework that combines mesh-renderings with real images, enabling robust 3D reconstruction across extreme viewpoint variations (e.g., aerial-ground).
A neat application: merging non-overlap ground images into a shared 3D scene using aerial views as a "map"/global context
See the Twitter (X) thread:
https://x.com/kvuongdev/status/1913290597718462607
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VGGT: Visual Geometry Grounded Transformer (2025)
- D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes (2025)
- Regist3R: Incremental Registration with Stereo Foundation Model (2025)
- Multi-view Reconstruction via SfM-guided Monocular Depth Estimation (2025)
- Accenture-NVS1: A Novel View Synthesis Dataset (2025)
- Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors (2025)
- ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper