EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Abstract
EPiC is a framework for efficient 3D camera control in video diffusion models that constructs high-quality anchor videos through first-frame visibility masking, integrates them using a lightweight ControlNet module, and achieves state-of-the-art performance on I2V tasks with minimal resources.
Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation (2025)
- OmniCam: Unified Multimodal Video Generation via Camera Control (2025)
- MotionPro: A Precise Motion Controller for Image-to-Video Generation (2025)
- GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors (2025)
- CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models (2025)
- CamContextI2V: Context-aware Controllable Video Generation (2025)
- EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper