Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model
Abstract
Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
Community
We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model without any training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reangle-A-Video: 4D Video Generation as Video-to-Video Translation (2025)
- TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models (2025)
- Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency (2025)
- SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation (2025)
- GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control (2025)
- GenFusion: Closing the Loop between Reconstruction and Generation via Videos (2025)
- I2V3D: Controllable image-to-video generation with 3D guidance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper