arxiv:2503.06053

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Published on Mar 8

· Submitted by

lixiaochuan on Mar 18

#1 Paper of the day

Upvote

Authors:

Guoguang Du ,

Xiaochuan Li ,

Lu Liu ,

Cong Xu ,

Abstract

Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at https://dropletx.github.io.

View arXiv page View PDF Add to collection

Community

lixiaochuan

Paper author Paper submitter about 20 hours ago

In the past year, we have achieved notable progress in open-source video generation. Nonetheless, these models exhibit suboptimal performance with respect to camera movement. This paper investigates the integral spatio-temporal consistency in video generation, concentrating on the influence of camera movement on narrative, including the entry and exit of objects or scenes, and the correlation between emerging objects and antecedent plot elements.
To advance communal exploration, we have released the DropletVideo project's paper, code, weights, and dataset under an open-source license.

paper: https://arxiv.org/abs/2503.06053
blog: https://medium.com/@guoguang35_68674/dropletvideo-promoting-spatio-temporal-consistent-video-generation-bd6d1d9b8086
Github: https://github.com/IEIT-AGI/DropletVideo
Project: https://dropletx.github.io/
Model Weight: https://huggingface.co/DropletX/DropletVideo-5B
DropletVideo-10M: https://huggingface.co/datasets/DropletX/DropletVideo-10M
DropletVideo-1M: https://huggingface.co/datasets/DropletX/DropletVideo-1M