arxiv:2502.16427

Fine-Grained Video Captioning through Scene Graph Consolidation

Published on Feb 23

Authors:

Abstract

Recent advances in visual language models (VLMs) have significantly improved image captioning, but extending these gains to video understanding remains challenging due to the scarcity of fine-grained video captioning datasets. To bridge this gap, we propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation. Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions. To achieve this, we leverage a lightweight graph-to-text model trained solely on text corpora, eliminating the need for video captioning annotations. Experiments on the MSR-VTT and ActivityNet Captions datasets show that our approach outperforms zero-shot video captioning baselines, demonstrating that aggregating frame-level scene graphs yields rich video understanding without requiring large-scale paired data or high inference cost.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2502.16427 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2502.16427 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2502.16427 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.