Spaces:

lshzhm
/

Video-to-Audio-and-Piano

Running

App Files Files Community

Video-to-Audio-and-Piano / README.md

lshzhm's picture

Update README.md

84388f3 about 1 month ago

|

history blame contribute delete

1.81 kB

A newer version of the Gradio SDK is available: 5.28.0

Upgrade

metadata

title: Video-to-Audio-and-Piano
emoji: 🔊
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Results

1. Results of Video-to-Audio Synthesis

https://github.com/user-attachments/assets/d6761371-8fc2-427c-8b2b-6d2ac22a2db2

https://github.com/user-attachments/assets/50b33e54-8ba1-4fab-89d3-5a5cc4c22c9a

2. Results of Video-to-Piano Synthesis

https://github.com/user-attachments/assets/b6218b94-1d58-4dc5-873a-c3e8eef6cd67

https://github.com/user-attachments/assets/ebdd1d95-2d9e-4add-b61a-d181f0ae38d0

Installation

1. Create a conda environment

conda create -n v2ap python=3.10
conda activate v2ap

2. Install requirements

pip install -r requirements.txt

Pretrained models

The models are available at https://huggingface.co/lshzhm/Video-to-Audio-and-Piano/tree/main.

Inference

1. Video-to-Audio inference

python src/inference_v2a.py

2. Video-to-Piano inference

python src/inference_v2p.py

Dateset is in progress

Metrix

Acknowledgement

Audeo for video to midi prediction
E2TTS for CFM structure and base E2 implementation
FLAN-T5 for FLAN-T5 text encode
CLIP for CLIP image encode
AudioLDM Eval for audio evaluation