Spaces:
Running
Running
File size: 1,812 Bytes
1991049 84388f3 1991049 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
title: Video-to-Audio-and-Piano
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
---
## Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
## Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
## Results
**1. Results of Video-to-Audio Synthesis**
https://github.com/user-attachments/assets/d6761371-8fc2-427c-8b2b-6d2ac22a2db2
https://github.com/user-attachments/assets/50b33e54-8ba1-4fab-89d3-5a5cc4c22c9a
**2. Results of Video-to-Piano Synthesis**
https://github.com/user-attachments/assets/b6218b94-1d58-4dc5-873a-c3e8eef6cd67
https://github.com/user-attachments/assets/ebdd1d95-2d9e-4add-b61a-d181f0ae38d0
## Installation
**1. Create a conda environment**
```bash
conda create -n v2ap python=3.10
conda activate v2ap
```
**2. Install requirements**
```bash
pip install -r requirements.txt
```
**Pretrained models**
The models are available at https://huggingface.co/lshzhm/Video-to-Audio-and-Piano/tree/main.
## Inference
**1. Video-to-Audio inference**
```bash
python src/inference_v2a.py
```
**2. Video-to-Piano inference**
```bash
python src/inference_v2p.py
```
## Dateset is in progress
## Metrix
## Acknowledgement
- [Audeo](https://github.com/shlizee/Audeo) for video to midi prediction
- [E2TTS](https://github.com/lucidrains/e2-tts-pytorch) for CFM structure and base E2 implementation
- [FLAN-T5](https://huggingface.co/google/flan-t5-large) for FLAN-T5 text encode
- [CLIP](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) for CLIP image encode
- [AudioLDM Eval](https://github.com/haoheliu/audioldm_eval) for audio evaluation
|