File size: 1,812 Bytes
1991049
84388f3
1991049
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---

title: Video-to-Audio-and-Piano
emoji: πŸ”Š
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
---



## Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
## Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

## Results

**1. Results of Video-to-Audio Synthesis**

https://github.com/user-attachments/assets/d6761371-8fc2-427c-8b2b-6d2ac22a2db2

https://github.com/user-attachments/assets/50b33e54-8ba1-4fab-89d3-5a5cc4c22c9a

**2. Results of Video-to-Piano Synthesis**

https://github.com/user-attachments/assets/b6218b94-1d58-4dc5-873a-c3e8eef6cd67

https://github.com/user-attachments/assets/ebdd1d95-2d9e-4add-b61a-d181f0ae38d0


## Installation

**1. Create a conda environment**

```bash

conda create -n v2ap python=3.10

conda activate v2ap

```

**2. Install requirements**

```bash

pip install -r requirements.txt

```


**Pretrained models**

The models are available at https://huggingface.co/lshzhm/Video-to-Audio-and-Piano/tree/main.


## Inference

**1. Video-to-Audio inference**

```bash

python src/inference_v2a.py

```

**2. Video-to-Piano inference**

```bash

python src/inference_v2p.py

```

## Dateset is in progress


## Metrix


## Acknowledgement

- [Audeo](https://github.com/shlizee/Audeo) for video to midi prediction
- [E2TTS](https://github.com/lucidrains/e2-tts-pytorch) for CFM structure and base E2 implementation
- [FLAN-T5](https://huggingface.co/google/flan-t5-large) for FLAN-T5 text encode
- [CLIP](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) for CLIP image encode
- [AudioLDM Eval](https://github.com/haoheliu/audioldm_eval) for audio evaluation