upload models
Browse files- README.md +76 -3
- diffusion_posiguide_hubertbase_mamba2/args.txt +46 -0
- diffusion_posiguide_hubertbase_mamba2/piano2pose-iter=99999-val_loss=0.0376890823245049.ckpt +3 -0
- diffusion_posiguide_hubertbase_tf2/args.txt +46 -0
- diffusion_posiguide_hubertbase_tf2/piano2pose-iter=99999-val_loss=0.0391630232334137.ckpt +3 -0
- diffusion_posiguide_hubertlarge_tf2/args.txt +46 -0
- diffusion_posiguide_hubertlarge_tf2/piano2pose-iter=90000-val_loss=0.0364401508122683.ckpt +3 -0
- diffusion_posiguide_wav2veclarge_tf2/args.txt +46 -0
- diffusion_posiguide_wav2veclarge_tf2/piano2pose-iter=99000-val_loss=0.0359656028449535.ckpt +3 -0
README.md
CHANGED
@@ -1,3 +1,76 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<div align="center">
|
2 |
+
<h1><img src="assets/favicon.png" width="50"> PianoMotion10M </h1>
|
3 |
+
<h3>Dataset and Benchmark for Hand Motion Generation in Piano Performance</h3>
|
4 |
+
|
5 |
+
[Qijun Gan](https://github.com/agnJason)<sup>1</sup>, [Song Wang](https://songw-zju.github.io/)<sup>1</sup>, Shengtao Wu<sup>2</sup> ,[Jianke Zhu](https://scholar.google.cz/citations?user=SC-WmzwAAAAJ)<sup>1</sup> <sup>:email:</sup>
|
6 |
+
|
7 |
+
<sup>1</sup> Zhejiang University, <sup>2</sup> Hangzhou Dianzi University
|
8 |
+
|
9 |
+
(<sup>:email:</sup>) corresponding author.
|
10 |
+
|
11 |
+
[ArXiv Preprint](https://arxiv.org/abs/2406.09326) [Project Page](https://agnjason.github.io/PianoMotion-page/) Dataset[[Google Drive](https://drive.google.com/drive/folders/1JY0zOE0s7v9ZYLlIP1kCZUdNrih5nYEt?usp=sharing)]/[[Hyper.ai](https://hyper.ai/datasets/32494)]
|
12 |
+
|
13 |
+
</div>
|
14 |
+
|
15 |
+
#
|
16 |
+
### News
|
17 |
+
* **`Jun. 14th, 2024`:** Paper is available at [arxiv](https://arxiv.org/abs/2406.09326). ☕️
|
18 |
+
* **`Jun. 1st, 2024`:** We released our code and datasets! Paper is coming soon. Please stay tuned! ☕️
|
19 |
+
|
20 |
+
## Abstract
|
21 |
+
|
22 |
+
Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird's-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes.
|
23 |
+
|
24 |
+
## Introduction
|
25 |
+
<div align="center"><h4>PianoMotion10M is a large-scale piano-motion dataset. And we present a benchmark for hand motion generation with piano music.</h4></div>
|
26 |
+
|
27 |
+

|
28 |
+
|
29 |
+
Overview of our framework. We collect videos of expert piano performances from the internet and annotated and processed them to obtain a large-scale dataset, PianoMotion10M, comprising piano music and hand motions. Building upon this dataset, we establish a benchmark aimed at generating hand movements from piano music.
|
30 |
+
|
31 |
+
## Models
|
32 |
+
|
33 |
+
Sample results of our generation model.
|
34 |
+
|
35 |
+
https://github.com/user-attachments/assets/53378820-4749-4d23-8e8e-a55b544bba98
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
|
40 |
+
> Results from the [PianoMotion10M paper]()
|
41 |
+
|
42 |
+
|
43 |
+

|
44 |
+
|
45 |
+
| Method | Backbone | Decoder | FID | PARAMs |
|
46 |
+
|:-------------:|:----------:|:-----------:|:-----:|:------:|
|
47 |
+
| EmoTalk | HuBert | Transformer | 4.645 | 308 |
|
48 |
+
| LivelySpeaker | HuBert | Transformer | 4.157 | 321 |
|
49 |
+
| Our-Base | Wav2Vec2.0 | SSM | 3.587 | 320 |
|
50 |
+
| Our-Base | Wav2Vec2.0 | Transformer | 3.608 | 323 |
|
51 |
+
| Our-Base | HuBert | SSM | 3.412 | 320 |
|
52 |
+
| Our-Base | HuBert | Transformer | 3.529 | 323 |
|
53 |
+
| Our-Large | Wav2Vec2.0 | SSM | 3.453 | 539 |
|
54 |
+
| Our-Large | Wav2Vec2.0 | Transformer | 3.376 | 557 |
|
55 |
+
| Our-Large | HuBert | SSM | 3.395 | 539 |
|
56 |
+
| Our-Large | HuBert | Transformer | **3.281** | 557 |
|
57 |
+
|
58 |
+
**Notes**:
|
59 |
+
|
60 |
+
- All the experiments are performed on 1 NVIDIA GeForce RTX 3090Ti GPU.
|
61 |
+
|
62 |
+
|
63 |
+
## Getting Started
|
64 |
+
- [Installation](docs/install.md)
|
65 |
+
- [Prepare Dataset](docs/prepare_dataset.md)
|
66 |
+
- [Train and Eval](docs/train_eval.md)
|
67 |
+
|
68 |
+
## Citation
|
69 |
+
If you find PianoMotion10M is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
|
70 |
+
```bibtex
|
71 |
+
@inproceedings{gan2024pianomotion,
|
72 |
+
title={PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance},
|
73 |
+
author={Gan, Qijun and Wang, Song and Wu, Shengtao and Zhu, Jianke},
|
74 |
+
year={2024},
|
75 |
+
}
|
76 |
+
```
|
diffusion_posiguide_hubertbase_mamba2/args.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bs_dim": 96,
|
3 |
+
"train_xy": false,
|
4 |
+
"feature_dim": 832,
|
5 |
+
"period": 30,
|
6 |
+
"max_seq_len": 5000,
|
7 |
+
"batch_size": 48,
|
8 |
+
"gpu0_bs": 6,
|
9 |
+
"valid_batch_size": 32,
|
10 |
+
"experiment_name": "diffusion_posiguide_hubertbase_mamba2",
|
11 |
+
"data_root": "/mnt/ssd/PianoPose-new",
|
12 |
+
"preload": true,
|
13 |
+
"tiny": false,
|
14 |
+
"adjust": true,
|
15 |
+
"use_midiguide": false,
|
16 |
+
"is_random": true,
|
17 |
+
"return_beta": false,
|
18 |
+
"up_list": [
|
19 |
+
"1467634",
|
20 |
+
"66685747"
|
21 |
+
],
|
22 |
+
"continue_train": false,
|
23 |
+
"wav2vec_path": "facebook/hubert-large-ls960-ft",
|
24 |
+
"piano2posi_path": "logs/piano2posi_hubertbase_mamba",
|
25 |
+
"timesteps": 1000,
|
26 |
+
"fine_map": 0,
|
27 |
+
"unet_dim": 256,
|
28 |
+
"xyz_guide": true,
|
29 |
+
"remap_noise": true,
|
30 |
+
"RAG": false,
|
31 |
+
"hidden_type": "audio_f",
|
32 |
+
"latest_layer": "tanh",
|
33 |
+
"encoder_type": "mamba",
|
34 |
+
"num_layer": 8,
|
35 |
+
"loss_mode": "naive_l1",
|
36 |
+
"weight_rec": 1.0,
|
37 |
+
"weight_vel": 1.0,
|
38 |
+
"iterations": 100000,
|
39 |
+
"train_sec": 8,
|
40 |
+
"lr": 5e-05,
|
41 |
+
"check_val_every_n_iteration": 1000,
|
42 |
+
"limit_val_batches": 0.6,
|
43 |
+
"save_every_n_iteration": 1000,
|
44 |
+
"save_top_k": 5,
|
45 |
+
"logdir": "logs"
|
46 |
+
}
|
diffusion_posiguide_hubertbase_mamba2/piano2pose-iter=99999-val_loss=0.0376890823245049.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:53766a6cb5d08e492105517e0a64b4e86e8f0c2703ba8183d92cee0d9b769daf
|
3 |
+
size 1313128490
|
diffusion_posiguide_hubertbase_tf2/args.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bs_dim": 96,
|
3 |
+
"train_xy": false,
|
4 |
+
"feature_dim": 832,
|
5 |
+
"period": 30,
|
6 |
+
"max_seq_len": 5000,
|
7 |
+
"batch_size": 48,
|
8 |
+
"gpu0_bs": 6,
|
9 |
+
"valid_batch_size": 8,
|
10 |
+
"experiment_name": "diffusion_posiguide_hubertbase_tf2",
|
11 |
+
"data_root": "/mnt/ssd/PianoPose-new",
|
12 |
+
"preload": true,
|
13 |
+
"tiny": false,
|
14 |
+
"adjust": true,
|
15 |
+
"use_midiguide": false,
|
16 |
+
"is_random": true,
|
17 |
+
"return_beta": false,
|
18 |
+
"up_list": [
|
19 |
+
"1467634",
|
20 |
+
"66685747"
|
21 |
+
],
|
22 |
+
"continue_train": false,
|
23 |
+
"wav2vec_path": "facebook/hubert-large-ls960-ft",
|
24 |
+
"piano2posi_path": "logs/piano2posi_hubertbase_tf",
|
25 |
+
"timesteps": 1000,
|
26 |
+
"fine_map": 0,
|
27 |
+
"unet_dim": 256,
|
28 |
+
"xyz_guide": true,
|
29 |
+
"remap_noise": true,
|
30 |
+
"RAG": false,
|
31 |
+
"hidden_type": "audio_f",
|
32 |
+
"latest_layer": "tanh",
|
33 |
+
"encoder_type": "transformer",
|
34 |
+
"num_layer": 4,
|
35 |
+
"loss_mode": "naive_l1",
|
36 |
+
"weight_rec": 1.0,
|
37 |
+
"weight_vel": 1.0,
|
38 |
+
"iterations": 100000,
|
39 |
+
"train_sec": 8,
|
40 |
+
"lr": 5e-05,
|
41 |
+
"check_val_every_n_iteration": 1000,
|
42 |
+
"limit_val_batches": 0.6,
|
43 |
+
"save_every_n_iteration": 1000,
|
44 |
+
"save_top_k": 5,
|
45 |
+
"logdir": "logs"
|
46 |
+
}
|
diffusion_posiguide_hubertbase_tf2/piano2pose-iter=99999-val_loss=0.0391630232334137.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7896108645d8ee8d773456a11dc74dc88531fb0a232b9921d3b8363409e526e8
|
3 |
+
size 1520681914
|
diffusion_posiguide_hubertlarge_tf2/args.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bs_dim": 96,
|
3 |
+
"train_xy": false,
|
4 |
+
"feature_dim": 832,
|
5 |
+
"period": 30,
|
6 |
+
"max_seq_len": 5000,
|
7 |
+
"batch_size": 20,
|
8 |
+
"gpu0_bs": 2,
|
9 |
+
"valid_batch_size": 32,
|
10 |
+
"experiment_name": "diffusion_posiguide_hubertlarge_tf2",
|
11 |
+
"data_root": "/mnt/ssd/PianoPose-new",
|
12 |
+
"preload": false,
|
13 |
+
"tiny": false,
|
14 |
+
"adjust": true,
|
15 |
+
"use_midiguide": false,
|
16 |
+
"is_random": true,
|
17 |
+
"return_beta": false,
|
18 |
+
"up_list": [
|
19 |
+
"1467634",
|
20 |
+
"66685747"
|
21 |
+
],
|
22 |
+
"continue_train": false,
|
23 |
+
"wav2vec_path": "facebook/hubert-large-ls960-ft",
|
24 |
+
"piano2posi_path": "logs/piano2posi2/",
|
25 |
+
"timesteps": 1000,
|
26 |
+
"fine_map": 0,
|
27 |
+
"unet_dim": 256,
|
28 |
+
"xyz_guide": true,
|
29 |
+
"remap_noise": true,
|
30 |
+
"RAG": false,
|
31 |
+
"hidden_type": "audio_f",
|
32 |
+
"latest_layer": "tanh",
|
33 |
+
"encoder_type": "transformer",
|
34 |
+
"num_layer": 8,
|
35 |
+
"loss_mode": "naive_l1",
|
36 |
+
"weight_rec": 1.0,
|
37 |
+
"weight_vel": 1.0,
|
38 |
+
"iterations": 100000,
|
39 |
+
"train_sec": 8,
|
40 |
+
"lr": 5e-05,
|
41 |
+
"check_val_every_n_iteration": 1000,
|
42 |
+
"limit_val_batches": 0.6,
|
43 |
+
"save_every_n_iteration": 1000,
|
44 |
+
"save_top_k": 5,
|
45 |
+
"logdir": "logs"
|
46 |
+
}
|
diffusion_posiguide_hubertlarge_tf2/piano2pose-iter=90000-val_loss=0.0364401508122683.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e18a6a7b5332cb0dbcf3388d36dee49d57eb2dbfa4f8877a3f929d550421cd3e
|
3 |
+
size 2970320157
|
diffusion_posiguide_wav2veclarge_tf2/args.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bs_dim": 96,
|
3 |
+
"train_xy": false,
|
4 |
+
"feature_dim": 832,
|
5 |
+
"period": 30,
|
6 |
+
"max_seq_len": 5000,
|
7 |
+
"batch_size": 20,
|
8 |
+
"gpu0_bs": 2,
|
9 |
+
"valid_batch_size": 32,
|
10 |
+
"experiment_name": "diffusion_posiguide_wav2veclarge_tf2",
|
11 |
+
"data_root": "/mnt/ssd/PianoPose-new",
|
12 |
+
"preload": false,
|
13 |
+
"tiny": false,
|
14 |
+
"adjust": true,
|
15 |
+
"use_midiguide": false,
|
16 |
+
"is_random": true,
|
17 |
+
"return_beta": false,
|
18 |
+
"up_list": [
|
19 |
+
"1467634",
|
20 |
+
"66685747"
|
21 |
+
],
|
22 |
+
"continue_train": false,
|
23 |
+
"wav2vec_path": "facebook/hubert-large-ls960-ft",
|
24 |
+
"piano2posi_path": "logs/piano2posi_wav2veclarge_tf",
|
25 |
+
"timesteps": 1000,
|
26 |
+
"fine_map": 0,
|
27 |
+
"unet_dim": 256,
|
28 |
+
"xyz_guide": true,
|
29 |
+
"remap_noise": true,
|
30 |
+
"RAG": false,
|
31 |
+
"hidden_type": "audio_f",
|
32 |
+
"latest_layer": "tanh",
|
33 |
+
"encoder_type": "transformer",
|
34 |
+
"num_layer": 8,
|
35 |
+
"loss_mode": "naive_l1",
|
36 |
+
"weight_rec": 1.0,
|
37 |
+
"weight_vel": 1.0,
|
38 |
+
"iterations": 100000,
|
39 |
+
"train_sec": 8,
|
40 |
+
"lr": 5e-05,
|
41 |
+
"check_val_every_n_iteration": 1000,
|
42 |
+
"limit_val_batches": 0.6,
|
43 |
+
"save_every_n_iteration": 1000,
|
44 |
+
"save_top_k": 5,
|
45 |
+
"logdir": "logs"
|
46 |
+
}
|
diffusion_posiguide_wav2veclarge_tf2/piano2pose-iter=99000-val_loss=0.0359656028449535.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ce042fe9fbd4b0020040033f5748c34675354955bf0280f1e13b52d133923850
|
3 |
+
size 2970320221
|