antoniorv6
/

smt-grandstaff

camera_grandstaff

Model card Files Files and versions Community

smt-grandstaff / README.md

antoniorv6's picture

Update README.md

2eb5a45 verified 8 months ago

|

history blame contribute delete

1.41 kB

	---
	license: mit
	pipeline_tag: image-to-text
	datasets:
	- antoniorv6/grandstaff
	tags:
	- omr
	- camera_grandstaff
	arxiv: 2402.07596
	---

	# Sheet Music Transformer (base model, fine-tuned on the Grandstaff dataset)

	The SMT model fine-tuned on the _Camera_ GrandStaff dataset for pianoform transcription.
	The code of the model is hosted in [this repository](https://github.com/antoniorv6/SMT).

	## Model description

	The SMT model consists of a vision encoder (ConvNext) and a text decoder (classic Transformer).
	Given an image of a music system, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

	<img src="https://github.com/antoniorv6/SMT/raw/master/graphics/SMT.jpg" alt="drawing" width="720"/>

	## Intended uses & limitations

	This model is fine-tuned on the GrandStaff dataset, its use is limited to transcribe pianoform images only.

	### BibTeX entry and citation info

	```bibtex
	@misc{RiosVila2024,
	title={Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription},
	author={Antonio Ríos-Vila and Jorge Calvo-Zaragoza and Thierry Paquet},
	year={2024},
	eprint={2402.07596},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2402.07596},
	}
	```