The third edition of The International Workshop on Reading Music Systems (WoRMS) was held in a hybrid live/virtual setting 23rd of July. WoRMS is a novel workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. This edition brought together many researchers working in Optical Music Recognition (OMR) and also from the industry. This edition 11 papers researching a broad list of topics in OMR were presented, and an outstanding keynote from Anthony Wilkes (Organum Ltd) was talking on The design of ReadScoreLib.
One of the presented papers was work done by me (Elona Shatri) and George Fazekas on the newly created DoReMi dataset. We presented some of the challenges of OMR, specifically, the lack of a well-annotated, that supports more than one stage of OMR poses and how DoReMi moves closer to such dataset. Furthermore, statistics on the dataset and baseline experiments on object detection using Faster R-CNN models. DoReMi is a product of our collaborative work with Steinberg’s Dorico team. Using Dorico we generated 6 types of data that could possibly be used in different steps of OMR. These data are PNG images of scores (binary or colour), the musicXML file, XML files with metadata such as bounding boxes of each object together with musical information, Dorico original projects, MEI and MIDI files. Here you can find the DoReMi documentation and here you can download the dataset.
Below you can find a summary of some of the papers presented at the workshop
Hybrid Annotation Systems for Music Transcription
Dwells on the idea of bringing human annotation and automated methods together for music transcription. In other words, how can a non-specialist carry out music transcription with careful task interaction using AI automated methods. Among 144 workers that executed tasks in MTurk, those with formal knowledge in music were rare. Audio extracts of target music scores were offered to increase performance, especially for short segments of one or two measures. For longer segments, audio extracts have shown better results against textual measures, but a combination of the two was used as more preferable.
Implementation and evaluation of a neural network for the recognition of handwritten melodies
This research came as a fruit of a current need for digital archiving and digitisation of music for the University Library of Regensburg. It evaluates if the existing SOTA deep learning architecture can recognise handwritten monophonic scores for digitisation. Based on existing work, the architecture includes two neural networks: a stave recognition network using autoencoders and an end-to-end note recognition using recurrent convolutional networks. One limitation mentioned is the amount of annotated data available for this research.
The Challenge of Reconstructing Digits in Music Scores
Pacha presented some focused research he is currently conducting at enote in recognising and reconstructing the digit elements in sheet music. He shows the main challenges posed by the ambiguity of the variations in their classes, their contextual nature and more computer vision issues. He then shows the results in using deep learning to recognise digits. The network was trained in synthetic samples and achieved a validation accuracy of 95%, which does not live up to the real-world scores. To address it was fine-tuned on 7000 manually annotated real scores, but yet again, accuracy does not reach 60%. In the end, this opened up a long discussion in the workshop on why does this happen and the ways to tackle it.
Detecting Staves and Measures in Music Scores with Deep Learning
This paper investigates strategies of detecting measures, staves and system measures using machine learning. That is to aid the detection of structural elements as a basis for an OMR system. A neural network is trained in handwritten music scores to generate annotations for typeset music. Detectron2 was used as a framework and Faster R-CNNs as a model to predict the bounding boxes in images. The datasets used for training were MUSCIMA++ and AudioLabs datasets. They applied the model in three settings: single class models (system measures, stave measures, staves), two class models (system measures & staves) and three class models (system measures & stave measures & staves). The first setting is performing best. However, considering that that model lacks diversity, it might not work well for every kind of sheet music.
Unsupervised Neural Document Analysis for Music Score Images
Given the lack of large training annotated set, this study suggests using Domain Adaptation (DA) based on adversarial training. They propose combining DA and Selectional Auto-Encoders for unsupervised document analysis. They utilise three corpora manually labelled for the layers: SALZINNES, EINSIEDELN and CAPITAN, and F-score as an evaluation metric. Results obtained show the proposed method slightly improves state-of-the-art, but such adaptation should not be carried out in every type of layer.
Multimodal Audio and Image Music Transcription
This paper draws attention to Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) similarities and exploits them to assist each field. The paper presents a proof-of-concept that combines end-to-end AMT and OMR systems predictions over a set of monophonic scores. Using Symbol Error Rate (SER), they show that a fusion model of the two can slightly improve the error rate in OMR.
Sequential Next-Symbol Prediction for Optical Music Recognition
This study proposes to address the lack of large training sets with a sequential classification-based approach for music scores. That is done by predicting the symbol locations and their respective music-notation label using Convolutional Neural Networks (CNN).
Completing Optical Music Recognition with Agnostic Transcription and Machine
This work focuses on the last stage of OMR, encoding, where outputs from images are converted to a score encoding format. The paper investigates the implementations of recognition pipelines that use Machine Translation to do the encoding.
Original article: https://elonashatri.github.io/WoRMS2021.html
Author: Elona Shatri