Automating Sequence to Sequence Piano Transcription With Transformers

Sequence Piano Transcription With Transformers

Automatic music transcription (AMT) is a core task in Music Information Retrieval (MIR) that converts raw audio to an appropriate symbolic representation. In the context of piano transcription, this usually means converting a series of note events indicating precise onset/offset timings and velocities to a sheet music score that correctly aligns with a given metrical grid.

Much recent progress on piano transcription has been driven by two factors: the construction and release of new datasets that accurately annotate both piano audio and MIDI (most notably MAESTRO and GuitarSet), and the use of domain-specific deep neural network architectures that incorporate specific knowledge about how to play a musical instrument (e.g., the Onsets and Frames architecture that models note onsets and frames separately). However, these model designs often add significant complexity to the decoder, which may limit their applicability to other MIR tasks and musical instruments.

www.tartalover.net

The problem of AMT is challenging due to the wide variety of features present in piano audio. For example, while the pitch of each note can be identified by a single feature, tempo and other musical features are harder to determine. Furthermore, different parts of a piece of music are composed using both melody and harmony, and the relationship between the melodic and harmonic structure is highly non-linear. As a result, transcription algorithms must rely on a combination of features to identify and transcribe the correct notes in a given sequence, while also being able to understand how the notes relate to one another and predict future note locations in the music.

Automating Sequence to Sequence Piano Transcription With Transformers

Existing methods for AMT employ a mixture of spectrogram-based feature extraction and time-frequency-based feature extraction. The most promising approaches train a convolutional recurrent neural network framework with connectionist temporal classification loss, but suffer from low training efficiency and difficulty achieving stable convergence for very long music sequences. In addition, these systems require a significant amount of data augmentation to learn the relationships between musical symbols, which makes it difficult for them to perform well on real-world sequences with real-world durations.

We propose to leverage the attention mechanism of the Transformer architecture to overcome these challenges by learning to directly translate spectrogram inputs into MIDI-like output events. We show that the resulting system can learn to recognize piano notes and their timings, as well as the dynamics of chords by training on labeled MIDI outputs. This approach enables us to train an end-to-end system that reaches state-of-the-art performance on several transcription tasks, including identifying and transcribing piano notes, generating continuations that elaborate on a given melody, and in a seq2seq setup generating accompaniments conditioned on melodies.

By leveraging the natural attention mechanism of Transformer, we achieve better performance than previous end-to-end transcription systems that incorporate both a standard recurrent neural network and an attention model, while being able to generalize to other musical instruments and metrical transcriptions.

Leave a Reply

Your email address will not be published. Required fields are marked *