SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training

Authors

Jiaxing Yu (Zhejiang University) yujx@zju.edu.cn
Xinda Wu (Zhejiang University) wuxinda@zju.edu.cn
Yunfei Xu (OPPO) xuyunfei@oppo.com
Tieyao Zhang (Zhejiang University) kreutzer0421@zju.edu.cn
Songruoyao Wu (Zhejiang University) wsry@zju.edu.cn
Le Ma (Zhejiang University) maller@zju.edu.cn
Kejun Zhang* (Zhejiang University) zhangkejun@zju.edu.cn

* Corresponding Author

Abstract

Lyric-to-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melody harmony modeling, which usually relies heavily on intermediates or strict rules, limiting model's capabilities and generative diversity. In this paper, we propose SongGLM, a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training based on the General Language Model (GLM) to guarantee the alignment and harmony between lyrics and melodies. Specifically, 1) we introduce a unified symbolic song representation for lyrics and melodies with word-level and phrase-level (2D) alignment encoding to capture the lyric-melody alignment; 2) we design a multi-task pre-training framework with hierarchical blank infilling objectives (n-gram, phrase, and long span), and incorporate lyric-melody relationships into the extraction of harmonized n-grams to ensure the lyric-melody harmony. We also construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning. The objective and subjective results indicate that SongGLM can generate melodies from lyrics with significant improvements in both alignment and harmony, outperforming all the previous baseline methods.

Lyric-to-Melody Generation Challenges

SongGLM Overview

Given the paired lyric-melody dataset, we first establish two relationships between lyrics and melodies based on their representative features, and incorporate these relationships into n-gram extraction to select the most harmonized n-grams. Then, we introduce a unified symbolic song representation with 2D alignment encoding and adopt a multi-task pre-training framework that employs hierarchical blank infilling objectives for lyric-to-melody generation.

Detailed Framework

Lyric-Melody Dataset

Public Corpus	Website	Raw	Processed
NES	https://www.kaggle.com/datasets/imsparsh/nes-mdb-dataset https://github.com/chrisdonahue/nesmdb
POP909	https://github.com/music-x-lab/POP909-Dataset
MTCL	https://www.liederenbank.nl/mtc
Wikifonia	http://www.wikifonia.org http://www.synthzone.com/files/Wikifonia/Wikifonia.zip
Session	https://thesession.org
LMD	https://colinraffel.com/projects/lmd
SymphonyNet	https://symphonynet.github.io
MetaMIDI	https://zenodo.org/record/5142664

Web Collections	Website	Raw	Processed
MuseScore	https://musescore.org https://github.com/Xmader/musescore-dataset
Hooktheory	https://www.hooktheory.com https://github.com/wayne391/lead-sheet-dataset
BitMidi	https://bitmidi.com
FreeMidi	https://freemidi.org https://github.com/josephding23/Free-Midi-Library
KernScores	http://kern.ccarh.org
Kunstderfuge	https://www.kunstderfuge.com
ABC Notation	https://abcnotation.com