WuYun: Skeleton-Guided Melody Generation with Long-Term Structure

Abstract

Deep learning has significantly advanced melody generation, but generating melodies with coherent long-term structure remains an important challenge. Previous studies typically employ single-stage end-to-end models or utilize two-stage frameworks that leverage structural similarity (e.g., repetition and variation) to guide the melody generation process, with both approaches often treating each musical event equally. In this paper, we present WuYun, a novel skeleton-guided two-stage melody generation framework with transformers. It first generates the most structurally important notes to construct a melodic skeleton and then infills this skeleton with decorative notes to create a full-fledged melody. Specifically, we propose a knowledge-based and data-driven method to effectively extract melodic skeletons from three aspects (meter, rhythm, and harmony). To mitigate the problem of error accumulation in the two-stage method, we introduce a skeleton ranker and use pre-training methods to enhance our model's robustness. Both subjective and objective results demonstrate that WuYun generates melodies with improved long-term structure and musicality, significantly outperforming other state-of-the-art methods.

WuYun Overview

WuYun Samples

Generated Melody with Chord		Generated Melody w/o Chord
	midi		midi
	midi		midi
	midi		midi

Baseline Models

Here are melodies generated by the baseline models:


MT	midi	midi	midi
CWT	midi	midi	midi
Melons	midi	midi	midi
WuYun-Base	midi	midi	midi