WuYun: Skeleton-Guided Melody Generation with Long-Term Structure

Abstract

Deep learning has significantly advanced melody generation, but generating melodies with coherent long-term structure remains an important challenge. Previous studies typically employ single-stage end-to-end models or utilize two-stage frameworks that leverage structural similarity (e.g., repetition and variation) to guide the melody generation process, with both approaches often treating each musical event equally. In this paper, we present WuYun, a novel skeleton-guided two-stage melody generation framework with transformers. It first generates the most structurally important notes to construct a melodic skeleton and then infills this skeleton with decorative notes to create a full-fledged melody. Specifically, we propose a knowledge-based and data-driven method to effectively extract melodic skeletons from three aspects (meter, rhythm, and harmony). To mitigate the problem of error accumulation in the two-stage method, we introduce a skeleton ranker and use pre-training methods to enhance our model's robustness. Both subjective and objective results demonstrate that WuYun generates melodies with improved long-term structure and musicality, significantly outperforming other state-of-the-art methods.


WuYun Overview

WuYun Samples

Generated Melody with Chord Generated Melody w/o Chord
midi midi
midi midi
midi midi

Baseline Models

Here are melodies generated by the baseline models:
MT midi midi midi
CWT midi midi midi
Melons midi midi midi
WuYun-Base midi midi midi