Deep learning has significantly advanced melody generation, but generating melodies with coherent long-term structure remains an important challenge. Previous studies typically employ single-stage end-to-end models or utilize two-stage frameworks that leverage structural similarity (e.g., repetition and variation) to guide the melody generation process, with both approaches often treating each musical event equally. In this paper, we present WuYun, a novel skeleton-guided two-stage melody generation framework with transformers. It first generates the most structurally important notes to construct a melodic skeleton and then infills this skeleton with decorative notes to create a full-fledged melody. Specifically, we propose a knowledge-based and data-driven method to effectively extract melodic skeletons from three aspects (meter, rhythm, and harmony). To mitigate the problem of error accumulation in the two-stage method, we introduce a skeleton ranker and use pre-training methods to enhance our model's robustness. Both subjective and objective results demonstrate that WuYun generates melodies with improved long-term structure and musicality, significantly outperforming other state-of-the-art methods.
WuYun Overview