Home  | Publications | YKG25

Generation of Musical Timbres Using a Text-Guided Diffusion Model

MCML Authors

Abstract

In recent years, text-to-audio systems have achieved remarkable success, enabling the generation of complete audio segments directly from text descriptions. While these systems also facilitate music creation, the element of human creativity and deliberate expression is often limited. In contrast, the present work allows composers, arrangers, and performers to create the basic building blocks for music creation: audio of individual musical notes for use in electronic instruments and DAWs. Through text prompts, the user can specify the timbre characteristics of the audio. We introduce a system that combines a latent diffusion model and multi-modal contrastive learning to generate musical timbres conditioned on text descriptions. By jointly generating the magnitude and phase of the spectrogram, our method eliminates the need for subsequently running a phase retrieval algorithm, as related methods do.

misc


Preprint

Apr. 2025

Authors

W. Yuan • Q. KhanV. Golkov

Links

GitHub

Research Area

 B1 | Computer Vision

BibTeXKey: YKG25

Back to Top