1University of Maryland, College Park, MD, USA 2Adobe Research, San Francisco, CA, USA sonalkum@umd.edu
Abstract: The field of text-to-audio generation has seen remarkable advancements, yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects, allowing creators and sound designers to manipulate key acoustic parameters like loudness, pitch, reverb, fade, brightness, and noise during the generation process. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio and its acoustic features. This innovative approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our results demonstrate the effectiveness of this approach in producing high-quality, customizable audio outputs that align closely with user specifications.
Text Prompt with descriptors & Generated Audio Examples
#
Text Prompt
Generated Audio
Comparison of Baseline (in red) and SILA (in Green) on "Reveb"
#
Text Prompt
Generated Audio
Comparison of Baseline (in red) and SILA (in Green) on "Noise"
#
Text Prompt
Generated Audio
Comparison of Baseline (in red) and SILA (in Green) on "Pitch"
#
Text Prompt
Generated Audio
Comparison of Baseline (in red) and SILA (in Green) on "Duration"