SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Sonal Kumar1,2, Prem Seetharaman2, Justin Salamon2, Dinesh Manocha1, Oriol Nieto2

1University of Maryland, College Park, MD, USA
2Adobe Research, San Francisco, CA, USA
sonalkum@umd.edu



Abstract: The field of text-to-audio generation has seen remarkable advancements, yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects, allowing creators and sound designers to manipulate key acoustic parameters like loudness, pitch, reverb, fade, brightness, and noise during the generation process. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio and its acoustic features. This innovative approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our results demonstrate the effectiveness of this approach in producing high-quality, customizable audio outputs that align closely with user specifications.

SILA Illustration

Figure 1: Illustration of key components of SILA.

Text Prompt with descriptors & Generated Audio Examples

# Text Prompt Generated Audio
SILA Illustration

Figure 2: Comparison of average acoustic characteristic values. SILA outperforms the baseline, with scores within the expected range for each feature, indicating improved disentanglement between audio and its characteristics.

Comparison of Baseline (in red) and SILA (in Green) on "Reveb"

# Text Prompt Generated Audio

Comparison of Baseline (in red) and SILA (in Green) on "Noise"

# Text Prompt Generated Audio

Comparison of Baseline (in red) and SILA (in Green) on "Pitch"

# Text Prompt Generated Audio

Comparison of Baseline (in red) and SILA (in Green) on "Duration"

# Text Prompt Generated Audio