SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Sonal Kumar^1,2, Prem Seetharaman², Justin Salamon², Dinesh Manocha¹, Oriol Nieto²

¹University of Maryland, College Park, MD, USA
²Adobe Research, San Francisco, CA, USA
sonalkum@umd.edu

Abstract: The field of text-to-audio generation has seen remarkable advancements, yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects, allowing creators and sound designers to manipulate key acoustic parameters like loudness, pitch, reverb, fade, brightness, and noise during the generation process. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio and its acoustic features. This innovative approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our results demonstrate the effectiveness of this approach in producing high-quality, customizable audio outputs that align closely with user specifications.

SILA Illustration — **Figure 1:** Illustration of key components of SILA.

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Text Prompt with descriptors & Generated Audio Examples

Comparison of Baseline (in red) and SILA (in Green) on "Reveb"

Comparison of Baseline (in red) and SILA (in Green) on "Noise"

Comparison of Baseline (in red) and SILA (in Green) on "Pitch"

Comparison of Baseline (in red) and SILA (in Green) on "Duration"