SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Sonal Kumar1,2, Prem Seetharaman2, Justin Salamon2, Dinesh Manocha1, Oriol Nieto2

1University of Maryland, College Park, MD, USA
2Adobe Research, San Francisco, CA, USA
sonalkum@umd.edu



Abstract: The field of text-to-audio generation has seen remarkable advancements, yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects, allowing creators and sound designers to manipulate key acoustic parameters like loudness, pitch, reverb, fade, brightness, and noise during the generation process. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio and its acoustic features. This innovative approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our results demonstrate the effectiveness of this approach in producing high-quality, customizable audio outputs that align closely with user specifications.

SILA Illustration

Figure 1: Illustration of key components of SILA.

Text Prompt with descriptors & Generated Audio Examples

# Text Prompt Generated Audio
1Explosion, & loudness: very loud
2Explosion, & loudness: very soft
3The deep rumble of the storm echoes through the sky, & loudness: soft.
4The deep rumble of the storm echoes through the sky, & loudness: very loud.
5Futuristic sci-fi swish and whoosh, swish & fade: out
6Big metal object hitting a large metal tank with a , deep sound & pitch: low, & duration: 5 seconds & reverb: slightly wet
7Consecutive footsteps in dress shoes, echoing on a hard floor surface & fade: in & noise: silent background & reverb: wet
8A dog barking nearby, & reverb: dry.
9A dog barking nearby, & reverb: wet.
10Gunshots being fired, & reverb: dry.
11Gunshots being fired, & reverb: wet.
12Footsteps on a wooden floor, & reverb: dry.
13Footsteps on a wooden floor, & reverb: very wet.
14A joyful man giggling, & reverb: very wet.
15Continuos pouring of rain, & reverb: wet.
SILA Illustration

Figure 2: Comparison of average acoustic characteristic values. SILA outperforms the baseline, with scores within the expected range for each feature, indicating improved disentanglement between audio and its characteristics.

Comparison of Baseline (in red) and SILA (in Green) on "Reveb"

# Text Prompt Generated Audio
1Dog barking, & reverb: dry.
2Dog barking, & reverb: dry.
3Dog barking, & reverb: wet.
4Dog barking, & reverb: wet.
5Gun shot, & reverb: dry.
6Gun shot, & reverb: dry.
7Gun shot, & reverb: wet.
8Gun shot, & reverb: wet.
9Cat meow, & reverb: dry.
10Cat meow, & reverb: dry.
11Cat meow, & reverb: very wet.
12Cat meow, & reverb: very wet.

Comparison of Baseline (in red) and SILA (in Green) on "Noise"

# Text Prompt Generated Audio
1Metal utensils clanking, & noise: silent
2Metal utensils clanking, & noise: silent
3Metal utensils clanking, & noise: noisy
4Metal utensils clanking, & noise: noisy
5Futuristic sci-fi swish, & noise: silent
6Futuristic sci-fi swish, & noise: silent
7Futuristic sci-fi swish, & noise: noisy
8Futuristic sci-fi swish, & noise: noisy
9Glass explosion, & noise: silent
10Glass explosion, & noise: silent
11Glass explosion, & noise: noisy
12Glass explosion, & noise: noisy
13whoosh, & noise: silent
14whoosh, & noise: silent
15whoosh, & noise: noisy
16whoosh, & noise: noisy

Comparison of Baseline (in red) and SILA (in Green) on "Pitch"

# Text Prompt Generated Audio
1Car honk, & pitch: low
2Car honk, & pitch: low
3Car honk, & pitch: high
4Car honk, & pitch: high
5Gun shot, & pitch: low
6Gun shot, & pitch: low
7Gun shot, & pitch: high
8Gun shot, & pitch: high

Comparison of Baseline (in red) and SILA (in Green) on "Duration"

# Text Prompt Generated Audio
1Flowing water stream, & duration: 3 seconds
2Flowing water stream, & duration: 3 seconds
3Flowing water stream, & duration: 5 seconds
4Flowing water stream, & duration: 5 seconds
5Car honk, & duration: 3 seconds
6Car honk, & duration: 3 seconds
7Car honk, & duration: 5 seconds
8Car honk, & duration: 5 seconds
9Baby crying, & duration: 3 seconds
10Baby crying, & duration: 3 seconds
11Baby crying, & duration: 5 seconds
12Baby crying, & duration: 5 seconds