1University of Maryland, College Park, MD, USA
2Adobe Research, San Francisco, CA, USA
sonalkum@umd.edu
Abstract: The field of text-to-audio generation has seen remarkable advancements, yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects, allowing creators and sound designers to manipulate key acoustic parameters like loudness, pitch, reverb, fade, brightness, and noise during the generation process. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio and its acoustic features. This innovative approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our results demonstrate the effectiveness of this approach in producing high-quality, customizable audio outputs that align closely with user specifications.
Figure 1: Illustration of key components of SILA.
# | Text Prompt | Generated Audio |
---|---|---|
1 | Explosion, & loudness: very loud | |
2 | Explosion, & loudness: very soft | |
3 | The deep rumble of the storm echoes through the sky, & loudness: soft. | |
4 | The deep rumble of the storm echoes through the sky, & loudness: very loud. | |
5 | Futuristic sci-fi swish and whoosh, swish & fade: out | |
6 | Big metal object hitting a large metal tank with a , deep sound & pitch: low, & duration: 5 seconds & reverb: slightly wet | |
7 | Consecutive footsteps in dress shoes, echoing on a hard floor surface & fade: in & noise: silent background & reverb: wet | |
8 | A dog barking nearby, & reverb: dry. | |
9 | A dog barking nearby, & reverb: wet. | |
10 | Gunshots being fired, & reverb: dry. | |
11 | Gunshots being fired, & reverb: wet. | |
12 | Footsteps on a wooden floor, & reverb: dry. | |
13 | Footsteps on a wooden floor, & reverb: very wet. | |
14 | A joyful man giggling, & reverb: very wet. | |
15 | Continuos pouring of rain, & reverb: wet. |
Figure 2: Comparison of average acoustic characteristic values. SILA outperforms the baseline, with scores within the expected range for each feature, indicating improved disentanglement between audio and its characteristics.
# | Text Prompt | Generated Audio |
---|---|---|
1 | Dog barking, & reverb: dry. | |
2 | Dog barking, & reverb: dry. | |
3 | Dog barking, & reverb: wet. | |
4 | Dog barking, & reverb: wet. | |
5 | Gun shot, & reverb: dry. | |
6 | Gun shot, & reverb: dry. | |
7 | Gun shot, & reverb: wet. | |
8 | Gun shot, & reverb: wet. | |
9 | Cat meow, & reverb: dry. | |
10 | Cat meow, & reverb: dry. | |
11 | Cat meow, & reverb: very wet. | |
12 | Cat meow, & reverb: very wet. |
# | Text Prompt | Generated Audio |
---|---|---|
1 | Metal utensils clanking, & noise: silent | |
2 | Metal utensils clanking, & noise: silent | |
3 | Metal utensils clanking, & noise: noisy | |
4 | Metal utensils clanking, & noise: noisy | |
5 | Futuristic sci-fi swish, & noise: silent | |
6 | Futuristic sci-fi swish, & noise: silent | |
7 | Futuristic sci-fi swish, & noise: noisy | |
8 | Futuristic sci-fi swish, & noise: noisy | |
9 | Glass explosion, & noise: silent | |
10 | Glass explosion, & noise: silent | |
11 | Glass explosion, & noise: noisy | |
12 | Glass explosion, & noise: noisy | |
13 | whoosh, & noise: silent | |
14 | whoosh, & noise: silent | |
15 | whoosh, & noise: noisy | |
16 | whoosh, & noise: noisy |
# | Text Prompt | Generated Audio |
---|---|---|
1 | Car honk, & pitch: low | |
2 | Car honk, & pitch: low | |
3 | Car honk, & pitch: high | |
4 | Car honk, & pitch: high | |
5 | Gun shot, & pitch: low | |
6 | Gun shot, & pitch: low | |
7 | Gun shot, & pitch: high | |
8 | Gun shot, & pitch: high |
# | Text Prompt | Generated Audio |
---|---|---|
1 | Flowing water stream, & duration: 3 seconds | |
2 | Flowing water stream, & duration: 3 seconds | |
3 | Flowing water stream, & duration: 5 seconds | |
4 | Flowing water stream, & duration: 5 seconds | |
5 | Car honk, & duration: 3 seconds | |
6 | Car honk, & duration: 3 seconds | |
7 | Car honk, & duration: 5 seconds | |
8 | Car honk, & duration: 5 seconds | |
9 | Baby crying, & duration: 3 seconds | |
10 | Baby crying, & duration: 3 seconds | |
11 | Baby crying, & duration: 5 seconds | |
12 | Baby crying, & duration: 5 seconds |