Timestamped Audio Captioning

Sonal Kumar1,2,*, Prem Seetharaman1,*, Ke Chen1, Oriol Nieto1, Jiaqi Su1,
Zhepei Wang1, Rithesh Kumar3, Dinesh Manocha2, Nicholas J. Bryan1, Zeyu Jin1, Justin Salamon1
1Adobe Research, USA 2University of Maryland, College Park, USA 3OpenAI, USA (work done while at Adobe)
*Equal contribution
Correspondence to: Sonal Kumar, Prem Seetharaman

TAC produces timestamped captions for any audio or audiovisual source β€” across general sound, music, and speech.

TAC: Timestamped Audio Captioning

Structured, timestamped descriptions of overlapping sound events with type tags ([music], [sfx], [speech]) and precise temporal boundaries.

🎬

TAC-V: Timestamped Audio-Visual Captioning

Fuses TAC outputs with visual language models for temporally dense audio-visual captions with hallucination correction and visual grounding.

🧠

TAC→LLM Cascade

TAC serves as a "semantic bridge" for text-only reasoners, achieving SOTA on MMAU-Pro, MMSU, MMAR, Daily-Omni, and VideoHolmes.

Benchmark Performance

Comparison with previous state-of-the-art models across audio and audio-visual reasoning tasks

Audio-Visual Reasoning Benchmarks

Audio-Only Reasoning Benchmarks

Captioning Examples

Explore our model's audio captioning capabilities across diverse content types

Video Captioning Examples

Dense audio-visual captions with timestamped events β€” video with embedded captions on the left, event timeline on the right

Audio-Visual Benchmark Results

Performance on challenging AV understanding benchmarks with reasoning traces

Audio-Only Benchmark Results

Performance on challenging audio understanding benchmarks with reasoning traces

About TAC

Paper Abstract

Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC→LLM and TAC-V→LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (Daily-Omni, VideoHolmes) understanding and reasoning respectively.

Results

Table 1a: Training Ablations & Baselines

Training ablations showing the impact of data sources and hyperparameters, plus baseline comparisons. βœ“ = enabled, βœ— = disabled.

Configuration Multitask Pretrained Templates Acoustic Sim TACOS Iters LoRA TS Wt EvtF1 ↑ SegF1 Hal% ↓ Conf Spec
TAC (Ours) βœ“ βœ“ βœ“ βœ“ βœ“ 5k 128 5.0 .50 .71 4.9 0.89 0.74
Ablations
βœ— Multitask βœ— βœ“ βœ“ βœ“ βœ“ 5k 128 5.0 .45 .72 7.0 0.87 0.70
(merge=0.1) βœ— βœ“ βœ“ βœ“ βœ“ 5k 128 5.0 .41 .71 13.8 0.80 0.70
βœ— Pretrained βœ“ βœ— βœ“ βœ“ βœ“ 5k 128 5.0 .49 .70 8.8 0.85 0.70
βœ— Templates βœ“ βœ“ βœ— βœ“ βœ“ 5k 128 5.0 .47 .71 2.2 0.93 0.78
βœ— Acoustic Sim βœ“ βœ“ βœ“ βœ— βœ“ 5k 128 5.0 .49 .71 5.3 0.89 0.75
βœ— TACOS βœ“ βœ“ βœ“ βœ“ βœ— 5k 128 5.0 .42 .68 7.6 0.85 0.70
LoRA Rank
Rank 256 βœ“ βœ“ βœ“ βœ“ βœ“ 5k 256 5.0 .48 .70 3.5 0.90 0.75
Rank 64 βœ“ βœ“ βœ“ βœ“ βœ“ 5k 64 5.0 .49 .71 4.8 0.89 0.74
Rank 8 βœ“ βœ“ βœ“ βœ“ βœ“ 5k 8 5.0 .19 .66 36.0 0.58 0.54
Timestamp Weight
Weight 1.0 βœ“ βœ“ βœ“ βœ“ βœ“ 5k 128 1.0 .48 .71 4.2 0.91 0.76
Weight 10.0 βœ“ βœ“ βœ“ βœ“ βœ“ 5k 128 10.0 .48 .71 5.8 0.88 0.73
Iterations
10k iterations βœ“ βœ“ βœ“ βœ“ βœ“ 10k 128 5.0 .47 .70 5.2 0.89 0.75
2.5k iterations βœ“ βœ“ βœ“ βœ“ βœ“ 2.5k 128 5.0 .46 .70 8.0 0.85 0.72
Baselines
Gemini 3 Pro β€” β€” β€” β€” β€” β€” β€” β€” .42 .64 6.1 0.84 0.66
Qwen3-Omni β€” β€” β€” β€” β€” β€” β€” β€” .37 .66 7.3 0.84 0.62
Audio Flamingo 3 β€” β€” β€” β€” β€” β€” β€” β€” .27 .55 11.6 0.73 0.59

Table 1b: Inference Parameter Sweeps

Inference parameter sweeps on the TAC checkpoint. Best configuration shown in bold.

Style Merge (Ξ΄merge) Activity Resolution (Ξ΄res) EvtF1 ↑ SegF1 Hal% ↓ Conf Spec
brief 0.25 0.05 0.10 .50 .71 4.5 0.89 0.77
Style Variations
detailed 0.25 0.05 0.10 .49 .71 8.0 0.86 0.72
keywords 0.25 0.05 0.10 .47 .66 1.3 0.89 0.78
Merge Threshold (Ξ΄merge)
brief 0.10 0.05 0.10 .31 .66 20.2 0.73 0.67
brief 0.50 0.05 0.10 .48 .72 4.0 0.90 0.74
brief 1.00 0.05 0.10 .42 .72 4.7 0.89 0.69
Activity Threshold
brief 0.25 0.01 0.10 .49 .72 4.7 0.89 0.74
brief 0.25 0.10 0.10 .49 .70 5.5 0.88 0.76
brief 0.25 0.20 0.10 .45 .70 4.5 0.90 0.76
Resolution Threshold (Ξ΄res)
brief 0.25 0.05 0.01 .43 .71 11.8 0.83 0.73
brief 0.25 0.05 0.50 .48 .70 5.4 0.88 0.77

Table 2: Downstream Reasoning Benchmarks

Comparison of native multimodal LLMs against our cascade approach: TAC/TAC-V captions fed to a text-only reasoner.

Audio Understanding & Reasoning
Benchmark Native LALM Score TAC + Qwen3 TAC + Gemini3
MMAU Audio Thinker 75.9 73.9 72.2
Sound 78.8 79.7 79.6
Music 73.8 62.6 63.4
Speech 75.2 79.3 73.6
MMAR Audio Flamingo 3 60.1 60.1 71.9
MMSU Audio Flamingo 3 62.3 65.0 72.4
MMAU-Pro Gemini 2.5 Flash 59.2 62.5 62.9
Audio-Visual Understanding & Reasoning
Benchmark Native MLLM Score VLM + Qwen3 TAC-V + Qwen3 TAC-V + Gemini3
Daily-Omni Qwen3-Omni 76.2 51.5 72.9 77.9
World-Sense Gemini 2.5 Pro 65.1 37.4 45.7 58.6
Video-Holmes Qwen3-Omni 57.3 45.6 47.7 59.2
AVHBench (AVH) PandaGPT 58.5 70.8 79.8 81.7
AVHBench (VAH) PandaGPT 61.3 51.8 76.1 76.6
AVHBench (AVM) OneLLM 60.1 50.5 56.7 61.6
AVHBench (AVC) Video-LLaMa 14.0 12.9 22.6 20.6

Citation


  @misc{kumar2026tactimestampedaudiocaptioning,
      title={TAC: Timestamped Audio Captioning}, 
      author={Sonal Kumar and Prem Seetharaman and Ke Chen and Oriol Nieto and Jiaqi Su and Zhepei Wang and Rithesh Kumar and Dinesh Manocha and Nicholas J. Bryan and Zeyu Jin and Justin Salamon},
      year={2026},
      eprint={2602.15766},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.15766}, 
}