Timestamped Audio Captioning
TAC produces timestamped captions for any audio or audiovisual source β across general sound, music, and speech.
TAC: Timestamped Audio Captioning
Structured, timestamped descriptions of overlapping sound events with type tags ([music], [sfx], [speech]) and precise temporal boundaries.
TAC-V: Timestamped Audio-Visual Captioning
Fuses TAC outputs with visual language models for temporally dense audio-visual captions with hallucination correction and visual grounding.
TACβLLM Cascade
TAC serves as a "semantic bridge" for text-only reasoners, achieving SOTA on MMAU-Pro, MMSU, MMAR, Daily-Omni, and VideoHolmes.
Benchmark Performance
Comparison with previous state-of-the-art models across audio and audio-visual reasoning tasks
Audio-Visual Reasoning Benchmarks
Audio-Only Reasoning Benchmarks
Captioning Examples
Explore our model's audio captioning capabilities across diverse content types
Video Captioning Examples
Dense audio-visual captions with timestamped events β video with embedded captions on the left, event timeline on the right
Audio-Visual Benchmark Results
Performance on challenging AV understanding benchmarks with reasoning traces
Audio-Only Benchmark Results
Performance on challenging audio understanding benchmarks with reasoning traces
About TAC
Paper Abstract
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TACβLLM and TAC-VβLLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (Daily-Omni, VideoHolmes) understanding and reasoning respectively.
Results
Table 1a: Training Ablations & Baselines
Training ablations showing the impact of data sources and hyperparameters, plus baseline comparisons. β = enabled, β = disabled.
| Configuration | Multitask | Pretrained | Templates | Acoustic Sim | TACOS | Iters | LoRA | TS Wt | EvtF1 β | SegF1 | Hal% β | Conf | Spec |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TAC (Ours) | β | β | β | β | β | 5k | 128 | 5.0 | .50 | .71 | 4.9 | 0.89 | 0.74 |
| Ablations | |||||||||||||
| β Multitask | β | β | β | β | β | 5k | 128 | 5.0 | .45 | .72 | 7.0 | 0.87 | 0.70 |
| (merge=0.1) | β | β | β | β | β | 5k | 128 | 5.0 | .41 | .71 | 13.8 | 0.80 | 0.70 |
| β Pretrained | β | β | β | β | β | 5k | 128 | 5.0 | .49 | .70 | 8.8 | 0.85 | 0.70 |
| β Templates | β | β | β | β | β | 5k | 128 | 5.0 | .47 | .71 | 2.2 | 0.93 | 0.78 |
| β Acoustic Sim | β | β | β | β | β | 5k | 128 | 5.0 | .49 | .71 | 5.3 | 0.89 | 0.75 |
| β TACOS | β | β | β | β | β | 5k | 128 | 5.0 | .42 | .68 | 7.6 | 0.85 | 0.70 |
| LoRA Rank | |||||||||||||
| Rank 256 | β | β | β | β | β | 5k | 256 | 5.0 | .48 | .70 | 3.5 | 0.90 | 0.75 |
| Rank 64 | β | β | β | β | β | 5k | 64 | 5.0 | .49 | .71 | 4.8 | 0.89 | 0.74 |
| Rank 8 | β | β | β | β | β | 5k | 8 | 5.0 | .19 | .66 | 36.0 | 0.58 | 0.54 |
| Timestamp Weight | |||||||||||||
| Weight 1.0 | β | β | β | β | β | 5k | 128 | 1.0 | .48 | .71 | 4.2 | 0.91 | 0.76 |
| Weight 10.0 | β | β | β | β | β | 5k | 128 | 10.0 | .48 | .71 | 5.8 | 0.88 | 0.73 |
| Iterations | |||||||||||||
| 10k iterations | β | β | β | β | β | 10k | 128 | 5.0 | .47 | .70 | 5.2 | 0.89 | 0.75 |
| 2.5k iterations | β | β | β | β | β | 2.5k | 128 | 5.0 | .46 | .70 | 8.0 | 0.85 | 0.72 |
| Baselines | |||||||||||||
| Gemini 3 Pro | β | β | β | β | β | β | β | β | .42 | .64 | 6.1 | 0.84 | 0.66 |
| Qwen3-Omni | β | β | β | β | β | β | β | β | .37 | .66 | 7.3 | 0.84 | 0.62 |
| Audio Flamingo 3 | β | β | β | β | β | β | β | β | .27 | .55 | 11.6 | 0.73 | 0.59 |
Table 1b: Inference Parameter Sweeps
Inference parameter sweeps on the TAC checkpoint. Best configuration shown in bold.
| Style | Merge (Ξ΄merge) | Activity | Resolution (Ξ΄res) | EvtF1 β | SegF1 | Hal% β | Conf | Spec |
|---|---|---|---|---|---|---|---|---|
| brief | 0.25 | 0.05 | 0.10 | .50 | .71 | 4.5 | 0.89 | 0.77 |
| Style Variations | ||||||||
| detailed | 0.25 | 0.05 | 0.10 | .49 | .71 | 8.0 | 0.86 | 0.72 |
| keywords | 0.25 | 0.05 | 0.10 | .47 | .66 | 1.3 | 0.89 | 0.78 |
| Merge Threshold (Ξ΄merge) | ||||||||
| brief | 0.10 | 0.05 | 0.10 | .31 | .66 | 20.2 | 0.73 | 0.67 |
| brief | 0.50 | 0.05 | 0.10 | .48 | .72 | 4.0 | 0.90 | 0.74 |
| brief | 1.00 | 0.05 | 0.10 | .42 | .72 | 4.7 | 0.89 | 0.69 |
| Activity Threshold | ||||||||
| brief | 0.25 | 0.01 | 0.10 | .49 | .72 | 4.7 | 0.89 | 0.74 |
| brief | 0.25 | 0.10 | 0.10 | .49 | .70 | 5.5 | 0.88 | 0.76 |
| brief | 0.25 | 0.20 | 0.10 | .45 | .70 | 4.5 | 0.90 | 0.76 |
| Resolution Threshold (Ξ΄res) | ||||||||
| brief | 0.25 | 0.05 | 0.01 | .43 | .71 | 11.8 | 0.83 | 0.73 |
| brief | 0.25 | 0.05 | 0.50 | .48 | .70 | 5.4 | 0.88 | 0.77 |
Table 2: Downstream Reasoning Benchmarks
Comparison of native multimodal LLMs against our cascade approach: TAC/TAC-V captions fed to a text-only reasoner.
Audio Understanding & Reasoning
| Benchmark | Native LALM | Score | TAC + Qwen3 | TAC + Gemini3 |
|---|---|---|---|---|
| MMAU | Audio Thinker | 75.9 | 73.9 | 72.2 |
| Sound | 78.8 | 79.7 | 79.6 | |
| Music | 73.8 | 62.6 | 63.4 | |
| Speech | 75.2 | 79.3 | 73.6 | |
| MMAR | Audio Flamingo 3 | 60.1 | 60.1 | 71.9 |
| MMSU | Audio Flamingo 3 | 62.3 | 65.0 | 72.4 |
| MMAU-Pro | Gemini 2.5 Flash | 59.2 | 62.5 | 62.9 |
Audio-Visual Understanding & Reasoning
| Benchmark | Native MLLM | Score | VLM + Qwen3 | TAC-V + Qwen3 | TAC-V + Gemini3 |
|---|---|---|---|---|---|
| Daily-Omni | Qwen3-Omni | 76.2 | 51.5 | 72.9 | 77.9 |
| World-Sense | Gemini 2.5 Pro | 65.1 | 37.4 | 45.7 | 58.6 |
| Video-Holmes | Qwen3-Omni | 57.3 | 45.6 | 47.7 | 59.2 |
| AVHBench (AVH) | PandaGPT | 58.5 | 70.8 | 79.8 | 81.7 |
| AVHBench (VAH) | PandaGPT | 61.3 | 51.8 | 76.1 | 76.6 |
| AVHBench (AVM) | OneLLM | 60.1 | 50.5 | 56.7 | 61.6 |
| AVHBench (AVC) | Video-LLaMa | 14.0 | 12.9 | 22.6 | 20.6 |
Resources
Citation
@misc{kumar2026tactimestampedaudiocaptioning,
title={TAC: Timestamped Audio Captioning},
author={Sonal Kumar and Prem Seetharaman and Ke Chen and Oriol Nieto and Jiaqi Su and Zhepei Wang and Rithesh Kumar and Dinesh Manocha and Nicholas J. Bryan and Zeyu Jin and Justin Salamon},
year={2026},
eprint={2602.15766},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2602.15766},
}