A Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Overview of the MMAU-Pro benchmark. MMAU-Pro provides comprehensive coverage across all three core audio domains-speech, sound, and music-and extends evaluation to their mixtures. It further includes multi-audio reasoning, long-form audio (up to 10 minutes), voice-chat QA, spatial audio understanding, open-ended QA, and multimodal instruction following, offering a broad and realistic assessment of audio intelligence.
(Left) Distribution of audio perception skills required for questions in the MMAU-Pro across the domains of sound, speech, and music. (Right) Distribution of auditory reasoning skills required for questions in MMAU-Pro. Each question in MMAU-Pro demands the model to apply one or more of the perception and reasoning skills to generate a reliable and accurate response.
# | Model | Size | Average (%) |
---|
Medals: 🥇 🥈 🥉 for top 3 (excluding Human/Random baselines).
results.json
@article{kumar2025mmau, title={MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence}, author={Kumar, Sonal and Sedl{\'a}{\v{c}}ek, {\v{S}}imon and Lokegaonkar, Vaibhavi and L{\'o}pez, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Pli{\v{c}}ka, Maxim and Hlav{\'a}{\v{c}}ek, Miroslav and others}, journal={arXiv preprint arXiv:2508.13992}, year={2025} } }