A Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Overview of the MMAU-Pro benchmark. MMAU-Pro provides comprehensive coverage across all three core audio domains-speech, sound, and music-and extends evaluation to their mixtures. It further includes multi-audio reasoning, long-form audio (up to 10 minutes), voice-chat QA, spatial audio understanding, open-ended QA, and multimodal instruction following, offering a broad and realistic assessment of audio intelligence.
(Left) Distribution of audio perception skills required for questions in the MMAU-Pro across the domains of sound, speech, and music. (Right) Distribution of auditory reasoning skills required for questions in MMAU-Pro. Each question in MMAU-Pro demands the model to apply one or more of the perception and reasoning skills to generate a reliable and accurate response.
# | Model | Size | Average (%) |
---|
Medals: 🥇 🥈 🥉 for top 3 (excluding Human/Random baselines).
results.json
Coming Soon }