MMAU logo

MMAU‑Pro

A Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

Authors

Examples

Overview of the MMAU-Pro benchmark. MMAU-Pro provides comprehensive coverage across all three core audio domains-speech, sound, and music-and extends evaluation to their mixtures. It further includes multi-audio reasoning, long-form audio (up to 10 minutes), voice-chat QA, spatial audio understanding, open-ended QA, and multimodal instruction following, offering a broad and realistic assessment of audio intelligence.

Abstract

Highlights

    Skill Coverage

    (Left) Distribution of audio perception skills required for questions in the MMAU-Pro across the domains of sound, speech, and music. (Right) Distribution of auditory reasoning skills required for questions in MMAU-Pro. Each question in MMAU-Pro demands the model to apply one or more of the perception and reasoning skills to generate a reliable and accurate response.

    Skills overview diagram

    What the skills test

    • Speech: ASR‑plus reasoning (semantics, coreference, intent).
    • Sound: non‑speech events; causal and physical reasoning.
    • Music: instruments, rhythm, theory descriptors.
    • Spatial: binaural cues, relative positions, motion.
    • Multi‑audio: mixture attribution, stream segregation.
    • Voice‑chat: persona, prosody, multi‑turn QA.
    • Instruction following: constrained multi‑step tasks.

    At a Glance

    Breakdown

    Model Performance — Leaderboard

    Sorted by overall average (desc)
    |
    #ModelSizeAverage (%)

    Medals: 🥇 🥈 🥉 for top 3 (excluding Human/Random baselines).

    Full Results Table

    Data source: results.json

    BibTeX

      Coming Soon
    
    }

    Links & Contact

    • 📄 Paper (PDF): Open
    • 🐙 Code: Repository
    • 🔗 Dataset: Landing page
    • ✉️ Corresponding author: sonalkum@umd.edu
    • Affiliations: University of Maryland, Brno University of Technology, Universidad Autónoma de Madrid, Tsinghua University, KAIST, and others.