Sonal Kumar | Research

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

We propose GAMA, a novel Large Audio-Language Model (LALM) that is capable of responding accurately to complex questions about an input audio. GAMA benefits from a mixture of encoders and synthetic data generated using a novel data generation pipeline we propose. GAMA currently stands as the state-of-the-art LALM on various audio understanding, reasoning, and hallucination benchmarks.

arXiv Homepage Code GAMA Demo GAMA-IT Demo

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

We introduce MMAU (Massive Multi-Task Audio Understanding and Reasoning Benchmark), a comprehensive benchmark designed to evaluate Large Audio-Language Models (LALMs) on tasks that demand expert-level knowledge and complex reasoning. MMAU includes 10,000 meticulously curated audio clips paired with human-annotated natural language questions and answers, covering speech, environmental sounds, and music. The benchmark features information extraction and reasoning questions that require models to demonstrate 27 distinct skills across unique and challenging tasks. Notably, even the advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves 52.50%, underscoring significant potential for improvement.

arXiv Homepage Code EvalAI

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

We present Synthio, a novel method for generating synthetic data specifically for audio classification. Our approach first involves aligning a Text-to-Audio generation model with the target dataset through preference optimization. We then introduce an iterative prompting method with large language models (LLMs) to generate diverse and consistent audio captions, which are used to prompt the Text-to-Audio generation model for synthetic data creation. By augmenting small-scale audio classification datasets with data generated by Synthio, we achieve up to a 39% performance improvement on benchmark datasets.

arXiv Code Demo

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

We introduce CompA, a benchmark specifically designed to address gaps in compositional reasoning in audio-language models (ALMs). CompA includes two expert-annotated benchmarks: CompA-order, which evaluates how well an ALM understands the sequence of acoustic events, and CompA-attribute, which tests the model’s ability to associate attributes with specific sounds. Each test instance contains audio-caption pairs with the same events but in varying compositions, challenging the model to match audio accurately to captions. Using CompA, we demonstrate that current ALMs, including CLAP, struggle with complex compositional reasoning. To improve performance, we propose CompA-CLAP, a fine-tuned model that leverages compositionally-aware hard negatives and a new modular contrastive learning objective, significantly enhancing compositional reasoning capabilities across both benchmarks

arXiv Homepage Code

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

We introduce EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised approach for speech representation learning. EH-MAM enables better learning from unsupervised data by using an adaptive masking strategy that gradually increases the difficulty of the p re-text SSL task and selectively reconstructing challenging regions within the speech input. EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.

arXiv Code Checkpoint