Sonal Kumar | PhD Researcher in Audio AI

Publications

Audio, speech & multimodal research · Full Google Scholar profile →

2026

arXiv

TAC: Timestamped Audio Captioning

S Kumar, P Seetharaman, K Chen, O Nieto, J Su, Z Wang, R Kumar, ...

arXiv:2602.15766 · 2026

Page arXiv

We introduce TAC, a system for generating audio captions with precise timestamps, enabling fine-grained alignment between audio events and their textual descriptions. TAC advances the state of the art in temporal audio understanding and supports downstream applications like audio retrieval and audio question answering.

2025

ICML

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

S Ghosh, Z Kong, S Kumar, S Sakshi, J Kim, W Ping, R Valle, D Manocha, ...

ICML 2025

Page arXiv Code

Audio Flamingo 2 is a large audio-language model capable of understanding long audio sequences (up to minutes) and performing complex multi-step reasoning about audio content. By combining a powerful audio encoder with a long-context language model, AF2 achieves state-of-the-art performance on multiple benchmarks while maintaining efficiency.

NeurIPS

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

A Goel, S Ghosh, J Kim, S Kumar, Z Kong, S Lee, CHH Yang, ...

NeurIPS 2025

Page arXiv Code

Audio Flamingo 3 presents a fully open-source suite of large audio language models, advancing audio intelligence through improved training recipes, open model weights, and competitive performance across diverse audio understanding and reasoning benchmarks.

AAAI

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

S Kumar, Š Sedláček, V Lokegaonkar, F López, W Yu, N Anand, H Ryu, ...

AAAI 2026

Page arXiv

MMAU-Pro extends the original MMAU benchmark with significantly more challenging questions requiring deeper expert-level knowledge and multi-step reasoning across speech, music, and environmental sound. It provides a harder evaluation frontier for the next generation of audio general intelligence systems.

NAACL

Do Audio-Language Models Understand Linguistic Variations?

R Selvakumar*, S Kumar*, HK Giri, N Anand, A Seth, S Ghosh, D Manocha

NAACL 2025

arXiv Code

We systematically investigate the robustness of audio-language models to linguistic variations in text queries — including paraphrasing, negation, and syntactic transformations — revealing significant brittleness in current ALMs and proposing evaluation protocols to benchmark this understudied dimension of model robustness.

WASPAA

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

S Kumar, P Seetharaman, J Salamon, D Manocha, O Nieto

WASPAA 2025

Page

SILA introduces a data augmentation framework for text-to-audio generation that bridges the gap between signal-level audio properties and language descriptions. By automatically generating rich, controllable textual descriptions from audio signals, SILA enables more faithful and controllable audio synthesis from text prompts.

EMNLP

MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

R Selvakumar, A Seth, N Anand, U Tyagi, S Kumar, S Ghosh, D Manocha

EMNLP 2025

arXiv

MultiVox is a comprehensive benchmark for evaluating voice assistants in real-world multimodal interaction scenarios that combine speech, visual, and audio signals. It exposes limitations of current voice assistants when handling complex, concurrent modalities.

DCASE

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning

CH Huck Yang, S Ghosh, Q Wang, J Kim, H Hong, S Kumar, ...

DCASE 2025 Challenge

arXiv

We introduce a multi-domain audio question answering challenge spanning environmental sounds, music, and speech. The task requires models to reason about acoustic content across diverse domains, pushing the frontier of audio general intelligence beyond category classification.

ICASSP

ReClap: Improving Zero-Shot Audio Classification by Describing Sounds

S Ghosh, S Kumar, CKR Evuru, O Nieto, R Duraiswami, D Manocha

ICASSP 2025

arXiv

ReClap improves zero-shot audio classification by augmenting class labels with rich natural language descriptions of the corresponding sounds. By training CLAP-style models with these enriched descriptions, ReClap significantly improves zero-shot transfer across diverse audio classification benchmarks.

2024

ICLR ★

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi*, U Tyagi*, S Kumar*, A Seth*, R Selvakumar, O Nieto, ...

ICLR 2025 — Spotlight

arXiv Page Code EvalAI

We introduce MMAU — Massive Multi-Task Audio Understanding and Reasoning Benchmark — a comprehensive benchmark designed to evaluate Large Audio-Language Models on tasks requiring expert-level knowledge and complex reasoning. MMAU includes 10,000 meticulously curated audio clips paired with human-annotated questions and answers spanning speech, environmental sounds, and music. The benchmark covers 27 distinct skills across unique and challenging tasks. Even state-of-the-art systems like Gemini Pro v1.5 and Qwen2-Audio achieve only ~53% accuracy, underscoring substantial headroom for improvement.

EMNLP ★

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

S Ghosh*, S Kumar*, A Seth, CKR Evuru, U Tyagi, S Sakshi, O Nieto, ...

EMNLP 2024 — Oral

arXiv Page Code Demo

We propose GAMA, a novel Large Audio-Language Model (LALM) capable of responding accurately to complex questions about input audio. GAMA benefits from a mixture of encoders and synthetic data generated using a novel data generation pipeline. GAMA achieves state-of-the-art performance on various audio understanding, reasoning, and hallucination benchmarks, and was nominated for an Oral presentation at EMNLP 2024.

ICLR

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

S Ghosh, S Kumar, Z Kong, R Valle, B Catanzaro, D Manocha

ICLR 2025

arXiv Code Demo

We present Synthio, a novel method for augmenting small-scale audio classification datasets with synthetic data. Our approach aligns a Text-to-Audio generation model with the target dataset through preference optimization, then uses an iterative LLM prompting method to generate diverse audio captions. By augmenting datasets with data generated by Synthio, we achieve up to a 39% performance improvement on benchmark datasets.

EMNLP

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

A Seth, R Selvakumar, S Sakshi, S Kumar, S Ghosh, D Manocha

EMNLP 2024

arXiv Code Checkpoint

EH-MAM introduces an adaptive masked acoustic modeling strategy that gradually increases masking difficulty during self-supervised pre-training. By selectively reconstructing challenging regions, EH-MAM enables better speech representations and outperforms state-of-the-art baselines across low-resource speech recognition and SUPERB benchmarks by 5–10%.

InterSpeech

LipGER: Visually-Conditioned Generative Error Correction for Robust ASR

S Ghosh, S Kumar, A Seth, P Chiniya, U Tyagi, R Duraiswami, ...

Interspeech 2024

arXiv Page Code

LipGER is a visually-conditioned generative approach to ASR error correction that uses lip movements as an additional conditioning signal to disambiguate acoustically confusing speech segments. By incorporating visual speech information, LipGER substantially reduces word error rates in noisy acoustic environments.

ICASSP

RECAP: Retrieval-Augmented Audio Captioning

S Ghosh, S Kumar, CKR Evuru, R Duraiswami, D Manocha

ICASSP 2024

arXiv

RECAP introduces retrieval-augmented generation for automated audio captioning. By retrieving semantically similar audio captions at inference time and conditioning caption generation on them, RECAP produces more accurate, diverse, and grounded audio descriptions compared to standard encoder-decoder baselines.

CVPR

AV-RIR: Audio-Visual Room Impulse Response Estimation

A Ratnarajah, S Ghosh, S Kumar, P Chiniya, D Manocha

CVPR 2024

arXiv

AV-RIR leverages both audio and visual cues to estimate room impulse responses — the acoustic fingerprint of a space. By jointly modeling visual scene geometry and captured audio, AV-RIR achieves significantly more accurate RIR estimation than audio-only methods, enabling better downstream spatial audio applications.

2023

ICLR

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

S Ghosh*, A Seth*, S Kumar*, U Tyagi*, CK Evuru*, S Ramaneswaran, ...

ICLR 2024

arXiv Page Code

We introduce CompA, a benchmark for compositional reasoning in audio-language models. CompA includes CompA-order (evaluating understanding of sequential acoustic events) and CompA-attribute (testing attribute-sound association). We demonstrate that models like CLAP struggle with compositional reasoning, and propose CompA-CLAP with compositionally-aware hard negatives and modular contrastive learning to significantly improve performance.

Service

Co-Organizing

DCASE 2025

Audio Question Answering Task at DCASE 2025

Challenge: Apr 1 – Jun 15, 2025 · Workshop: Oct 30–31, 2025 · Barcelona, Spain

The IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2025. We proposed the Audio Question Answering (AQA) task — focusing on advancing question-answering capabilities in "interactive audio understanding," covering both general acoustic events and knowledge-heavy sound information. The task encourages systems that can accurately interpret and respond to complex audio-based questions (multi-choice), requiring models to reason across diverse audio types. Reproducible baselines include a resource-efficient single 8GB RAM setting; direct enterprise API use is prohibited.

DCASE 2025 Challenge

JSALT 2025

Advancing Expert-Level Reasoning in Large Audio Language Models at JSALT 2025

Jun 9 – Aug 1, 2025 · Brno, Czechia

The 2025 Jelinek Workshop on Speech and Language Technologies (JSALT) is an eight-week residential summer research workshop bringing together international teams to work intensively on challenging problems in speech and language engineering, ML, and AI. We proposed a workshop focused on advancing expert-level understanding and complex reasoning in audio-language models, drawing team members from several universities and industry in the US, Europe, and Asia.

JSALT 2025

EMNLP 2026

Workshop on Speech and Audio Language Models (SALMA-2)

Co-located with EMNLP 2026

SALMA-2 is the second edition of the Workshop on Speech and Audio Language Models, co-located with EMNLP 2026. Building on the success of the first SALMA workshop, this edition continues to bring together researchers working at the intersection of speech, audio, and language modeling to share recent advances and foster new collaborations.

Talks & Presentations

PhD Proposal Talk

Advancing Expert-Level Audio Understanding and Reasoning in Large Audio Language Models

Slides

Introduction to Large Audio Language Models

An overview of large audio-language models, benchmarks, and open research challenges in audio intelligence

Slides

Program Committee & Reviewer

AAAI 2026

ACL 2023, 2024, 2025

ARR 2023 – Present

CVPR 2025

EMNLP 2023, 2024, 2025

ICASSP 2024, 2025, 2026

ICLR 2025, 2026

ICML 2025, 2026

NAACL 2024, 2025

NeurIPS 2024, 2025

Hi, I'm Sonal Kumar

Publications

TAC: Timestamped Audio Captioning

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

Do Audio-Language Models Understand Linguistic Variations?

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning

ReClap: Improving Zero-Shot Audio Classification by Describing Sounds

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

LipGER: Visually-Conditioned Generative Error Correction for Robust ASR

RECAP: Retrieval-Augmented Audio Captioning

AV-RIR: Audio-Visual Room Impulse Response Estimation

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Service

Co-Organizing

Audio Question Answering Task at DCASE 2025

Advancing Expert-Level Reasoning in Large Audio Language Models at JSALT 2025

Workshop on Speech and Audio Language Models (SALMA-2)

Talks & Presentations

PhD Proposal Talk

Introduction to Large Audio Language Models

Program Committee & Reviewer

News & Updates

Resume