Sonal Kumar
PhD Candidate · UMD College Park

Hi, I'm Sonal Kumar

I am a Ph.D. student in Computer Science at the University of Maryland, College Park, working under Prof. Dinesh Manocha and Prof. Ramani Duraiswami. As part of the Gamma Lab and PIRL Lab, my research is centered on audio-language models, multimodal learning, and advanced reasoning in speech and audio processing.

Previously, I was a Software Engineer II at Cisco Systems, and a Research Scientist Intern at Adobe, where I focused on audio generative models. I graduated with a B.Tech in Computer Science from Christ University in 2021, where I served as President of Neuron, the university's first AI club.

Latest: |

Publications

Audio, speech & multimodal research  ·  Full Google Scholar profile →

2026
arXiv

TAC: Timestamped Audio Captioning

S Kumar, P Seetharaman, K Chen, O Nieto, J Su, Z Wang, R Kumar, ...

arXiv:2602.15766 · 2026

We introduce TAC, a system for generating audio captions with precise timestamps, enabling fine-grained alignment between audio events and their textual descriptions. TAC advances the state of the art in temporal audio understanding and supports downstream applications like audio retrieval and audio question answering.

2025
NeurIPS

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

A Goel, S Ghosh, J Kim, S Kumar, Z Kong, S Lee, CHH Yang, ...

NeurIPS 2025

Audio Flamingo 3 presents a fully open-source suite of large audio language models, advancing audio intelligence through improved training recipes, open model weights, and competitive performance across diverse audio understanding and reasoning benchmarks.

AAAI

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

S Kumar, Š Sedláček, V Lokegaonkar, F López, W Yu, N Anand, H Ryu, ...

AAAI 2026

MMAU-Pro extends the original MMAU benchmark with significantly more challenging questions requiring deeper expert-level knowledge and multi-step reasoning across speech, music, and environmental sound. It provides a harder evaluation frontier for the next generation of audio general intelligence systems.

NAACL

Do Audio-Language Models Understand Linguistic Variations?

R Selvakumar*, S Kumar*, HK Giri, N Anand, A Seth, S Ghosh, D Manocha

NAACL 2025

We systematically investigate the robustness of audio-language models to linguistic variations in text queries — including paraphrasing, negation, and syntactic transformations — revealing significant brittleness in current ALMs and proposing evaluation protocols to benchmark this understudied dimension of model robustness.

WASPAA

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

S Kumar, P Seetharaman, J Salamon, D Manocha, O Nieto

WASPAA 2025

SILA introduces a data augmentation framework for text-to-audio generation that bridges the gap between signal-level audio properties and language descriptions. By automatically generating rich, controllable textual descriptions from audio signals, SILA enables more faithful and controllable audio synthesis from text prompts.

EMNLP

MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

R Selvakumar, A Seth, N Anand, U Tyagi, S Kumar, S Ghosh, D Manocha

EMNLP 2025

MultiVox is a comprehensive benchmark for evaluating voice assistants in real-world multimodal interaction scenarios that combine speech, visual, and audio signals. It exposes limitations of current voice assistants when handling complex, concurrent modalities.

DCASE

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning

CH Huck Yang, S Ghosh, Q Wang, J Kim, H Hong, S Kumar, ...

DCASE 2025 Challenge

We introduce a multi-domain audio question answering challenge spanning environmental sounds, music, and speech. The task requires models to reason about acoustic content across diverse domains, pushing the frontier of audio general intelligence beyond category classification.

ICASSP

ReClap: Improving Zero-Shot Audio Classification by Describing Sounds

S Ghosh, S Kumar, CKR Evuru, O Nieto, R Duraiswami, D Manocha

ICASSP 2025

ReClap improves zero-shot audio classification by augmenting class labels with rich natural language descriptions of the corresponding sounds. By training CLAP-style models with these enriched descriptions, ReClap significantly improves zero-shot transfer across diverse audio classification benchmarks.

2024
EMNLP

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

A Seth, R Selvakumar, S Sakshi, S Kumar, S Ghosh, D Manocha

EMNLP 2024

EH-MAM introduces an adaptive masked acoustic modeling strategy that gradually increases masking difficulty during self-supervised pre-training. By selectively reconstructing challenging regions, EH-MAM enables better speech representations and outperforms state-of-the-art baselines across low-resource speech recognition and SUPERB benchmarks by 5–10%.

EH-MAM
InterSpeech

LipGER: Visually-Conditioned Generative Error Correction for Robust ASR

S Ghosh, S Kumar, A Seth, P Chiniya, U Tyagi, R Duraiswami, ...

Interspeech 2024

LipGER is a visually-conditioned generative approach to ASR error correction that uses lip movements as an additional conditioning signal to disambiguate acoustically confusing speech segments. By incorporating visual speech information, LipGER substantially reduces word error rates in noisy acoustic environments.

ICASSP

RECAP: Retrieval-Augmented Audio Captioning

S Ghosh, S Kumar, CKR Evuru, R Duraiswami, D Manocha

ICASSP 2024

RECAP introduces retrieval-augmented generation for automated audio captioning. By retrieving semantically similar audio captions at inference time and conditioning caption generation on them, RECAP produces more accurate, diverse, and grounded audio descriptions compared to standard encoder-decoder baselines.

CVPR

AV-RIR: Audio-Visual Room Impulse Response Estimation

A Ratnarajah, S Ghosh, S Kumar, P Chiniya, D Manocha

CVPR 2024

AV-RIR leverages both audio and visual cues to estimate room impulse responses — the acoustic fingerprint of a space. By jointly modeling visual scene geometry and captured audio, AV-RIR achieves significantly more accurate RIR estimation than audio-only methods, enabling better downstream spatial audio applications.

2023

Service

Co-Organizing

DCASE 2025

Audio Question Answering Task at DCASE 2025

Challenge: Apr 1 – Jun 15, 2025  ·  Workshop: Oct 30–31, 2025  ·  Barcelona, Spain

The IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2025. We proposed the Audio Question Answering (AQA) task — focusing on advancing question-answering capabilities in "interactive audio understanding," covering both general acoustic events and knowledge-heavy sound information. The task encourages systems that can accurately interpret and respond to complex audio-based questions (multi-choice), requiring models to reason across diverse audio types. Reproducible baselines include a resource-efficient single 8GB RAM setting; direct enterprise API use is prohibited.

JSALT 2025

Advancing Expert-Level Reasoning in Large Audio Language Models at JSALT 2025

Jun 9 – Aug 1, 2025  ·  Brno, Czechia

The 2025 Jelinek Workshop on Speech and Language Technologies (JSALT) is an eight-week residential summer research workshop bringing together international teams to work intensively on challenging problems in speech and language engineering, ML, and AI. We proposed a workshop focused on advancing expert-level understanding and complex reasoning in audio-language models, drawing team members from several universities and industry in the US, Europe, and Asia.

EMNLP 2026

Workshop on Speech and Audio Language Models (SALMA-2)

Co-located with EMNLP 2026

SALMA-2 is the second edition of the Workshop on Speech and Audio Language Models, co-located with EMNLP 2026. Building on the success of the first SALMA workshop, this edition continues to bring together researchers working at the intersection of speech, audio, and language modeling to share recent advances and foster new collaborations.

Talks & Presentations

PhD Proposal Talk

Advancing Expert-Level Audio Understanding and Reasoning in Large Audio Language Models

Slides

Introduction to Large Audio Language Models

An overview of large audio-language models, benchmarks, and open research challenges in audio intelligence

Slides

Program Committee & Reviewer

AAAI 2026
ACL 2023, 2024, 2025
ARR 2023 – Present
CVPR 2025
EMNLP 2023, 2024, 2025
ICASSP 2024, 2025, 2026
ICLR 2025, 2026
ICML 2025, 2026
NAACL 2024, 2025
NeurIPS 2024, 2025

News & Updates

May 2025Joined Adobe as Research Scientist Intern for summer.
May 2025Audio Flamingo 2 accepted at ICML 2025 🎉
Jan 20253 papers accepted at ICLR 2025 (incl. one Spotlight).
Jan 20253 papers accepted at NAACL 2025.
Sept 2024Released MMAU — most comprehensive audio reasoning benchmark.
Sept 2024GAMA accepted as Oral at EMNLP 2024.
Sept 20242 papers accepted to EMNLP 2024.
Aug 2024SALMA workshop accepted at ICASSP 2025.
June 2024Released GAMA, a state-of-the-art audio-language model.
May 2024Joined Adobe in San Francisco as Research Scientist Intern.
May 20242 papers at ACL 2024 · 1 paper at ICML 2024.
Mar 20242 papers at NAACL 2024 · 1 paper at CVPR 2024.
Jan 20241 paper at ICLR 2024 · 1 paper at ICASSP 2024.
Oct 20232 papers accepted to EMNLP 2023.
May 2023Papers at Interspeech 2023 · ACL 2023 · SIGIR 2023.

Resume