I am a Ph.D. student in Computer Science at the University of Maryland, College Park, working under Prof. Dinesh Manocha and Prof. Ramani Duraiswami. As part of the Gamma Lab and PIRL Lab, my research is centered on audio-language models, multimodal learning, and advanced reasoning in speech and audio processing.
Previously, I was a Software Engineer II at Cisco Systems, and a Research Scientist Intern at Adobe, where I focused on audio generative models. I graduated with a B.Tech in Computer Science from Christ University in 2021, where I served as President of Neuron, the university's first AI club.
Audio, speech & multimodal research · Full Google Scholar profile →
We introduce TAC, a system for generating audio captions with precise timestamps, enabling fine-grained alignment between audio events and their textual descriptions. TAC advances the state of the art in temporal audio understanding and supports downstream applications like audio retrieval and audio question answering.
Audio Flamingo 2 is a large audio-language model capable of understanding long audio sequences (up to minutes) and performing complex multi-step reasoning about audio content. By combining a powerful audio encoder with a long-context language model, AF2 achieves state-of-the-art performance on multiple benchmarks while maintaining efficiency.
Audio Flamingo 3 presents a fully open-source suite of large audio language models, advancing audio intelligence through improved training recipes, open model weights, and competitive performance across diverse audio understanding and reasoning benchmarks.
MMAU-Pro extends the original MMAU benchmark with significantly more challenging questions requiring deeper expert-level knowledge and multi-step reasoning across speech, music, and environmental sound. It provides a harder evaluation frontier for the next generation of audio general intelligence systems.
We systematically investigate the robustness of audio-language models to linguistic variations in text queries — including paraphrasing, negation, and syntactic transformations — revealing significant brittleness in current ALMs and proposing evaluation protocols to benchmark this understudied dimension of model robustness.
SILA introduces a data augmentation framework for text-to-audio generation that bridges the gap between signal-level audio properties and language descriptions. By automatically generating rich, controllable textual descriptions from audio signals, SILA enables more faithful and controllable audio synthesis from text prompts.
MultiVox is a comprehensive benchmark for evaluating voice assistants in real-world multimodal interaction scenarios that combine speech, visual, and audio signals. It exposes limitations of current voice assistants when handling complex, concurrent modalities.
We introduce a multi-domain audio question answering challenge spanning environmental sounds, music, and speech. The task requires models to reason about acoustic content across diverse domains, pushing the frontier of audio general intelligence beyond category classification.
ReClap improves zero-shot audio classification by augmenting class labels with rich natural language descriptions of the corresponding sounds. By training CLAP-style models with these enriched descriptions, ReClap significantly improves zero-shot transfer across diverse audio classification benchmarks.
We introduce MMAU — Massive Multi-Task Audio Understanding and Reasoning Benchmark — a comprehensive benchmark designed to evaluate Large Audio-Language Models on tasks requiring expert-level knowledge and complex reasoning. MMAU includes 10,000 meticulously curated audio clips paired with human-annotated questions and answers spanning speech, environmental sounds, and music. The benchmark covers 27 distinct skills across unique and challenging tasks. Even state-of-the-art systems like Gemini Pro v1.5 and Qwen2-Audio achieve only ~53% accuracy, underscoring substantial headroom for improvement.
We propose GAMA, a novel Large Audio-Language Model (LALM) capable of responding accurately to complex questions about input audio. GAMA benefits from a mixture of encoders and synthetic data generated using a novel data generation pipeline. GAMA achieves state-of-the-art performance on various audio understanding, reasoning, and hallucination benchmarks, and was nominated for an Oral presentation at EMNLP 2024.
We present Synthio, a novel method for augmenting small-scale audio classification datasets with synthetic data. Our approach aligns a Text-to-Audio generation model with the target dataset through preference optimization, then uses an iterative LLM prompting method to generate diverse audio captions. By augmenting datasets with data generated by Synthio, we achieve up to a 39% performance improvement on benchmark datasets.
EH-MAM introduces an adaptive masked acoustic modeling strategy that gradually increases masking difficulty during self-supervised pre-training. By selectively reconstructing challenging regions, EH-MAM enables better speech representations and outperforms state-of-the-art baselines across low-resource speech recognition and SUPERB benchmarks by 5–10%.
LipGER is a visually-conditioned generative approach to ASR error correction that uses lip movements as an additional conditioning signal to disambiguate acoustically confusing speech segments. By incorporating visual speech information, LipGER substantially reduces word error rates in noisy acoustic environments.
RECAP introduces retrieval-augmented generation for automated audio captioning. By retrieving semantically similar audio captions at inference time and conditioning caption generation on them, RECAP produces more accurate, diverse, and grounded audio descriptions compared to standard encoder-decoder baselines.
AV-RIR leverages both audio and visual cues to estimate room impulse responses — the acoustic fingerprint of a space. By jointly modeling visual scene geometry and captured audio, AV-RIR achieves significantly more accurate RIR estimation than audio-only methods, enabling better downstream spatial audio applications.
We introduce CompA, a benchmark for compositional reasoning in audio-language models. CompA includes CompA-order (evaluating understanding of sequential acoustic events) and CompA-attribute (testing attribute-sound association). We demonstrate that models like CLAP struggle with compositional reasoning, and propose CompA-CLAP with compositionally-aware hard negatives and modular contrastive learning to significantly improve performance.
The IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2025. We proposed the Audio Question Answering (AQA) task — focusing on advancing question-answering capabilities in "interactive audio understanding," covering both general acoustic events and knowledge-heavy sound information. The task encourages systems that can accurately interpret and respond to complex audio-based questions (multi-choice), requiring models to reason across diverse audio types. Reproducible baselines include a resource-efficient single 8GB RAM setting; direct enterprise API use is prohibited.
The 2025 Jelinek Workshop on Speech and Language Technologies (JSALT) is an eight-week residential summer research workshop bringing together international teams to work intensively on challenging problems in speech and language engineering, ML, and AI. We proposed a workshop focused on advancing expert-level understanding and complex reasoning in audio-language models, drawing team members from several universities and industry in the US, Europe, and Asia.
SALMA-2 is the second edition of the Workshop on Speech and Audio Language Models, co-located with EMNLP 2026. Building on the success of the first SALMA workshop, this edition continues to bring together researchers working at the intersection of speech, audio, and language modeling to share recent advances and foster new collaborations.