Skip to main content
University of Maryland
Emerging Technologies

UMD Team Advances AI Audio Systems with New Training Data and Benchmarks

December 2, 2025
colorful headphone audio illustration

Artificial intelligence (AI) systems that understand and produce text and images have grown exponentially into mainstream use, flooding the internet and social media feeds with AI-generated content. 

But the same can’t be said for audio-based AI systems, known as Large Audio-Language Models (LALMs), which use machine learning, speech recognition, and natural language processing to convert spoken words into data, or to synthesize realistic human-like voices and music from text or other audio.

This deficiency stems in part from researchers and software developers involved with LALMs having little access to largescale audio-based training datasets or measurable benchmarks for assessing their work. 

University of Maryland researchers, collaborating with scientists at tech giant NIVIDIA, are working to resolve this discrepancy, developing open-source AI platforms that offer a rich array of training data and a set of comprehensive benchmarks that will advance the efficiency and reliability of LALMs.

The researchers are presenting their findings as a spotlight paper at the Conference on Neural Information Processing Systems (NeurIPS), taking place in December in both San Diego and Mexico City. 

In their paper, the researchers introduce Audio Flamingo 3, a fully open state-of-the-art LALM that advances reasoning and understanding across speech, sound, and music using a unified audio encoder trained with a novel strategy for joint representation learning.

The research team also developed several training datasets that were curated using novel strategies, with state-of-the-art results achieved on more than 20-plus benchmarks used to measure audio understanding and reasoning. These results surpassed both open-weight and closed source models that were trained on much larger datasets.

The value of these benchmarks, and the open-source training data, is that they will spur further research in this domain, says Ramani Duraiswami, a University of Maryland Professor of Computer Science who was a co-author on the paper.

“The performance of both academic and commercial LALMs on several of the 41 categories covered in this benchmark is far behind human performance, providing the community with valuable directions for future discoveries,” says Duraiswami, who has an appointment in the University of Maryland Institute for Advanced Computer Studies and is a core faculty member in the University of Maryland Center for Machine Learning.

Other UMD researchers working on this project include Dinesh Manocha, a Distinguished University Professor of Computer Science with an appointment in UMIACS; and computer science graduate students Sreyan Ghosh and Sonal Kumar, who are co-advised by Duraiswami and Manocha.

Much of the early research for the published paper took place during a six-week workshop held over the summer at the Brno University of Technology in the Czech Republic.

The researchers attending the workshop were divided into four teams to address scientific challenges within machine learning applied to human speech, with Duraiswami leading the research thrust on Auditory General Intelligence.

—Story by Maria Herd, UMIACS communications group

Back to Top