Although researchers are making steady progress in understanding the models that power modern AI, significant gaps remain in how these systems work internally—how they develop skills, generate specific outputs, adapt to individual users and faithfully reflect their reasoning processes. As AI capabilities advance rapidly, that gap in understanding continues to widen, with researchers racing to keep pace.
Sarah Wiegreffe, who joined the University of Maryland last fall as an assistant professor of computer science, is one of those researchers. With an affiliate appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS), Wiegreffe is working to uncover the internal mechanisms behind these systems and develop models that are both more reliable and more trustworthy.
Wiegreffe earned her Ph.D. in computer science from the Georgia Institute of Technology before completing a postdoctoral fellowship with the Allen Institute for AI (AI2) and the University of Washington.
Her interest in interpretability began early in graduate school. As a first-year doctoral student working on machine learning for health care, she encountered a fundamental challenge while writing her first paper involving electronic health records: researchers still had limited insight into how AI systems reach their conclusions. That made it difficult to guarantee that AI-generated explanations were truly faithful to a model’s internal reasoning, limiting their effectiveness and utility to physicians.
“That realization is what got me moving from the explainability side to the interpretability side,” Wiegreffe says. Her research focus shifted from studying how AI explanations affect people’s interactions with models to investigating how AI systems learn capabilities and why they produce specific outcomes.
Today, her work centers on natural language processing and large language models in English. A major focus of her research is what she calls “actionable interpretability”—turning interpretability from a diagnostic tool for researchers into a practical capability that gives both developers and users meaningful control over how models behave.
For example, instead of repeatedly rewriting prompts to guide an AI system toward a desired response, users might one day be able to directly adjust model behavior.
“Given a trained model, can we develop techniques that give you some level of control—maybe via a slider—to change the extent to which the output has a specific attribute, like agreeability?” Wiegreffe asks.
Her work also addresses key challenges in AI reliability and safety. This includes studying how an AI system’s refusal to answer dangerous queries is encoded within its internal representations. By understanding these mechanisms, researchers hope to strengthen safety safeguards and reduce the risk of “jailbreaks” that attempt to bypass them.
To pursue this work, Wiegreffe sought an academic environment that combined strong technical expertise with interdisciplinary collaboration. She first became interested in the University of Maryland after participating in the UMD Center for Machine Learning’s Rising Stars program in December 2024.
“I really liked the collegiality of the department and the fact that the department is strong in multiple areas,” she says. “It’s kind of a jack-of-all-trades place.”
Because her work sits at the intersection of machine learning and natural language processing, Wiegreffe was also drawn to the opportunity to collaborate with researchers in UMD’s Computational Linguistics and Information Processing (CLIP) Lab.
UMIACS resources also played an important role in her decision. She points specifically to the institute’s shared computing infrastructure.
“It’s a shared model that’s way more effective and efficient than each professor or lab managing their own computational resources,” she says.
Beyond faculty collaborations and technical resources, Wiegreffe says the students she advises and teaches have been one of the most rewarding parts of her time at Maryland so far. Reflecting on a recent graduate seminar she taught on language model interpretability, she emphasized how much she values mentorship.
“I have great students, and they really inspire me,” she says. “Not only was the course a learning experience for the students, but teaching it was also a great experience for me to remind myself why I got into this field and why I love it.”
—Story by Diya Sharma, UMIACS communications group