As artificial intelligence systems become more powerful, one of the field’s most pressing questions is whether they will remain understandable to the humans who build and oversee them.
Sarah Wiegreffe, an assistant professor of computer science with an affiliate appointment at the University of Maryland Institute for Advanced Computer Studies (UMIACS), is leading a new research effort aimed at answering that question. Her work will test whether the reasoning processes used by advanced AI systems will remain transparent—or whether future models might learn to bypass the explanation-generating step designed to keep them accountable.
The project is supported by a $270,000 grant from Coefficient Giving, formerly known as Open Philanthropy, a philanthropic organization that helps direct funding toward high-impact global causes.
At the center of Wiegreffe’s research is a widely used technique known as chain-of-thought reasoning, in which AI models generate step-by-step explanations of how they reach their answers. Today, those explanations can help researchers and other AI systems monitor whether a model is behaving properly. For example, oversight systems can scan reasoning traces for warning signs such as “reward hacking,” when a model exploits unintended shortcuts to accomplish a task, or potential policy violations.
But Wiegreffe says researchers have not yet tested whether these safeguards will continue to work as AI systems grow more capable.
“Researchers often assume that dangerous tasks are complex enough to require visible reasoning, and that future models won’t simply bypass these explanations,” Wiegreffe says. “These assumptions are plausible, but we don’t yet have strong empirical evidence that they’ll continue to hold as models become more capable.”
To investigate, Wiegreffe and her collaborators will build an experimental framework that treats task complexity, and thus reasoning necessity, as a spectrum rather than a simple yes-or-no feature. The team will measure how many reasoning tokens—the basic units of text that language models process—are needed for a model to solve problems of increasing difficulty. By systematically limiting the amount of reasoning tokens from a system that a human or AI supervisor can observe, the researchers hope to identify the point at which a supervisor’s ability to monitor the system begins to break down.
The project will also examine a more subtle risk: encoded reasoning. In this scenario, a model might hide intermediate calculations within stylistic patterns or specific word choices rather than clearly articulated explanations, making its reasoning harder for humans or automated monitors to detect.
To counter that possibility, Wiegreffe plans to explore training techniques designed to make transparent reasoning structurally necessary. One approach would act like a “digital blindfold,” selectively disabling certain internal pathways so that a model must rely on explicit reasoning steps to solve complex tasks.
Ultimately, the research aims to help scientists understand how human oversight can scale alongside increasingly capable AI systems—a core challenge in the field of AI safety.
“We need to understand not just when models explain themselves,” Wiegreffe says, “but when those explanations are actually trustworthy by virtue of being faithful to the model’s internal computations—and when they’re not.”
The work contributes to ongoing efforts at UMIACS’ Computational Linguistics and Information Processing Laboratory, where Wiegreffe is a core faculty member, to ensure that as artificial intelligence and language processing technologies advance, they remain transparent, trustworthy and aligned with human goals.
—Story by Melissa Brachfeld, UMIACS Communications