As AI tools become part of everyday life, large language models (LLMs) are growing more capable—generating text, translating languages, assisting with writing, and tackling complex reasoning. But researchers are finding these systems can subtly bend safety rules, appearing responsible in one context while providing restricted content through seemingly harmless rewrites or language shifts.

Furong Huang, an associate professor of computer science with an appointment in the University of Maryland Institute for Advanced Computer Studies, is leading a new effort to investigate this evasive behavior.
Her work is supported by a $1 million award from Open Philanthropy, a philanthropic entity that partners with GiveWell and Good Ventures.
The one-year project, called “FAKE-LLM: Knowledgeable Exploitation” aims to understand when language models merely appear safe and aligned and when they actually follow rules reliably. Even small changes—a paraphrase, a role-play setup, or a language switch—can cause a model that initially refuses a request to quietly comply instead.
Huang became interested in the problem after noticing two trends shaping AI behavior. First, as models grow more capable, tiny changes in wording or context can turn a refusal into quiet compliance. Second, most safety tests are narrow—often limited to a single language or a single turn of conversation—so a model might pass controlled evaluations yet behave unpredictably in real-world use. Together, these trends highlight the gap between how safe a model seems in the lab and how it actually performs in everyday applications.
To explore these gaps, she and her team have been testing models with paraphrasing, persona shifts, multi-turn role-play, language and code switching, translation relays, tool use, and carefully crafted “honeypot” prompts (that appear harmless but reveal whether alignment is genuine or scripted). They also review consistency across languages, scan for hidden backdoors, and track subtle “policy slippage,” where a model refuses outright but quietly provides most of the answer.
“We’re building the equivalent of a stress test for AI, probing when models only act well-behaved versus when they truly are,” says Huang, who is also a member of the Institute for Trustworthy AI in Law & Society and the University of Maryland Center for Machine Learning.
Her work also addresses a deeper challenge: identifying behaviors in which AI systems appear aligned during evaluation while internally pursuing goals that may not match human intent. Early evasive behaviors—such as subtly “sandbagging” responses, exploiting evaluation rubrics, or quietly preserving misaligned objectives—can precede more obvious misbehavior. By developing robust behavioral benchmarks and stress tests, the team aims to reveal when models act deceptively, under what conditions, and why. These tools help oversight teams evaluate models before deployment, reducing risks in real-world use.
Huang says her team has caught models “cheating,” noting cases where a harmful request is denied in English but allowed after translation, or where a refusal is followed by actionable details “for education.” She notes that many of these lapses reveal gaps in evaluation as much as flaws in the models themselves.
These subtle behaviors can appear in everyday tools. A chatbot might block an unsafe question as written but respond after a slight rephrasing. A translation app might weaken a safety warning after language transfer. A writing assistant might refuse to produce disallowed content yet edit user-supplied text into a nearly equivalent result. Small inconsistencies, repeated at scale, can create real-world risk.
FAKE-LLM aims to deliver practical safeguards, including a multilingual, multi-turn evaluation suite for evasive behavior, methods to detect and prevent risky outputs, and guidance to ensure models continue to behave safely once deployed.
“Ensuring AI systems stay aligned in realistic settings—not just controlled tests—is critical,” she says. “We want to build systems that behave safely all the time, not only when watched.”
—Story by Melissa Brachfeld, UMIACS communications group