Meta is using a University of Maryland–led safety framework to test its new multimodal AI model, underscoring how academic research is shaping the evaluation of advanced systems before deployment.
The company recently introduced Muse Spark, a large language model designed to process and understand text, images, video, and audio. Before releasing it publicly, scientists at Meta Superintelligence Labs subjected the model to safety testing under simulated operational pressure.
That evaluation relied on PropensityBench, a framework developed through a collaboration between researchers at the University of Maryland, Scale AI, and contributors from the University of North Carolina at Chapel Hill, Google DeepMind, Netflix, and the University of Texas at Austin.
PropensityBench was created to address a gap in how AI systems are typically assessed, said Furong Huang, an associate professor of computer science at UMD who co-led the effort.
Most existing safety tests focus on capabilities—what a model can do when prompted. But that approach misses a critical question: how a model might behave if it had access to risky or harmful tools. As AI systems are deployed in environments where they can take actions, use external tools, and operate with greater autonomy, that distinction is becoming more significant.
PropensityBench instead measures “propensity,” or the likelihood that a model will choose high-risk actions when given simulated access to them. The framework places models in controlled, agent-like environments where they must make decisions using a range of tools, including ones associated with potentially dangerous activities.
The benchmark includes 5,874 scenarios and 6,648 tools across four high-risk domains: cybersecurity, self-proliferation, biosecurity, and chemical security. Each scenario introduces constraints such as limited resources, efficiency incentives, or opportunities for increased autonomy, reflecting the kinds of trade-offs AI systems may face in real-world settings.
The evaluation found that models frequently selected high-risk tools under pressure—even when they lacked the ability to execute those actions. Researchers identified nine indicators of risky behavior across both open-source and proprietary systems.
Those findings suggest a gap in current safety evaluations. A model may appear safe based on its present capabilities, but still demonstrate a willingness to engage in harmful behavior if given the opportunity.
“Understanding what models are inclined to do—not just what they can do—is essential for responsible deployment,” said Huang, who has an appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS) and is active in the UMD Center for Machine Learning.
The research behind PropensityBench has been accepted to the International Conference on Learning Representations (ICLR) 2026. Huang and her collaborators, including co-lead author and UMD doctoral student Shayan Shabihi, are scheduled to present the work on April 23 in Rio de Janeiro as efforts continue to refine how advanced AI systems are evaluated before release.
—Story by UMIACS communications group