UMD Researchers Enable Robots to Learn from Human Experience | University of Maryland Institute for Advanced Computer Studies

Men and women are paired with humanoid robots, illustrating the concept of AI assistants that help people perform various tasks. — Researchers at the University of Maryland developed HumanEgo, a framework that helps robots learn manipulation tasks by observing humans perform them in first-person video recordings. Illustration: iStock.

What if teaching a robot a new task were as simple as showing it a video?

Researchers at the University of Maryland have developed a new artificial intelligence framework that enables robots to learn manipulation skills directly from first-person videos of humans performing tasks. The system, called HumanEgo, requires no robot demonstrations, no robot-specific training data and no large-scale pretraining, allowing robots to acquire new skills from as little as 30 minutes of human video.

The research, funded in part by a grant from the Brin Family Foundation, addresses one of robotics' most persistent challenges: the embodiment gap, or the fundamental differences between human bodies and robotic systems. Humans and robots look different, move differently and perceive the world differently, making it difficult for robots to translate human actions into their own behaviors.

HumanEgo approaches the problem from a different angle. Rather than teaching robots to imitate human movements, the system focuses on understanding the interaction itself—how a hand approaches, grasps, moves and releases an object.

To do that, the researchers developed a new representation called Interaction-Centric Tokens (ICT). The approach captures the spatial relationship between hands and objects, allowing robots to learn the essence of a task regardless of who performs it or what robot eventually carries it out.

The research is detailed in a paper describing the HumanEgo framework, which is currently under review. The study demonstrates that robots can learn useful manipulation skills directly from human video, eliminating the need for robot-specific demonstrations during training.

"Most robot learning systems today still rely on collecting hundreds or thousands of demonstrations from the robot itself," said Zhi (Leo) Wang, a doctoral student in computer science at the University of Maryland and lead author of the study. "HumanEgo shows that robots can instead learn from ordinary human demonstrations recorded with smart glasses, dramatically reducing the amount of specialized data needed to teach new skills."

The breakthrough points to a new approach to robot learning—one that taps into the vast amount of knowledge humans generate every day rather than requiring robots to learn each task from scratch through their own experience.

"For decades, robotics has been limited by the need to collect large amounts of robot data," said Yiannis Aloimonos, professor of computer science and Wang's advisor. "HumanEgo shows that robots can begin learning directly from human experience."

The system begins with egocentric video recorded through smart glasses. HumanEgo removes the appearance of the human arm and extracts information about hand-object interactions before converting those observations into a compact spatial representation. Rather than relying on appearance alone, ICT explicitly encodes the relative six-degree-of-freedom relationships between hands and objects—the geometry that ultimately determines successful manipulation.

That distinction turns out to be critical. In experiments, reducing the visual differences between humans and robots produced only modest improvements. Even when researchers largely eliminated the visual mismatch, success rates plateaued at 32.5%. By contrast, adding ICT to the learning process increased performance from 7.5% to 85%, while the full HumanEgo system achieved a 95% success rate.

The findings suggest that successful robot learning depends less on visual similarity and more on understanding the spatial structure of interactions. Explicit spatial reasoning mattered far more than visual realism. Even robot-like imagery provided only limited gains, while ICT's representation of hand-object geometry produced dramatic improvements across tasks.

"One of the biggest challenges is identifying what information should transfer from humans to robots," said Furong Huang, associate professor of computer science and a co-author of the study. "Our results show that understanding the interaction between a hand and an object is far more important than replicating how a human looks or moves."

In addition to overcoming the embodiment gap, HumanEgo addresses another major challenge in robotics: learning effectively from limited data. The researchers paired ICT with a generative AI technique known as flow matching, which can model multiple valid ways of completing a task while remaining fast enough for real-time robotic control.

To further improve data efficiency, the team developed auxiliary learning objectives that extract additional supervision from every demonstration. These objectives train the system to predict object motion, model interaction trajectories and maintain consistency in its internal representations, enabling the robot to learn far more from each video than conventional imitation-learning approaches.

The resulting framework achieves an average success rate of 92.5% across four real-world manipulation tasks using only 30 minutes of human demonstrations per task. Even with just 15 minutes of training data, the system achieves a 75% success rate. HumanEgo also outperforms robot teleoperation collected over the same amount of time by 41%.

Perhaps most significantly, HumanEgo transfers zero-shot to new robots, camera configurations, lighting conditions and environments without retraining or fine-tuning. The results suggest that robots can acquire useful skills from ordinary human experience, potentially reducing one of the biggest barriers to deploying robotic systems in homes, workplaces and other real-world settings.

In addition to Wang, Huang and Aloimonos, the project includes Ruohan Gao, assistant professor of computer science, and UMD graduate students Botao He, Kelin Yu and Seungjae Lee.

Faculty members Aloimonos, Huang and Gao also hold appointments in the University of Maryland Institute for Advanced Computer Studies (UMIACS), which provides technical and administrative support for the project.

The researchers have released HumanEgo as an open-source framework to accelerate future work on robot learning from human demonstrations.

—Story by UMIACS communications group