PROVIDENCE, R.I. [Brown University] — Whether in the kitchen or on a workshop floor, robot assistants that can fetch items for people could be extremely useful. Now, a team of Brown University researchers has developed a way of making robots better at figuring out exactly which items a user might want them to retrieve.
The new approach enables robots to use inputs from both human language and gesture as they reason about how to locate and retrieve target objects. In a study that will be presented on Tuesday, March 17, during the International Conference on Human-Robot Interaction in Edinburgh, Scotland, the researchers show that the approach had an 89% success rate in finding the correct object in complex environments, outperforming other object retrieval approaches.
“Searching for things requires a robot to navigate large environments,” said Ivy He, a graduate student at Brown and the study’s lead author. “With current technology, robots are pretty good at identifying objects, but when the environment is cluttered, things are moving around or things are hidden by other objects, that makes things much more difficult. So this work is about using both language and gesture to help in that search task.”
The research makes use of an approach to robot planning called a POMDP (partially observable Markov decision process), a mathematical framework that allows a robot to reason under uncertainty. In the real world, robots rarely have a perfect understanding of the world. Different types of objects can look similar. There may be more than one of a particular object in a room. Items might be partially or completely hidden from view.
To succeed in a search, a robot has to act even when it isn’t sure what it’s seeing. Without a way to manage that uncertainty, it might freeze. Or worse, it might make overconfident final decisions based on incomplete information. A POMDP turns ambiguities into a probabilistic framework that helps the robot track how confident it is about what’s in the world, and update those beliefs according to new information, including information from large vision and language models. In the process, it can choose actions that help it learn more — for example, moving to get a better view — before committing to a final decision.
The innovation in this latest research is a POMDP that incorporates inputs from both language and human gestures, such as pointing toward the object of interest. To incorporate the gesture component, He drew on insights from a Brown laboratory led by Associate Professor of Cognitive and Psychological Sciences Daphna Buchsbaum, on how the undisputed world champions of fetch — dogs — interpret human pointing.
Building on this expertise, He and Ph.D. student Madeline Pelgrim performed a study of the finer points of human pointing, as well as how dogs interpret pointing gestures. The study helped He to model the target of a pointing gesture within a cone of probability.
“What we have found is that humans use eye gaze to align with what they’re pointing to,” He said. “So it was natural to create a cone based on a connecting line from the eye to elbow to the wrist. That turns out to be a fairly good approximation of where someone is pointing.”
Buchsbaum adds, “Our work in the Brown Dog Lab has shown just how sophisticated dogs are in their communication with humans, solving many of the cooperation problems we want robots to solve. This makes them a natural model for intuitive human-non-human cooperation. This work translates the dog's intuitive understanding of human gaze and pointing into a probabilistic model, which allows the robot to handle the ambiguity inherent in human communication. It moves us closer to truly intuitive robotic assistants.”
He then combined the gesture model with a vision language model or VLM, an AI system designed to interpret visual scenes together with natural language descriptions. The result was a POMDP capable of incorporating both language and gesture for robot planning.
In lab experiments, the researchers asked a quadruped robot to find various objects scattered around the lab space. The experiments showed that the robot was able to locate the correct object nearly 90% of time using combined gesture and language, far better than using either input alone.
For He and her coauthors, the research is a step toward robots that are able to operate side-by-side with people at home and in the workplace.
“The framework we developed helps pave the way for seamless multimodal human-robot interaction,” said research co-author Jason Liu, a postdoctoral researcher at MIT who worked on the project while completing his Ph.D. at Brown. “In the future, we can communicate with our assistant robots the same way people interact through language, gestures, eye gazes, demonstrations and much more.”
The work was supported through Brown’s AI Research Institute on Interaction for AI Assistants (ARIA), which is funded by the National Science Foundation.
"This is a really great illustration of how we can enable more natural and effective human-machine interaction by strengthening collaborations between computer science and cognitive science,” said Ellie Pavlick, an associate professor of computer science at Brown who leads ARIA. “Embracing what we know about how humans naturally want to communicate, and building systems aligned with those human tendencies and intuitions about behavior, is the right way forward.”
The work was supported by the National Science Foundation (2433429) and the Long-Term Autonomy for Ground and Aquatic Robotics program (GR5250131), and by the Office of Naval Research (N0001424-1-2784, N0001424-1-2603).
LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments
17-Mar-2026