What Can a Computer Learn from a Baby?

Image courtesy of Catherine Kwon.

Before we turn three months old, humans have already developed an intuitive sense of how the physical world works. If an infant knocks over a block tower, they know the blocks will tumble spectacularly down to the ground. If the blocks float in the air or fall straight through the floor, the infant might cry out in surprise.

This sense of common intuition is dubbed our ‘intuitive physics engine,’ fundamental to both biological and artificial intelligent systems operating in the real world. Understanding how a system’s physical actions will affect the world around them can guide what decisions and movements they must execute to carry out their intentions. In biological systems, the intuitive physics engine develops extremely rapidly, suggesting its importance for survival in the physical world.

Despite the ubiquity of intuitive physics in intelligent biological organisms, the best artificial intelligence (AI) systems still struggle to replicate the same understanding of physics that even very young children have mastered. This difficulty continues despite the great progress made in the field. AI systems easily best humans in complex games, like Chess and Go, and have solved some of the most complicated scientific problems, like protein folding. The challenge of teaching intuitive physics to an artificial system lies within its pervasiveness. “Intuitive physics is everywhere, and when something is everywhere, it becomes hard to analyze because it’s interacting with so many things,” said Luis Piloto, a research scientist at DeepMind, a subsidiary of Alphabet, Google’s parent company. However, Piloto’s team has recently made great strides toward finding a solution by taking inspiration from the methods and findings of developmental psychology, ultimately creating an AI system that learns intuitive physics from visual data.

Perhaps the most important novelty of Piloto’s work was how their AI model’s understanding of intuitive physics was probed and evaluated. In developmental psychology, intuitive physics is separated into several distinct concepts, such as object permanence or object solidity. For each concept tested individually, human subjects are shown relevant scenes that are either consistent or inconsistent with the concept of interest. If subjects show surprise after seeing inconsistent scenes, which is usually measured by gaze duration, there is evidence that the subject understands that concept. This method of evaluating intuitive physics knowledge is known as the violation-of-expectation (VoE) paradigm.

Inspired by these methods used in developmental psychology, Piloto and his team constructed the Physical Concepts dataset. This dataset contains videos generated by a physics engine, each consistent or inconsistent with one of five distinct concepts from intuitive physics. These concepts included object permanence (objects will not simply disappear), object solidity (objects will not pass through one another), continuity (objects will have continuous paths and cannot teleport from one place to another), unchangeableness (objects retain their properties over time), and directional inertia (objects will stay in their path unless a force acts upon them). Every video that abided by a concept was paired with a visually similar one that violated that concept, both starting with identical scenes but deviating over the course of the video. The amount of ‘surprise’ that a model exhibited was determined by the model’s prediction error—how different a future scene predicted by the model is compared to the real future scene in the accompanying video. Thus, the model’s understanding of a concept can be evaluated by examining the difference in the model’s surprise in response to physically plausible and implausible pairs of videos.

This VoE paradigm is a departure from standard methods of evaluating AI’s performance on intuitive physics tasks. One common approach utilizes video prediction on physically plausible situations alone to evaluate learning progress. In the VoE paradigm, the model should, in theory, make incorrect predictions about physically implausible videos, enabling researchers to better understand whether a concept is truly being learned. Another common approach employs reinforcement learning tasks, whereby models plan actions to interact with the environment around them to receive a reward. However, the complexity of these tasks makes it difficult to isolate the true cause of failure because success requires intuitive physics knowledge and knowledge of how to navigate the given space. 

“If we want to evaluate intuitive physics knowledge, let’s break it down into these different concepts, and let’s build stimuli that are really about the concepts… You can do well on benchmarks, but if they don’t reflect the capabilities that you’re actually trying to measure, then increasing your performance on those benchmarks doesn’t necessarily get you closer to the capabilities that you want,” Piloto explained.

The next key insight from developmental psychology incorporated by Piloto’s team was an object-based conception of physics. Infant intuitive physics behavior involves segmenting the visual field into distinct objects with their own properties (object individuation), tracking these objects across space and time (object tracking), and then processing how these objects interact with each other (relational processing). These three processes were implemented in a model called Physics Learning through Auto-encoding and Tracking Objects, nicknamed PLATO. Rather than only looking at patterns of pixels in a visual scene as other visual prediction models do, each frame that PLATO processes is broken down by masking specific parts of the scene so that the model can learn representations of individual objects. Indices assigned to each object enable them to be tracked through time. Lastly, a separate module is used to process how these objects interact with one another and predict future scenes.

After just twenty-eight hours of visual training with physically plausible videos from the physical concepts dataset, PLATO demonstrated a grasp of all five concepts by exhibiting greater surprise in response to physically implausible videos. This result outperformed AI models that do not rely on object-based representations as PLATO does. The model also performed well on a different dataset developed independently by a team at the Massachusetts Institute of Technology, suggesting that PLATO’s understanding of intuitive physics is robust.

The quest to build an AI system that can learn intuitive physics is far from over. While Piloto’s team at DeepMind has taken a great step forward, there is still much room for improvement. In particular, PLATO did not learn how to segment and label the visual field into distinct objects by itself. Instead, the researchers spoon-fed the model a series of masks that told it where each object was. Other recent work has successfully tackled this challenge, introducing methods for object discovery in an unsegmented visual field. Integrating this research with the relational processing module of PLATO would result in a seamless model that can understand intuitive physics with nothing but a video. 

PLATO’s success as a machine learning model inspired by biological brains speaks to artificial intelligence’s close relationship with neuroscience and psychology. “AI and neuroscience are attacking the same problem from different sides,” Piloto said. “The analogy that I like to use is that AI is building intelligence from scratch, and neuroscientists are saying, ‘Wait, hold on, we’ve got this intelligent system right here, why don’t we try and reverse-engineer what’s going on?’”

Although PLATO is far from being an accurate model of intuitive physics learning in children, the study still presents important implications for developmental psychology. PLATO’s success proves that intuitive physics knowledge is not necessarily innate—it can be rapidly acquired through visual learning. Additionally, Piloto’s team proposes using models like PLATO to investigate the order in which different intuitive physics concepts are acquired throughout development. This study demonstrates that the brain sciences and artificial intelligence have much to gain through work at the intersection of the two fields.