The Case Against Intelligent Computer Vision

Image Courtesy of Flickr.

Convolutional neural networks, or CNNs, are deep learning networks trained with millions of images. Designed to imitate primate brains, they proved highly adept at object recognition, sparking media excitement over the future of computer vision—the use of AI to interpret visual input. Moreover, researchers hoped CNNs could offer a shortcut to studying the primate brain. Rather than undertake copious MRI scans and patient trials, scientists could run a simulation through a CNN to predict how the human brain would respond.

But the research of Yaoda Xu, a senior research scientist at Yale, proves otherwise. Ten years ago, when a perfect visual recognition system and its manifold implications—think! driverless cars! — loomed on the brink of discovery, those possibilities now seem distant as ever. “People got excited about using the CNN to model the brain,” Xu said. “But my findings have been that, no, it doesn’t look like the brain. It’s maybe a primitive, overdeveloped, early visual area of the brain.”

As her recent paper published in NeuroImage clarifies, where CNNs fail is in the realm of identifying transformation tolerant representations. The process sounds complex, but it’s something humans carry out every day: when a person walks toward a table and sees it enlarge in their field of vision, they know it’s the same table as before. The same goes for objects viewed from different perspectives or positions—even as altered representations, the brain maps them onto the same visual identity.  

This intuitive procedure proves much more challenging for neural networks. In her project, Xu took images of eight real-world objects, ranging from a pair of scissors to an elephant, and distorted them in various ways. Some she geometrically transformed, moving up and down or dilating on the page. Others she subjected to non-Euclidean transformations, changing the contrast and resolution. In each case, when tested on eight different CNNs, the neural networks showed weaker consistency and tolerance for these images.

The implication is striking: a process trivial for primate brains remains elusive for the complex, pre-trained machines meant to model them. Xu attributes this discrepancy to the mechanics of human cognition versus machine learning. The primate brain processes visual information through two streams: dorsal, which recognizes the object’s spatial location (the “where”), and ventral, which recognizes the object’s identity (the “what”). Though seemingly redundant, the ability to identify the same object in different contexts arises from these two systems.   

The CNN, in contrast, employs a sub-optimal approach. “In my view, it basically has a huge amount of memory,” Xu said. “It memorizes each instance of each object it was exposed to, without making a connection among these different objects.” Scientists are unsure how this algorithm works precisely, thus creating a “black box.” But Xu is hopeful about cracking it — if only the scientific community reframes its approach. She plans to delve deeper into neuroscience research, seeing where and how primate vision diverges from neural networks, to shed light on the CNN algorithm and identify stages for improvement. Importantly, she believes the key lies in crafting a comprehensive biological understanding of vision rather than tackling the problem unilaterally through computer engineering. 

She compared this pursuit to trying to replicate flight: someone can blindly tweak the wing, fold a new flap, and throw everything against the wall until something flies or falls off a cliff. But someone can also investigate how flight works, learning the fundamental aerodynamics and physics which drive movement to find inspiration for an airplane. “What is vision trying to achieve? What is the problem you’re trying to solve?” Xu asked. She expressed aversion toward the trial-and-error experimentation employed by many computer science labs. “I’m showing you that, hey, this is the algorithm and computation that’s happening in the brain. If the system you’re building can have the same principles, maybe you can do a lot better than what you have right now.”

Xu looks towards a future where artificial networks could perfectly mimic human vision. She recalled how, as a student growing up in China, she spent entire weekends hand-washing her clothes. When the laundry machine mechanized the process, her free time could be put toward more valuable endeavors—like advancing her research career. “There’s a lot of human potential that is untapped,” Xu said. “If some of our boring tasks can be done by a machine efficiently with this kind of visual intelligence, it could lead to another leap in human development. We could have the creativity to be who we want or to be the best version of ourselves.”