This story is over 5 years old.

Eye, Robot: The Turing Test for Computer Vision

It's like "Guess Who," but for computers.

​Along with being waterproof, one other thing that people can do better than computers is look at a picture and identify what's going on in it. But in an effort to expedite our irrelevance, researchers from Brown University just published a paper in the Proceedings of the National Academy of Sciences that outlines a sort of visual Turing Test that will help scientists chart how well computers are "understanding" what it is that they're seeing.


As a great Wired article from January outlined, computers and the mathematical frameworks they operate on are able to capably identify dogs in hats, but even the latest state of the art artificial intelligence can mistake alternating black and yellow solid stripes for a school bus.

One of the interesting things that this article laid out is how much was going on that the computer scientists themselves didn't understand, because the models are increasingly complicated and increasingly learning on their own.

"There's millions of neurons and they're all doing their own thing. And we don't have a lot of understanding about how they're accomplishing these amazing feats," Jeff Clune, head of the Evolving Artificial Intelligence Laboratory at the University of Wyoming told Wired.

In order to better chart progress, researchers from Brown and Johns Hopkins University set out to make what they're calling a "visual Turing test." It scores image-identifying software models based on how well they do on a series of yes-or-no questions that not only ask what's in the image, but how whatever's in the image is relating to other people and objects in the image.

So rather than just testing whether a computer can see someone, the visual Turing test works its way along through a series of yes-no questions, sort of like it's playing the board game "Guess Who?"

"The tests we have [for visual learning] that don't really scale up. The bar is too low," Donald


Geman, one of the study's coauthors told me. "A lot of the competitions will just be based on how well you can detect and put a box around bicycles or cars or people."

That's all well and good if you just need identifying software. But to get to anything like "artificial intelligence," the goal is to get closer to how humans actually describe what they see.

"When people parse images, they do so at a much deeper and richer level—attributes people have, relationships—we tell a story about the picture," Geman said. "People say an image is worth a thousand words, it's not just labels and nouns. It's description of what's going on on a deep semantic level."

Each question in the new test belongs to one of four categories: existence questions (is it there?); uniqueness questions (is something else there?); attribute questions; or relationship questions.

"The goal of the existence and uniqueness questions is to instantiate objects, which are then labeled (person 1, vehicle 3,…) and subsequently available, by reference to the label, in attribute and relationship questions ('Is person 1 partially occluding vehicle 3?')," the study states.

"The questions are adaptive. Geman said. "The next question in the series is based not only on the answers, but also on the previous questions that have come up."

If the idea of computers quizzing each other makes you feel threatened, don't worry; there's still room for humans in this visual Turing test.

"You need a human in the loop to provide the true answers," Geman said. "The vision system answers and it could be right or wrong and at some point the vision system is generally incapable of providing answers because the questions go beyond their capability."

So humans remain the standard by which computer vision systems are judged at least for a little while longer. The visual Turing test, though, is another step towards making computers able to better see and describe. At least, for now, we'll still be more waterproof.