In 2012, an array of 16,000 computer processors, or "neural network," taught itself to recognize a cat. The technology has come a long way since then. Networks of digital neurons are now able to analyze and caption not just single objects, but entire bustling scenes.
On Monday, researchers at Stanford's Computer Vision Lab and Google Brain—the unofficial title for the artificial intelligence branch of Google X—separately announced that they've trained neural networks to describe complex photos with impressive accuracy and depth using machine learning and pattern recognition. The Stanford team's paper can be read here, and Google's here.
There was apparently no coordination between the teams, even though their results were similar. Andrej Karpathy, the lead author of the Stanford paper, told me he hadn't heard of the Google effort, or any of the many similar research projects under way at various universities, until very recently.
It's an insane coincidence that the research announcements came so hot on each other's heels, but in many ways an understandable one. Neural networks are poised to improve how we catalogue, annotate, and search data—hence the interest from Google and Chinese search engine giant Baidu.
Many experts are exploring linking multiple neural networks to get the most out of their specialized capabilities.
"Neural networks can be plugged into one another in a very natural way," Karpathy told me. "So we simply take a convolutional neural network, which understands the content of images, and then we take a recurrent neural network, which is very good at processing language, and we plug one into the other. They speak to each other—they can take an image and describe it in a sentence."
Linking two neural networks is a new and clever improvement over previous methods of image recognition that relied on pre-programmed classifiers merely being fed into one network or another. The convolutional neural network, which consists of 60 million nodes, learns the names of objects in a photo, and the recurrent neural network, which consists of 50 million nodes, learns how to spin those terms into a sentence. The researchers really only had to provide the training data.
But the method still has some issues.
Given a photo of a baby and its mother playing with blocks, for example, the Stanford researchers' neural network described the scene as, "Two young girls are playing with legos [sic] toy." Close, but no e-cigar.
When the same photo was given to real people to caption on Mechanical Turk, Amazon's crowdsourced labour platform, the baby and its mother were correctly identified.
Similarly, given a photo of a Dave Matthews-looking dude tuning up his acoustic guitar, the system provided this caption: "Man in black shirt is playing guitar." The human caption makers noted that the man in the photo was in fact tuning, not playing.
Google's neural network displayed the same kinds of hiccups: a yellow car was described as a school bus, a hot pink scooter became a red motorcycle, and a BMX rider was mistaken for a skateboarder.
"Right now, all of these methods look at the entire image in a single time, and they generate a description from it. They don't see too much detail in the image, they only get the gist, or the idea, of an image," Karpathy said. "We really have a long way to go to fully understanding and breaking down all the objects and understanding their connections."
The approaches can be improved by expanding the set of training images, Karpathy said. With more data to work off of, neural networks can make more accurate inferences about what they're "seeing" in a new image. But the size and number of training data sets is only part of the problem.
Computational ability is one severe limitation of neural networks at the moment, and one that is currently being addressed by AI firm DeepMind Technologies with its "Neural Turing Machine." By combining a deep learning neural network with a read-write memory store, DeepMind researchers reproduced a digital facsimile of working memory. (Although it is also financed by Google, the project is unrelated to Google Brain's image recognition research.)
The ability to store complex variables like nouns, names, and verbs en masse for later use could vastly improve a neural network's ability to learn and reproduce complex relationships between words and objects.
Recurrent neural networks—the linguistic processor in the system—have a limited ability to store memory now, Karpathy said, and DeepMind's advancement could very well allow for neural networks to further expand their knowledge about objects and their connections.
Despite decades of work, neural networks are still fairly early along in their development, no matter what Elon Musk says. For the reasons why, refer to the title and ethos of Motherboard's new series, Science Is Really Hard.
"These are really the baby steps," Karpathy said.