Samim Winiger spends a lot of his time playing around with neural networks, primitive artificial brains trained to autonomously run through algorithms for some kind of output, whether image recognition or text generation. In his latest project, he's been working with the Image Captions Network to identify events in videos. If you're scared of the rise of AI, this is the kind of thing that can put your mind at rest for a few years.
Convolutional neural networks work like Google's Deep Dream, processing an image and searching for an object it's been trained to recognize. But in the case of the Generating Captions project, Winiger set it on identifying videos. But the Image Captions Network also has a Recurrent Neural Network designed specifically to generate captions around what it sees in the video.
It's a convoluted explanation, but the output is this:
In other words, it's the sort of thing that's not ready for prime time. But as Winiger pointed out in an email to Motherboard, the algorithms are already being used in drones and other methods in surveillance, including military and private sector uses (like insurance claims.)
"'True scene understanding' is a hard problem, and shows the capabilities that we humans have. We can easily pick up on ambiguity and understand meaning," Winiger says. "For machines to reach this level, we have a long way to go."
Winiger is quick to point out that he chose scenes like Yoda on Luke's back during the training in Empire Strikes Back as a sort of inverse of why, say, a TED Talk (another target of Winiger's AI shenanigans) choose only the best examples of videos in demonstrations. In the same way, Winiger chose some of the most challenging.
"This 'blackbox' trend is often observed and leads to wrong perceptions of A.I.," Winiger says. "When designing with intelligent systems, the human creativity is a crucial partner."
In other words, we only think the machines are taking over because it's in the best interest of AI researchers to present the most idealized version of their technology. There still needs to be the guiding hand of a human telling it what it's seeing. Next generation video recognition may well correctly identify Kanye West, but it won't do so without being taught on some fundamental level first.