This Algorithm Taught Itself to Animate a Still Photo
This is the first time that a machine has been able to generate multi-frame video from a static image.
The team had two neural nets compete against each other, one which was trying to fool the other into thinking the videos it generated were 'real'. Image: MIT CSAIL/YouTube.
A team of researchers at MIT's Computer Science and Artificial Intelligence Lab (CSAIL) have created a deep-learning algorithm that is able to generate its own videos and predict the future of a video based on a single frame.
As detailed in a paper to be presented next week at the Conference on Neural Information Processing Systems in Barcelona, the CSAIL team trained their algorithm by having it watch 2 million videos which would last for over a year if played back to back. These videos consisted of banal moments in day to day life to better accustom the machine to normal human interactions. Importantly, these videos were found "in the wild," meaning they were unlabeled and thus didn't offer the algorithm any clues as to what was happening in the video.
Drawing from this video data set, the algorithm would attempt to generate videos from scratch that mimicked human motion based on what it had observed in the 2 million videos. It was then pitted against another deep-learning algorithm which tried to discriminate between the videos that were machine generated and those that were real, a method of training machines called adversarial learning.
"What we found in early prototypes of this model was the generator [network] would try to fool the other network by warping the background or having these unusual motions in the background," Carl Vondrick, a PhD candidate at CSAIL and lead author of the paper, told Motherboard. "What we needed to give the model was the notion that the world is mostly static."
To rectify this issue, Vondrick and his colleagues created a "two-stream architecture" which forces the generative network to render a static background while objects in the foreground moved. This two-stream model generated much more realistic videos, albeit short ones with really low resolutions. The videos produced by the algorithm were 64 x 64 and comprised of 32 frames (standard movies shoot at 24 frames per second which means these videos just over one second long), depicting things like beaches, trains stations, and the faces of new born babies (these are particularly terrifying).
While the ability to generate a second of video from scratch may not sound like much, this far surpasses previous work in the field which was only able to generate a few frames of video with much stricter parameters in terms of the content. The main pitfall of the machine generated videos is that the objects in motion in the video, particularly people, were often rendered as "blobs," although the researchers still found it "promising that our model can generate plausible motion."
Indeed, this motion was so plausible that when the researchers showed a machine generated video and a 'real' video to workers hired through Amazon's Mechanical Turk and asked them which they found to be more realistic, they chose the machine generated videos about 20 percent of the time.
Beyond generating original videos, one of the more promising results of this work is the ability to apply it to videos and photos that already exist. When the researchers applied their deep-learning algorithm to a still frame, the algorithm was able to discriminate among objects in the photo and animate them for 32 frames producing "fairly reasonable motions" for the objects. To Vondrick's knowledge, this is the first time that a machine has been able to generate multi-frame video from a static image.
This ability to anticipate the motion of an object or person is crucial to the future integration of machines in the real world, insofar as this will allow machines to not take actions that might hurt people or help people not hurt themselves. According to Vondrick it will also help the field of unsupervised machine learning, since this type of machine vision algorithm received all of its input data from unlabeled videos. If machines really want to get good at recognizing and classifying objects, they're going to need to be able to do this without label data for every single object.
But for Vondrick, one of the most exciting possibilities contained in his research has little scientific or real-world value.
"I sort of fantasize about a machine creating a short movie or TV show," Vondrick said. "We're generating just one second of video, but as we start scaling up maybe it can generate a few minutes of video where it actually tells a coherent story. We're not near being able to do that, but I think we're taking a first step."