​This Algorithm Makes Movies from Street View Stills

A deep learning tool from Google researchers synthesises new images to fill in the gaps between existing frames.
July 7, 2015, 11:45am

This looks like a movie, but it's made entirely of stills from Google Street View—and a little deep learning trickery.

Google researchers made the video by building an algorithm that can synthesise new scenes from existing images. Feed in two consecutive images, say a couple taken a step apart along a street, and it can fill in the blank between them to eliminate any clunky stop-motion effect.

They introduce their tool, which they call DeepStereo, in a paper published on Arxiv.

The difficulty with rendering a new image like this is that the system doesn't know exactly what's supposed to be in the unseen view. That problem can be made worse by objects blocking each other in the existing frames, which can lead to a jarring effect or parts of the image looking mutated.

But as you can see in the video, the new solution gives a pretty smooth effect.

"When uncertainty cannot be avoided our method prefers to blur detail which generates much more visually pleasing results compared to tearing or repeating, especially when animated," the researchers write. The new images might not have quite as high resolution as a result, but the transition is graceful.

MIT Technology Review explains the process in a nutshell: the computer determines the depth and color of each pixel in the new scene according to the depth and color of the same pixel in the images before and after it.

The researchers tested their model by getting it to make new images of a scene they actually had a Street View image of—but that the computer hadn't seen. "Overall, our model produces plausible outputs that are difficult to immediately distinguish from the original imagery," they write.

They conclude that their work shows it's possible to get a deep network to synthesise new views, but concede that their method currently has a couple drawbacks, namely that it's slow and needs a set number of input images.

But ultimately, they envisage applications in cinema, virtual reality, and teleconferencing.