Imagine if your favorite picture could automatically be converted into a short video and labeled. Sound like a fantasy? Maybe not for much longer.

Using a deep learning algorithm, MIT’s Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba recently generated one second of predictive video based on a single still frame.

Called Scene Dynamics, the software has been taught with roughly two million unlabeled videos. After being fed a new image, the system runs two competing neural networks. The first generates the predictive video while the second discerns if the videos are real or fake. Beyond predicting an impressive number of frames based on assumed motion, the algorithm also classifies the specific action occurring. While clearly not perfect, the results are impressive already.

It’s notable the software learned from unlabeled videos. Deep learning programs are usually fed masses of meticulously labeled data (images, for example). This takes a lot of time and effort and limits learning to tailored experiences. The researchers hope their work will advance less laborious “unsupervised learning,” reducing the need for special data sets and allowing machines to learn from messier information.   

Also, this isn’t the only project with the goal of predictive video.

Visual Dynamics is a similar project (also out of MIT) working to generate new frames of predictive video per source frame. The difference? Visual Dynamics predicts short snippets of what may theoretically happen next, while Scene Dynamics creates entirely new longer sequences of video that didn’t exist before. Also, Scene Dynamics can separate background from subjects and generate new content for each.

Predictive video from stills has a variety of immediate applications, most notably creating video “out of thin air.” And there might even be room for more creative endeavors down the road.

“I sort of fantasize about a machine creating a short movie or TV show,” lead author Carl Vondrick told Motherboard. “We’re generating just one second of video, but as we start scaling up maybe it can generate a few minutes of video where it actually tells a coherent story. We’re not near being able to do that, but I think we’re taking a first step.”

Beyond video creation, similar motion prediction capabilities might be integrated into computer vision systems, allowing robots to better guess how people and objects in front of them will move. Such powers might help them avoid damaging themselves or hurting others around them.

More speculatively, if software like this can predict motion, what else might it be trained to predict?

One possible use in the future could be predicting what blurry or distorted pixels in videos should look like if sharpened. Low-resolution, compressed, or artifact-laden video would then be automatically upgraded to high resolution.

According to the researchers, they also see use-cases for improved security tactics and self-driving technology. But the dark side of multimedia manipulation is clear too. We may eventually see it power propaganda or generate falsified evidence (assuming fakeness can’t be easily detected).

Thankfully, we still have quite a way to go before this concern is valid. But for better or worse, as media manipulation becomes more flexible and widespread, video as a medium will shift into something more fluid than static. Ultimately, how such technology is used will depend on the motivation of each user.

The code is already available on GitHub if anyone wants to start playing around today. And the original video data set is also available on the Scene Dynamics website.


Image Credit: MIT

Andrew operates as a media producer and archivist. Generating backups of critical cultural data, he has worked across various industries — entertainment, art, and technology — telling emerging stories via recording and distribution.

Follow Andrew J.: