On Thursday, Facebook’s parent company Meta announced Make-A-Video, a tool that generates short video clips from text descriptions—an unsettling, albeit inevitable, next step for the world of AI image generation.
The tool follows the company’s Make-A-Scene tool that was launched in July, which generates still images from text descriptions. While there are many comparable tools like DALL-E and Midjourney that have taken over the internet, Make-A-Video is the first time we are seeing a text-to-video tool that will soon be available to the public.
“Generative AI research is pushing creative expression forward by giving people tools to quickly and easily create new content,” Meta’s press release said. “With just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors, characters, and landscapes. The system can also create videos from images or take existing videos and create new ones that are similar.”
“It's much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they'll change over time,” Meta CEO Mark Zuckerberg wrote in a Facebook post. “Make-A-Video solves this by adding a layer of unsupervised learning that enables the system to understand motion in the physical world and apply it to traditional text-to-image generation.”
The example videos on the Make-A-Video site show videos of “a dog wearing a Superhero outfit with red cape flying through the sky” and “a teddy bear painting a portrait.” The videos are clearly AI-generated, with a blurry, painterly quality native to AI-generated images. Yet, they nonetheless show the fast-moving progress of AI art systems, which only a few years ago were the stuff of memes and science fiction.
Meta seems to be aware of the dangers behind AI art-generating systems, and claims it is “openly sharing this generative AI research and results with the community for their feedback, and will continue to use our responsible AI framework to refine and evolve our approach to this emerging technology.”
But according to the Make-A-Video research paper, the image models were trained using a subset of the LAION dataset, which is known for scraping unfiltered web data that produces biased results. Motherboard recently reported that within this dataset were images of ISIS executions, nonconsensual nudes, and photoshopped nudes of celebrities. Meta seems to address this issue by parsing down the original data set of over 5.8 billion images down to 2.3 billion, with the paper’s authors claiming, “We filter out sample pairs with NSFW images, toxic words in the text, images with a watermark probability larger than 0.5.”
Meanwhile, AI ethics researchers have pushed back against the use of these large language models, warning that their sheer size creates fundamental problems of harmful bias that can not be easily solved. Even Facebook’s own researchers have admitted that their language models have a “high propensity” for producing racist and harmful results.
Sign up for Motherboard’s daily newsletter for a regular dose of our original reporting, plus behind-the-scenes content about our biggest stories.
The introduction of text-to-video as a tool for artists and creators also complicates the ongoing issue of whether or not the use of AI-generated art should be considered legitimate. In August, a man named Jason Allen won an art competition using an AI-generated image, which caused intense backlash online with artists accusing Allen of expediting the death of creative jobs.
Meta’s announcement follows OpenAI’s release of DALLE-2 to the public on Wednesday. OpenAI, the company that developed DALLE-2, recently removed the system’s waitlist, allowing anyone to generate images from text prompts. But even as the public gets access to more and more AI-art generating tools, some of the most fundamental ethical questions about their use remain unanswered.