READ MORE: Introducing Make-A-Video: An AI system that generates videos from text (Meta)
The inevitable has happened, albeit a little sooner than expected. After all the hoopla surrounding text-to-image AI generators in recent months, Meta is first out of the gate with a text-to-video version.
Perhaps Meta wanted to establish some headline leadership in this space, since the results aren’t ready for primetime.
But as developments in text-to-image generation has shown, by the time you read this the technology will already have advanced.
Meta is only giving a glimpse to the public at the tech it calls Make-A-Video. It’s still being researched with no hint of a commercial release.
“Generative AI research is pushing creative expression forward by giving people tools to quickly and easily create new content,” Meta stated in a blog post announcing the new AI tool. “With just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors and landscapes.”
In a Facebook post, Meta CEO Mark Zuckerberg described the work as “amazing progress,” adding, “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.”
Examples on Make-A-Video’s announcement page include “a young couple walking in heavy rain” and “a teddy bear painting a portrait.” It also showcases Make-A-Video’s ability to take a static source image and animate it. For example, a still photo of a sea turtle, once processed through the AI model, can appear to be swimming.
The key technology behind Make-A-Video — and why it has arrived sooner than some experts anticipated — is that it builds off existing work with text-to-image synthesis used with image generators like OpenAI’s DALL-E. Meta announced its own text-to-image AI model in July.
According to Benj Edwards at Arts Technica, instead of training the Make-A-Video model on labeled video data (for example, captioned descriptions of the actions depicted), Meta instead took image synthesis data (still images trained with captions) and applied unlabeled video training data so the model learns a sense of where a text or image prompt might exist in time and space. It can then predict what comes after the image and display the scene in motion for a short period.
READ MORE: Meta announces Make-A-Video, which generates video from text (Arts Technica)
In Meta’s white paper, “Make-A-Video: Text-To-Video Generation Without Text-Video Data,” the researchers note that Make-A-Video is training on pairs of images and captions as well as unlabeled video footage. Training content was sourced from two datasets which, together, contain millions of videos spanning hundreds of thousands of hours of footage. This includes stock video footage created by sites like Shutterstock and scraped from the web.
The Verge’s James Vincent shares other examples, but notes that they were all provided by Meta. “That means the clips could have been cherry-picked to show the system in its best light,” he says. “The videos are clearly artificial, with blurred subjects and distorted animation, but still represent a significant development in the field of AI content generation.”
The clips are no longer than five seconds (16 frames of video) at a resolution of 64 by 64 pixels, which are then boosted in size using a separate AI model to 768 by 768. They contain no audio but span a huge range of prompts.
READ MORE: Meta’s new text-to-video AI generator is like DALL-E for video (The Verge)
The researchers note that the model has many technical limitations beyond blurry footage and disjointed animation. For example, their training methods are unable to learn information that might only be inferred by a human watching a video — e.g., whether a video of a waving hand is going left to right or right to left. Other problems include generating videos longer than five seconds, videos with multiple scenes and events, and higher resolution.
The researchers are also aware of walking into a minefield of controversy. Make-A-Video has “learnt and likely exaggerated social biases, including harmful ones,” while all AI-generated video content from the AI contains a watermark to “help ensure viewers know the video was generated with AI and is not a captured video.”
READ MORE: Make-A-Video: Text-To-Video Generation Without Text-Video Data (Meta)
Cracking the code to create photorealistic video on demand — and then drive it with a narrative — is exercising other minds too.
Chinese researchers are behind another text-to-video model named CogVideo, OpenAI is also thought to be working on one, and no doubt there are numerous other initiatives in the works.
EXPLORING ARTIFICIAL INTELLIGENCE:
With nearly half of all media and media tech companies incorporating Artificial Intelligence into their operations or product lines, AI and machine learning tools are rapidly transforming content creation, delivery and consumption. Find out what you need to know with these essential insights curated from the NAB Amplify archives:
- This Will Be Your 2032: Quantum Sensors, AI With Feeling, and Life Beyond Glass
- Learn How Data, AI and Automation Will Shape Your Future
- Where Are We With AI and ML in M&E?
- How Creativity and Data Are a Match Made in Hollywood/Heaven
- How to Process the Difference Between AI and Machine Learning