Meta's AI can now turn your text prompts into synthetic video on demand

0 0

By Matthew Griffin Intelligence and the Senses 26th September 2022

WHY THIS MATTERS IN BRIEF

Imagine being able to create videos by just typing what you want them to be. That’s what Meta’s latest tech does, and it’s only going to get better in the future.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

As I first predicted many many years ago back in 2017 and continuously ever since, and even wrote a book on, Meta today unveiled an Artificial Intelligence (AI) system that generates short videos based on text prompts alone, and while just a few years ago these Text-to-Video (T2V) AI’s were fairly awful, interesting but awful, you can now see just how much they’ve progressed in what’s a relatively short space of time.

New study finds AI copies humans in the winter and gets lazier

Make-A-Video as it’s known lets you type in a string of words, like “A dog wearing a superhero outfit with a red cape flying through the sky,” and then generates a five-second clip that, while pretty accurate, has the aesthetics of a trippy old home video. And, just like those interestingly awful videos from a few years ago, from the likes of Stanford University, fast forward a few years and these newer synthetic videos will be longer and higher quality.

The Future of Synthetic Content, by Futurist Matthew Griffin

Although the effect is rather crude the system offers an early glimpse of what’s coming next for generative artificial intelligence, with synthetic video generation being the obvious step on from the static images created by the Text-to-Image AI systems that I’ve talked about over the past few years from companies such as Google, OpenAI, and Nvidia.

Meta’s announcement of Make-A-Video, which is not yet being made available to the public yet will likely prompt other AI labs to release their own versions. It also raises some big ethical questions.

This $130 Million Hyperloop Hotel lets you take your hotel room with you

In the last month alone, AI lab OpenAI has made its latest Text-to-Image AI system DALL-E available to everyone, and AI startup Stability.ai launched Stable Diffusion, yet another open source Text-to-Image system.

But Text-to-Video AI comes with even greater challenges. For one, these models need a vast amount of computing power. They are an even bigger computational lift than largest Text-to-Image AI models, which use millions of images to train, because putting together just one short video not only requires the AI generate hundreds of images but it also needs to “know” what it’s doing.

As a result that means it’s really only large tech companies that can afford to build these systems for the foreseeable future. They’re also trickier to train, because there aren’t currently any large scale data sets of high quality videos paired with text metadata.

Researchers create a kill switch to terminate rogue AI agents

To work around this, Meta combined data from three open-source image and video data sets to train its model. Standard text-image data sets of labelled still images helped the AI learn what objects are called and what they look like. And a database of videos helped it learn how those objects are supposed to move in the world. The combination of the two approaches helped Make-A-Video, which is described in a non-peer-reviewed paper published today, generate videos from text at scale.

Tanmay Gupta, a computer vision research scientist at the Allen Institute for Artificial Intelligence, says Meta’s results are promising. The videos it’s shared show that the model can capture 3D shapes as the camera rotates. The model also has some notion of depth and understanding of lighting. Gupta says some details and movements, as we saw recently with Nvidia’s AI, are decently done and convincing.

However, “there’s plenty of room for the research community to improve on, especially if these systems are to be used for video editing and professional content creation,” he adds. In particular, it’s still tough to model complex interactions between objects and characters, for example.”

The US Army just 3D printed a grenade launcher and grenades

In the video generated by the text prompt “An artist’s brush painting on a canvas,” the brush moves over the canvas, but strokes on the canvas aren’t realistic.

“I would love to see these models succeed at generating a sequence of interactions, such as ‘The man picks up a book from the shelf, puts on his glasses, and sits down to read it while drinking a cup of coffee,’” Gupta says.

For its part, Meta promises that the technology could “open new opportunities for creators and artists” and democratise content creation. But as the technology develops, there are fears it could be harnessed as a powerful tool to create and disseminate misinformation and deepfakes.

It will also make it even more difficult to differentiate between real and fake content online. Meta’s model ups the stakes for generative AI both technically and creatively but also “in terms of the unique harms that could be caused through generated video as opposed to still images,” says Henry Ajder, an expert on synthetic media.

In bizarre twist AI gets a job coaching call center staff compassion and empathy

“At least today, creating factually inaccurate content that people might believe in requires some effort,” Gupta says. “In the future, it may be possible to create misleading content with a few keystrokes.”

And I myself say “will” not “may.”

The researchers who built Make-A-Video filtered out offensive images and words, but with data sets that consist of millions and millions of words and images, it is almost impossible to fully remove biased and harmful content.

A spokesperson for Meta says it is not making the model available to the public yet, and that “as part of this research, we will continue to explore ways to further refine and mitigate potential risks.”

Source: Facebook

Matthew Griffin / About Author

Matthew Griffin, multi-award winning Futurist and named Futurist of the Year 2024, has been described as a "Walking encyclopaedia of the future" by NASA and a futurist polymath. One of the world's most renowned futurists and strategic foresight experts Matthew is the 15 times author of the blockbuster "Codex of the Future" series, and is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working across the next 50 years, XPotential University, the world's first free futures and foresight university, and the World Futures Forum which works with the United Nations to solve the worlds greatest challenges. Matthew is an in demand international keynote, acclaimed university lecturer and mentor, and host of the hit Fanatical Futurist podcast.

A rare talent in his past Matthew helped build and run several multi-billion dollar business units for Atos, Dell-EMC, and IBM, and his ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders, G7, G20, and G77+ governments, and many of the world's most respected brands including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Dentons, Deloitte, Disney, Dow, EY, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, T-Mobile, UBS, VISA, and many others. He was also the only futurist invited to talk at the UN COP28 held in Dubai alongside world leaders.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.