Google responds to Meta's video-generating AI with its own image, dubbed Imagen Video

Not to be outdone by Meta's Make-A-Video, Google today detailed its work on Imagen Video, an AI system capable of generating video clips from a text prompt (for example, "a teddy bear washes the dishes"). While the results aren't perfect - the looping clips the system generates tend to have artifacts and noise - Google says Imagen Video is a step towards a system with a "high degree of controllability" and knowledge of the world, including the ability to generate sequences in a range of art styles.

As my colleague Devin Coldewey noted in his article on Make-A-Video, video synthesis systems are nothing new. Earlier this year, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence released CogVideo, which can translate text into reasonably high-fidelity short clips. But Imagen Video appears to be a significant leap from the previous state of the art, showing an ability to animate subtitles that existing systems would struggle to understand.

"It's definitely an improvement," Matthew Guzdial, an assistant professor at the University of Alberta who studies AI and machine learning, told TechCrunch via email. "As you can see in the sample videos, even though the communications team selects the best releases, there's still some weird fuzziness and artifice. So this definitely won't be used directly in animation or television. any time soon. But that, or something like it, could definitely be integrated into tools to help speed up some things."

Image credits: Google
Image credits: Google
Imagen Video is based on Google's Imagen, an image generation system comparable to DALL-E 2 and OpenAI's Stable Diffusion. Imagen is what is called a "diffusion" model, generating new data (e.g. videos) by learning to "destroy" and "recover" many existing data samples. As it feeds into existing samples, the model better recovers the data it had previously destroyed to create new works.

Image credits: Google
As the Google research team behind Imagen Video explained in an article, the system uses a textual description and generates a 16-frame video, three frames per second, at 24 resolution. x 48 pixels. Then the system scales and "predicts" additional frames, producing a final video of 128 frames, 24 frames per second at 720p (1280×768).

Image credits: Google

Image credits: Google
Google says Imagen Video was trained on 14 million video-text pairs and 60 million image-text pairs, as well as...

Technology Oct 5, 2022 0 59 Add to Reading List

Google responds to Meta's video-generating AI with its own image, dubbed Imagen Video

Not to be outdone by Meta's Make-A-Video, Google today detailed its work on Imagen Video, an AI system capable of generating video clips from a text prompt (for example, "a teddy bear washes the dishes"). While the results aren't perfect - the looping clips the system generates tend to have artifacts and noise - Google says Imagen Video is a step towards a system with a "high degree of controllability" and knowledge of the world, including the ability to generate sequences in a range of art styles.

As my colleague Devin Coldewey noted in his article on Make-A-Video, video synthesis systems are nothing new. Earlier this year, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence released CogVideo, which can translate text into reasonably high-fidelity short clips. But Imagen Video appears to be a significant leap from the previous state of the art, showing an ability to animate subtitles that existing systems would struggle to understand.

"It's definitely an improvement," Matthew Guzdial, an assistant professor at the University of Alberta who studies AI and machine learning, told TechCrunch via email. "As you can see in the sample videos, even though the communications team selects the best releases, there's still some weird fuzziness and artifice. So this definitely won't be used directly in animation or television. any time soon. But that, or something like it, could definitely be integrated into tools to help speed up some things."

Image credits: Google Google Imagen Video

Image credits: Google

Imagen Video is based on Google's Imagen, an image generation system comparable to DALL-E 2 and OpenAI's Stable Diffusion. Imagen is what is called a "diffusion" model, generating new data (e.g. videos) by learning to "destroy" and "recover" many existing data samples. As it feeds into existing samples, the model better recovers the data it had previously destroyed to create new works.

Image credits: Google

As the Google research team behind Imagen Video explained in an article, the system uses a textual description and generates a 16-frame video, three frames per second, at 24 resolution. x 48 pixels. Then the system scales and "predicts" additional frames, producing a final video of 128 frames, 24 frames per second at 720p (1280×768).

Image credits: Google

Image credits: Google

Google says Imagen Video was trained on 14 million video-text pairs and 60 million image-text pairs, as well as...