Google has had a dedicated video generation model for some time now. Vo has been powering text-to-video features in Gemini apps and Flow for the past year. But at Google I/O 2026, the company introduced something that it says is a step beyond that. gemini omni There is a new multimodal model It doesn’t just take a text prompt and return a clip. It combines text, images, audio and video together to produce a single, coherent output.
Instead of describing a scene from scratch, you can feed Gemini Omni an existing video clip, a voice memo, a sketch, or any mix of those and build from there. You can refine the results through swapping conversations, characters, backgrounds or objects just by asking. Google’s claymation protein folding demo at I/O showed what this looks like in practice. A stop-motion style lecturer with a prompt, voice narration.
The first version now being released, Gemini Omni Flash, produces clips up to 10 seconds long. Google says this is a product choice rather than a model limitation, and longer videos are coming. An invisible SynthID watermark is baked into each clip, which can be verified through the Gemini app or a Google search.
The feature Google is stopping
Here’s one thing the Gemini Omni won’t do yet. Audio and speech editing, changing what someone says in an existing video, is absent from this launch. Google clearly said that it is still working on how to bring it to users responsibly. This is a notable difference, considering how close the demo came to that area. You can create a personalized digital avatar to appear in videos, but this requires recording yourself speaking a series of numbers first as an anti-deepfake step.
Gemini Omni Flash is now live in the Gemini app and Google Flow for Plus, Pro and Ultra subscribers. YouTube Shorts users will get it for free this weekend. A developer API will be coming in the coming weeks, and a higher-end Omni Pro model was teased with details to follow. Speech editing capability will arrive eventually. Google is not ready to ship it.
