Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Use Stable Diffusion to generate videos without any finetuning

✅ 完全没有经过训练，使用 T2I Base Model(stable diffusion Model) 生成视频。

Motivation: How to use Stable Diffusion for video generation without finetuning?

P103

Start from noises of similar pattern: given the first frame’s noise, define a global scene motion, used to translate the first frame’s noise to generate similar initial noise for other frames

✅ 在 noise 上对内容进行编辑，即定义第一帧的 noise，以及后面帧的 noise 运动趋势。

P104

Make intermediate features of different frames to be similar: always use K and V from the first frame in self-attention

✅ 保证中间帧尽量相似。

P105

Optional background smoothing: regenerate the background, average with the first frame

✅ 扣出背景并 smooth.

ReadPapers