Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Use Stable Diffusion to generate videos without any finetuning

✅ 完全没有经过训练,使用 T2I Base Model(stable diffusion Model) 生成视频。

Motivation: How to use Stable Diffusion for video generation without finetuning?

  • Start from noises of similar pattern
  • Make intermediate features of different frames to be similar

P103

Step 1

  • Start from noises of similar pattern: given the first frame’s noise, define a global scene motion, used to translate the first frame’s noise to generate similar initial noise for other frames

✅ 在 noise 上对内容进行编辑,即定义第一帧的 noise,以及后面帧的 noise 运动趋势。

P104

Step2

  • Make intermediate features of different frames to be similar: always use K and V from the first frame in self-attention

✅ 保证中间帧尽量相似。

P105

Step3

  • Optional background smoothing: regenerate the background, average with the first frame

✅ 扣出背景并 smooth.