Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Use Stable Diffusion to generate videos without any finetuning
✅ 完全没有经过训练,使用 T2I Base Model(stable diffusion Model) 生成视频。
Motivation: How to use Stable Diffusion for video generation without finetuning?
- Start from noises of similar pattern
- Make intermediate features of different frames to be similar
P103
Step 1
- Start from noises of similar pattern: given the first frame’s noise, define a global scene motion, used to translate the first frame’s noise to generate similar initial noise for other frames
✅ 在 noise 上对内容进行编辑,即定义第一帧的 noise,以及后面帧的 noise 运动趋势。
P104
Step2
- Make intermediate features of different frames to be similar: always use K and V from the first frame in self-attention

✅ 保证中间帧尽量相似。
P105
Step3
- Optional background smoothing: regenerate the background, average with the first frame
✅ 扣出背景并 smooth.