Structure and Content-Guided Video Synthesis with Diffusion Models

  • Inflate Stable Diffusion to a 3D model, finetune on pretrained weights
  • Insert temporal convolution/attention layers
  • Finetune to take per-frame depth as conditions

✅ 特点:(1) 不需要训练。 (2) 能保持前后一致性。

P60

P61

  • Condition on structure (depth) and content (CLIP) information.
  • Depth maps are passed with latents as input conditions.
  • CLIP image embeddings are provided via cross-attention blocks.
  • During inference, CLIP text embeddings are converted to CLIP image embeddings.

✅ 用 depth estimator 从源视频提取 struct 信息,用 CLIP 从文本中提取 content 信息。
✅ depth 和 content 分别用两种形式注入。depth 作为条件,与 lantent concat 到一起。content 以 cross attention 的形式注入。