Structure and Content-Guided Video Synthesis with Diffusion Models

Inflate Stable Diffusion to a 3D model, finetune on pretrained weights
Insert temporal convolution/attention layers
Finetune to take per-frame depth as conditions

✅ 特点：(1) 不需要训练。 (2) 能保持前后一致性。

P60

P61

Condition on structure (depth) and content (CLIP) information.
Depth maps are passed with latents as input conditions.
CLIP image embeddings are provided via cross-attention blocks.
During inference, CLIP text embeddings are converted to CLIP image embeddings.

✅ 用 depth estimator 从源视频提取 struct 信息，用 CLIP 从文本中提取 content 信息。
✅ depth 和 content 分别用两种形式注入。depth 作为条件，与 lantent concat 到一起。content 以 cross attention 的形式注入。

ReadPapers

Structure and Content-Guided Video Synthesis with Diffusion Models