AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

T2I -> T2V

Transform domain-specific T2I models to T2V models

Domain-specific (personalized) models are widely available for image
- Domain-specific finetuning methodologies: LoRA, DreamBooth…
- Communities: Hugging Face, CivitAI…
Task: turn these image models into T2V models, without specific finetuning

✅ (1) 用同一个 patten 生成 noise，得到的 image 可能更有一致性。
✅ (2) 中间帧的特征保持一致。

P99

Methodology

Train a motion modeling module (some temporal layers) together with frozen base T2I model
Plug it into a domain-specific T2I model during inference

✅ 优势：可以即插即用到各种用户定制化的模型中。
✅ 在 noise 上对内容进行编辑，即定义第一帧的 noise，以及后面帧的 noise 运动趋势。

P100

Training

Train on WebVid-10M, resized at 256x256 (experiments show can generalize to higher res.)

✅ 在低分辨率数据上训练，但结果可以泛化到高分辨率。

✅ 保证中间帧尽量相似。

P101

✅ 扣出背景并 smooth.