AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

T2I -> T2V

Transform domain-specific T2I models to T2V models

  • Domain-specific (personalized) models are widely available for image
    • Domain-specific finetuning methodologies: LoRA, DreamBooth…
    • Communities: Hugging Face, CivitAI…
  • Task: turn these image models into T2V models, without specific finetuning

✅ (1) 用同一个 patten 生成 noise,得到的 image 可能更有一致性。
✅ (2) 中间帧的特征保持一致。

P99

Methodology

  • Train a motion modeling module (some temporal layers) together with frozen base T2I model
  • Plug it into a domain-specific T2I model during inference

✅ 优势:可以即插即用到各种用户定制化的模型中。
✅ 在 noise 上对内容进行编辑,即定义第一帧的 noise,以及后面帧的 noise 运动趋势。

P100

Training

  • Train on WebVid-10M, resized at 256x256 (experiments show can generalize to higher res.)

✅ 在低分辨率数据上训练,但结果可以泛化到高分辨率。

✅ 保证中间帧尽量相似。

P101

✅ 扣出背景并 smooth.