Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

核心问题是什么?

Stage I: Image Pretraining

Stage II: Curating a Video Pretraining Dataset

Systematic Data Curation
- Curate subsets filtered by various criteria (CLIP-, OCR-, optical flow-, aesthetic-scores…)
- Assess human preferences on models trained on different subsets
- Choose optimal filtering thresholds via Elo rankings for human preference votes
Well-curated beats un-curated pretraining dataset

Stage III: High-Quality Finetuning

Scaling latent video diffusion models to large datasets

Data Processing and Annotation

Cut Detection and Clipping
- Detect cuts/transitions at multiple FPS levels
- Extract clips precisely using keyframe timestamps
Synthetic Captioning
- Use CoCa image captioner to caption the mid-frame of each clip
- Use V-BLIP to obtain video-based caption
- Use LLM to summarise the image- and video-based caption
- Compute CLIP similarities and aesthetic scores
Filter Static Scene
- Use dense optical flow magnitudes to filter static scenes
Text Detection
- Use OCR to detect and remove clips with excess text

Blattmann et al., “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets,” 2023.

✅ SVD：构建数据集
✅ (1) 把视频切成小段，描述会更准确
✅ (2) 用现有模型生成视频描述