Show-1

Better text-video alignment? Generation in both pixel- and latent-domain

✅ Stable Diffusion Model存在的问题：当文本变复杂时，文本和内容的 align 不好。
✅ show-1 在 alignment 上做了改进。

P76

Motivation

pixel VS latent: 一致性

Pixel-based VDM achieves better text-video alignment than latent-based VDM

✅ 实验发现：pixel spase 比 latent space 更擅长 align ment.
✅ 原因：在 latent space，文本对 pixel 的控制比较差。

P77

pixel VS latent: memory

Pixel-based VDM achieves better text-video alignment than latent-based VDM
Pixel-based VDM takes much larger memory than latent-based VDM

P78

本文方法

Use Pixel-based VDM in low-res stage
Use latent-based VDM in high-res stage

P79

Result

https://github.com/showlab/Show-1

Better text-video alignment
Can synthesize large motion
Memory-efficient