Show-1

Better text-video alignment? Generation in both pixel- and latent-domain

✅ Stable Diffusion Model存在的问题:当文本变复杂时,文本和内容的 align 不好。
✅ show-1 在 alignment 上做了改进。

P76

Motivation

pixel VS latent: 一致性

  • Pixel-based VDM achieves better text-video alignment than latent-based VDM

✅ 实验发现:pixel spase 比 latent space 更擅长 align ment.
✅ 原因:在 latent space,文本对 pixel 的控制比较差。

P77

pixel VS latent: memory

  • Pixel-based VDM achieves better text-video alignment than latent-based VDM
  • Pixel-based VDM takes much larger memory than latent-based VDM

P78

本文方法

  • Use Pixel-based VDM in low-res stage
  • Use latent-based VDM in high-res stage

P79

Result

https://github.com/showlab/Show-1

  • Better text-video alignment
  • Can synthesize large motion
  • Memory-efficient