Learning Transferable Visual Models From Natural Language Supervision
CLIP:Encoders bridge vision and language
- CLIP text-/image-embeddings are commonly used in diffusion models for conditional generation
❓ 文本条件怎么输入到 denoiser?
✅ CLIP embedding:202 openai,图文配对训练用 CLIP 把文本转为 feature.