PROGRESSIVE DISTILLATION FOR FAST SAMPLING OF DIFFUSION MODELS

https://readpaper.com/pdf-annotate/note?pdfId=4667185955594059777

核心贡献是什么？

加速DDPM的生成过程。

提出了新的扩散模型参数化方式，在使用少量采样步骤时可以提供更高的稳定性。
提出了一种知识蒸馏的方法，可以把更高的迭代次数优化为更低的迭代次数。

大致方法是什么？

Distill a deterministic ODE sampler to the same model architecture.
At each stage, a “student” model is learned to distill two adjacent sampling steps of the “teacher” model to one sampling step.
At next stage, the “student” model from previous stage will serve as the new “teacher” model.

✅ 假设有一个 solver，可以根据$x_t$ 预测$x_{t-1}$．
✅ 调用两次 solver，可以从 $x_t$ 得到$x_{t-2}$，学习这个过程，可以直接得到 2 step 的 solver.
✅ 前一个 solver 称为 teacher，后一个称为 student.
✅ student 成为新的 teacher，训练新的 student.

On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps.

缺陷

局限	改进点
In the current work we limited ourselves to setups where the student model has the same architecture and number of parameters as the teacher model:	in future work we hope to relax this constraint and explore settings where the student model is smaller, potentially enabling further gains in test time computational requirements.
	In addition, we hope to move past the generation of images and also explore progressive distillation of diffusion models for different data modalities such as e.g. audio (Chen et al., 2021).
	In addition to the proposed distillation procedure, some of our progress was realized through different parameterizations of the diffusion model and its training loss. We expect to see more progress in this direction as the community further explores this model class.

验证

启发

The resulting target value $\tilde x(z_t)$ is fully determined given the teacher model and starting point $z_t$, which allows the student model to make a sharp prediction when evaluated at$z_t$. In contrast, the original data point x is not fully determined given $z_t$, since multiple different data points x can produce the same noisy data $z_t$: this means that the original denoising model is predicting a weighted average of possible x values, which produces a blurry prediction.
对噪声求L2 loss可以看作是加权平均的重建L2 loss，推导过程见公式9。但在distillation过程中，不适合预测噪声，而应该重建。
In practice, the choice of loss weighting also has to take into account how αt, σt are sampled during training, as this sampling distribution strongly determines the weight the expected loss gives to each signal-to-noise ratio.

遗留问题

很多细节看不懂。比如预测x与预测噪声的关系。怎么定义weight?parameterizations of the denoising diffusion model?DDIM?
https://caterpillarstudygroup.github.io/ImportantArticles/diffusion-tutorial-part/diffusiontutorialpart1.html

ReadPapers

PROGRESSIVE DISTILLATION FOR FAST SAMPLING OF DIFFUSION MODELS

核心问题是什么？

相关工作(Chapter 6)

核心贡献是什么？

大致方法是什么？

有效性

缺陷

验证

启发

遗留问题