Revisiting The Hidden Secrets in Diffusion models

It's a blog for beginners in generative modeling. The blog introduces the paper DDPM in detail, including the discrete decoder and its connection to VAE.

Is Rectified Flow theoretically better than Diffusion Model? How to "cook" a good diffusion model in practice

In this blog, we analyze why diffusion model performs bad when comparing with rectified flow in practice. Then, we discuss key aspects to condier when cooking a diffusion model.

Discussions on Some Interesting Topics

1. Why is Pre-norm more frequently used than Post-norm in transformers? Links: 1, 2, 3

Conclusion: Pre-norm does not disturb the gradient and scales cleanly with model size and learning rate.

2. The shape regularity of the diffusion sampling trajectories, Links: 1, 2

Conclusion: The sampling trajectories follow a similar shape with a linear-nonlinear-linear structure. Besides, the high-dimensional trajectories can be well-represented by a 3D sub-space. This is because each trajectory exhibits a very small deviation from the straight line joining its beginning (initial noise) and end points (denoised output).

3. Flexible patch size for ViT: How should we resize patch embedding? Links: 1

Conclusion: Start from principled experiment, the authors propose PI-Resize (a combination of transformation matrixs derived from bilinear interpolation) for "losslessly" upsampling weight embedding.

4. Why scaling qk by $\frac{1}{d}$ in Transformer? Links: 1, 2

Conclusion: To prevent the qk value from increasing with embedding dimension, leading to saturated softmax and low-rank attention matrix.