Jiachen's Home Page
  • Revisiting The Hidden Secrets in Diffusion models
  • It's a blog for beginners in generative modeling. The blog introduces the paper DDPM in detail, including the discrete decoder and its connection to VAE.

  • Is Rectified Flow theoretically better than Diffusion Model? How to "cook" a good diffusion model in practice
  • In this blog, we analyze why diffusion model performs bad when comparing with rectified flow in practice. Then, we discuss key aspects to condier when cooking a diffusion model.

  • Discussions on Some Interesting Topics
  • 1. Why is Pre-norm more frequently used than Post-norm in transformers? Links: 1, 2, 3

    Conclusion: Pre-norm does not disturb the gradient and scales cleanly with model size and learning rate.


    2. The shape regularity of the diffusion sampling trajectories, Links: 1, 2

    Conclusion: The sampling trajectories follow a similar shape with a linear-nonlinear-linear structure. Besides, the high-dimensional trajectories can be well-represented by a 3D sub-space. This is because each trajectory exhibits a very small deviation from the straight line joining its beginning (initial noise) and end points (denoised output).


    3. Flexible patch size for ViT: How should we resize patch embedding? Links: 1

    Conclusion: Start from principled experiment, the authors propose PI-Resize (a combination of transformation matrixs derived from bilinear interpolation) for "losslessly" upsampling weight embedding.


    4. Why scaling qk by $\frac{1}{d}$ in Transformer? Links: 1, 2

    Conclusion: To prevent the qk value from increasing with embedding dimension, leading to saturated softmax and low-rank attention matrix.