Latest Posts

External Discussions

Why is Pre-norm more frequently used than Post-norm?

Conclusion: Pre-norm does not disturb the gradient and scales cleanly.

The shape regularity of diffusion sampling trajectories

Conclusion: Trajectories follow a linear-nonlinear-linear structure in a 3D sub-space.

3. Flexible patch size for ViT: How should we resize patch embedding?

Conclusion: Start from principled experiment, the authors propose PI-Resize (a combination of transformation matrixs derived from bilinear interpolation) for "losslessly" upsampling weight embedding.

4. Why scaling qk by $\frac{1}{d}$ in Transformer?

Conclusion: To prevent the qk value from increasing with embedding dimension, leading to saturated softmax and low-rank attention matrix.