Residual networks can be viewed as discrete denoising, with score matching handling block-level training, enabling training without inter-block communication, greatly reducing memory pressure.

View Original
MeNews
DiffusionBlocks block training can reduce neural network memory to 1/B, with multiple architectures validating performance
DiffusionBlocks divides Transformer-based networks into independently trainable blocks, treating residual networks as discretized denoising steps, utilizing score matching to achieve block-level training without inter-block communication, significantly reducing training memory. Experiments show effectiveness across multiple architectures, with only one block activated per step during inference, reducing the computational cost of 12-layer DiT (B=3) to one-third of the original. This method is suitable for ViT/DiT/MDM/AR Transformers, but requires matching input and output dimensions and cannot be used with U-Net.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned