DiffusionBlocks block training can reduce neural network memory to 1/B, with multiple architectures validating performance
DiffusionBlocks divides Transformer-based networks into independently trainable blocks, treating residual networks as discretized denoising steps, utilizing score matching to achieve block-level training without inter-block communication, significantly reducing training memory. Experiments show effectiveness across multiple architectures, with only one block activated per step during inference, reducing the computational cost of 12-layer DiT (B=3) to one-third of the original. This method is suitable for ViT/DiT/MDM/AR Transformers, but requires matching input and output dimensions and cannot be used with U-Net.